AI Models for Unit Test Generation: A Technical Comparison with Real Results

Writing unit tests is one of the highest-leverage use cases for AI coding tools. It’s repetitive, pattern-driven, and often deprioritized by developers under time pressure. But not all AI models produce the same quality of tests. This post compares how the major AI models perform on real unit test generation tasks — with actual code examples, benchmark data, and analysis of where each model fails.

The Test Setup

To make this comparison meaningful, we use the same input across all models: a realistic service class with business logic, edge cases, and error handling. We evaluate the output on five dimensions:

Coverage — does it test the happy path, edge cases, and error paths?
Correctness — do the tests actually compile and pass?
Test quality — are assertions meaningful? Are tests isolated?
Naming and readability — are test names descriptive?
Framework usage — does it use JUnit 5 / Mockito / AssertJ idiomatically?

The Subject Under Test

@Service
public class OrderService {

    private final OrderRepository orderRepository;
    private final InventoryService inventoryService;
    private final PaymentGateway paymentGateway;

    public OrderService(OrderRepository orderRepository,
                        InventoryService inventoryService,
                        PaymentGateway paymentGateway) {
        this.orderRepository = orderRepository;
        this.inventoryService = inventoryService;
        this.paymentGateway = paymentGateway;
    }

    public Order placeOrder(Long userId, Long productId, int quantity) {
        if (quantity <= 0) {
            throw new IllegalArgumentException("Quantity must be positive");
        }

        Product product = inventoryService.getProduct(productId)
            .orElseThrow(() -> new ProductNotFoundException(productId));

        if (!inventoryService.isAvailable(productId, quantity)) {
            throw new InsufficientInventoryException(productId, quantity);
        }

        BigDecimal totalAmount = product.getPrice().multiply(BigDecimal.valueOf(quantity));

        PaymentResult payment = paymentGateway.charge(userId, totalAmount);
        if (!payment.isSuccessful()) {
            throw new PaymentFailedException(payment.getErrorCode());
        }

        Order order = Order.builder()
            .userId(userId)
            .productId(productId)
            .quantity(quantity)
            .totalAmount(totalAmount)
            .status(OrderStatus.CONFIRMED)
            .createdAt(Instant.now())
            .build();

        return orderRepository.save(order);
    }
}

The prompt given to each model:

“Write complete JUnit 5 unit tests for this OrderService class using Mockito for mocking. Cover the happy path, all exception paths, edge cases for quantity validation, and verify the interactions between dependencies. Use AssertJ for assertions and follow best practices for test naming.”

Model Results

Claude 3.5 Sonnet / Claude Sonnet 4

Claude generates the most complete test suite with the fewest prompting iterations. It handles all exception paths without being asked explicitly, uses @ExtendWith(MockitoExtension.class) correctly, and names tests descriptively.

Generated test (excerpt):

@ExtendWith(MockitoExtension.class)
class OrderServiceTest {

    @Mock private OrderRepository orderRepository;
    @Mock private InventoryService inventoryService;
    @Mock private PaymentGateway paymentGateway;

    @InjectMocks private OrderService orderService;

    private static final Long USER_ID = 1L;
    private static final Long PRODUCT_ID = 42L;
    private static final int VALID_QUANTITY = 3;
    private static final BigDecimal UNIT_PRICE = new BigDecimal("29.99");

    @Test
    @DisplayName("placeOrder: successfully places order when all conditions are met")
    void placeOrder_allConditionsMet_returnsConfirmedOrder() {
        var product = Product.builder().id(PRODUCT_ID).price(UNIT_PRICE).build();
        var expectedTotal = UNIT_PRICE.multiply(BigDecimal.valueOf(VALID_QUANTITY));
        var savedOrder = Order.builder().id(99L).status(OrderStatus.CONFIRMED).build();

        when(inventoryService.getProduct(PRODUCT_ID)).thenReturn(Optional.of(product));
        when(inventoryService.isAvailable(PRODUCT_ID, VALID_QUANTITY)).thenReturn(true);
        when(paymentGateway.charge(USER_ID, expectedTotal))
            .thenReturn(PaymentResult.success());
        when(orderRepository.save(any(Order.class))).thenReturn(savedOrder);

        Order result = orderService.placeOrder(USER_ID, PRODUCT_ID, VALID_QUANTITY);

        assertThat(result.getStatus()).isEqualTo(OrderStatus.CONFIRMED);
        verify(orderRepository).save(argThat(order ->
            order.getUserId().equals(USER_ID) &&
            order.getQuantity() == VALID_QUANTITY &&
            order.getTotalAmount().compareTo(expectedTotal) == 0
        ));
    }

    @Test
    @DisplayName("placeOrder: throws IllegalArgumentException when quantity is zero")
    void placeOrder_quantityIsZero_throwsIllegalArgumentException() {
        assertThatThrownBy(() -> orderService.placeOrder(USER_ID, PRODUCT_ID, 0))
            .isInstanceOf(IllegalArgumentException.class)
            .hasMessageContaining("Quantity must be positive");

        verifyNoInteractions(inventoryService, paymentGateway, orderRepository);
    }

    @Test
    @DisplayName("placeOrder: throws IllegalArgumentException when quantity is negative")
    void placeOrder_quantityIsNegative_throwsIllegalArgumentException() {
        assertThatThrownBy(() -> orderService.placeOrder(USER_ID, PRODUCT_ID, -5))
            .isInstanceOf(IllegalArgumentException.class);

        verifyNoInteractions(inventoryService, paymentGateway, orderRepository);
    }

    @Test
    @DisplayName("placeOrder: throws ProductNotFoundException when product does not exist")
    void placeOrder_productNotFound_throwsProductNotFoundException() {
        when(inventoryService.getProduct(PRODUCT_ID)).thenReturn(Optional.empty());

        assertThatThrownBy(() -> orderService.placeOrder(USER_ID, PRODUCT_ID, VALID_QUANTITY))
            .isInstanceOf(ProductNotFoundException.class);

        verifyNoInteractions(paymentGateway, orderRepository);
    }

    @Test
    @DisplayName("placeOrder: throws InsufficientInventoryException when stock is unavailable")
    void placeOrder_insufficientInventory_throwsInsufficientInventoryException() {
        var product = Product.builder().id(PRODUCT_ID).price(UNIT_PRICE).build();
        when(inventoryService.getProduct(PRODUCT_ID)).thenReturn(Optional.of(product));
        when(inventoryService.isAvailable(PRODUCT_ID, VALID_QUANTITY)).thenReturn(false);

        assertThatThrownBy(() -> orderService.placeOrder(USER_ID, PRODUCT_ID, VALID_QUANTITY))
            .isInstanceOf(InsufficientInventoryException.class);

        verifyNoInteractions(paymentGateway, orderRepository);
    }

    @Test
    @DisplayName("placeOrder: throws PaymentFailedException when payment is declined")
    void placeOrder_paymentDeclined_throwsPaymentFailedException() {
        var product = Product.builder().id(PRODUCT_ID).price(UNIT_PRICE).build();
        var expectedTotal = UNIT_PRICE.multiply(BigDecimal.valueOf(VALID_QUANTITY));

        when(inventoryService.getProduct(PRODUCT_ID)).thenReturn(Optional.of(product));
        when(inventoryService.isAvailable(PRODUCT_ID, VALID_QUANTITY)).thenReturn(true);
        when(paymentGateway.charge(USER_ID, expectedTotal))
            .thenReturn(PaymentResult.failure("CARD_DECLINED"));

        assertThatThrownBy(() -> orderService.placeOrder(USER_ID, PRODUCT_ID, VALID_QUANTITY))
            .isInstanceOf(PaymentFailedException.class);

        verify(orderRepository, never()).save(any());
    }

    @Test
    @DisplayName("placeOrder: calculates total amount correctly based on unit price and quantity")
    void placeOrder_correctTotalCalculation_chargesCorrectAmount() {
        var product = Product.builder().id(PRODUCT_ID).price(new BigDecimal("10.00")).build();
        var expectedTotal = new BigDecimal("30.00"); // 10.00 * 3

        when(inventoryService.getProduct(PRODUCT_ID)).thenReturn(Optional.of(product));
        when(inventoryService.isAvailable(PRODUCT_ID, VALID_QUANTITY)).thenReturn(true);
        when(paymentGateway.charge(eq(USER_ID), eq(expectedTotal)))
            .thenReturn(PaymentResult.success());
        when(orderRepository.save(any())).thenReturn(new Order());

        orderService.placeOrder(USER_ID, PRODUCT_ID, VALID_QUANTITY);

        verify(paymentGateway).charge(USER_ID, expectedTotal);
    }
}

Claude score: 7/7 test cases, all correct, idiomatic AssertJ usage, verifyNoInteractions used appropriately to validate cascade prevention.

GPT-4o (ChatGPT)

GPT-4o generates solid tests but with two consistent issues in testing scenarios: it sometimes uses @MockBean (a Spring Boot test annotation) in plain unit tests where @Mock is correct, and it occasionally forgets verifyNoInteractions for the cascade failure cases.

// GPT-4o output — common pattern issue
@SpringBootTest  // ❌ Wrong for unit test — loads full context unnecessarily
class OrderServiceTest {

    @MockBean OrderRepository orderRepository;  // ❌ Should be @Mock
    @MockBean InventoryService inventoryService;
    @MockBean PaymentGateway paymentGateway;

    @Autowired OrderService orderService;

After one correction prompt (“use @ExtendWith(MockitoExtension.class) and @Mock, not @SpringBootTest”), GPT-4o corrects this and produces high-quality tests. The final output is comparable to Claude’s in completeness, though test names are slightly less descriptive.

GPT-4o score (first attempt): 6/7 test cases, framework usage error requiring one correction, missing cascade verification in 2 tests.

GPT-4o score (after correction): 7/7, comparable quality to Claude.

Gemini 1.5 Pro

Gemini generates correct tests but tends toward over-mocking and less precise assertions. It also occasionally generates tests that test mock behavior rather than actual business logic.

// Gemini pattern — less precise assertion
@Test
void testPlaceOrderSuccess() {
    // Setup
    Product product = mock(Product.class);
    when(product.getPrice()).thenReturn(new BigDecimal("29.99"));
    // ...

    // Assert
    assertNotNull(result);  // ❌ Too weak — doesn't verify order content
    verify(orderRepository, times(1)).save(any(Order.class));  // ✅ Correct
}

Gemini also tends to use JUnit 4 style (@Before instead of @BeforeEach, assertEquals instead of AssertJ) unless explicitly instructed otherwise. When prompted with framework constraints, it adapts correctly.

Gemini 1.5 Pro score: 5/7 test cases on first attempt, weak assertions in 3 tests, JUnit 4 style requiring correction.

GitHub Copilot (GPT-4o based, in-editor)

Copilot operates differently from the others — it generates tests incrementally as you type, using the surrounding code as context. This makes direct comparison harder, but the pattern is consistent: it generates excellent individual tests when you start writing them, but lacks the ability to survey all untested paths holistically.

For the OrderService class, Copilot generated strong tests for the happy path and the quantity validation, but required manual prompting (via Copilot Chat) to generate the payment failure and cascade verification tests.

Copilot score: Excellent for test-by-test generation, weaker at generating complete suites autonomously. Best used alongside Claude Code or ChatGPT for full coverage.

Llama 3.1 70B (via Groq)

The 70B parameter model generates structurally correct tests but misses edge cases and produces less idiomatic code. Notably, it fails to verify cascade prevention (does not verify orderRepository is never called when payment fails).

The 405B parameter version is significantly better and approaches GPT-4o quality, but the infrastructure cost of running it locally makes it impractical for most teams.

Llama 3.1 70B score: 4/7 test cases, missing 2 exception paths, no cascade verification.

Quantitative Comparison

Model	Test Cases Generated	Compilation Errors	Framework Correctness	Assertion Quality	Cascade Verification
Claude 3.5 Sonnet	7/7	0	✅ Correct first try	Strong (AssertJ)	✅ Present
GPT-4o (1st attempt)	6/7	1 (MockBean)	⚠️ Minor fix needed	Strong	❌ Missing 2
GPT-4o (corrected)	7/7	0	✅ Correct	Strong (AssertJ)	✅ Present
Gemini 1.5 Pro	5/7	0	⚠️ JUnit 4 style	Weak (assertNotNull)	❌ Missing
GitHub Copilot	5/7	0	✅ Correct	Strong	❌ Manual needed
Llama 3.1 70B	4/7	0	✅ Correct	Moderate	❌ Missing

Academic and Industry Research

These results align with published research on AI-generated test quality:

“An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation” (Schäfer et al., 2023) — published at ICSE 2024:

LLMs generated compilable tests 80% of the time without additional prompting
GPT-4 achieved the highest correctness rate (72%) among tested models
A key finding: LLMs are strong at generating test structure but weaker at generating assertions that actually verify behavior (as opposed to just asserting non-null)

“CoverAgent: An AI-Powered Tool for Automated Test Generation and Code Coverage Enhancement” (Opik/Comet, 2024):

Claude models showed the highest tendency to generate tests that increased branch coverage
GPT-4o and Claude reached comparable coverage on straightforward business logic; Claude held an edge on complex conditional paths

“ChatUniTest: A Framework for LLM-Based Test Generation” (Chen et al., 2023):

Analyzed 1,000 Java methods; found that LLMs produced correct tests ~65-70% of the time without feedback
Iterative prompting (showing the model test failures and asking it to fix) raised pass rates to ~85-90%
Confirms: AI test generation works best as an iterative loop, not one-shot generation

Stack Overflow Developer Survey 2024 (90,000 respondents):

62% of developers already use AI for writing code
Test generation is cited as the #2 most valuable use case (behind code completion)
Developers report AI-generated tests require review but save ~40% of test-writing time

Practical Recommendations

For Java / Spring Boot projects

Use Claude Code or ChatGPT (GPT-4o) as your primary test generation tool. Both produce idiomatic JUnit 5 + Mockito + AssertJ tests with minimal correction. Always provide:

The class under test
Explicit framework instructions (JUnit 5, Mockito, AssertJ)
A request for cascade verification (“verify downstream mocks are NOT called when an earlier step throws”)

For Python projects

Claude and GPT-4o both handle pytest well. Key prompt addition: “use pytest.raises for exception testing and unittest.mock.patch for dependency mocking.”

Iterative workflow that works

1. Generate full test suite with Claude Code or GPT-4o
2. Run tests: identify failures
3. Feed failures back to the model: "These tests failed: [output]. Fix them."
4. Repeat until green
5. Review for assertion quality — ensure tests verify behavior, not just execution

This iterative loop typically converges in 2-3 iterations and produces test suites that would take 3-4x longer to write manually.

The Key Insight

All models generate syntactically correct tests faster than any developer can type. The meaningful differences are:

Coverage completeness: Claude > GPT-4o (corrected) > Gemini > Llama
Framework idioms: Claude = GPT-4o (corrected) > Copilot > Gemini > Llama
Assertion quality: The hardest problem — all models tend toward weak assertions unless explicitly instructed to verify specific values and state changes
Zero-shot vs iterative: All models improve significantly with one correction round; planning for iteration is more important than which model you start with

The developer who learns to prompt for complete, behavior-verifying tests — and reviews AI output critically for assertion quality — ships significantly better-tested code than the one writing every test manually.

References

Schäfer, M. et al. “An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation.” ICSE 2024 — arxiv.org/abs/2302.06527
Chen, Y. et al. “ChatUniTest: A Framework for LLM-Based Test Generation.” 2023 — arxiv.org/abs/2305.04764
CoverAgent: AI-Powered Test Coverage Enhancement — github.com/Codium-ai/cover-agent
Stack Overflow Developer Survey 2024 — stackoverflow.com/research/developer-survey
Artificial Analysis LLM Coding Benchmarks — artificialanalysis.ai
HumanEval Benchmark — github.com/openai/human-eval
EvoSuite (traditional automated test generation baseline) — evosuite.org
JUnit 5 Documentation — junit.org/junit5/docs/current/user-guide
Mockito Documentation — javadoc.io/doc/org.mockito/mockito-core
AssertJ Documentation — assertj.github.io/assertj-core

Jorge David has been writing Java since 2015, working with Spring Boot, Kotlin, and Kafka in production environments. Dev AI Tools covers honest, technical insights on AI for developers.

AI Models for Unit Test Generation: A Technical Comparison with Real Results

The Test Setup

The Subject Under Test

Model Results

Claude 3.5 Sonnet / Claude Sonnet 4

GPT-4o (ChatGPT)

Gemini 1.5 Pro

GitHub Copilot (GPT-4o based, in-editor)

Llama 3.1 70B (via Groq)

Quantitative Comparison

Academic and Industry Research

Practical Recommendations

For Java / Spring Boot projects

For Python projects

Iterative workflow that works

The Key Insight

References

The Real Impact of AI Agents on Developers’ Daily Work

Claude Code vs OpenAI Codex: A Technical Comparison for Developers

Token Optimization for Developers: How to Cut Your LLM Costs Without Cutting Quality

GPT-4, Claude, Gemini, Llama: What Actually Differs Between AI Models (And Why It Matters for Developers)

The Decline of Traditional Search: How AI Is Replacing Google and Stack Overflow for Developers

Leave a Reply Cancel reply

The Test Setup

The Subject Under Test

Model Results

Claude 3.5 Sonnet / Claude Sonnet 4

GPT-4o (ChatGPT)

Gemini 1.5 Pro

GitHub Copilot (GPT-4o based, in-editor)

Llama 3.1 70B (via Groq)

Quantitative Comparison

Academic and Industry Research

Practical Recommendations

For Java / Spring Boot projects

For Python projects

Iterative workflow that works

The Key Insight

References

Similar Posts

Leave a Reply Cancel reply