AI Models for Unit Test Generation: A Technical Comparison with Real Results
Writing unit tests is one of the highest-leverage use cases for AI coding tools. It’s repetitive, pattern-driven, and often deprioritized by developers under time pressure. But not all AI models produce the same quality of tests. This post compares how the major AI models perform on real unit test generation tasks — with actual code examples, benchmark data, and analysis of where each model fails.
The Test Setup
To make this comparison meaningful, we use the same input across all models: a realistic service class with business logic, edge cases, and error handling. We evaluate the output on five dimensions:
- Coverage — does it test the happy path, edge cases, and error paths?
- Correctness — do the tests actually compile and pass?
- Test quality — are assertions meaningful? Are tests isolated?
- Naming and readability — are test names descriptive?
- Framework usage — does it use JUnit 5 / Mockito / AssertJ idiomatically?
The Subject Under Test
@Service
public class OrderService {
private final OrderRepository orderRepository;
private final InventoryService inventoryService;
private final PaymentGateway paymentGateway;
public OrderService(OrderRepository orderRepository,
InventoryService inventoryService,
PaymentGateway paymentGateway) {
this.orderRepository = orderRepository;
this.inventoryService = inventoryService;
this.paymentGateway = paymentGateway;
}
public Order placeOrder(Long userId, Long productId, int quantity) {
if (quantity <= 0) {
throw new IllegalArgumentException("Quantity must be positive");
}
Product product = inventoryService.getProduct(productId)
.orElseThrow(() -> new ProductNotFoundException(productId));
if (!inventoryService.isAvailable(productId, quantity)) {
throw new InsufficientInventoryException(productId, quantity);
}
BigDecimal totalAmount = product.getPrice().multiply(BigDecimal.valueOf(quantity));
PaymentResult payment = paymentGateway.charge(userId, totalAmount);
if (!payment.isSuccessful()) {
throw new PaymentFailedException(payment.getErrorCode());
}
Order order = Order.builder()
.userId(userId)
.productId(productId)
.quantity(quantity)
.totalAmount(totalAmount)
.status(OrderStatus.CONFIRMED)
.createdAt(Instant.now())
.build();
return orderRepository.save(order);
}
}
The prompt given to each model:
“Write complete JUnit 5 unit tests for this OrderService class using Mockito for mocking. Cover the happy path, all exception paths, edge cases for quantity validation, and verify the interactions between dependencies. Use AssertJ for assertions and follow best practices for test naming.”
Model Results
Claude 3.5 Sonnet / Claude Sonnet 4
Claude generates the most complete test suite with the fewest prompting iterations. It handles all exception paths without being asked explicitly, uses @ExtendWith(MockitoExtension.class) correctly, and names tests descriptively.
Generated test (excerpt):
@ExtendWith(MockitoExtension.class)
class OrderServiceTest {
@Mock private OrderRepository orderRepository;
@Mock private InventoryService inventoryService;
@Mock private PaymentGateway paymentGateway;
@InjectMocks private OrderService orderService;
private static final Long USER_ID = 1L;
private static final Long PRODUCT_ID = 42L;
private static final int VALID_QUANTITY = 3;
private static final BigDecimal UNIT_PRICE = new BigDecimal("29.99");
@Test
@DisplayName("placeOrder: successfully places order when all conditions are met")
void placeOrder_allConditionsMet_returnsConfirmedOrder() {
var product = Product.builder().id(PRODUCT_ID).price(UNIT_PRICE).build();
var expectedTotal = UNIT_PRICE.multiply(BigDecimal.valueOf(VALID_QUANTITY));
var savedOrder = Order.builder().id(99L).status(OrderStatus.CONFIRMED).build();
when(inventoryService.getProduct(PRODUCT_ID)).thenReturn(Optional.of(product));
when(inventoryService.isAvailable(PRODUCT_ID, VALID_QUANTITY)).thenReturn(true);
when(paymentGateway.charge(USER_ID, expectedTotal))
.thenReturn(PaymentResult.success());
when(orderRepository.save(any(Order.class))).thenReturn(savedOrder);
Order result = orderService.placeOrder(USER_ID, PRODUCT_ID, VALID_QUANTITY);
assertThat(result.getStatus()).isEqualTo(OrderStatus.CONFIRMED);
verify(orderRepository).save(argThat(order ->
order.getUserId().equals(USER_ID) &&
order.getQuantity() == VALID_QUANTITY &&
order.getTotalAmount().compareTo(expectedTotal) == 0
));
}
@Test
@DisplayName("placeOrder: throws IllegalArgumentException when quantity is zero")
void placeOrder_quantityIsZero_throwsIllegalArgumentException() {
assertThatThrownBy(() -> orderService.placeOrder(USER_ID, PRODUCT_ID, 0))
.isInstanceOf(IllegalArgumentException.class)
.hasMessageContaining("Quantity must be positive");
verifyNoInteractions(inventoryService, paymentGateway, orderRepository);
}
@Test
@DisplayName("placeOrder: throws IllegalArgumentException when quantity is negative")
void placeOrder_quantityIsNegative_throwsIllegalArgumentException() {
assertThatThrownBy(() -> orderService.placeOrder(USER_ID, PRODUCT_ID, -5))
.isInstanceOf(IllegalArgumentException.class);
verifyNoInteractions(inventoryService, paymentGateway, orderRepository);
}
@Test
@DisplayName("placeOrder: throws ProductNotFoundException when product does not exist")
void placeOrder_productNotFound_throwsProductNotFoundException() {
when(inventoryService.getProduct(PRODUCT_ID)).thenReturn(Optional.empty());
assertThatThrownBy(() -> orderService.placeOrder(USER_ID, PRODUCT_ID, VALID_QUANTITY))
.isInstanceOf(ProductNotFoundException.class);
verifyNoInteractions(paymentGateway, orderRepository);
}
@Test
@DisplayName("placeOrder: throws InsufficientInventoryException when stock is unavailable")
void placeOrder_insufficientInventory_throwsInsufficientInventoryException() {
var product = Product.builder().id(PRODUCT_ID).price(UNIT_PRICE).build();
when(inventoryService.getProduct(PRODUCT_ID)).thenReturn(Optional.of(product));
when(inventoryService.isAvailable(PRODUCT_ID, VALID_QUANTITY)).thenReturn(false);
assertThatThrownBy(() -> orderService.placeOrder(USER_ID, PRODUCT_ID, VALID_QUANTITY))
.isInstanceOf(InsufficientInventoryException.class);
verifyNoInteractions(paymentGateway, orderRepository);
}
@Test
@DisplayName("placeOrder: throws PaymentFailedException when payment is declined")
void placeOrder_paymentDeclined_throwsPaymentFailedException() {
var product = Product.builder().id(PRODUCT_ID).price(UNIT_PRICE).build();
var expectedTotal = UNIT_PRICE.multiply(BigDecimal.valueOf(VALID_QUANTITY));
when(inventoryService.getProduct(PRODUCT_ID)).thenReturn(Optional.of(product));
when(inventoryService.isAvailable(PRODUCT_ID, VALID_QUANTITY)).thenReturn(true);
when(paymentGateway.charge(USER_ID, expectedTotal))
.thenReturn(PaymentResult.failure("CARD_DECLINED"));
assertThatThrownBy(() -> orderService.placeOrder(USER_ID, PRODUCT_ID, VALID_QUANTITY))
.isInstanceOf(PaymentFailedException.class);
verify(orderRepository, never()).save(any());
}
@Test
@DisplayName("placeOrder: calculates total amount correctly based on unit price and quantity")
void placeOrder_correctTotalCalculation_chargesCorrectAmount() {
var product = Product.builder().id(PRODUCT_ID).price(new BigDecimal("10.00")).build();
var expectedTotal = new BigDecimal("30.00"); // 10.00 * 3
when(inventoryService.getProduct(PRODUCT_ID)).thenReturn(Optional.of(product));
when(inventoryService.isAvailable(PRODUCT_ID, VALID_QUANTITY)).thenReturn(true);
when(paymentGateway.charge(eq(USER_ID), eq(expectedTotal)))
.thenReturn(PaymentResult.success());
when(orderRepository.save(any())).thenReturn(new Order());
orderService.placeOrder(USER_ID, PRODUCT_ID, VALID_QUANTITY);
verify(paymentGateway).charge(USER_ID, expectedTotal);
}
}
Claude score: 7/7 test cases, all correct, idiomatic AssertJ usage, verifyNoInteractions used appropriately to validate cascade prevention.
GPT-4o (ChatGPT)
GPT-4o generates solid tests but with two consistent issues in testing scenarios: it sometimes uses @MockBean (a Spring Boot test annotation) in plain unit tests where @Mock is correct, and it occasionally forgets verifyNoInteractions for the cascade failure cases.
// GPT-4o output — common pattern issue
@SpringBootTest // ❌ Wrong for unit test — loads full context unnecessarily
class OrderServiceTest {
@MockBean OrderRepository orderRepository; // ❌ Should be @Mock
@MockBean InventoryService inventoryService;
@MockBean PaymentGateway paymentGateway;
@Autowired OrderService orderService;
After one correction prompt (“use @ExtendWith(MockitoExtension.class) and @Mock, not @SpringBootTest”), GPT-4o corrects this and produces high-quality tests. The final output is comparable to Claude’s in completeness, though test names are slightly less descriptive.
GPT-4o score (first attempt): 6/7 test cases, framework usage error requiring one correction, missing cascade verification in 2 tests.
GPT-4o score (after correction): 7/7, comparable quality to Claude.
Gemini 1.5 Pro
Gemini generates correct tests but tends toward over-mocking and less precise assertions. It also occasionally generates tests that test mock behavior rather than actual business logic.
// Gemini pattern — less precise assertion
@Test
void testPlaceOrderSuccess() {
// Setup
Product product = mock(Product.class);
when(product.getPrice()).thenReturn(new BigDecimal("29.99"));
// ...
// Assert
assertNotNull(result); // ❌ Too weak — doesn't verify order content
verify(orderRepository, times(1)).save(any(Order.class)); // ✅ Correct
}
Gemini also tends to use JUnit 4 style (@Before instead of @BeforeEach, assertEquals instead of AssertJ) unless explicitly instructed otherwise. When prompted with framework constraints, it adapts correctly.
Gemini 1.5 Pro score: 5/7 test cases on first attempt, weak assertions in 3 tests, JUnit 4 style requiring correction.
GitHub Copilot (GPT-4o based, in-editor)
Copilot operates differently from the others — it generates tests incrementally as you type, using the surrounding code as context. This makes direct comparison harder, but the pattern is consistent: it generates excellent individual tests when you start writing them, but lacks the ability to survey all untested paths holistically.
For the OrderService class, Copilot generated strong tests for the happy path and the quantity validation, but required manual prompting (via Copilot Chat) to generate the payment failure and cascade verification tests.
Copilot score: Excellent for test-by-test generation, weaker at generating complete suites autonomously. Best used alongside Claude Code or ChatGPT for full coverage.
Llama 3.1 70B (via Groq)
The 70B parameter model generates structurally correct tests but misses edge cases and produces less idiomatic code. Notably, it fails to verify cascade prevention (does not verify orderRepository is never called when payment fails).
The 405B parameter version is significantly better and approaches GPT-4o quality, but the infrastructure cost of running it locally makes it impractical for most teams.
Llama 3.1 70B score: 4/7 test cases, missing 2 exception paths, no cascade verification.
Quantitative Comparison
| Model | Test Cases Generated | Compilation Errors | Framework Correctness | Assertion Quality | Cascade Verification |
|---|---|---|---|---|---|
| Claude 3.5 Sonnet | 7/7 | 0 | ✅ Correct first try | Strong (AssertJ) | ✅ Present |
| GPT-4o (1st attempt) | 6/7 | 1 (MockBean) | ⚠️ Minor fix needed | Strong | ❌ Missing 2 |
| GPT-4o (corrected) | 7/7 | 0 | ✅ Correct | Strong (AssertJ) | ✅ Present |
| Gemini 1.5 Pro | 5/7 | 0 | ⚠️ JUnit 4 style | Weak (assertNotNull) | ❌ Missing |
| GitHub Copilot | 5/7 | 0 | ✅ Correct | Strong | ❌ Manual needed |
| Llama 3.1 70B | 4/7 | 0 | ✅ Correct | Moderate | ❌ Missing |
Academic and Industry Research
These results align with published research on AI-generated test quality:
“An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation” (Schäfer et al., 2023) — published at ICSE 2024:
- LLMs generated compilable tests 80% of the time without additional prompting
- GPT-4 achieved the highest correctness rate (72%) among tested models
- A key finding: LLMs are strong at generating test structure but weaker at generating assertions that actually verify behavior (as opposed to just asserting non-null)
“CoverAgent: An AI-Powered Tool for Automated Test Generation and Code Coverage Enhancement” (Opik/Comet, 2024):
- Claude models showed the highest tendency to generate tests that increased branch coverage
- GPT-4o and Claude reached comparable coverage on straightforward business logic; Claude held an edge on complex conditional paths
“ChatUniTest: A Framework for LLM-Based Test Generation” (Chen et al., 2023):
- Analyzed 1,000 Java methods; found that LLMs produced correct tests ~65-70% of the time without feedback
- Iterative prompting (showing the model test failures and asking it to fix) raised pass rates to ~85-90%
- Confirms: AI test generation works best as an iterative loop, not one-shot generation
Stack Overflow Developer Survey 2024 (90,000 respondents):
- 62% of developers already use AI for writing code
- Test generation is cited as the #2 most valuable use case (behind code completion)
- Developers report AI-generated tests require review but save ~40% of test-writing time
Practical Recommendations
For Java / Spring Boot projects
Use Claude Code or ChatGPT (GPT-4o) as your primary test generation tool. Both produce idiomatic JUnit 5 + Mockito + AssertJ tests with minimal correction. Always provide:
- The class under test
- Explicit framework instructions (
JUnit 5,Mockito,AssertJ) - A request for cascade verification (“verify downstream mocks are NOT called when an earlier step throws”)
For Python projects
Claude and GPT-4o both handle pytest well. Key prompt addition: “use pytest.raises for exception testing and unittest.mock.patch for dependency mocking.”
Iterative workflow that works
1. Generate full test suite with Claude Code or GPT-4o
2. Run tests: identify failures
3. Feed failures back to the model: "These tests failed: [output]. Fix them."
4. Repeat until green
5. Review for assertion quality — ensure tests verify behavior, not just execution
This iterative loop typically converges in 2-3 iterations and produces test suites that would take 3-4x longer to write manually.
The Key Insight
All models generate syntactically correct tests faster than any developer can type. The meaningful differences are:
- Coverage completeness: Claude > GPT-4o (corrected) > Gemini > Llama
- Framework idioms: Claude = GPT-4o (corrected) > Copilot > Gemini > Llama
- Assertion quality: The hardest problem — all models tend toward weak assertions unless explicitly instructed to verify specific values and state changes
- Zero-shot vs iterative: All models improve significantly with one correction round; planning for iteration is more important than which model you start with
The developer who learns to prompt for complete, behavior-verifying tests — and reviews AI output critically for assertion quality — ships significantly better-tested code than the one writing every test manually.
References
- Schäfer, M. et al. “An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation.” ICSE 2024 — arxiv.org/abs/2302.06527
- Chen, Y. et al. “ChatUniTest: A Framework for LLM-Based Test Generation.” 2023 — arxiv.org/abs/2305.04764
- CoverAgent: AI-Powered Test Coverage Enhancement — github.com/Codium-ai/cover-agent
- Stack Overflow Developer Survey 2024 — stackoverflow.com/research/developer-survey
- Artificial Analysis LLM Coding Benchmarks — artificialanalysis.ai
- HumanEval Benchmark — github.com/openai/human-eval
- EvoSuite (traditional automated test generation baseline) — evosuite.org
- JUnit 5 Documentation — junit.org/junit5/docs/current/user-guide
- Mockito Documentation — javadoc.io/doc/org.mockito/mockito-core
- AssertJ Documentation — assertj.github.io/assertj-core
Jorge David has been writing Java since 2015, working with Spring Boot, Kotlin, and Kafka in production environments. Dev AI Tools covers honest, technical insights on AI for developers.