PHP, Zend Framework and Other Crazy Stuff
Posts tagged mutation testing
Lies, Damned Lies and Code Coverage: Towards Mutation Testing
Jan 14th
I spent the vast majority of 2014 not contributing to open source, so I kicked off 2015 by making Humbug available on Github.
About Humbug
Humbug is a Mutation Testing framework for PHP. Essentially, it injects deliberate defects into your source code, designed to emulate programmer errors, and then checks whether your unit tests notice. If they notice, good. If they don’t notice, bad. All quite straightforward. Humbug will log which defects were not noticed by your unit tests, complete with diffs, and provide some basic metric scores so that you can fuel your Github badge mania someday.
You can try out Humbug today, though it remains a work in progress (PHPUnit only) and certain combinations of code, tests and moon phase may result in “issues”. Do try it though, it’s polishing up very nicely and I’m looking forward to a stable release. The readme has more information.
This article however is mostly reserved to explain why I wrote Humbug, and why Mutation Testing is so badly needed in PHP. There’s a few closing words on Mutation Testing performance which has traditionally been a concern impeding its adoption.
Code Coverage
Code Coverage is a measure of how many lines of code a unit test suite has executed. Not tested. Executed. If lines of code are not executed then, it logically follows, they were not executed by the test suite. This might be bad. There are probably tests missing. Off to the editor with you!
This distinction between testing and executing is all important, and something I feel that we’ve lost sight of in PHP when we’re busy decorating our Github pages with nice green 100% badges and talking about imposing 100% Code Coverage.
Let’s imagine that you have 100% Code Coverage. That’s actually a lie. More specifically, you actually have 100% Line Coverage. PHP_CodeCoverage and XDebug are incapable, at this time, of measuring Statement Coverage, Branch Coverage, and Condition Coverage. Your 100% score is only 25% of the story. Let’s call it 10% because I’m mean and there are other forms of Coverage that I have not mentioned.
Your Code Coverage is now 10%.
You know, I think I was too generous, and I’ve arbitrarily assigned a score. This simply will not do and its unfair to those who, in reality, might have 11% Code Coverage. We’ll have to take a more scientific approach.
Your Code Coverage is now 0% pending scientific research and peer review.
We can rephrase all of the above as follows: Line Coverage is an indicator of where source code was definitely not executed by any test. It does not indicate that a line was tested, or even fully exercised, merely that something on that line was executed at least once.
Taking this at face value, we can invent a problem to provide more illumination on how Line Code Coverage can mislead us:
if ($i >= 5) { // do something }
The above is a condition where there are three possible outcomes. $i will either be greater than 5, equal to 5 or less than 5. Two of these possibilities will evaluate the expression to true, the other to false. This suggests that we need 3 tests – one for each of the outcomes. We also need to be very specific. What if an error changes the 5 to a 6? Testing if 10 passes would be a bad test. What if it were changed to a 4? Then not testing with values of 4 or 5 would make for bad tests. It’s not all random integers we want in such tests – their selection should be deliberately targeting the boundary of the condition so as to avoid writing overly positive tests that are unlikely to ever fail.
Writing just one test that executes the above line will still leave us two tests short of where we should be. How do we know when those two tests are missing? Line Code Coverage will give us a 100% percent score for writing between 0 and 33% of the expected effective tests.
Dave Marshall recently wrote about Code Coverage with another real life example.
Line Code Coverage in PHP is simply not fit for our purposes. Being the sole possible Code Coverage type in PHP at present does not excuse it from being a misleading, inflated, and overly trusted metric that is easily fooled by writing bad tests and relying on coincidental execution.
The more insidious problem is that relying solely on Code Coverage as a measure of test quality, which is what we often end up doing, is attempting to automate an intellectual task. You can’t simply run a magic report and leave your brain at home. Your brain is very much required when assessing test suite effectiveness.
Measuring Unit Test Quality
Above, I made a distinction between code that was executed and code that was tested. Code coverage is an assertion that code is executed. It’s entirely possible to attain 100% Code Coverage, yet test absolutely nothing at all. The probable methods of achieving this are through tests which make no assertions, positive tests with long odds of failure, and coincidental execution by tests not specifically targeting the line (see PHPUnit’s @covers annotation).
This needn’t be intentional! It’s quite easy to overlook tests and that’s why we use Code Coverage to help us identify missing tests. We just can’t rely it as our sole means of ensuring test existence. Better Code Coverage would help us find a lot more missing tests, but it’s still solely a measure of execution.
So, given a test suite with 100% Line Coverage, how can we examine the test suite and arrive at any conclusion as to its quality and effectiveness in preventing regressions?
This is where Mutation Testing shines.
Mutation Testing
Imagine our original example:
if ($i >= 5) { // do something }
During Mutation Testing, Humbug would introduce three subtle defects, i.e. mutations. It would mutate the “>=” to each of “>” and “<”. It would also mutate the “5” to “6”. Depending on the nature of the code block, this should result in unexpected behaviour that your unit tests, if written well, should have assertions against. Occasionally, a mutation is equivalent to the original statement (e.g. perhaps $i is hardcoded to >5 and it’s not actually settable from a test) but we would expect the false positive rate to be minimal.
For each mutation, noting that only one is applied at a time, we run the relevant unit tests. If a defect causes a test failure, error or a timeout (infinite loops may occur infrequently with a mutation) then we can assert that this particular defect is tested. If the tests all pass, we can assert that this defect was not tested and we can log it for investigation. A new test would now be needed to cover that defect unless, of course, it’s a provable false positive.
We are no longer playing games with execution statistics. We’re actually measuring the effectiveness of a test suite, and improving its effectiveness over time. The provided scores, taken with a double pinch of salt, assist in gauging how bad or good defect detection is by calculating the ratio of detected mutations to the total generated and the total covered by tests (yes, we contrast to Code Coverage). The logs are the more valuable output, offering diffs for each undetected mutation. These can be examined (by an actual living entity) to see where new tests might be needed.
Your Code Coverage metrics essentially tell me nothing about the effectiveness of your unit test suite. They only tell me that your unit tests executed stuff. Your Mutation Testing scores, on the other hand, give me some ballpark estimates on the real effectiveness of those same tests.
Performance
I can’t sign off without mentioning Mutation Testing performance.
Traditionally, Mutation Testing has been ridiculously slow, often running the entire test suite for every single mutation. On one library this morning, I generated close to 1000 mutations. The test suite typically took 5 seconds to run. Doing the math is close to crazy. The solution implemented by Humbug was to take something I criticised (ahem, Code Coverage) and use its data to only run tests which execute the mutated line. It takes around 2 minutes for Mutation Testing of that library. In another example, a library with ~5000 tests running in 3 minutes took around 12 minutes to mutation test (~1.5k mutations were generated).
I expect to improve on that even more and enable specific class targeting as a future feature. It would be even faster if we had improved Code Coverage in PHP. And, as always, your mileage will most definitely vary – performance is influenced by the mutation count and the performance of both code and tests. Slow tests, in particular, while ordered to run last may have a significant impact.
Tools like Humbug are no longer restricted to academic papers.
I bring this up, because performance is clearly one huge reason why Mutation Testing hasn’t already become commonplace despite its very obvious benefits. You won’t be mutation testing all the time, but running it occasionally for your entire test suite, or at least a few times for each new testable class, is quite reasonable and within current reach. Implementing filters and other focus aids, would allow for even more dynamic and regular usage alongside your testing framework to keep feedback regular and fast.
I’ll blog more specifically about Humbug in time as development rolls on.
What is Mutation Testing?
Aug 2nd
Some time ago, in between working on Zend Framework, I booted up a couple of libraries that I really wanted to integrate into my workflow. Recently, I’ve been being putting these through the grindmill so they can be properly released and supported for public consumption across PEAR. Just as Mockery fell out of older work on PHPMock, Mutagenesis will fall out of another project called MutateMe. This is a short introductory article as to what Mutagenesis will do and why. In other words, what the heck is Mutation Testing?
First, some background.
The most common means of measuring confidence in a test suite is the Code Coverage metric. Code Coverage essentially checks, on a per class basis, how many of the lines of code in the class are executed by a test suite and expresses this as a percentage. For example, a Code Coverage of 85% means 85% of the lines of code in a class was executed and 15% were not. The greater the number of lines of code executed, the more confidence one can presumably have that a test suite is doing its job, i.e. verifying class behaviour, preventing the introduction of bugs, supporting refactoring, and so on.
I have a huge and insurmountable problem with Code Coverage. For starters, my average Code Coverage is closer to 80% than the 90% expected of projects such as Zend Framework. The gap is explained by me not testing what I call “braindead” functions, i.e. methods which are either ridiculously simple, where a malfunction would quickly become self-evident, or which are marginalised (on the borders of deprecation). So Code Coverage actually increases the amount of work I need to do for very little gain and a lot of frustration.
Secondly, Code Coverage is easy to spoof or misinterpret. Since it’s a metric measuring the execution of source code, you need only…well…execute the source code. It’s a simple matter to construct a series of wonderfully useless tests to do just that and obtain a high Code Coverage result - it’s done all the time in my experience once someone’s patience in writing quality unit test runs out. It is particularly evident in cases where unit tests are written after the source code is completed - a still too common practice in PHP. The less villainous flipside is that certain nuggets of source code are fundamentally difficult to test. For example, a complex algorithm suffering from poor documentation may make composing a suitable unit test near impossible. The rollout of OAuth was filled with such examples.
This leads into my opinion of Code Coverage. I view the venerable Code Coverage metric as a near pointless exercise. While it may tell how much source code a test suite exercises, it tells you nothing about the actual quality of those unit tests. They could be good tests, sort-of-good tests or absolutely horrendous tests - Code Coverage will never tell you either way. I say near pointless because there are precious few alternatives. We need something to give us a reason to trust and have confidence in test suites and Code Coverage is easy to implement and has been a part of PHPUnit since forever. So, by and large, we make do. We measure Code Coverage just to make certain some kind of unit testing was performed.
Is there nothing better?
A good unit test serves a simple purpose. It verifies a behaviour of an object. In PHP, we’re more likely to verify umpteen million behaviours in a single test (count your assertions!) but we’ll let that slide. Since a test verifies behaviour, it follows that a test should fail when that behaviour is changed. If a test does not fail when class behaviour is changed, it also follows that the original behaviour was not fully tested, i.e. there is a gaping hole in our test suite whether due to a flawed or missing test that could allow bugs entry into our application. So, to really stick unit tests under a microscope to assess their quality and our confidence in them, we need to introduce changes into the source code under test and see if the unit test suite can or cannot detect them.
This process is known as Mutation Testing. Mutagenesis is a Mutation Testing framework for PHP 5.3+.
Mutation Testing, as you have probably surmised, is not a super-complex activity. You take a set of source code and compile a list of possible “mutations” that are likely to break the behaviour of the source code. Then, you apply one mutation to that source to create a “mutant”, i.e. a copy of the source code with the mutation change applied. Next, you run the source code’s test suite against the mutant and see if any tests fail. If a test fails, celebrate - the mutation was detected so your tests were, in this instance, adequate. If no test fails, curse the Gods - the mutation was not detected and you’ll need to figure out whether a new test is needed or an old one modified/corrected. Rinse and repeat the above for each mutation you’ve compiled.
Mutations are typically quite simple such as replacing operators, booleans, strings and other scalar values with either an opposing form or a random value. Expressions might also be reversed or driven to zero to give an opposing boolean or zero value. Making such minor changes seems like a minor irritation but behind every serious flaw in an application is one or more smaller contributing errors. If your test cases can detect the potentially contributing errors, then there’s an excellent chance it would detect the bigger ones anyway. This is known as the Coupling Effect in Mutation Testing.
Some of you will be vaguely aware of Mutation Testing. In terms of implementations, Ruby has heckler, Python has Pester, and Java has Jumbler, Jester and a couple of others. Those who prefer Microsoft’s technologies can use Nester. There’s a running ryhme apparent since so much is inspired by the original Jester framework for Java. To my knowledge, Mutagenesis will be the only Mutation Testing framework for PHP (though I sincerely wish I was wrong).
Examining those libraries, you eventually realize a few problems with Mutation Testing which explain its lack of popularity until relatively recently: performance is a concern and Mutation Testing requires a Human Brain to complete the process.
Performance is a concern because each mutation requires a test suite to be executed. Imagine a set of classes from which you extract 100 possible mutations, coupled with a test suite that takes 5 minutes to run. A basic Mutation Testing framework (e.g. Ruby’s heckler) would therefore take 500 minutes to complete a Mutation Testing session. That’s 8.3 hours of continuous Mutation Testing. Mutation Testing for Zend Framework would be very interesting .
Similar to Jumbler for Java, Mutagenesis will utilise a few heuristics (shortcuts) to significantly improve performance without compromising results. We only need one single test to fail in order to rule that a mutation was detected and killed, so we can do a few things to boost performance:
1. Terminate the test suite on first failure/error or exception.
2. Execute test cases in order of execution time ascending (fastest first; slowest last).
3. Prioritise execution of last test case to detect a mutant to take advantage of same-class detection.
4. Log which tests detect which mutations, and prioritise those associations in subsequent runs.
The effect of the above is to speed up Mutation Testing by a significant degree. The final heuristic ensures that for gradually changing source code and tests, the first Mutation Testing process might take a while but subsequent runs will be significantly faster making them far more usable in a Test-Driven Development setting. Mutation Testing is best served with a healthy dose of efficiency.
The second reason for its lack of popularity is that Mutation Testing can’t analyse the logic of the source code under test. For example, an expression might accept any integer less than 10 to evaluate to TRUE. If the input from another class were 7, and a mutation were generated to swap this for a 9, then the associated unit test would still pass (the mutation of switching 7 for 9 still allows the <10 expression evaluate to TRUE). If you recall, if a mutant passes a test suite than we assume either the presence of a flawed test or the lack of a suitable test. Obviously, as the above suggests, this isn’t always the case. Mutation Testing can and often will report false positives.
Ruling out false positives, coupled with the need to improve test suites to detect more mutations, makes Mutation Testing a source of extra work. Who likes extra work least? Programmers, especially the lazy kind .
Mutation Testing is not a far fetched idea. The principles are sound and it beats the pants off Code Coverage when it comes to measuring what confidence we can have in our testing suites. It is still hampered, as a methodology, by the lack of good implementations in other programming languages. Mutagenesis, by adopting implementation heuristics from Java’s Jumbler, should avoid that fate and offer a decent framework in PHP that performs as well as can be expected.
Once it’s released…of course . Mutagenesis is in development but should see a fresh release in a couple of weeks alongside Mockery. I’ll be looking forward to seeing how people perceive it. Mutation Testing has zero presence in PHP to date but having something to complement Code Coverage can’t do any harm!