On 1 January 2015, I first pushed Humbug onto Github and three months later it is reaching a state where I can prep for the release of 1.0.0. Release early, release often! I haven’t publicised it a lot so this is my first Humbug specific blog post since I started writing code on December 21st with a season appropriate name becoming my chosen namespace.

https://github.com/padraic/humbug

About Humbug

Humbug is a Mutation Testing framework intended to measure the true effectiveness of test suites and provide sufficient information to allow for their improvement.

You may already be familiar with the concept. In Mutation Testing, defects which emulate simple programmer errors are introduced into source code (your canonical code is untouched) and the relevant unit tests are run to see if they notice the defect. The more defects that are noticed, the more effective the test suite is presumed to be. The methodology relies on the theory that a quantity of relatively simple defects, either in isolation or combined, provide as much useful information as would a series of more complex defects.

You can find a comprehensive (and growing) list of the types of defects Humbug creates here: https://github.com/padraic/humbug#mutators

The traditional tool in PHP for measuring test suite quality is Code Coverage. Whereas Code Coverage measures execution statistics (without regard for how unit tests are written), Humbug provides a reliable and conservative assessment of how effective your test suites are at fulfilling their objective: the detection of regressions, and an accurate and complete description of implemented behaviour.

The differences between Code Coverage and Mutation Testing can be quite obvious. On a library I run Humbug on frequently, the Code Coverage is 65%. After running Humbug for several minutes to its conclusion, the resulting Mutation Score Indicator (MSI) is 47%, a quite stark difference of 18%.

Why the discrepancy? Since Code Coverage only cares about what lines you execute, it’s blissfully unaware of any other essential information: the content of a line of code, logical branches and paths, the likely errors that might arise, whether the unit tests were written poorly or well. It ignores all of these factors which are essential to assessing the real effectiveness of a test suite.

As a result, Code Coverage’s importance as a test quality metric is overestimated. Merely executing lines of code is not a good indicator of test suite quality, and it really only informs you of what parts of your code are definitely not tested. It’s possible to reach a 100% Code Coverage score with the most horrible unit tests imaginable. There are other limitations, such as Code Coverage in PHP generally being Line based and not measuring statement or branch coverage.

I don’t want to give the impression that a 0-100 score is the full extent of Humbug’s purpose: it also produces detailed logs of defects (with sufficient information to replicate them outside Humbug) which go undetected by a test suite, allowing you to write new targeted tests that better document the actually implemented behaviour to support refactoring and prevent unnoticeable regressions.

Hmm, I should probably explain how to use it now…

Installing Humbug

Humbug requires PHP 5.4 and only works, for now, with PHPUnit. I’ll be looking into phpspec/behat support in the near future.

Humbug is available on Packagist as `humbug/humbug` to install globally via Composer, or you can clone it and run a composer install, but the simplest way to get it is to just download the PHAR. Given I’m a security freak, the PHAR is cryptographically signed (hence the additional public key download) and delivered over HTTPS. You can move or rename these, so long as both files are kept together.

wget https://padraic.github.io/humbug/downloads/humbug.phar
wget https://padraic.github.io/humbug/downloads/humbug.phar.pubkey

If you wish to make humbug.phar directly executable:

chmod +x humbug.phar

The PHAR is self-updating using the following command:

./humbug.phar self-update

I manually update the central PHAR as new functionality or bug fixes are added. This will track the development version in the run up to 1.0.0. Thereafter, I expect there to be a choice between stable versions or development versions when updating your PHAR copy.

Once you have Humbug somewhere, you’ll need a guinea pig. Assessing Humbug’s performance (a lot more on that later) on a huge repository of code is probably not the greatest idea ever. Pick something more moderate, where you can get used to the size vs performance ratio, and navigate to its base directory. From there:

./humbug.phar configure

Follow the steps presented as questions to generate a configuration file. The main information needed are the location(s) of source code that is being tested, the directory from which to run tests (if not the base directory), the timeout to apply for any one test run (defaults to 10s – used to kill infinite loops arising from certain mutations) and any directories which should be excluded from mutation testing within the source code directories you chose (e.g. Tests if under the src hierarchy). This will generate a humbug.json.dist configuration file. You may write it manually - it’s relatively simple.

{
    "timeout": 10,
    "source": {
        "directories": [
            "src"
        ]
    },
    "logs": {
        "text": "humbuglog.txt",
        "json": "humbuglog.json"
    }
}

Running Humbug

Mutation Testing itself is just the default command, so now run:

./humbug.phar

Humbug operates in several stages. The first is to run the test suite normally to ensure that it’s in a passing state. At this stage, Humbug also collects data on the tests: execution times, code coverage and the junit log. This data is utilised later for optimisation purposes to ensure we can eliminate tests where they don’t exercise the specific line of code being mutated and also execute the fastest tests first. As a result, it’s essential that all tests are currently passing – not only for the data, but because unavoidable failures would play havoc with the mutation testing results. If they don’t pass, Humbug will terminate the process and show an extract of the TAP formatted output indicating the failing test.

Note: Humbug is still a relatively young framework, so there are also edge cases where it will terminate even on a passing test suite. We’ll gradually deal with those cases over time.

The second stage analyses the source code, breaking it into tokens which are fed into a queue of Mutation Operators (aka Mutators). These Mutator objects check every token to see if they are capable of applying a specific mutation at that point in the source code, returning a simple boolean as confirmation either way.

In the third stage, all of the information gathered to date is sent to the assembly line (also known as the God Method that gets an F on Scrutinizer). In the example I mentioned earlier, over 650 mutations are generated. In this third stage, we setup configuration files and a StreamHandler (which intercepts includes) for a separate process. In this separate process, the StreamHandler intercepts the inclusion of the original file, and replaces it with a mutated form (the mutant containing the current mutation). The test suite is then run within this separate process to see how it responds to the mutant’s presence.

All of the optimisations also kick in so that only relevant tests are executed for a given mutation. The PHPUnit output is then reported back to the main Humbug process to assess and collect the result.. Progress is rendered as a series of dots and letters until we have finally iterated across all of the available mutations. In this case, 653 mutations, including 653 PHPUnit runs, takes a total of around 3 minutes on my local VPS. Given the code coverage of 65%, it would be over 5 minutes if the unit tests were more complete.

The fourth and final stage is rendering a result summary and writing any requested logs. This includes the now simple calculation of the Mutation Score Indicator, the primary metric referred to in Mutation Testing.

Here’s a sample of the resulting command line output:

 _  _            _
| || |_  _ _ __ | |__ _  _ __ _
| __ | || | '  \| '_ \ || / _` |
|_||_|\_,_|_|_|_|_.__/\_,_\__, |
                          |___/
Humbug version 1.0-dev
Humbug running test suite to generate logs and code coverage data...
  361 [==========================================================] 28 secs
Humbug has completed the initial test run successfully.
Tests: 361 Line Coverage: 64.86%
Humbug is analysing source files...
Mutation Testing is commencing on 78 files...
(.: killed, M: escaped, S: uncovered, E: fatal error, T: timed out)
.....M.M..EMMMMMSSSSMMMMMSMMMMMSSSE.ESSSSSSSSSSSSSSSSSM..M.. |   60 ( 7/78)
...MM.ES..SSSSSSSSSS...MMM.MEMME.SSSS.............SSMMSSSSM. |  120 (12/78)
M.M.M...TT.M...T.MM....S.....SSS..M..SMMSM.......T...M...... |  180 (17/78)
MM...M...ESSSEM..MMM.M.MM...SSS.SS.M.SMMMMMMM..SMMMMS....... |  240 (24/78)
.........SMMMSMMMM.MM..M.SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS |  300 (26/78)
SSSSSSSSM..E....S......SS......M.SS..S..M...SSSSSSSS....MEM. |  360 (37/78)
.M....MM..SM..S..SSSSSSSS.EM.S.E.M............M.....M.SM.M.M |  420 (45/78)
..M....MMS...MMSSS................M.....EME....SEMS...SSSSSS |  480 (52/78)
SSSSS.EMSSSSM..M.MMMM...SSE.....MMM.M..MM..MSSSSSSSSSSSSSSSS |  540 (60/78)
SSS....SSSSSSSSMM.SSS..........S..M..MSSMS.SSSSSSSSSSSSSSSSS |  600 (68/78)
......E...M..........SM.....M..MMMMM.MMMMMSSSSSSSM.SS
653 mutations were generated:
     283 mutants were killed
     218 mutants were not covered by tests
     130 covered mutants were not detected
      18 fatal errors were encountered
       4 time outs were encountered
Out of 435 test covered mutations, 70% were detected.
Out of 653 total mutations, 47% were detected.
Out of 653 total mutations, 67% were covered by tests.
Remember that some mutants will inevitably be harmless (i.e. false positives).
Humbug results are being logged as JSON to: log.json
Humbug results are being logged as TEXT to: log.txt

Interpreting Humbug

You might recognise the summary results as being the example I explained earlier. Code coverage is 65% but the Mutation Score Indicator (MSI) is 47% (though it needs to be emphasised in output). I didn’t quite explain what the MSI is, so I’ll do so now.

A Mutation Score (MS) is a simple calculation of Detected Mutations as a percentage of Total Mutations, i.e. the more defects a test suite detects, the better its score. There is however a certain flexibility as to what constitutes both of these values.

Humbug doesn’t generate every single mutation possible. Humbug also doesn’t eliminate false positive results which may arise from Mutant Equivalents, i.e. when a generated defect behaves identically to the original source code. Some of this will make its way into later Humbug versions as time (and PHP’s cooperation) allows. Certain other things like eliminating Equivalents can actually be quite hard and resource/time intensive. If Mutation Testing needs to perform well, this simply can’t be tolerated.

Rather than make a false claim, Humbug therefore reports the more nebulous and conservative Mutation Score Indicator (which is par for the course in the real world). It indicates what your actual Mutation Score might be, but it’s not definitive. An MSI of 47% may be slightly understated as a result of this uncertainty. It also means that a perfect score of 100% is very likely unobtainable except in very simple straightforward cases.

My Kingdom For A Diff

The logs generated by Humbug are essentially a collection of file name, line number, mutator type and a diff demonstrating how a specific mutation is applied, all categorised by the result type. The purpose is to allow you to review mutations which were covered by tests but went undetected, apply them as needed, and write tests allowing you to detect the defect represented by the mutation should it or a related behavioural regression ever occur.

The logs can be generated in Human readable text, and in JSON format to be consumed by other services or your own creations. We also generate, optionally, a number of JSON logs which cache results and other information necessary to perform Incremental Analysis (IA) which is an experimental feature mentioned in the next section.

Performance

In a nutshell, performance is the reason why Mutation Testing is not regularly used in any programming language. Running your entire test suite for potentially thousands of possible defects can indeed be extremely slow, so focus has remained on ad-hoc manual reviews and code coverage to assess test suite quality.

Humbug, like some other Mutation Testing frameworks, pursues performance optimisations even where it may have a small cost to accuracy. In our prior example, we generated 653 mutations with a test suite that normally takes 29 seconds on average. That would suggest a runtime of roughly 5.26 hours. In reality, Humbug completes the mutation testing in around 5 minutes. Your mileage may vary.

One significant optimisation is that Humbug uses Code Coverage for its actual purpose: assessing what lines of code are tested, and what tests actually exercise those lines. This eliminates “run the whole test suite per defect” since we can select only relevant tests and then order by their execution times to run the fastest relevant tests as a priority. It also means that we can skip running any tests where a mutated line’s Code Coverage is zero. There are other smaller tweaks (both to speed and memory utilisation) and running faster tests first eliminates most of the costs of slow tests.

Additional undiscovered optimisations (micro or otherwise) may yet be possible, although the more obvious targets are understandably the focus at this point. Certain other optimisations such as parallel processes and more fine grained test selection are dependent on the test suite being sufficiently well designed (which is relatively rare), so these have gone unimplemented for the moment as we chase optimisations applicable to the broadest base possible.

Incremental Analysis

Incremental Analysis (IA) is an experimental feature under progress to incorporate caching into Humbug. The principle being to cache results, incrementally updating them as the source and test code changes, to eliminate the upfront cost of iterating across all possible mutations when unnecessary.

While it won’t be stable for an initial release, IA promises to bring performance down to the point where Humbug is more usable with larger projects, and capable of being used more frequently in general.

In Closing

Where now? Humbug 1.0.0 is intended to get the ball rolling on Mutation Testing with PHPUnit,  allow for some experimentation with Incremental Analysis, and start attracting issues for the inevitable bugs and various PHPUnit setups that exist out in the wild.

Other than the inevitable bug fixes and basic maintenance, the next obvious target is phpspec and behat, taking on the issue of how diverse we’ve become in describing behaviour instead of merely verifying all things.

This is largely a question of increasing Humbug’s detection net to what people actually use in real life. In a setup with behat, phpspec and PHPUnit, some mutants may escape a PHPUnit oriented approach by design. Humbug will need to take all of the various tools into account in those cases.

In the meantime, I hope Humbug terrifies your unit tests ;).