In January of this year, I had the idea of writing a HTML Sanitiser for PHP. Why not? All PHP has is HTMLPurifier and a bunch of random solutions that are about as secure as the average wooden gate. If you think that’s harsh, wait for my next blog post ;) . HTMLPurifier is the only secure by default HTML Sanitiser in PHP. Fact. But the darn thing is gigantic and slow. That has never stopped me using it (for years), even if I had to do a little funky engineering so I could minimise the performance hit. Other developers, however, have often abandoned HTMLPurifier, falling into the trap of believing that alternative solutions will serve them just as well.

That’s the state of HTML Sanitisation in PHP - pick a big slow library that crushes Cross-Site Scripting and Phishing attacks, or use yet another regular expression based sanitiser that a) barely manages a fraction of HTMLPurifier’s features and b) can probably be exploited by any scriptkiddie working with a stack of data cards. It says an awful lot about security standards among PHP developers that such delusions are uncomprehendingly rampant.

In case you haven’t noticed, I’m biased. Sue me.

I have opined since forever that regular expression sanitisers are nothing short of insane. Since the problem with HTMLPurifier is speed and size, I started thinking about ways to build something like HTMLPurifier that was fast, small and almost as feature packed as HTMLPurifier. At first, this sounds like an impossible task. The typical suggestion is to use regular expressions, but I’m not completely insane…yet. Instead I borrowed a concept called a DOM Filter and chucked in a helpful dose of HTML Tidy. The result was Wibble.

Wibble is basically a DOM Filter. It loads up HTML into PHP DOM, applies a set of filters against all nodes in the DOM, passes the output through HTML Tidy, and then hands it back to the user - sanitised and well-formed. It’s almost stupid in its obviousness. Better, this allows Wibble to skip regular expression dependence. It operates far more like HTMLPurifier by relying on a DOM representation (no string parsing to funk around with) partnered with Tidy for cleanup.

Of course, there have to be regular expressions somewhere. And whitelists. And other stuff. Wibble is really an amalgamation of borrowed concepts. It’s hard to be too original in HTML Sanitisation because originality is a good way to shoot yourself in the foot (hence regex is EVIL!), so I wasn’t going to spend too long digging my own grave when there is a wealth of sanitisation resources in the programming world. Wibble’s approach borrows elements from Ruby’s loofah, Python’s HTML5Lib, and Java’s AntiSamy. Wibble mixes and matches from the useful design elements each of these offers, serving them up on top of PHP’s DOM and Tidy extensions with its own distinctive twists.

I completed the first Wibble prototype recently, so I figured that with something that was at that 90% point where the remaining 10% would be in-depth sanity testing, cleanup and documentation, it was time to see how it compared to some other PHP solutions (HTMLPurifier and HtmLawed). I had some fairly conservative performance objectives so the results came as a pleasant surprise.

If you are a benchmark fiend, you can download and independently fiddle with my benchmark process from http://github.com/padraic/wibble-benchmarks. Note that the current benchmark uses a Wibble prototype - there are additional elements that need to be added over time. The benchmark currently uses three sample snippets of HTML: Small (blog comment size), Medium (markup heavy with limited textual content), and Big (markup light with lots of textual content). It operates by filtering each HTML sample 200 times with each benchmarked HTML sanitisation solution. Each iteration includes the instantiation and setup phases of each solution (where relevant) to reflect the most likely real world experience of using sanitisation as a once off (non-repeating in same request) process. I use PEAR’s Benchmark package to record the aggregate run time per loop of sanitisation tasks. All operations occur within one single PHP process with HTMLPurifier caching enabled (Wibble and HtmLawed do not use caching). Each solution is configured as close as possible to target total stripping of all HTML from the content.

You can view a sample result at http://gist.github.com/468426.

The results show that both Wibble and HtmLawed outperform HTMLPurifier by a very wide margin. Wibble underperforms HtmLawed by a variable margin - from twice as slow on small to medium sized input, to four times slower on large inputs with minimal HTML tags. In Wibble’s slowest benchmark, it outperformed HTMLPurifier by a factor of four.

Wibble intent is to try and replicate the completeness of HTMLPurifier, so it’s speed deficit when compared to HtmLawed is expected (when stripping all tags). There is not a lot to be done to improve this specific benchmark result since Wibble does a lot of stuff behind the scenes like encoding normalisation, DOM manipulation and HTML tidying. It also does all three of these things far more consistently and completely than HtmLawed is capable of.

So how does Wibble match up against Big Daddy? Wibble is a prototype, so obviously it still has ground to gain in terms of features with HTMLPurifier. But on the most significant points it only has one specific problem - it’s not HTML 5 ready. Neither DOM or Tidy support HTML 5, though you can “pretend” it’s HTML 4.01 (or even XHTML 1.0) for HTML 5 fragments so long as you are aware Tidy will strip unsupported HTML 5 tags and attributes.

The other points are syncing up with HTMLPurifier quite nicely. Wibble will santitise all HTML by default using strict filters (i.e. by default it strips every tag and only outputs plain text). It handles multiple encodings including conversion if necessary. It outputs standards compliant (other than HTML 5) HTML or XHTML. It fixes all the usual page breaking stuff like unclosed tags and illegal tag nesting. It is entirely reliant on whitelists and strict validation rather than blacklists and loose reconstructive parsing. It includes minimal regular expression usage (only needed for attribute and CSS validation) based on regular expressions widely used and tested in other languages. While testing will (and must) continue, it has so far proven resistant to XSS and Phishing attacks. This can’t be absolutely assured until sufficient testing has been performed.

Otherwise, it will be interesting to see the final version of Wibble. HTMLPurifier has a tough reputation to follow, but having something which can even up the odds and do it with a pronounced advantage in speed will be really nice. Well, until someone needs to install it on CentOS ;) .