HTML Sanitisation (defined below) has been with us for a long time, ever since the first genius who came up with the idea of allowing potentially untrustworthy third party HTML to be dynamically patched into their own markup. The years have not taken this kindly, and third-party HTML inclusion has remained one of the most complex and underappreciated vectors for security vulnerabilities.

In this article, I take a look at some of the solutions PHP developers rely upon to perform HTML Sanitisation. Mostly because few others have done it or written about such solutions in any great detail (at least publicly). HTML Sanitisation has a very low profile in PHP. It’s rarely mentioned, usually not understood all that well, and examining some of the solutions in this area with more deliberate attention is worth doing. Also, it’s valuable research since I am writing my own HTML Sanitisation library (bias alert!) for a future Zend Framework 2.0 proposal. Knowing what the competition is up to does no harm! Finally, I was simply curious. Nobody seems too pushed to look closely at all these HTML Sanitisation solutions despite the fact that there are other developers (I think) who wouldn’t touch most of them with a barge pole.

One somewhat remarkable example, just to illustrate why I figured this article was worth the time, is HTMLPurifier’s Comparison analysis where HTMLPurifier is compared against a number of other HTML Sanitisers. The comparison is remarkable because it seems inclined to err on the side of giving HTMLPurifier’s competitors the benefit of the doubt. Unfortunately, this means the analysis is often flawed and its conclusions suspect. Also, it assists in legitimising other solutions in the minds of readers by making assumptions of safety. Not that this reflects on HTMLPurifier’s functionality, incidentally, which I have always maintained is the only HTML Sanitiser worth looking at.

Back on track…

What is HTML Sanitisation?

HTML is an amazingly dangerous thing. It can contain Javascript, CSS, or malformed markup, or even gigantic images that laugh at your dual 32″ monitor setup. Each of these, in their own way, can damage the experience of an end user, whether it be by Cross-Site Scripting (XSS), Phishing or simply mangling the page until it’s unusable and/or defaced with scriptkiddie jibes.

There are two ways of dealing with these threats to the HTML output of an application: escape output so that the only HTML rendered by the browser is the application’s (anything else being neutered by HTML entities), or by sanitising output so that any additional HTML it contains, that is renderable by a browser, is stripped of any potentially damaging markup. This article concerns the second option.

HTML Sanitisation may therefore be defined as any means of filtering HTML to ensure that a) Cross-Site Scripting (XSS) vulnerabilities are removed, b) Phishing vulnerabilities are removed, c) the HTML is well formed and adheres to an acceptable HTML standard, and d) the HTML contains no obvious means of breaking expected web page rendering.

I won’t claim this is a perfect definition but it covers most of the salient points you’ll likely encounter.

So there are, broadly speaking, four primary objectives of HTML Sanitisation, any one of which is capable of preventing damage to end users or web application functionality (including javascript powered client side functionality). Each is, in its own way, quite a difficult proposition requiring suitable tools and specialised knowledge. However, with some objectives we can measure our success somewhat reliably. The question of this article being: how well do HTML Sanitisers in PHP measure up to these objectives?

The Candidates

Since this is intended as a brief examination (just a few million words long!), I decided to select four candidate HTML Sanitisers meeting certain conditions. These conditions included:

1. Having a release at some point in the past two years;
2. Describing itself as a HTML sanitiser/filter to prevent Bad Things;
3. Having a design clearly in line with an intent to filter XSS/Phishing; and
4. Having no publicly acknowledged long standing security vulnerabilities.

The great part about applying these conditions is that I pretty much eliminated stacks of HTML Sanitisers (as some might claim them as being). Outside of those, it also eliminates anything users might misconstrue as a HTML Sanitiser (for example PHP’s strip_tags() function or Zend Framework’s Zend_Filter_StripTags class). What we are left with is pretty thin on the ground, but fits what I’d expect a reasonably educated PHP developer to swing with. From what remained, I selected four candidates (or maybe these were the only four left - I’ll never tell):

1. PEAR’s HTML_Safe
2. htmLawed
3. WordPress’ Kses
4. HTMLPurifier

With four candidates in tow, I proceeded to examine each against the four objectives of HTML Sanitisation I noted earlier. Just to emphasise to readers, this examination was not so in-depth as to identify every possible flaw or issue with each candidate. My intent was to attempt to locate one security vulnerability and assess each candidate in general terms for the other non-security related HTML Sanitisation objectives.

Before I go any further, let me clarify that all security vulnerabilities discovered were notifed almost immediately to the parties responsible for each candidate solution. All such parties confirmed receipt of these reports within one week, and all were given a period (approx. six or more weeks to today) in which to apply fixes, make new releases, update documentation, perform additional security reviews, etc. I’ve sat on this article for a long time. Regardless of the effectiveness of any such actions (or lack of action as the case may be), this article discloses all perceived security vulnerabilities discovered during my examination whether or not all parties agree with my opinion. In some cases a disclosure may clearly indicate a fundamental flaw in the underlying design of the candidate. In these cases, it was emphasised to the responsible parties that reported vulnerabilities were limited to the scope of my examination and that I believed it was likely that additional and possibly related vulnerabilities remained unreported but easily discoverable as a result of public disclosure. This concludes my rendition of “Cover Your Ass” ;) .

Note that all discussions below relate to the whichever version of each candidate solution was initially examined (before any fixed releases). I have noted resolutions as necessary for each.


To get us started, PEAR’s HTML_Safe is one of the older candidates despite its examined release being in April 2010 (the previous release being a beta in 2005). HTML_Safe’s description states that “This parser strips down all potentially dangerous content within HTML”. HTML_Safe operates on the basis of parsing HTML with regular expressions and applying filtering logic which is dependent on predefined blacklists of potentially harmful elements, attributes and CSS properties.

Unfortunately, HTML_Safe’s blacklists prove its undoing. The problem with blacklists is that they require constant attention and updates for new problems. A cursory examination of the CSS property blacklist showed that it omitted many browser specific CSS properties such as -ms-behavior. The -ms-behavior CSS property (specific to Internet Explorer 8) may contain as a value a URI reference to a locally hosted HTC file (which contains executable Javascript). While such a file would need to exist on the local domain, this is obviously a security vulnerability in that it allows the execution of any arbitrary Javascript an attacker can store or reuse on the local domain thus opening up XSS possibilities.

HTML_Safe also has another vulnerability shared by practically all HTML Sanitisation solutions based on the use of regular expressions. Regular expression parsing typically assumes that all HTML special characters are encoded in ASCII (an encoding subset which is common across other encodings such as ISO-8859-1 and UTF-8). However, UTF-7 encodes the greater than (>) and less than (<) characters differently. This means that typical regular expression parsing does not detect these characters when encoded in UTF-7. If you can't detect them, you can't sanitise them! This sanitisation bypass requires a secondary vulnerability where an attacker either forces a webpage containing unsanitised UTF-7 encoded markup to be rendered with a charset of UTF-7 (IE has vulnerabilities here, as do some versions of Firefox) or where the target application actually allows a user to select a nonsensical custom character set (as happened with Google and Yahoo when struck with this same exploit, and several CMS applications as recently as last Spring).

Finally, HTML_Safe's blacklisting also misses out on CSS properties not directly tied to XSS (e.g. position), but which may be used to perform Phishing attacks by using CSS styling to alter or overlay HTML elements. This could, for example, allow an attacker to re-style their injected HTML to replace (as in positioning above) specific page elements (if not the entire page). This can lead to Clickjacking among other forms of Phishing.

In terms of HTML well formedness, HTML_Safe does not necessarily emit standards compliant or well formed HTML, nor does it check some other common page breaking tactics such as overlarge images as a precaution.

HTML_Safe may well be the least secure of our four candidates. Its use of blacklisting, its relative age, and a lack of peer review have left it woefully outdated for the task of HTML Sanitisation. In short, it should be avoided at all costs and my main request to PEAR at the time of reporting the above vulnerabilities was to seriously consider removing it from PEAR. Personally, I find it almost tragic that a library of such limited capability may benefit from PEAR’s reputation and lead to users trusting it over far more secure alternatives.

Of the issues noted above, the UTF-7 vulnerability has been resolved in a new release. The CSS blacklist has not yet been revised though I remain confident that the security advisory will go out any day now. PEAR does do security advisories, right? My recommendation, as originally suggested, remains that HTML_Safe should be removed, or overhauled, given it is not up to the task of HTML Sanitisation in its current condition. Blacklists are simply the worst approach ever to HTML sanitisation.


Our second candidate is, at least based on Google’s omnipotence, one of the more popular open source standalone HTML Sanitisers. htmLawed has garnered a reputation as something of a rebellious spirit built for speed, a stark contrast to the slow resource intensive operation of HTMLPurifier with which is often compared. htmLawed’s description states that it is “PHP code to purify & filter HTML” and that it can “make HTML markup in text secure and standard-compliant”.

The problem with htmLawed is that its operation is not much more than a very short stone’s throw away from HTML_Safe. On the face of it, htmLawed is significantly more complex than HTML_Safe, being loosely based on Kses (an older HTML Sanitisation script). Its documentation is also huge, finely detailed, and packed full of options. It’s source code is also complex and heavily obscured (presumably to discourage anyone from trying to examine it too closely). But…complexity and options a HTML Sanitiser does not make.

Similar to HTML_Safe, htmLawed carries no functionality to thwart character encoding based attacks. The UTF-7 exploit as described for PEAR’s HTML_Safe works perfectly well.

In addition, htmLawed has a number of oddities which are not vulnerabilities but may potentially become such in the future. These are related to attempts by htmLawed to fix insecure attribute values and CSS values. I have no idea why htmLawed tries to fix them (just remove them!), but the results come very close to enabling reconstructive logic based attacks (i.e. where you can trick a fixing mechanism to reconstruct a more obvious attempt at a vulnerability into a less obvious but equally dangerous one). I can’t say these are anything more than oddities however. One example, is that you can get the parser to create a CSS string containing something like “exp ress ion” (expression properties allow the execution of Javascript). The spaces in the string prevent it from being used by all current browsers, but what if a new browser release decides to ignore a little whitespace? Several other oddities also exist but are either harmless or fixed in the most recent htmLawed release.

Back firmly in current reality, htmLawed’s CSS filtering is seriously flawed. For example, it does not filter out the CSS behavior property which leads to the same result as it would for HTML_Safe’s allowance of -ms-behavior, the execution of local domain hosted HTC files containing Javascript.

And, again, htmLawed is open to the exact same Phishing and Clickjacking vulnerabilities as HTML_Safe, allowing all CSS position/height/width and other properties capable of re-styling web pages.

htmLawed is not quite as easy to fool as HTML_Safe, but both share a remarkably similar lack of attention to specific areas of HTML Sanitisation. The conclusion I’ve come to hold is that both of these libraries (and others in the wild) are reading from the same book. Which might sound ridiculous before you consider that these libraries all seem to revolve around the year 2005 or so - the same period when Kses was King of HTML Sanitisation. There’s no originality when you write new libraries in various styles which all follow the exact same assumptions and knowledge level.

htmlLawed performs tag balancing and other cleanup tasks. Nevertheless, it will not necessarily output wellformed or standards compliant HTML. You should bring ext/tidy to the party as a post-processor.

My main gripe with htmLawed, besides the above, is that it does not appear these are going to be fixed any time soon. The script’s author has taken the approach that htmLawed filters HTML, and only HTML. Its CSS filtering can’t be trusted and is not secure, and this IS part of HTML Sanitisation (HTML contains CSS!). Nevertheless, htmLawed documentation has been updated since my reports to clarify what developers may need to do before using htmLawed (i.e. normalise input character encoding, ensure all CSS is stripped/disabled or sanitised by another process, etc.). In short, htmLawed is not even remotely in the race as a fully featured HTML Sanitiser. It is lacking too many features, or using features which are obviously incomplete, and it is pushing too much complex responsibilities on end users who we all know will never bother doing them.

The issues noted above have not, to date, been fully resolved to my satisfaction though the updated documentation means I can’t call them vulnerabilities anymore. While documentation updates and minor fixes have resulted in some improvement and clarification, it appears that the fundamental flaws in CSS sanitisation and character encoding will not be implemented until some future unscheduled major revision of htmLawed (encoding has been documented as an end user concern instead). In effect, the security vulnerabilities reported resulted not in fixes, but in pushing the fixes back to the end users to deal with and denying them as a responsibility of htmLawed. In my book, that simply removes htmLawed entirely as a usable HTML sanitiser. Pushing responsibility back to users may let you off the security vulnerability hook but it does call the entire description and goals of htmLawed into question.

Users are advised to ensure that they follow the documentation to the letter (read it in detail!) in disabling all CSS styling and enabling safe mode for htmLawed (sanitisation is disabled by default for this one). Using ext/iconv to do some minimal character encoding normalisation is also highly recommended. As with PEAR, there has been no formal security advisory issued, also the release notes for the latest release make no reference to any security vulnerability existing.

WordPress (Kses)

Candidate three is WordPress 3.0, or more specifically the Kses script bundled with WordPress that is used internally for HTML Sanitisation. It should be noted that while the original Kses is stuck in 2005 (and should be avoided like the plague), WordPress have heavily updated their internal copy. This creates a viable HTML Sanitisation solution which is simple to extract for personal use (just needs a handful of extra functions borrowed from an includes file), and which may well be the defacto winner in the popularity stakes just because it’s a core part of everyone’s favourite blogging platform.

What could possibly go wrong? Well, surprisingly not as much as I was half expecting after my earlier candidates!

WordPress’ Kses proved far more challenging than the previous two candidates. It’s obvious that the widespread use of WordPress has enforced constant peer review and improvement. Nevertheless, where there’s a will (and a quiet afternoon), there’s a way.

WordPress’ Kses first of all offers no protection when used as a standalone sanitiser against character encoding attacks, thus the previously described UTF-7 XSS attack works quite well. I do NOT consider this a security vulnerability since the use of Kses within WordPress prevents manipulation of the HTML’s declared charset (though presumably this protection could be mangled by some WordPress template author). Nevertheless, for anyone using Kses in standalone mode you should note that character encoding normalisation is necessary and should be implemented.

Fast forward three coffees, and it finally clicked that Kses did have one obvious flaw in that it uses a little blacklisting for CSS filtering. In short, it removes all CSS attribute values which contain any of the characters /, \, * and (. It’s actually very clever (in that it’s indiscriminately simple) since it just removes CSS using these characters altogether. Unfortunately, it never considered the use of the right-handed curly brace, }. Under all versions of Internet Explorer, in either quirks or standards mode, the right curly brace is treated as a CSS terminator (i.e. just like a semi-colon). This means that something like “position: absolute; top: 5px; left: 10px;” can be written for Internet Explorer as “position: absolute} top: 5px} left: 10px}”. IE extends another unwanted helping hand… The result is that such CSS styling values are not intercepted by Kses (which relies on semi-colon terminators), and may allow for Phishing and Clickjacking attacks where Kses filtered user input is rendered to an end user from WordPress.

After those three coffees and some head banging, I thanked my numb brain for finally giving me something and left well enough alone. Later on, I also reported to the WordPress team that Internet Explorer in quirks mode (doesn’t work in standards mode) also accepted the equals sign in place of a colon, i.e. “position: absolute;” could be written as “position=absolute;” or even “position=absolute}”. This is of far less concern than the previous issue - and more of a funny aside on IE’s incredible silliness.

As with previous candidates, Kses will not always output well formed HTML not will it check for other page breaking tactics. Again, this will require some custom checks and perhaps ext/tidy in some cases.

WordPress’ Kses is surprisingly (well, it should be!) adept at the HTML Sanitisation game. If it bundled encoding normalisation, and packed HTML tidying and a few other bits and pieces I’d almost use it myself. Almost. At the end of the day, however, I just don’t trust regular expressions enough.

The issues noted above were fixed in WordPress 3.0.1 released in late July 2010 (and weeks beforehand in SVN). The reported Phishing issue was mysteriously absent from the list of 54 fixed issues for 3.0.1, which isn’t really all that surprising (would not have been a public issue anywhere) but it would have been nice to see a public disclosure from the vendor before I published this article.


HTMLPurifier is the brain child (no doubt with Angelina Jolie) of Edward Z. Yang. In short it is nothing remotely like the other HTML Sanitisation candidates. It bundles a HTML parser/validator/supercomputer (rumour says it might be The Stig’s brain). HTMLPurifier describes itself with “HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C’s specifications.”

Short answer. I completely failed to find anything that got past it. I tried. Then I tried some more. Then I tried multiple caffeine injections but to no avail. It just sat there looking smug. HTMLPurifier is seemingly impervious to pretty much anything. Worse, it refused to produce any mangled attributes, or reconstructions, or anything else I normally expect of a typical HTML Sanitisation solution.

Out of four candidates, HTMLPurifier was the only one to successfully meet all four objectives of HTML Sanitisation. It’s only problem? It eats RAM and sucks CPU cycles far more than any other candidate. A point which at least some HTML Sanitisers may note as a disadvantage since they can’t hang it on anything else.

Since I cannot just leave it there, and since it is always worth noting, this does not mean that HTMLPurifier is invincible. The problem with the world of Cross-Site Scripting is that there are constantly evolving new ways to attack users. HTMLPurifier has fixed several vulnerabilities in the past, and these are publicly disclosed by the author (as security advisories even). Many of them are reported directly by discoverers of new vectors which adds to HTMLPurifier security in one respect since it gets such attention. The author is also quite proactive about locating and researching fixes to possible future vulnerabilities.

HTMLPurifier is, quite simply, the only fully functioning and fully featured HTML Sanitiser in PHP. It literally stands alone.

What Does It All Mean?

Besides the fact that PHP developers are sheep to the slaughter? It means that HTML Sanitisation is incredibly misunderstood even by the authors of HTML Sanitisation solutions. It means that such solutions have minimal peer review by individuals who are relatively knowledgeable of security issues. It probably means that the days of combating XSS and Phishing and other concerns linked to HTML Sanitisation are far from over, at least for PHP.

Consider the nature of the vulnerabilities. None of them are particularly obscure or poorly documented in the public domain. There are several excellent sources of XSS/Phishing vectors for browsers, but it truly appears that everyone relies entirely on just one - the XSS (Cross Site Scripting) Cheatsheet and it’s blatantly obvious that it’s the only source for XSS vectors utilised by most solutions in whatever passes for their testing (I doubt anyone checks the forum for more either). This is a limited exposure issue - if you focus on the same subset of possible exploits and ignore anything else, you are entirely at the mercy of a single source of knowledge that omits quite a lot (the XSS Cheatsheet doesn’t contain all possible exploits - far from it!). For example, the CSS vulnerability approach for WordPress Kses is clearly documented on the HTML5 Security Cheatsheet but not mentioned anywhere on the more commonly referenced XSS Cheatsheet. It’s a very old CSS vector.

In addition, many HTML Sanitisation solutions share common approaches and vulnerabilities. I was only looking for one vulnerability, but cross checks netted those in common. This is one indication of an isolated evolutionary line - everybody feeds off everybody else’s work in the area with no attempt to look outside the PHP house to see what the weather is like over in Rubyville or Javatown (it looks really sunny in both incidentally) or even just to keep up to date on new resources, ideas and research. More interestingly, all this in-feeding seems to start with Kses in 2005. It’s like the past six years dropped into a timeless black hole insulated from the harsh realities of Terra Firma even as tidal forces slowly tore them apart. The singularity being a library from 2005 that is no longer maintained (outside of WordPress).

What is surprising, is that this isn’t surprising to me anymore. It’s par for the course in PHP and that hasn’t changed since the 90s. I love PHP to bits, but it’s a language that is tragically short on security expertise, and where a security expert can be nothing more than someone who read a book long ago and who has no incentive to move beyond the obvious or do something extraordinary - like using Google. If this article achieves nothing more than a blip on some faraway radar forgotten the next day as the download count for HTML_Safe, htmLawed, and all the other solutions I haven’t examined, goes unchanged then at least I got a blip. It’s a start. Maybe it’s a few less XSS reports on Bugtraq or a few less people choosing an insecure solution because it claims to be fast and routinely misleads people about its efficacy in between taking senseless potshots at the only reliable solution out there. Maybe a few more developers will question all these miraculous HTML sanitisers and stop believing their summary descriptions at face value.

Developers need to start using their heads when it comes to selecting any security related solution. It’s truly amazing to see developers recommend something purely on the basis of speed, or to watch them argue against all logic in support of an insecure option and just plain ignore the security implications (or assume there are none, or point at an RFC and misquote/misread it). The whole point of a security related solution IS security. If you want to compromise security to gain performance then at least be brutally honest about what it entails for the applications you build.

Do yourself a favour, use HTMLPurifier. And Ambush Commander, update the fracking comparison page already! ;)