Regex HTML Sanitisation: Off With Its Head!
Defeating Cross-Site Scripting
The solutions which prevent and defend against XSS in HTML are commonly known:
If you inject data into HTML (e.g. a template), and cannot be 110% sure it never crossed paths with a malicious user, you escape it. In PHP this means passing such data through a function like PHP’s htmlspecialchars(), always remembering to pass the character encoding of your output as the third parameter. An alternative exists for cases where you do not determine the HTML markup of output, for example, when aggregating content from RSS or Atom feeds, from web service API responses, from HTML emails, from user comments where HTML is allowed, or even from the output of HTML transformers such as libraries which translate BBCode, Markdown or some other intermediate format into HTML. These alternatives are usually called HTML Sanitisers or XSS Cleaners.
The first case is simple, easy to execute, and very difficult to spoof. Its main problem is that it requires foreknowledge of the character encoding of the output since HTML special characters may differ between encodings. A simple example of this encoding difference is found by comparing UTF-8 and . Whereas UTF-8 is US ASCII compatible, UTF-7 is not. Escaping UTF-7 markup as if it were UTF-8 would cause the escaping mechanism to fail in detecting the angular brackets that HTML tags are enclosed by since they occupy different points in UTF-7. Obviously, such a failure is a potential disaster – especially if your output supports a UTF-7 encoding, or if it never specifies a character encoding at all either via a header or a HTML meta tag since this may allow some browsers (cough…Internet Explorer) to guess the wrong encoding to use.
The second case is complex. There are no easy solutions or single paragraph pearls of wisdom you can rely on. Instead of a simple function, you need a library of code capable of parsing HTML and handling character encoding differences. Then you need a friendly API so programmers aren’t buried in the complexity of the task, a whitelist and whitelist limiter to defend against misconfiguration, knowledge of every HTML standard since the dawn of time and up to the minute advice on emerging HTML quirks across all browsers (even the ones you think are no longer used). After that, you are not done. You’re only beginning. You’re going to need a parser and lexer, a character encoding handler, a HTML tidier so you don’t break stuff, a possy of XSS wizards to tell you when you’re failing, and enough unit tests that if Sebastian Bergmann knew what you were doing, even he would jump out the nearest window screaming. Even then – you probably missed something.
The second case is complex. Horribly complex. Yes, I’m repeating myself.
PHP developers have responded to this complex second case by creating HTML Sanitisers/XSS Cleaners. They poured over the theory behind Cross-Site Scripting like ants. Yes, their combined intellect was activated, bringing the force of the PHP language to bear in ensuring all programmers are sufficiently protected. One unusually bright spark, Edward Z. Yang, succeeded and HTMLPurifier was born. An open source library which quashed XSS with the kind of force usually reserved for inter-galactic warfare.
The Truth Is Out There…If You Bother Looking
But HTMLPurifer was not alone. There were others.
People began to realise even before HTMLPurifer turned up that XSS used an interesting tactic. It usually added a string into an even bigger string. It was a realisation of epic proportions because it emerged that PHP had support for removing strings from even bigger strings. This marvellous invention was known as Perl Regular Expressions (or Regex for short). Regex based HTML sanitisers were a boon to the PHP community. They offered a fast means of stripping HTML of naughty XSS injections, leaving behind safe and pristine HTML suitable for output. These solutions thrived and became an essential tool in every PHP programmer’s toolbox. To this day, they are universally popular and can be found in applications, libraries and frameworks used by thousands of PHP programmers every day.
Regex based HTML Sanitisers do have one teeny tiny little wrinkle. They don’t work.
I’ve just spent a year researching these…things…on and off. First, I examined all the prominant examples of standalone HTML sanitisers. They all had vulnerabilities and I previously blogged about several. Next, I tried solutions integrated in applications and a few suggested alternatives. They all had vulnerabilities. Next…well, I can’t disclose those for another five weeks. But, they all had vulnerabilities too. There have been no exceptions.
I also feel it’s important to address the behaviour of those who maintain Regex based solutions. Most of those who received my reports were more than happy to fix the reported vulnerabilities. A small number drifted down the path of ignoring my reports, downplaying my reports or being openly hostile. Even though all were informed that I was offering examples and not a comprehensive list of vulnerabilities, nobody took the hint that this would require an in-depth review of their source code. As far as I am aware, all the solutions examined remain publicly available for download. All of them still call themselves HTML Sanitisers or Cleaners.
When Will We Ever Learn?
Right about this time, I would normally sign off and let you go. Not today. After my umpteenth report (this one containing eight examples of vulnerabilities), I’ve decided to end my research. I don’t regret persuing it because, frankly, nobody else has bothered doing something similar. It needed to be done. It should have been done years ago but apparently nobody cared enough to dig around for the truth before now (myself included) or their study/articles never saw widespread attention. What started out as a quick expedition to serve my own curiousity about whether PHP needed yet another HTML sanitiser has ended up uncovering what I now suspect is a massive problem. A problem which has been ignored, sidelined and is subject to a continuing campaign of deliberate or unintentional misinformation which conceals a simple truth.
Regex based HTML sanitisation is an utter failure. A disgrace. A terrifying menace to PHP security which must, at all costs, be eradicated. Now. I don’t care if you swear blind your Regex solution works – it doesn’t and you’re either insane, ignorant or a fucking moron to believe otherwise. Your shit is useless. If you keep offering it for download and pushing its “benefits” on users, you are perpetuating a practice of inflicting security vulnerabilities on those users who trust you. I don’t believe that you are all doing this deliberately but you are going to have start asking yourself some tough questions about what exactly you’re doing to the PHP community because of your ignorance. What is PHP? That joke of a horribly insecure language that sucks compared to everything else or a glowing example of best security practice? Pick one.
If you are a user of such a Regex based HTML Sanitiser, replace it with HTMLPurifier. HTMLPurifier is a library that is reliable, peer reviewed, uses a sound strategy and is phenonomally open about the infrequent vulnerabilities reported against it. There is no such thing as a perfectly secure solution for XSS filtering of HTML but HTMLPurifier gives that statement a run for its money. If HTMLPurifier is honestly too slow for what you need to do, then start hoping you have a dead simple use case which might scrape by with a very restrictive solution. There is no alternative choice.
In approximately five to six weeks (grace period for fixes), I’ll summarise the final set of vulnerabilities I have sent out about HTML Sanitisers.