LicenseSyndicate This BlogPlanet PHPMy interview at dot KDE - Henri Bergius
Thursday, September 2. 2010 Zend Framework is a BOSSie Award Winner - Zend Developer Zone Wednesday, September 1. 2010 Speaking at PHPNW 2010 - John Mertic Wednesday, September 1. 2010 Contributing to ZendFramework - ThinkPHP /dev/blog - PHP Wednesday, September 1. 2010 Big step forward in Modular Database Applications with DataObjects - Alan Knowles Tuesday, August 31. 2010 The fine art of application virtualization - John Lim (PHP Everywhere - By John Lim) Tuesday, August 31. 2010 Collecting Garbage: PHP's take on variables - Derick Rethans Tuesday, August 31. 2010 How to Roll Your Own JavaScript Compressor with PHP and the Closure Compiler - SitePoint » PHP Tuesday, August 31. 2010 Beware of the default Apache 2 config for PHP - Ilia Alshanetsky Monday, August 30. 2010 PHP Manager for IIS 7 – beta release - Ruslan Yakushev Monday, August 30. 2010 StatisticsLast entry: 2010-08-09 22:00
414 entries written
1581 comments have been made
|
Monday, August 9. 2010
HTML Sanitisation: The Devil's In ... Posted by Pádraic Brady
in PHP General, PHP Security, Zend Framework at
22:00
Comments (19) Trackbacks (0) Defined tags for this entry: html sanisation, htmlpurifier, phishing, php general, php security, rantings, xss, zend framework
HTML Sanitisation: The Devil's In The Details (And The Vulnerabilities)
HTML Sanitisation (defined below) has been with us for a long time, ever since the first genius who came up with the idea of allowing potentially untrustworthy third party HTML to be dynamically patched into their own markup. The years have not taken this kindly, and third-party HTML inclusion has remained one of the most complex and underappreciated vectors for security vulnerabilities.
In this article, I take a look at some of the solutions PHP developers rely upon to perform HTML Sanitisation. Mostly because few others have done it or written about such solutions in any great detail (at least publicly). HTML Sanitisation has a very low profile in PHP. It's rarely mentioned, usually not understood all that well, and examining some of the solutions in this area with more deliberate attention is worth doing. Also, it's valuable research since I am writing my own HTML Sanitisation library (bias alert!) for a future Zend Framework 2.0 proposal. Knowing what the competition is up to does no harm! Finally, I was simply curious. Nobody seems too pushed to look closely at all these HTML Sanitisation solutions despite the fact that there are other developers (I think) who wouldn't touch most of them with a barge pole. One somewhat remarkable example, just to illustrate why I figured this article was worth the time, is HTMLPurifier's Comparison analysis where HTMLPurifier is compared against a number of other HTML Sanitisers. The comparison is remarkable because it seems inclined to err on the side of giving HTMLPurifier's competitors the benefit of the doubt. Unfortunately, this means the analysis is often flawed and its conclusions suspect. Also, it assists in legitimising other solutions in the minds of readers by making assumptions of safety. Not that this reflects on HTMLPurifier's functionality, incidentally, which I have always maintained is the only HTML Sanitiser worth looking at. Back on track… What is HTML Sanitisation?HTML is an amazingly dangerous thing. It can contain Javascript, CSS, or malformed markup, or even gigantic images that laugh at your dual 32" monitor setup. Each of these, in their own way, can damage the experience of an end user, whether it be by Cross-Site Scripting (XSS), Phishing or simply mangling the page until it's unusable and/or defaced with scriptkiddie jibes. There are two ways of dealing with these threats to the HTML output of an application: escape output so that the only HTML rendered by the browser is the application's (anything else being neutered by HTML entities), or by sanitising output so that any additional HTML it contains, that is renderable by a browser, is stripped of any potentially damaging markup. This article concerns the second option. HTML Sanitisation may therefore be defined as any means of filtering HTML to ensure that a) Cross-Site Scripting (XSS) vulnerabilities are removed, b) Phishing vulnerabilities are removed, c) the HTML is well formed and adheres to an acceptable HTML standard, and d) the HTML contains no obvious means of breaking expected web page rendering. I won't claim this is a perfect definition but it covers most of the salient points you'll likely encounter. So there are, broadly speaking, four primary objectives of HTML Sanitisation, any one of which is capable of preventing damage to end users or web application functionality (including javascript powered client side functionality). Each is, in its own way, quite a difficult proposition requiring suitable tools and specialised knowledge. However, with some objectives we can measure our success somewhat reliably. The question of this article being: how well do HTML Sanitisers in PHP measure up to these objectives? The CandidatesSince this is intended as a brief examination (just a few million words long!), I decided to select four candidate HTML Sanitisers meeting certain conditions. These conditions included: 1. Having a release at some point in the past two years; 2. Describing itself as a HTML sanitiser/filter to prevent Bad Things; 3. Having a design clearly in line with an intent to filter XSS/Phishing; and 4. Having no publicly acknowledged long standing security vulnerabilities. The great part about applying these conditions is that I pretty much eliminated stacks of HTML Sanitisers (as some might claim them as being). Outside of those, it also eliminates anything users might misconstrue as a HTML Sanitiser (for example PHP's strip_tags() function or Zend Framework's Zend_Filter_StripTags class). What we are left with is pretty thin on the ground, but fits what I'd expect a reasonably educated PHP developer to swing with. From what remained, I selected four candidates (or maybe these were the only four left - I'll never tell): 1. PEAR's HTML_Safe 2. htmLawed 3. Wordpress' Kses 4. HTMLPurifier With four candidates in tow, I proceeded to examine each against the four objectives of HTML Sanitisation I noted earlier. Just to emphasise to readers, this examination was not so in-depth as to identify every possible flaw or issue with each candidate. My intent was to attempt to locate one security vulnerability and assess each candidate in general terms for the other non-security related HTML Sanitisation objectives. Before I go any further, let me clarify that all security vulnerabilities discovered were notifed almost immediately to the parties responsible for each candidate solution. All such parties confirmed receipt of these reports within one week, and all were given a period (approx. six or more weeks to today) in which to apply fixes, make new releases, update documentation, perform additional security reviews, etc. I've sat on this article for a long time. Regardless of the effectiveness of any such actions (or lack of action as the case may be), this article discloses all perceived security vulnerabilities discovered during my examination whether or not all parties agree with my opinion. In some cases a disclosure may clearly indicate a fundamental flaw in the underlying design of the candidate. In these cases, it was emphasised to the responsible parties that reported vulnerabilities were limited to the scope of my examination and that I believed it was likely that additional and possibly related vulnerabilities remained unreported but easily discoverable as a result of public disclosure. This concludes my rendition of "Cover Your Ass" Note that all discussions below relate to the whichever version of each candidate solution was initially examined (before any fixed releases). I have noted resolutions as necessary for each. PEAR's HTML_SafeTo get us started, PEAR's HTML_Safe is one of the older candidates despite its examined release being in April 2010 (the previous release being a beta in 2005). HTML_Safe's description states that "This parser strips down all potentially dangerous content within HTML". HTML_Safe operates on the basis of parsing HTML with regular expressions and applying filtering logic which is dependent on predefined blacklists of potentially harmful elements, attributes and CSS properties. Unfortunately, HTML_Safe's blacklists prove its undoing. The problem with blacklists is that they require constant attention and updates for new problems. A cursory examination of the CSS property blacklist showed that it omitted many browser specific CSS properties such as -ms-behavior. The -ms-behavior CSS property (specific to Internet Explorer 8) may contain as a value a URI reference to a locally hosted HTC file (which contains executable Javascript). While such a file would need to exist on the local domain, this is obviously a security vulnerability in that it allows the execution of any arbitrary Javascript an attacker can store or reuse on the local domain thus opening up XSS possibilities. HTML_Safe also has another vulnerability shared by practically all HTML Sanitisation solutions based on the use of regular expressions. Regular expression parsing typically assumes that all HTML special characters are encoded in ASCII (an encoding subset which is common across other encodings such as ISO-8859-1 and UTF-8). However, UTF-7 encodes the greater than (>) and less than (<) characters differently. This means that typical regular expression parsing does not detect these characters when encoded in UTF-7. If you can't detect them, you can't sanitise them! This sanitisation bypass requires a secondary vulnerability where an attacker either forces a webpage containing unsanitised UTF-7 encoded markup to be rendered with a charset of UTF-7 (IE has vulnerabilities here, as do some versions of Firefox) or where the target application actually allows a user to select a nonsensical custom character set (as happened with Google and Yahoo when struck with this same exploit, and several CMS applications as recently as last Spring). Finally, HTML_Safe's blacklisting also misses out on CSS properties not directly tied to XSS (e.g. position), but which may be used to perform Phishing attacks by using CSS styling to alter or overlay HTML elements. This could, for example, allow an attacker to re-style their injected HTML to replace (as in positioning above) specific page elements (if not the entire page). This can lead to Clickjacking among other forms of Phishing. In terms of HTML well formedness, HTML_Safe does not necessarily emit standards compliant or well formed HTML, nor does it check some other common page breaking tactics such as overlarge images as a precaution. HTML_Safe may well be the least secure of our four candidates. Its use of blacklisting, its relative age, and a lack of peer review have left it woefully outdated for the task of HTML Sanitisation. In short, it should be avoided at all costs and my main request to PEAR at the time of reporting the above vulnerabilities was to seriously consider removing it from PEAR. Personally, I find it almost tragic that a library of such limited capability may benefit from PEAR's reputation and lead to users trusting it over far more secure alternatives. Of the issues noted above, the UTF-7 vulnerability has been resolved in a new release. The CSS blacklist has not yet been revised though I remain confident that the security advisory will go out any day now. PEAR does do security advisories, right? My recommendation, as originally suggested, remains that HTML_Safe should be removed, or overhauled, given it is not up to the task of HTML Sanitisation in its current condition. Blacklists are simply the worst approach ever to HTML sanitisation. htmLawedOur second candidate is, at least based on Google's omnipotence, one of the more popular open source standalone HTML Sanitisers. htmLawed has garnered a reputation as something of a rebellious spirit built for speed, a stark contrast to the slow resource intensive operation of HTMLPurifier with which is often compared. htmLawed's description states that it is "PHP code to purify & filter HTML" and that it can "make HTML markup in text secure and standard-compliant". The problem with htmLawed is that its operation is not much more than a very short stone's throw away from HTML_Safe. On the face of it, htmLawed is significantly more complex than HTML_Safe, being loosely based on Kses (an older HTML Sanitisation script). Its documentation is also huge, finely detailed, and packed full of options. It's source code is also complex and heavily obscured (presumably to discourage anyone from trying to examine it too closely). But...complexity and options a HTML Sanitiser does not make. Similar to HTML_Safe, htmLawed carries no functionality to thwart character encoding based attacks. The UTF-7 exploit as described for PEAR's HTML_Safe works perfectly well. In addition, htmLawed has a number of oddities which are not vulnerabilities but may potentially become such in the future. These are related to attempts by htmLawed to fix insecure attribute values and CSS values. I have no idea why htmLawed tries to fix them (just remove them!), but the results come very close to enabling reconstructive logic based attacks (i.e. where you can trick a fixing mechanism to reconstruct a more obvious attempt at a vulnerability into a less obvious but equally dangerous one). I can't say these are anything more than oddities however. One example, is that you can get the parser to create a CSS string containing something like "exp ress ion" (expression properties allow the execution of Javascript). The spaces in the string prevent it from being used by all current browsers, but what if a new browser release decides to ignore a little whitespace? Several other oddities also exist but are either harmless or fixed in the most recent htmLawed release. Back firmly in current reality, htmLawed's CSS filtering is seriously flawed. For example, it does not filter out the CSS behavior property which leads to the same result as it would for HTML_Safe's allowance of -ms-behavior, the execution of local domain hosted HTC files containing Javascript. And, again, htmLawed is open to the exact same Phishing and Clickjacking vulnerabilities as HTML_Safe, allowing all CSS position/height/width and other properties capable of re-styling web pages. htmLawed is not quite as easy to fool as HTML_Safe, but both share a remarkably similar lack of attention to specific areas of HTML Sanitisation. The conclusion I've come to hold is that both of these libraries (and others in the wild) are reading from the same book. Which might sound ridiculous before you consider that these libraries all seem to revolve around the year 2005 or so - the same period when Kses was King of HTML Sanitisation. There's no originality when you write new libraries in various styles which all follow the exact same assumptions and knowledge level. htmlLawed performs tag balancing and other cleanup tasks. Nevertheless, it will not necessarily output wellformed or standards compliant HTML. You should bring ext/tidy to the party as a post-processor. My main gripe with htmLawed, besides the above, is that it does not appear these are going to be fixed any time soon. The script's author has taken the approach that htmLawed filters HTML, and only HTML. Its CSS filtering can't be trusted and is not secure, and this IS part of HTML Sanitisation (HTML contains CSS!). Nevertheless, htmLawed documentation has been updated since my reports to clarify what developers may need to do before using htmLawed (i.e. normalise input character encoding, ensure all CSS is stripped/disabled or sanitised by another process, etc.). In short, htmLawed is not even remotely in the race as a fully featured HTML Sanitiser. It is lacking too many features, or using features which are obviously incomplete, and it is pushing too much complex responsibilities on end users who we all know will never bother doing them. The issues noted above have not, to date, been fully resolved to my satisfaction though the updated documentation means I can't call them vulnerabilities anymore. While documentation updates and minor fixes have resulted in some improvement and clarification, it appears that the fundamental flaws in CSS sanitisation and character encoding will not be implemented until some future unscheduled major revision of htmLawed (encoding has been documented as an end user concern instead). In effect, the security vulnerabilities reported resulted not in fixes, but in pushing the fixes back to the end users to deal with and denying them as a responsibility of htmLawed. In my book, that simply removes htmLawed entirely as a usable HTML sanitiser. Pushing responsibility back to users may let you off the security vulnerability hook but it does call the entire description and goals of htmLawed into question. Users are advised to ensure that they follow the documentation to the letter (read it in detail!) in disabling all CSS styling and enabling safe mode for htmLawed (sanitisation is disabled by default for this one). Using ext/iconv to do some minimal character encoding normalisation is also highly recommended. As with PEAR, there has been no formal security advisory issued, also the release notes for the latest release make no reference to any security vulnerability existing. Wordpress (Kses)Candidate three is Wordpress 3.0, or more specifically the Kses script bundled with Wordpress that is used internally for HTML Sanitisation. It should be noted that while the original Kses is stuck in 2005 (and should be avoided like the plague), Wordpress have heavily updated their internal copy. This creates a viable HTML Sanitisation solution which is simple to extract for personal use (just needs a handful of extra functions borrowed from an includes file), and which may well be the defacto winner in the popularity stakes just because it's a core part of everyone's favourite blogging platform. What could possibly go wrong? Well, surprisingly not as much as I was half expecting after my earlier candidates! Wordpress' Kses proved far more challenging than the previous two candidates. It's obvious that the widespread use of Wordpress has enforced constant peer review and improvement. Nevertheless, where there's a will (and a quiet afternoon), there's a way. Wordpress' Kses first of all offers no protection when used as a standalone sanitiser against character encoding attacks, thus the previously described UTF-7 XSS attack works quite well. I do NOT consider this a security vulnerability since the use of Kses within Wordpress prevents manipulation of the HTML's declared charset (though presumably this protection could be mangled by some Wordpress template author). Nevertheless, for anyone using Kses in standalone mode you should note that character encoding normalisation is necessary and should be implemented. Fast forward three coffees, and it finally clicked that Kses did have one obvious flaw in that it uses a little blacklisting for CSS filtering. In short, it removes all CSS attribute values which contain any of the characters /, \, * and (. It's actually very clever (in that it's indiscriminately simple) since it just removes CSS using these characters altogether. Unfortunately, it never considered the use of the right-handed curly brace, }. Under all versions of Internet Explorer, in either quirks or standards mode, the right curly brace is treated as a CSS terminator (i.e. just like a semi-colon). This means that something like "position: absolute; top: 5px; left: 10px;" can be written for Internet Explorer as "position: absolute} top: 5px} left: 10px}". IE extends another unwanted helping hand... The result is that such CSS styling values are not intercepted by Kses (which relies on semi-colon terminators), and may allow for Phishing and Clickjacking attacks where Kses filtered user input is rendered to an end user from Wordpress. After those three coffees and some head banging, I thanked my numb brain for finally giving me something and left well enough alone. Later on, I also reported to the Wordpress team that Internet Explorer in quirks mode (doesn't work in standards mode) also accepted the equals sign in place of a colon, i.e. "position: absolute;" could be written as "position=absolute;" or even "position=absolute}". This is of far less concern than the previous issue - and more of a funny aside on IE's incredible silliness. As with previous candidates, Kses will not always output well formed HTML not will it check for other page breaking tactics. Again, this will require some custom checks and perhaps ext/tidy in some cases. Wordpress' Kses is surprisingly (well, it should be!) adept at the HTML Sanitisation game. If it bundled encoding normalisation, and packed HTML tidying and a few other bits and pieces I'd almost use it myself. Almost. At the end of the day, however, I just don't trust regular expressions enough. The issues noted above were fixed in Wordpress 3.0.1 released in late July 2010 (and weeks beforehand in SVN). The reported Phishing issue was mysteriously absent from the list of 54 fixed issues for 3.0.1, which isn't really all that surprising (would not have been a public issue anywhere) but it would have been nice to see a public disclosure from the vendor before I published this article. HTMLPurifierHTMLPurifier is the brain child (no doubt with Angelina Jolie) of Edward Z. Yang. In short it is nothing remotely like the other HTML Sanitisation candidates. It bundles a HTML parser/validator/supercomputer (rumour says it might be The Stig's brain). HTMLPurifier describes itself with "HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications." Short answer. I completely failed to find anything that got past it. I tried. Then I tried some more. Then I tried multiple caffeine injections but to no avail. It just sat there looking smug. HTMLPurifier is seemingly impervious to pretty much anything. Worse, it refused to produce any mangled attributes, or reconstructions, or anything else I normally expect of a typical HTML Sanitisation solution. Out of four candidates, HTMLPurifier was the only one to successfully meet all four objectives of HTML Sanitisation. It's only problem? It eats RAM and sucks CPU cycles far more than any other candidate. A point which at least some HTML Sanitisers may note as a disadvantage since they can't hang it on anything else. Since I cannot just leave it there, and since it is always worth noting, this does not mean that HTMLPurifier is invincible. The problem with the world of Cross-Site Scripting is that there are constantly evolving new ways to attack users. HTMLPurifier has fixed several vulnerabilities in the past, and these are publicly disclosed by the author (as security advisories even). Many of them are reported directly by discoverers of new vectors which adds to HTMLPurifier security in one respect since it gets such attention. The author is also quite proactive about locating and researching fixes to possible future vulnerabilities. HTMLPurifier is, quite simply, the only fully functioning and fully featured HTML Sanitiser in PHP. It literally stands alone. What Does It All Mean?Besides the fact that PHP developers are sheep to the slaughter? It means that HTML Sanitisation is incredibly misunderstood even by the authors of HTML Sanitisation solutions. It means that such solutions have minimal peer review by individuals who are relatively knowledgeable of security issues. It probably means that the days of combating XSS and Phishing and other concerns linked to HTML Sanitisation are far from over, at least for PHP. Consider the nature of the vulnerabilities. None of them are particularly obscure or poorly documented in the public domain. There are several excellent sources of XSS/Phishing vectors for browsers, but it truly appears that everyone relies entirely on just one - the XSS (Cross Site Scripting) Cheatsheet and it's blatantly obvious that it's the only source for XSS vectors utilised by most solutions in whatever passes for their testing (I doubt anyone checks the ha.ckers.org forum for more either). This is a limited exposure issue - if you focus on the same subset of possible exploits and ignore anything else, you are entirely at the mercy of a single source of knowledge that omits quite a lot (the XSS Cheatsheet doesn't contain all possible exploits - far from it!). For example, the CSS vulnerability approach for Wordpress Kses is clearly documented on the HTML5 Security Cheatsheet but not mentioned anywhere on the more commonly referenced XSS Cheatsheet. It's a very old CSS vector. In addition, many HTML Sanitisation solutions share common approaches and vulnerabilities. I was only looking for one vulnerability, but cross checks netted those in common. This is one indication of an isolated evolutionary line - everybody feeds off everybody else's work in the area with no attempt to look outside the PHP house to see what the weather is like over in Rubyville or Javatown (it looks really sunny in both incidentally) or even just to keep up to date on new resources, ideas and research. More interestingly, all this in-feeding seems to start with Kses in 2005. It's like the past six years dropped into a timeless black hole insulated from the harsh realities of Terra Firma even as tidal forces slowly tore them apart. The singularity being a library from 2005 that is no longer maintained (outside of Wordpress). What is surprising, is that this isn't surprising to me anymore. It's par for the course in PHP and that hasn't changed since the 90s. I love PHP to bits, but it's a language that is tragically short on security expertise, and where a security expert can be nothing more than someone who read a book long ago and who has no incentive to move beyond the obvious or do something extraordinary - like using Google. If this article achieves nothing more than a blip on some faraway radar forgotten the next day as the download count for HTML_Safe, htmLawed, and all the other solutions I haven't examined, goes unchanged then at least I got a blip. It's a start. Maybe it's a few less XSS reports on Bugtraq or a few less people choosing an insecure solution because it claims to be fast and routinely misleads people about its efficacy in between taking senseless potshots at the only reliable solution out there. Maybe a few more developers will question all these miraculous HTML sanitisers and stop believing their summary descriptions at face value. Developers need to start using their heads when it comes to selecting any security related solution. It's truly amazing to see developers recommend something purely on the basis of speed, or to watch them argue against all logic in support of an insecure option and just plain ignore the security implications (or assume there are none, or point at an RFC and misquote/misread it). The whole point of a security related solution IS security. If you want to compromise security to gain performance then at least be brutally honest about what it entails for the applications you build. Do yourself a favour, use HTMLPurifier. And Ambush Commander, update the fracking comparison page already! Friday, October 31. 2008
Example Zend Framework Blog ... Posted by Pádraic Brady
in Irishisms, PHP General, PHP Security, Zend Framework at
18:16
Comments (16) Trackback (1) Defined tags for this entry: design patterns, htmlpurifier, irishisms, maugrim, php general, php security, tutorial, zend framework
Example Zend Framework Blog Application Tutorial: Parts 1-8 Revisited
By now many readers are aware of the all-consuming mega tutorial I've been writing illustrating one method of writing a blog application with the Zend Framework. What started initially as a possible book project switched over to a more open process of blog posts with a future PDF version as a standard reference project.
Back during the Summer I had to put the series on hold, but it's time to get kicking again. So to spark the revival here's a quickie tour to allow readers catch up on my intentions since Parts 1 to 8 were originally published. If you're looking for the series, here are the relevant links! Subsequent parts will follow soon. Edit: Part 9 will be reposted within a few days. I located my text file backup of the original (preferable to relying on a third party snapshot sans corrections). Part 1: Introductory Planning Part 2: The MVC Application Architecture Part 3: A Simple Hello World Tutorial Part 4: Setting the Design Stage with Blueprint CSS Framework and Zend_Layout Part 5: Creating Models with Zend_Db and adding an Administration Module Part 6: Introduction to Zend_Form and Authentication with Zend_Auth Part 7: Authorisation with Zend_Acl and Revised Styling Part 8: Creating and Editing Blog Entries with a dash of HTMLPurifier My plans for the revival require a short detour to the original articles. It was always my intention to run a multi-step process. First I would write the code as quickly as possible. Secondly I would create a blog post which bumps the maximum entry size for Serendipity. Third I would transfer the entry into a longer more detailed Docbook format for easy transfer to HTML and PDF. Fourthly I would make the new formats available on a donation funded website. Plans being plans - sometimes they go awry or get delayed. That's what happened over the Summer when my time was occupied, too occupied to even keep my blog online! Over the next month I will polish the first 8 entries, and proceed with the final chapters. I have a rough plan of what future parts remain (more than 1, less than 100 Obviously a lot happens in programming over the course of months (sometimes weeks) so for every part you can currently read, you can expect an updated version which reflects my own, and others', growing body of habitual programming practices where the Zend Framework is concerned. One debatably major update will be to polish areas I simplified too much (like setting up database connections) since at the end of the day, simplification has led to subtle inefficiencies (like database connections loading even if a database is not required Monday, February 11. 2008
Zend_Feed: Getting Started With ... Posted by Pádraic Brady
in PHP General, PHP Security, Zend Framework at
15:55
Zend_Feed: Getting Started With Aggregating RSS/Atom Content
One of the components I spent some time working with recently was Zend_Feed, which was interesting at first, then a little irritating, and eventually compliant. In this entry I explore Zend_Feed from the perspective of someone aggregating RSS and Atom feeds with a view to building a database of uniquely identified content for later presentation in a "Planet" style application.
My first overall assessment is that Zend_Feed needs work. It is a wonderful component that can simplify your life immeasurably, but it's up against competition from third party libraries like MagpieRSS which do a better job at the one malfeasant facet of blog feeds: invalid, malformed and non-standard RSS and Atom XML. What Zend_Feed needs to breach the threshold of usability is more ability to handle the various problems you meet parsing RSS and Atom to a common range of data. It also badly needs improved documentation with a focus on examples of real use - for example I can't find a single documentation snippet or blog tutorial showing how to get a entry's actual HTML content using Zend_Feed which is problematic, possibly symptomatic, and as you'll see unintuitive. Don't feel too disheartened - it's still a powerful package with sufficient utility to get you well on your way. This tutorial is intended to cover some basics to get you started. In fact all we create here is a simple command line script to aggregate content frequently (e.g. just set up cron to run it every hour or so) into a database for later presentation. Setting Up Database And ModelsSo what common data do I want for each entry in an Atom or RSS feed?
CREATE TABLE IF NOT EXISTS `blog` ( `id` int(11) NOT NULL AUTO_INCREMENT, `url` tinytext collate utf8_unicode_ci NOT NULL, `feedurl` tinytext collate utf8_unicode_ci NOT NULL, `title` tinytext collate utf8_unicode_ci NOT NULL, `author` tinytext collate utf8_unicode_ci NOT NULL, `modified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci; You can insert into this table as you wish. One row per blog you intend aggregating. Here's some sample data: url: http://blog.astrumfutura.org feedurl: http://blog.astrumfutura.com/feeds/index.rss2 title: Maugrim The Reaper's Blog author: Pádraic Brady The other fields update without manual intervention. A database table for entries may be something like: CREATE TABLE IF NOT EXISTS `entry` ( `id` int(11) NOT NULL AUTO_INCREMENT, `blog_id` int(11) NOT NULL DEFAULT '0', `guid` tinytext collate utf8_unicode_ci NOT NULL, `title` tinytext collate utf8_unicode_ci NOT NULL, `url` tinytext collate utf8_unicode_ci NOT NULL, `description` text collate utf8_unicode_ci NOT NULL, `date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00', `creator` tinytext collate utf8_unicode_ci NOT NULL, `content` text collate utf8_unicode_ci NOT NULL, `modified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, `hash` varchar(32) collate utf8_unicode_ci NOT NULL, PRIMARY KEY (`id`), FULLTEXT KEY `search` (`title`,`description`,`content`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci; Now that we have database tables to work with, we can use Zend_Db to create some Models. We'll store these in the usual ./application/models directory in keeping with the default Zend Framework directory structure as Blog.php and Entry.php: Blog.php class Blog extends Zend_Db_Table { protected $_name = 'blog'; } Entry.php class Entry extends Zend_Db_Table { protected $_name = 'entry'; } The Aggregator Script FoundationThe aggregator script is a pretty simple one executed by php on the command line. Here's a basic foundation to start from stored to "./scripts/Zend/Aggregate.php":// get root directory for app $root = dirname(dirname(dirname(<u>_FILE_</u>))); // set the include path for this script set_include_path( $root . '/library' . PATH_SEPARATOR // The ZF . $root . '/application/models' . PATH_SEPARATOR // Models . $root . '/vendor' . PATH_SEPARATOR // Other . get_include_path() ); // setup autoloading require_once 'Zend/Loader.php'; Zend_Loader::registerAutoload(); // get initial required non-ZF classes require_once 'Blog.php'; require_once 'Entry.php'; require_once 'HTMLPurifier.php'; class Zend_Aggregate { // bootstrap this class and commence aggregation public static function main() { // Mini-Bootstrap $config = new Zend_Config_Ini( dirname(dirname(dirname(<u>_FILE_</u>))) . '/config/config.ini', 'general' ); $db = Zend_Db::factory($config->db->adapter, $config->db->toArray()); $db->query("SET NAMES 'utf8'"); Zend_Db_Table::setDefaultAdapter($db); // Models $blogTable = new Blog; $entryTable = new Entry; // Aggregator Startup $aggregator = new self; $aggregator->aggregate($blogTable, $entryTable); } public function __construct() {} public function aggregate(Blog $blogTable = null, Entry $entryTable = null) { if (!$blogTable) { $blogTable = new Blog; } if (!$entryTable) { $entryTable = new Entry; } $blogs = $blogTable->fetchAll(); $client = new Zend_Http_Client; $client->setConfig( array('timeout'=>30) ); foreach ($blogs as $blog) { try { $feed = Zend_Feed::import($blog->feedurl); } catch (Zend_Feed_Exception $e) { echo "Failed importing feed: {$e->getMessage()}\n"; exit(); } foreach($feed as $item) { $entryData = $this->parseEntry($item, $blog); $this->_syncToDatabase($entryData, $blog, $entryTable); } } } protected function parseEntry(Zend_Feed_Entry_Abstract $item, Zend_Db_Table_Row $blog) {} protected function _syncToDatabase() {} } // execute the main() method Zend_Aggregate::main(); The source code above is not difficult. We use a static main() method call at the end of the file to initiate the process. The main() method creates a new instance of Zend_Aggregate, set's up items like configuration, database connection, and Model instantiation, and finally calls aggregate() on the new object. Inside, Zend_Aggregate::aggregate() we get a list of all blogs stored in the database. For each of these blogs, we fetch its relevent feed URL. The result of Zend_Feed::import() can be iterated across to get each individual entry or item. On each entry/item we pass it to the Zend_Aggregate::parseEntry() method - which bears overall responsibility for what to do next. Using Zend_Feed to get common data for RSS/Atom entriesSo what would parseEntry() do? Well, it's main role is to parse the feed and assemble a collection of common data we require - see the list above ;). So it is also responsibility for ensuring we get that data in a common format regardless of whether the source was RSS or Atom. Here's the first thing we put in parseEntry(): $dom = $item->getDOM(); Yes, a lot of the resolution of source comparison and common format will likely require we use DOM. This gets interesting later since Zend_Feed appears to do two completely incoherent things. Firstly, it strips source of several XML namespaces (particularly "content" for RSS), but leaves in most others (e.g. "dc" for RSS). There's an alternative access method for namespaced elements which is plain weird, but which Nick Halstead reminded me of. Let's hit our list though. First item is getting some sort of unique id for each entry - usually it's a URL reference. RSS and Atom don't agree here, but both have some sort of id. For RSS it's the "guid" element, while Atom has an "id" element. So into Zend_Aggregate::parseEntry() we add: // get a unique id $guid = ''; if ($item->guid()) { $guid = $item->guid(); } elseif ($item->id()) { $guid = $item->id(); } else { $guid = $item->link(); } So if we can't find one of those IDs, we'll just use the item's URL which is probably just about enough. In most cases though there will be an ID. Next up we need a title. Thankfully, both RSS and Atom at least agree on this: // fetch a title $title = ''; if($item->title()) { $title = $item->title(); } else { $title = $blog->title; } In case I was wrong, I'll just insert the blog title instead as a placeholder. Next up is a description. Atom rarely has something for this...in which case we'll just take the entry title. // get a description or similar $description = ''; if ($item->description()) { $description = $item->description; } else { $description = $title; } If you're following this, note that all single elements can have their enclosed values retrieved by simple calling "$item->element()" as a method. Calling "$item->element" (as a property) actually returns another Zend_Feed_Entry_Abstract object which we can use to traverse into element children if required. Now for some HTML content. The problem with the content is that it comes in two forms. Content as HTML in RSS is encoded (think something like htmlentities()) while the content in Atom is merely enclosed in a CDATA block. So firstly, if it's RSS we can decode it before use. The second part is ensuring the HTML is relatively safe from XSS, and finally that it all follows the same HTML standard. In this mind boggling decision making I've just delegated to the excellent HTMLPurifier library. Now the interesting thing about content is that in RSS it's enclosed by a "content:encoded" element, i.e. it uses the "content" namespace with a URL of "http://purl.org/rss/1.0/modules/content/". Zend_Feed handles this by stripping out the namespace so we're left with an "encoded" element. Atom on the other hand just has a "content" element to start with (the HTML in a CDATA block). // normalise content $contentOriginal = ''; $content = ''; if ($item->encoded()) { $contentOriginal = html_entity_decode($item->encoded(), ENT_QUOTES, 'UTF-8'); } elseif ($item->content()) { $contentOriginal = $item->content(); } // Purify and normalise content to XHTML 1.0 Transitional $purifier = new HTMLPurifier(); $content = $purifier->purify($contentOriginal); We also retain the original content unchanged. Later we use a md5 hash of it to detect changes (e.g. maybe the author edits their entry) and update our entries on the database. Next, we need a URL to the specific entry we're parsing here: // fetch entry item link (adjust if href holds it) $link = ''; if($item->link()) { $link = $item->link(); } else { $links = $dom->getElementsByTagName('link'); $link = $links->item(0)->getAttribute('href'); } Since some RSS feeds like Planet-PHP's include the link in a href attribute, rather than the link element's nodeValue, we need to a few acrobatics with DOM just in case. Getting the author or creator's name is another pain in the ass. RSS 2.0 often uses a dc:creator element, with a "dc" namespace with URL of "http://purl.org/dc/elements/1.1/". Unlike with Zend_Feed's previous RSS content:encoded stripping of the namespace - it doesn't do it here at all...:(. How naughty! So we have to parse out the dc elements ourselves assuming they exist, or otherwise search for an author or creator element ourselves, remembering that for Atom it's actually a name element child of author... I feel your confusion ;). // get the author name $author = ''; $creators = $dom->getElementsByTagNameNS( 'http://purl.org/dc/elements/1.1/', 'creator' ); $creator = $creators->item(0)->nodeValue; if($creator) { $author = $creator; } elseif($item->author() && is_string($item->author())) { $author = $item->author(); } elseif($item->author->name()) { $author = $item->author->name(); } else { $author = $blog->author; } If all else fails, we'll just take the author name from the blog database entry. In the above I'm using the DOM to extract a dc:creator value. Here's another Zend_Feed snippet of a shortcut - if the DOM is not your thing, you can instead get the value of $creator using: $dccreator = strval($item->{'dc:creator'}); if($dccreator && !empty($dccreator)) { $author = $dccreator; } I love shortcuts - but here it's almost if not quite worse than using the DOM. You still need to strval() the resulting Zend_Feed_Element value returned, check if it's empty or not, and only then assign it. Still, using this without DOM can be an advantage if DOM is one of those less than familiar extensions. To be honest, I think Zend_Feed really desperately needs a simple get() method to centralise value fetching without all these gymnastics... Let's not forget a published date! // get a publication date and normalise $date = ''; $dcdates = $dom->getElementsByTagNameNS( 'http://purl.org/dc/elements/1.1/', 'date' ); $dcdate = $dcdates->item(0)->nodeValue; if($dcdate) { $date = $dcdate; } elseif ($item->pubDate()) { $date = $item->pubDate(); } elseif ($item->published()) { $date = $item->published(); } elseif ($item->created()) { $date = $item->created(); } elseif ($item->updated()) { $date = $item->updated(); } elseif ($item->modified()) { $date = $item->modified(); } $date = $this->_normaliseDate($date); Yeah, dates are worse than content sometimes... We'll need to normalise the date from the differing RSS and Atom forms. protected function _normaliseDate($date) { $date = preg_replace("/([0-9])T([0-9])/", "$1 $2", $date); $date = preg_replace("/([\+\-][0-9]{2}):([0-9]{2})/", "$1$2", $date); $time = strtotime($date); if (($time - time()) > 3600) { $time = time(); } $date = gmdate("Y-m-d H:i:s O", $time); return $date; } Another method for you to append to the file... Moving right along, we have two final parseEntry() parts: // get a unique content hash to detect future content changes $hash = ''; $arrayContent = array($title, $contentOriginal, $link); $stringContent = implode(' ', $arrayContent); $hash = md5($stringContent); // put together result object $result = new stdClass; $result->guid = $guid; $result->blog_id = $blog->id; $result->title = $title; $result->url = $link; $result->description = $description; $result->date = $date; $result->creator = $author; $result->content = $content; $result->hash = $hash; return $result; You could return an array either, I just like objects. ;) The rest of the tutorial after the jump! Continue reading "Zend_Feed: Getting Started With Aggregating RSS/Atom Content"
« previous page
(Page 1 of 2, totaling 5 entries)
next page »
Frontpage View as PDF: This month | Full blog |
Calendar
QuicksearchCommentsRichard about HTML Sanitisation: The Devil's In The Details (And The Vulnerabilities) Mon, 30.08.2010 23:22 This is quite an interesting p ost and also informational. Ce rtainly one of such posts that brings a fresh perspect [...] Bobby about HTML Sanitisation: The Devil's In The Details (And The Vulnerabilities) Tue, 17.08.2010 22:24 I just wanted to thank you for the article and the research. I was looking for a solution and was surprised to fin [...] Tyson Sturdivant about HTML Sanitisation: The Devil's In The Details (And The Vulnerabilities) Mon, 16.08.2010 19:30 Does anyone have any input on "Universal Feed Parser" and it s effectiveness? Pádraic Brady about HTML Sanitisation: The Devil's In The Details (And The Vulnerabilities) Mon, 16.08.2010 17:44 Is it a big table? Miha about HTML Sanitisation: The Devil's In The Details (And The Vulnerabilities) Thu, 12.08.2010 15:59 OMG. What did I write. You men tioned html5lib in your post. And I go on mentioning just th at. /me is now ashamed [...] Miha about HTML Sanitisation: The Devil's In The Details (And The Vulnerabilities) Wed, 11.08.2010 20:46 html5lib (http://code.google.c om/p/html5lib/) is the one I r un on a few days ago, so I'm p robably guessing that th [...] Padraic Brady about HTML Sanitisation: The Devil's In The Details (And The Vulnerabilities) Wed, 11.08.2010 19:56 I haven't decided on it yet. A t the moment, many server side development tools are in the same boat. libxml2 and t [...] Miha about HTML Sanitisation: The Devil's In The Details (And The Vulnerabilities) Wed, 11.08.2010 19:32 @Padraic: What will you do in a 6months when html5 becomes p opular and along with it stand ardized parser. Its prob [...] Maarten about HTML Sanitisation: The Devil's In The Details (And The Vulnerabilities) Wed, 11.08.2010 12:45 looking forward to your soluti on. To be honest, we're using HTMLPurifier and I have yet to encounter big problems [...] Padraic Brady about HTML Sanitisation: The Devil's In The Details (And The Vulnerabilities) Tue, 10.08.2010 18:44 Quoting from the original repo rt (26 June '10): "Bonus vu lnerability from a brief look through of the blacklist [...] Jeremy Cook about HTML Sanitisation: The Devil's In The Details (And The Vulnerabilities) Tue, 10.08.2010 18:30 Thanks for the excellent artic le. Very informative. Brett Bieber about HTML Sanitisation: The Devil's In The Details (And The Vulnerabilities) Tue, 10.08.2010 17:43 I believe you're incorrect reg arding the -ms-behavior css er ror in HTML_Safe. The blacklis t includes "behavior" wh [...] Pádraic Brady about HTML Sanitisation: The Devil's In The Details (And The Vulnerabilities) Tue, 10.08.2010 11:18 Yes, it's being proposed to Ze nd Framework. HTMLPurifier rea lly is that good, largely beca use it properly normalis [...] Pádraic Brady about HTML Sanitisation: The Devil's In The Details (And The Vulnerabilities) Tue, 10.08.2010 11:13 Hi Santosh, As the article notes, CSS may be used to styl e elements in such a way that may overlay or expand th [...] Peter about HTML Sanitisation: The Devil's In The Details (And The Vulnerabilities) Tue, 10.08.2010 10:15 So if HTML Purifier is that go od, will you still be proposin g your own for inclusion into Zend? CategoriesArchivesTop ReferrersShow tagged entries application security article astrum futura asynchronous processing atom bdd behavior-driven development behaviour-driven development benchmark book deep end dependency injection design patterns devnetwork docbook documentation eve online games htmlpurifier inversion of control irish php user group irishisms maugrim microformat mock objects mockery model mutateme mutation testing mvc oauth openid openid and yadis pc gaming pear phing php php game development php games php general php security phpmock phpspec phpunit poka-yoke qgl quantum game library quantum star se rantings rss simpletest snarl solar empire surviving the deep end symfony tdd test spy tutorial unit testing xp programming xrd xrds xss yadis yaml zend framework zf proposal zfstde |
|||||||||||||||||||||||||||||||||||||||||||||||||


