PHP, Zend Framework and Other Crazy Stuff
Posts tagged zf proposal
HTML Sanitisation Benchmarking With Wibble (ZF Proposal)
Jul 8th
In January of this year, I had the idea of writing a HTML Sanitiser for PHP. Why not? All PHP has is HTMLPurifier and a bunch of random solutions that are about as secure as the average wooden gate. If you think that’s harsh, wait for my next blog post
. HTMLPurifier is the only secure by default HTML Sanitiser in PHP. Fact. But the darn thing is gigantic and slow. That has never stopped me using it (for years), even if I had to do a little funky engineering so I could minimise the performance hit. Other developers, however, have often abandoned HTMLPurifier, falling into the trap of believing that alternative solutions will serve them just as well.
That’s the state of HTML Sanitisation in PHP – pick a big slow library that crushes Cross-Site Scripting and Phishing attacks, or use yet another regular expression based sanitiser that a) barely manages a fraction of HTMLPurifier’s features and b) can probably be exploited by any scriptkiddie working with a stack of data cards. It says an awful lot about security standards among PHP developers that such delusions are uncomprehendingly rampant.
In case you haven’t noticed, I’m biased. Sue me.
I have opined since forever that regular expression sanitisers are nothing short of insane. Since the problem with HTMLPurifier is speed and size, I started thinking about ways to build something like HTMLPurifier that was fast, small and almost as feature packed as HTMLPurifier. At first, this sounds like an impossible task. The typical suggestion is to use regular expressions, but I’m not completely insane…yet. Instead I borrowed a concept called a DOM Filter and chucked in a helpful dose of HTML Tidy. The result was Wibble.
Wibble is basically a DOM Filter. It loads up HTML into PHP DOM, applies a set of filters against all nodes in the DOM, passes the output through HTML Tidy, and then hands it back to the user – sanitised and well-formed. It’s almost stupid in its obviousness. Better, this allows Wibble to skip regular expression dependence. It operates far more like HTMLPurifier by relying on a DOM representation (no string parsing to funk around with) partnered with Tidy for cleanup.
Of course, there have to be regular expressions somewhere. And whitelists. And other stuff. Wibble is really an amalgamation of borrowed concepts. It’s hard to be too original in HTML Sanitisation because originality is a good way to shoot yourself in the foot (hence regex is EVIL!), so I wasn’t going to spend too long digging my own grave when there is a wealth of sanitisation resources in the programming world. Wibble’s approach borrows elements from Ruby’s loofah, Python’s HTML5Lib, and Java’s AntiSamy. Wibble mixes and matches from the useful design elements each of these offers, serving them up on top of PHP’s DOM and Tidy extensions with its own distinctive twists.
I completed the first Wibble prototype recently, so I figured that with something that was at that 90% point where the remaining 10% would be in-depth sanity testing, cleanup and documentation, it was time to see how it compared to some other PHP solutions (HTMLPurifier and HtmLawed). I had some fairly conservative performance objectives so the results came as a pleasant surprise.
If you are a benchmark fiend, you can download and independently fiddle with my benchmark process from http://github.com/padraic/wibble-benchmarks. Note that the current benchmark uses a Wibble prototype – there are additional elements that need to be added over time. The benchmark currently uses three sample snippets of HTML: Small (blog comment size), Medium (markup heavy with limited textual content), and Big (markup light with lots of textual content). It operates by filtering each HTML sample 200 times with each benchmarked HTML sanitisation solution. Each iteration includes the instantiation and setup phases of each solution (where relevant) to reflect the most likely real world experience of using sanitisation as a once off (non-repeating in same request) process. I use PEAR’s Benchmark package to record the aggregate run time per loop of sanitisation tasks. All operations occur within one single PHP process with HTMLPurifier caching enabled (Wibble and HtmLawed do not use caching). Each solution is configured as close as possible to target total stripping of all HTML from the content.
You can view a sample result at http://gist.github.com/468426.
The results show that both Wibble and HtmLawed outperform HTMLPurifier by a very wide margin. Wibble underperforms HtmLawed by a variable margin – from twice as slow on small to medium sized input, to four times slower on large inputs with minimal HTML tags. In Wibble’s slowest benchmark, it outperformed HTMLPurifier by a factor of four.
Wibble intent is to try and replicate the completeness of HTMLPurifier, so it’s speed deficit when compared to HtmLawed is expected (when stripping all tags). There is not a lot to be done to improve this specific benchmark result since Wibble does a lot of stuff behind the scenes like encoding normalisation, DOM manipulation and HTML tidying. It also does all three of these things far more consistently and completely than HtmLawed is capable of.
So how does Wibble match up against Big Daddy? Wibble is a prototype, so obviously it still has ground to gain in terms of features with HTMLPurifier. But on the most significant points it only has one specific problem – it’s not HTML 5 ready. Neither DOM or Tidy support HTML 5, though you can “pretend” it’s HTML 4.01 (or even XHTML 1.0) for HTML 5 fragments so long as you are aware Tidy will strip unsupported HTML 5 tags and attributes.
The other points are syncing up with HTMLPurifier quite nicely. Wibble will santitise all HTML by default using strict filters (i.e. by default it strips every tag and only outputs plain text). It handles multiple encodings including conversion if necessary. It outputs standards compliant (other than HTML 5) HTML or XHTML. It fixes all the usual page breaking stuff like unclosed tags and illegal tag nesting. It is entirely reliant on whitelists and strict validation rather than blacklists and loose reconstructive parsing. It includes minimal regular expression usage (only needed for attribute and CSS validation) based on regular expressions widely used and tested in other languages. While testing will (and must) continue, it has so far proven resistant to XSS and Phishing attacks. This can’t be absolutely assured until sufficient testing has been performed.
Otherwise, it will be interesting to see the final version of Wibble. HTMLPurifier has a tough reputation to follow, but having something which can even up the odds and do it with a pronounced advantage in speed will be really nice. Well, until someone needs to install it on CentOS
.
Self-Contained Reusable Zend Framework Modules With Standardised Configurators
Sep 13th
It was during last week, while writing out a draft chapter for Zend Framework: Survive The Deep End, that I found myself hitting a conceptual wall. If you are familiar with Zend Framework, you likely understand the concept of a Module in some detail. A Module is, in theory, a reusable collection of controllers, views and other classes which is packaged in its own directory for simpler copying or seperate treatment in a version control system like git or subversion.
The problem I had lay in demonstrated this fabled reusability. The more I tried to, the more I found myself throwing out cautions, warnings and advice on what to avoid doing. When it came to using Zend_Application, the trend continued since Zend_Application (a great component otherwise!) is just badly documented and explained. So off went another section just to try and explain its often confusing terminology. If you read the source code it all makes sense but if you don’t the disconnect between the explanations and a user’s expectations is obvious.
Reusability Rules
Zend Framework developers have, for better or worse, been ignoring the potential of modules for an interminably long time. It’s not that big of a surprise given the focus of the framework has always been to present a use-at-will architecture which relies on loose coupling and independent components. Tight integration through overarching features (which don’t break the framework’s impressive orthogonality) like a command line tool or initialisation tools has long been neglected until very recently. Zend Framework 1.8 saw the long needed introduction of Zend_Application which offers standardised bootstrapping. Zend_Tool is another ongoing effort on the command line side.
The most typical example of a module in the literature is also the worst. An administration backend. It’s a logical module since it’s a completely separate system to the frontend, but it’s the worst example because it is so very rarely reusable. Not every logical separation is reusable – they are mutually exclusive concepts. You could equally have a logical module which itself is comprised of several reusable modules and one non-reusable module. By definition, an administration backend is closely tied to CRUD operations against the application’s domain model (at least to start with). Since each application will be different, the administration backend will also.
A far better example of a reusable module is something much narrower and focused. Consider a module dealing with User Management, or Paypal IPN integration, or implementing a blog aggregator. These are each common needs which, depending on the application, may require little change from implementation to implementation. Drop them in, configure them, integrate them, and you can have them working with few issues. Unfotunately, we keep focusing our module efforts on obviously non-reusable things like administration backends. Losing sight of the potential reuse of smaller subsystems will lead us to repeatedly developing them over and over again without even noticing this as a problem.
For the Zend Framework, this would be a big win. Rather than having developers re-implement commonly used web application systems it would encourage the distribution of third-party modules which would benefit from open source licensing and feedback. Imagine your next application requiring a minimal blog or integration with Paypal IPN and finding a third party module which does the trick so you can save some development time.
Achieving Reusability
When we discuss achieving reusability there are several factors and features covered when it comes to modules:
1. They are separated into their own parent directory.
2. They can apply specific configuration when accessed.
3. They require no special integration work.
4. Their classes are automatically available to the host application.
5. They are not required to contain controllers or views.
It’s not an exhaustive list. Items 1, 4 and 5 are already a reality. Zend Framework modules do live in a module directory, using Zend_Application and some conventions their classes are autoloaded on demand and they are not required to contain controllers and views. A module may exist which merely offers models, helpers and some default forms.
So our path to reusable modules is hampered by items 2 and 3. Modules currently don’t have on-access configuration unless we impose it through various means. This flows into integration work which is commonly needed to achieve this in the first place.
The Layout Example: Integration Through Front Controller Plugins
A simple example, taking our example administration backend (an “admin” module) is that of switching layouts. Suppose our main application uses a professional design but our administration backend uses a very simple minimal one. How do we switch layouts when the admin module is accessed so the correct layout template is applied?
An initial expectation might be to try this from our application.ini file (if using Zend_Application) using:
[geshi lang=css]; Default Module
resources.layout.layout = “default”
resources.layout.layoutPath = APPLICATION_PATH “/views/layouts”
; Admin Module
admin.resources.layout.layout = “default”
admin.resources.layout.layoutPath = APPLICATION_PATH “/modules/admin/views/layouts”[/geshi]
Ah, module configuration! This is a very common first attempt since the expectation is that a module framed configuration will kick in only for that module.
Alas, this will not work even if it looks blatantly obvious that it should (expectations again). Module configuration here is used during the bootstrapping process which occurs before a request is routed, i.e. we can’t know what module the request relates to yet because our routes are not yet applied. So any module configuration of this type is actually applied to the same resources as the previous set of settings, i.e. module configuration overwrites the main resource configuration. The example above, replaces the layout and layout path across the entire application with that of the admin module. Visiting any module, including the default module, will show the last configured layout being applied no matter what module prefixing you use.
What possible good is having this confusing module configuration then? Well, it’s useful to pass custom options to your module’s bootstrap class for something. Beyond that, I can’t think of many other use cases. You could, for example, use it to register module-hosted plugins, classes, etc but that’s just as easily done without the module name prefixed to the option. Note, this is my own ignorance speaking – I haven’t seen any detailed examples using this.
In the meantime, how do we ensure the layout is only switched if the module is accessed? The above configuration won’t work, obviously. Well, we first need to know what module is being dispatched to, so it must be done after routing has taken place. The most obvious location for our switching logic is therefore a front controller plugin which implements the preDispatch() method (i.e. it’s executed just before any controller is called, giving us an opportunity to re-configure some resources like Zend_Layout).
Here’s an example plugin for this. It’s the simplest possible version – I’ve seen some examples which forget that Zend_Layout already offers a plugin we can subclass to keep things simple.
[geshi lang=php]
class ZFExt_Controller_Plugin_LayoutSwitcher
extends Zend_Layout_Controller_Plugin_Layout
{
public function preDispatch(Zend_Controller_Request_Abstract $request)
{
$this->getLayout()->setLayoutPath(
Zend_Controller_Front::getInstance()->getModuleDirectory(
$request->getModuleName()
) . ‘/views/layouts’
);
$this->getLayout()->setLayout(‘default’);
}
}[/geshi]
It’s not a perfect class – every single module must follow the convention on using the same layout path and layout name. We could also add some logic to skip the default module since this would configure it twice for no reason. But it works! When we access the “admin” module, the layout path will be set to /application/modules/admin/views/layouts and the layout template used will be default.phtml. The default modules path will likewise reflect its original configuration.
To get this working, let’s add a new layout resource option so our custom plugin replaces the default one from Zend_Layout:
[geshi lang=css]; Default Module
resources.layout.layout = “default”
resources.layout.layoutPath = APPLICATION_PATH “/views/layouts”
resources.layout.pluginClass= “ZFExt_Controller_Plugin_LayoutSwitcher”[/geshi]
This does work by the way
. Add the following test alongside a directory _modules containing a readable subdirectory _modules/admin and it will pass.
[geshi lang=php]
class ZFExt_Controller_Plugin_LayoutSwitcherTest extends PHPUnit_Framework_TestCase
{
protected $plugin = null;
protected $request = null;
public function setup()
{
Zend_Controller_Front::getInstance()->addModuleDirectory(dirname(__FILE__) . ‘/_modules’);
$this->plugin = new ZFExt_Controller_Plugin_LayoutSwitcher(
new Zend_Layout
);
$this->request = new Zend_Controller_Request_Http;
}
public function teardown()
{
Zend_Controller_Front::getInstance()->resetInstance();
}
public function testSwitchesLayoutNameIfAdminModuleDispatched()
{
$this->request->setModuleName(‘admin’);
$this->plugin->preDispatch($request);
$this->assertEquals(‘default’, $this->plugin->getLayout()->getLayout());
}
public function testSwitchesLayoutPathIfAdminModuleDispatched()
{
$this->request->setModuleName(‘admin’);
$this->plugin->preDispatch($request);
$this->assertEquals(dirname(__FILE__) . ‘/_modules/admin/views/layouts’,
$this->plugin->getLayout()->getLayoutPath());
}
}[/geshi]
How about something different? What if our main application uses HTML 5 and our administration backend uses XHTML 1.0 Transitional. Damn, we need another plugin. Worse, this time we can’t reduce it to a convention since a doctype can be anything and we have no way of predicting it. We could set it on the module layout, but layouts are rendered last – it would still not be applied to page level templates or partials. Forms would be messed up, for example. Same goes for the character encoding of our views (messed up escaping).
So slap in another plugin to handle doctype switching, and another to handle encoding changes. Why not add another just for fun so we can handle connecting to a module’s shared database. Then there’s the case where… Alright…enough of that
. The point is a simple one. We are adding custom plugins all over the place to integrate modules into our application. These plugins will not be reusable, will require editing for different modules, and will need to be rewritten between applications. We need something more structured.
Integration: Modular Pre-Dispatch Configuration
As we can see, integration efforts are tricky. Relying on custom plugins and trying to wrestle the bootstrap system into submission are a lot of trouble to go through. Zend_Application and bootstrapping may not offer us a good solution for integration, but they do give us the roadmap.
Zend_Application defines bootstrap classes which are used to initialise resources before routing takes place. Keeping it simple, we need to reconfigure resources after routing but before dispatching occurs. We may also need to initialise different resources if they are used by a module, but not the main application. What we need is something like bootstrapping that occurs after routing. After a bit of thought, we might come to the conclusion that the current Resource classes of Zend_Application could live parallel to counterparts who exist not to initialise a Resource, but to modify a pre-initialised Resource by resetting its configuration. These are what I term Configurators, maybe not the best name, which mirror Resources.
Take a Configurator class for Layouts as an example:
[geshi lang=php]
class ZFExt_Application_Module_Configurator_Layout
extends Zend_Application_Resource_ResourceAbstract
{
public function init()
{
$layout = $this->getBootstrap()->getResource(‘Layout’);
$layout->setOptions($this->getOptions());
}
}[/geshi]
Our Configurator actually extends from Zend_Application_Resource_ResourceAbstract demonstrating its close relationship to a Resource. However, it does not create and initialise a Resource – it merely modifies the existing one by injecting a new configuration sourced from a module.
Where does this replacement configuration come from? I’ve decided to use a simple convention. If you want a module to impose its own configuration when, and only when, it is accessed then create that configuration in a file called module.ini located at /application/modules/admin/configs/module.ini. The configuration file could be any supported format, but I’ve used the INI format for simplicity. This would look like (using the typical environmental groups):
[geshi lang=css][production]
; Standard Resource Options
resources.layout.layout = “default”
resources.layout.layoutPath = APPLICATION_PATH “/modules/admin/views/layouts”
[staging : production]
[testing : production]
[development : production][/geshi]
So, we have the Configurator class, and the configuration file it will use. Let’s bind these together. We’ll start by putting in place a class whose role is to use a collection of options loaded from module.ini to instantiate and run a set of Configurator classes.
[geshi lang=php]
class ZFExt_Application_Module_Configurator
{
public function __construct(Zend_Application_Bootstrap_Bootstrapper $bootstrap,
Zend_Config $config)
{
$this->_bootstrap = $bootstrap;
$this->_config = $config;
}
public function run()
{
$resources = array_keys($this->_config->resources->toArray());
foreach ($resources as $resourceName) {
$options = $this->_config->resources->$resourceName;
$configuratorClass = ‘ZFExt_Application_Module_Configurator_’ . ucfirst($resourceName);
$configurator = new $configuratorClass($options);
$configurator->setBootstrap($this->_bootstrap);
$configurator->init();
}
}
}[/geshi]
As you can see, it is very simple. It takes a configuration, detects what Resources it applies to, instantiates relevant Configurators and executes them. It could be improved a lot by allowing for custom Resources and other such customisations but for now the basics will do nicely.
Earlier, we mentioned that the application only becomes aware of the current module when the request is routed. Therefore, to get this working we need to trigger the Configurators after routing (or prior to request dispatching). We also need to check if the current module has a module.ini file and also ensure we skip over the default module (our main application space might be reusable so this is an arguable point and probably should be allowed for).
We’ll accomplish this using a front controller plugin:
[geshi lang=php]
class ZFExt_Controller_Plugin_ModuleConfigurator
extends Zend_Controller_Plugin_Abstract
{
public function preDispatch(Zend_Controller_Request_Abstract $request)
{
$front = Zend_Controller_Front::getInstance();
$bootstrap = $front->getParam(‘bootstrap’);
$moduleName = $request->getModuleName();
if ($moduleName == $front->getDefaultModule()) {
return;
}
$moduleDirectory = Zend_Controller_Front::getInstance()
->getModuleDirectory($moduleName);
$configPath = $moduleDirectory . ‘/configs/module.ini’;
if (file_exists($configPath)) {
if (!is_readable($configPath)) {
throw Exception(‘modules.ini not readable for module “‘ . $module . ‘”‘);
}
$config = new Zend_Config_Ini($configPath, $bootstrap->getEnvironment());
$configurator = new ZFExt_Application_Module_Configurator(
$bootstrap, $config
);
$configurator->run();
}
}
}[/geshi]
If you’re still with me, and can piece this story together, you achieve a workflow as follows for the admin module when accessed from any URI like http://example.com/admin. I’ve skipped steps where not relevant.
1. Normal bootstrapping is completed with the layout being initially set using the application.ini options.
2. The request is routed. The module name “admin” is set internally.
4. ZFExt_Controller_Plugin_ModuleConfigurator::preDispatch() is called before dispatching commences (getting it done before other plugins can be addressed in the future).
5. The plugin detects /application/modules/admin/configs/module.ini and loads it as a Zend_Config instance.
6. The plugin instantiates ZFExt_Application_Module_Configurator, passes it the configuration and original bootstrap, and calls the new object’s run() method.
7. The Module Configurator assesses the configuration for resource names. For each resource detected, it instantiates a Resource Configurator like ZFExt_Application_Module_Configurator_Layout.
8. The Resource Configurator is executed and applies the new configuration to the existing Layout Resource thus overwriting the original configuration.
9. Dispatching occurs – the admin module’s requested action is rendered with the correct admin layout.
By itself, this seems like a lot of trouble to go through – except what if it becomes a Zend Framework feature? All of a sudden, countless custom plugins will meet their death and be replaced by a simple configuration file!
Conclusion
The goal of this article was to highlight the problems of achieving reusable modules and implement, as a proof of concept, at least part of the solution with an eye to encouraging greater discussion of where to go from here. I, for one, would love to see this included in the Zend Framework so we can get over the trend of relying on custom plugins and evolve towards a more standardised means of configuring modules.
If you’ve enjoyed the article please do add a comment and make suggestions on what could be improved or added as a feature. If I get enough positive feedback I’ll move this into a formal proposal (preferably with a partner or two
). If you’re interested in collaborating on a proposal addressing this let me know!

Recent Comments