PHP, Zend Framework and Other Crazy Stuff
Archive for February, 2008
Zend_Feed: Getting Started With Aggregating RSS/Atom Content
Feb 11th
One of the components I spent some time working with recently was Zend_Feed, which was interesting at first, then a little irritating, and eventually compliant. In this entry I explore Zend_Feed from the perspective of someone aggregating RSS and Atom feeds with a view to building a database of uniquely identified content for later presentation in a “Planet” style application.
My first overall assessment is that Zend_Feed needs work. It is a wonderful component that can simplify your life immeasurably, but it’s up against competition from third party libraries like MagpieRSS which do a better job at the one malfeasant facet of blog feeds: invalid, malformed and non-standard RSS and Atom XML. What Zend_Feed needs to breach the threshold of usability is more ability to handle the various problems you meet parsing RSS and Atom to a common range of data. It also badly needs improved documentation with a focus on examples of real use – for example I can’t find a single documentation snippet or blog tutorial showing how to get a entry’s actual HTML content using Zend_Feed which is problematic, possibly symptomatic, and as you’ll see unintuitive. Don’t feel too disheartened – it’s still a powerful package with sufficient utility to get you well on your way.
This tutorial is intended to cover some basics to get you started. In fact all we create here is a simple command line script to aggregate content frequently (e.g. just set up cron to run it every hour or so) into a database for later presentation.
Setting Up Database And Models
So what common data do I want for each entry in an Atom or RSS feed?
- Unique Identifier
- Author
- Title
- URL
- Publish Date
- Description
- HTML Content
Based on this, we can predict a decent Model for an Entry table in a database. More on that later, but for now, how to get this data from a feed? The first thing is to have a database or other source to get the URL for each RSS or Atom feed. I’m using a MySQL table as follows:
[geshi lang=sql]CREATE TABLE IF NOT EXISTS `blog` (
`id` int(11) NOT NULL auto_increment,
`url` tinytext collate utf8_unicode_ci NOT NULL,
`feedurl` tinytext collate utf8_unicode_ci NOT NULL,
`title` tinytext collate utf8_unicode_ci NOT NULL,
`author` tinytext collate utf8_unicode_ci NOT NULL,
`modified` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;[/geshi]
You can insert into this table as you wish. One row per blog you intend aggregating. Here’s some sample data:
url: http://blog.astrumfutura.org
feedurl: http://blog.astrumfutura.com/feeds/index.rss2
title: Maugrim The Reaper’s Blog
author: Pádraic Brady
The other fields update without manual intervention.
A database table for entries may be something like:
[geshi lang=sql]CREATE TABLE IF NOT EXISTS `entry` (
`id` int(11) NOT NULL auto_increment,
`blog_id` int(11) NOT NULL default ’0′,
`guid` tinytext collate utf8_unicode_ci NOT NULL,
`title` tinytext collate utf8_unicode_ci NOT NULL,
`url` tinytext collate utf8_unicode_ci NOT NULL,
`description` text collate utf8_unicode_ci NOT NULL,
`date` datetime NOT NULL default ’0000-00-00 00:00:00′,
`creator` tinytext collate utf8_unicode_ci NOT NULL,
`content` text collate utf8_unicode_ci NOT NULL,
`modified` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
`hash` varchar(32) collate utf8_unicode_ci NOT NULL,
PRIMARY KEY (`id`),
FULLTEXT KEY `search` (`title`,`description`,`content`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;[/geshi]
Now that we have database tables to work with, we can use Zend_Db to create some Models. We’ll store these in the usual ./application/models directory in keeping with the default Zend Framework directory structure as Blog.php and Entry.php:
Blog.php
[geshi lang=php]class Blog extends Zend_Db_Table
{
protected $_name = ‘blog’;
}[/geshi]
Entry.php
[geshi lang=php]class Entry extends Zend_Db_Table
{
protected $_name = ‘entry’;
}[/geshi]
The Aggregator Script Foundation
The aggregator script is a pretty simple one executed by php on the command line. Here’s a basic foundation to start from stored to “./scripts/Zend/Aggregate.php”:
[geshi lang=php]// get root directory for app
$root = dirname(dirname(dirname(__FILE__)));
// set the include path for this script
set_include_path(
$root . ‘/library’ . PATH_SEPARATOR // The ZF
. $root . ‘/application/models’ . PATH_SEPARATOR // Models
. $root . ‘/vendor’ . PATH_SEPARATOR // Other
. get_include_path()
);
// setup autoloading
require_once ‘Zend/Loader.php’;
Zend_Loader::registerAutoload();
// get initial required non-ZF classes
require_once ‘Blog.php’;
require_once ‘Entry.php’;
require_once ‘HTMLPurifier.php’;
class Zend_Aggregate
{
// bootstrap this class and commence aggregation
public static function main()
{
// Mini-Bootstrap
$config = new Zend_Config_Ini(
dirname(dirname(dirname(__FILE__))) . ‘/config/config.ini’, ‘general’
);
$db = Zend_Db::factory($config->db->adapter, $config->db->toArray());
$db->query(“SET NAMES ‘utf8′”);
Zend_Db_Table::setDefaultAdapter($db);
// Models
$blogTable = new Blog;
$entryTable = new Entry;
// Aggregator Startup
$aggregator = new self;
$aggregator->aggregate($blogTable, $entryTable);
}
public function __construct() {}
public function aggregate(Blog $blogTable = null, Entry $entryTable = null)
{
if (!$blogTable) {
$blogTable = new Blog;
}
if (!$entryTable) {
$entryTable = new Entry;
}
$blogs = $blogTable->fetchAll();
$client = new Zend_Http_Client;
$client->setConfig(
array(‘timeout’=>30)
);
foreach ($blogs as $blog) {
try {
$feed = Zend_Feed::import($blog->feedurl);
} catch (Zend_Feed_Exception $e) {
echo “Failed importing feed: {$e->getMessage()}\n”;
exit();
}
foreach($feed as $item) {
$entryData = $this->parseEntry($item, $blog);
$this->_syncToDatabase($entryData, $blog, $entryTable);
}
}
}
protected function parseEntry(Zend_Feed_Entry_Abstract $item, Zend_Db_Table_Row $blog)
{}
protected function _syncToDatabase() {}
}
// execute the main() method
Zend_Aggregate::main();[/geshi]
The source code above is not difficult. We use a static main() method call at the end of the file to initiate the process. The main() method creates a new instance of Zend_Aggregate, set’s up items like configuration, database connection, and Model instantiation, and finally calls aggregate() on the new object.
Inside, Zend_Aggregate::aggregate() we get a list of all blogs stored in the database. For each of these blogs, we fetch its relevent feed URL. The result of Zend_Feed::import() can be iterated across to get each individual entry or item. On each entry/item we pass it to the Zend_Aggregate::parseEntry() method – which bears overall responsibility for what to do next.
Using Zend_Feed to get common data for RSS/Atom entries
So what would parseEntry() do? Well, it’s main role is to parse the feed and assemble a collection of common data we require – see the list above
. So it is also responsibility for ensuring we get that data in a common format regardless of whether the source was RSS or Atom. Here’s the first thing we put in parseEntry():
[geshi lang=php] $dom = $item->getDOM();[/geshi]
Yes, a lot of the resolution of source comparison and common format will likely require we use DOM. This gets interesting later since Zend_Feed appears to do two completely incoherent things. Firstly, it strips source of several XML namespaces (particularly “content” for RSS), but leaves in most others (e.g. “dc” for RSS). There’s an alternative access method for namespaced elements which is plain weird, but which Nick Halstead reminded me of.
Let’s hit our list though.
First item is getting some sort of unique id for each entry – usually it’s a URL reference. RSS and Atom don’t agree here, but both have some sort of id. For RSS it’s the “guid” element, while Atom has an “id” element. So into Zend_Aggregate::parseEntry() we add:
[geshi lang=php] // get a unique id
$guid = ”;
if ($item->guid()) {
$guid = $item->guid();
} elseif ($item->id()) {
$guid = $item->id();
} else {
$guid = $item->link();
}[/geshi]
So if we can’t find one of those IDs, we’ll just use the item’s URL which is probably just about enough. In most cases though there will be an ID.
Next up we need a title. Thankfully, both RSS and Atom at least agree on this:
[geshi lang=php] // fetch a title
$title = ”;
if($item->title()) {
$title = $item->title();
} else {
$title = $blog->title;
}[/geshi]
In case I was wrong, I’ll just insert the blog title instead as a placeholder. Next up is a description. Atom rarely has something for this…in which case we’ll just take the entry title.
[geshi lang=php] // get a description or similar
$description = ”;
if ($item->description()) {
$description = $item->description;
} else {
$description = $title;
}[/geshi]
If you’re following this, note that all single elements can have their enclosed values retrieved by simple calling “$item->element()” as a method. Calling “$item->element” (as a property) actually returns another Zend_Feed_Entry_Abstract object which we can use to traverse into element children if required.
Now for some HTML content. The problem with the content is that it comes in two forms. Content as HTML in RSS is encoded (think something like htmlentities()) while the content in Atom is merely enclosed in a CDATA block. So firstly, if it’s RSS we can decode it before use. The second part is ensuring the HTML is relatively safe from XSS, and finally that it all follows the same HTML standard. In this mind boggling decision making I’ve just delegated to the excellent HTMLPurifier library.
Now the interesting thing about content is that in RSS it’s enclosed by a “content:encoded” element, i.e. it uses the “content” namespace with a URL of “http://purl.org/rss/1.0/modules/content/”. Zend_Feed handles this by stripping out the namespace so we’re left with an “encoded” element. Atom on the other hand just has a “content” element to start with (the HTML in a CDATA block).
[geshi lang=php] // normalise content
$contentOriginal = ”;
$content = ”;
if ($item->encoded()) {
$contentOriginal =
html_entity_decode($item->encoded(), ENT_QUOTES, ‘UTF-8′);
} elseif ($item->content()) {
$contentOriginal = $item->content();
}
// Purify and normalise content to XHTML 1.0 Transitional
$purifier = new HTMLPurifier();
$content = $purifier->purify($contentOriginal);[/geshi]
We also retain the original content unchanged. Later we use a md5 hash of it to detect changes (e.g. maybe the author edits their entry) and update our entries on the database.
Next, we need a URL to the specific entry we’re parsing here:
[geshi lang=php] // fetch entry item link (adjust if href holds it)
$link = ”;
if($item->link()) {
$link = $item->link();
} else {
$links = $dom->getElementsByTagName(‘link’);
$link = $links->item(0)->getAttribute(‘href’);
}[/geshi]
Since some RSS feeds like Planet-PHP’s include the link in a href attribute, rather than the link element’s nodeValue, we need to a few acrobatics with DOM just in case.
Getting the author or creator’s name is another pain in the ass. RSS 2.0 often uses a dc:creator element, with a “dc” namespace with URL of “http://purl.org/dc/elements/1.1/”. Unlike with Zend_Feed’s previous RSS content:encoded stripping of the namespace – it doesn’t do it here at all…:(. How naughty! So we have to parse out the dc elements ourselves assuming they exist, or otherwise search for an author or creator element ourselves, remembering that for Atom it’s actually a name element child of author… I feel your confusion
.
[geshi lang=php] // get the author name
$author = ”;
$creators = $dom->getElementsByTagNameNS(
‘http://purl.org/dc/elements/1.1/’,
‘creator’
);
$creator = $creators->item(0)->nodeValue;
if($creator) {
$author = $creator;
} elseif($item->author() && is_string($item->author())) {
$author = $item->author();
} elseif($item->author->name()) {
$author = $item->author->name();
} else {
$author = $blog->author;
}[/geshi]
If all else fails, we’ll just take the author name from the blog database entry. In the above I’m using the DOM to extract a dc:creator value. Here’s another Zend_Feed snippet of a shortcut – if the DOM is not your thing, you can instead get the value of $creator using:
[geshi lang=php] $dccreator = strval($item->{‘dc:creator’});
if($dccreator && !empty($dccreator)) {
$author = $dccreator;
}[/geshi]
I love shortcuts – but here it’s almost if not quite worse than using the DOM. You still need to strval() the resulting Zend_Feed_Element value returned, check if it’s empty or not, and only then assign it. Still, using this without DOM can be an advantage if DOM is one of those less than familiar extensions. To be honest, I think Zend_Feed really desperately needs a simple get() method to centralise value fetching without all these gymnastics…
Let’s not forget a published date!
[geshi lang=php] // get a publication date and normalise
$date = ”;
$dcdates = $dom->getElementsByTagNameNS(
‘http://purl.org/dc/elements/1.1/’,
‘date’
);
$dcdate = $dcdates->item(0)->nodeValue;
if($dcdate) {
$date = $dcdate;
} elseif ($item->pubDate()) {
$date = $item->pubDate();
} elseif ($item->published()) {
$date = $item->published();
} elseif ($item->created()) {
$date = $item->created();
} elseif ($item->updated()) {
$date = $item->updated();
} elseif ($item->modified()) {
$date = $item->modified();
}
$date = $this->_normaliseDate($date);[/geshi]
Yeah, dates are worse than content sometimes… We’ll need to normalise the date from the differing RSS and Atom forms.
[geshi lang=php] protected function _normaliseDate($date)
{
$date = preg_replace(“/([0-9])T([0-9])/”, “$1 $2″, $date);
$date = preg_replace(“/([\+\-][0-9]{2}):([0-9]{2})/”, “$1$2″, $date);
$time = strtotime($date);
if (($time – time()) > 3600) {
$time = time();
}
$date = gmdate(“Y-m-d H:i:s O”, $time);
return $date;
}[/geshi]
Another method for you to append to the file…
Moving right along, we have two final parseEntry() parts:
[geshi lang=php] // get a unique content hash to detect future content changes
$hash = ”;
$arrayContent = array($title, $contentOriginal, $link);
$stringContent = implode(‘ ‘, $arrayContent);
$hash = md5($stringContent);
// put together result object
$result = new stdClass;
$result->guid = $guid;
$result->blog_id = $blog->id;
$result->title = $title;
$result->url = $link;
$result->description = $description;
$result->date = $date;
$result->creator = $author;
$result->content = $content;
$result->hash = $hash;
return $result;[/geshi]
You could return an array either, I just like objects.
The rest of the tutorial after the jump! More >
The Zend Framework and Microformats: Zend_Microformat Proposed
Feb 5th
I sometimes get stick for complaining about the Zend Framework, but it’s only because I care
.
To show some more adoration, I’ve proposed Zend_Microformat on the ZF Proposal Wiki at Zend_Microformat – Pádraic Brady. I’ve also started a proposal page for Zend_Microformat_Xfn.
I’ve had Microformats on the back burner as a suitable proposal since last Summer, and since it still seems like a long overdue proposal to round out the Zend Frameworks web services offerings, I’m shooting it out there.
If you’re not familiar with Microformats, they are a collection of data encodings generally included in websites using highly simple syntaxes in plain old HTML, XHTML or XML. One of the most prevelant these days is the XHTML Friends Network (XFN) microformat which is extremely common these days and can provide a wide field of collectable data about social network users and their network of friends and colleagues. Others you may be familiar with include hCard and hCalendar.
You can read more about Microformats at http://microformats.org/about/.
The new proposal is now open for community comments, and I’ll spend some of tomorrow adding more use cases showing how the Zend_Microformat component could be used in practice for both parsing and generating Microformatted data.
