PHP, Zend Framework and Other Crazy Stuff
PHP Applications using UTF-8 – should we believe them?
It’s not everyday you see both these terms side by side. PHP has poor UTF-8 support. Well, actually its non-existent once you ignore the optional libraries that many shared hosts probably won’t compile with PHP. Now many will know all about setting up HTML pages for UTF-8, for many its the latest programming mantra. Unfortunately just setting a content-type is not enough. This will work completely for English webpages – and only English webpages using unaccented roman characters [a-z0-9].
So what is this mysterious character encoding and why is it so very important to get it right?
The problem is the 16 ISO-8859′s (ISO-8859-16 is a necessity for the EU Euro symbol I think), ASCII, and dozens of other language and custom character encodings to support native character sets. In short its a case of too many cooks. To try and unify all these variants and offer a single standard supporting all character sets (and therefore all languages) we have Unicode, and by extension UTF-8 encoding of the Unicode character set. With UTF-8, a UTF-8 encoded file can represent anything from standard ASCII (read English) to Hindi and more. For a multi-lingual web application this is the best thing since fire was invented – or would be if PHP supported UTF-8 and PHP developers figured out that string functions in PHP at present are doomed to failure with UTF-8.
Consider the example of a Sourceforge page – https://sourceforge.net/users/maugrim_t_r/ . Yes, that’s my profile
. However there’s a small problem. The page insists it is encoded in UTF-8 (view the source). That’s not true. You see, a simple European accented character in my first name is not displayed – its replaced with a ?. Why? Because its not UTF-8 encoded. Ha! So much for internationalisation on Sourceforge. It claims to be UTF-8 – but its obviously not UTF-8. The truth is that only parts (if any) are actually UTF-8. The rest is most likely ISO-8859-1 – i.e. ASCII English.
One could blame me – I’m a bad person who uses a name containing a Gaelic accented character (equivalent to the french a-acute character). Why not blame the French why we’re at it, or the rest of those pesky non-English speaking people?
Its because of this that many self-proclaimed web apps purporting to support internationalisation or localisation (we use the abbreviation I18N for the latter) are being unrealistic. Many will support I18N for English, and a subset of European languages where accented characters can be given an English character variant, or maybe if they use a template system with UTF-8 encoded templates that require no manipulation – i.e. no database storage or use of PHP string functions. Some will require the mbstring library be available for PHP.
There are still problems. Take a simple example. I create a form accepting a username of no more than 32 characters. So someone from India drops by and enters a Hindi username – 32 Hindi characters. Now the form is submitted. Everything is great, right? We just check for 32 characters. But wait – there is no character length function in PHP! And no, strlen() is not that – not at all.
strlen() does NOT count characters – it counts BYTES. Never forget that distinction lest it burn you. Now UTF-8 is a multi-byte encoding. Keeping in mind UTF-8 incorporates single-byte ASCII we can say the string “Reaper” has 6 bytes, equalling 6 characters. This is what most of us already do – its the evil assumption. But what about Hindi? Hindi uses multi-byte characters.
A 32 character Hindi string will actually contain a lot more than 32 bytes – some languages may take up to 32 * 6bytes, i.e. 192 bytes to represent a 32 character string! Result – our filtering logic fails miserably, a victim of our false assumption that the whole world speaks English and writes with Roman characters.
There are other deeply set practices that fail to account for UTF-8. While Perl supports UTF-8 in RegExp, PHP’s PCRE does not. How many of us use regular expressions in form validation? Probably the same number who use strlen() to count characters… There are other examples. Basically you could list all the string functions.
There is a solution however – we can compile (or edit php.ini for dynamic inclusion) the mbstring library for PHP. However this has one flaw – it’s not a default library which means many shared hosts will not actually support it. So much for PHP’s UTF-8 support. The other solution is to wait for PHP6 which will have native UTF-8 support. But then what about all those hosts who’ll persist in using PHP4 because PHP5 is not backwards compatible?
In short, unless you’re a native English speaker – people using PHP and outputting UTF-8 html while claiming to support I18N are (in the majority) completely ignorant. Afterall for an English speaker there is zero difference between using ISO-8859-1, ASCII and UTF-8. Its the rest of the world that must jump through hoops. Even worse – many of the UTF-8 HTML out there is not even encoded in UTF-8. Often its just a typical ISO-8859 document (written on WindowsXP in almost all cases) with a UTF-8 content-type charset – this largely explains why name is commonly corrupted with question marks and boxes on many English sites claiming (falsely) to be UTF-8 encoded.
My name is not P?draic or Pdraic – it’s Pádraic! ![]()
No related posts.
| Print article | This entry was posted by Pádraic Brady on February 28, 2006 at 6:16 pm, and is filed under Uncategorized. Follow any responses to this post through RSS 2.0. You can leave a response or trackback from your own site. |
-
http://www.aatraders.com Panama Jack
-
http://www.quantum-star.com Maugrim
-
http://www.kabal-invasion.com/bugjuice iamsure
-
http://blog.quantum-star.com Maugrim
-
http://www.kabal-invasion.com/bugjuice iamsure
-
http://blog.quantum-star.com Maugrim

