By Matt Butcher
utf-8
QueryPath and Character Sets: Converting content with mb_convert_encoding()
Submitted by matt on Mon, 2010-05-03 09:43QueryPath can be used to crawl the web, parsing web pages and gleaning information. But the HTML of remote websites is not always as pristine and standards compliant as we would like, and one thing that can be particularly frustrating is determining the encoding of a document. (This gets substantially more complicated when HTTP headers list one encoding and HTML meta tags list another -- a common configuration error).
QueryPath is primarily a library for working with XML and HTML, but it assumes that you know from the outset what character set your document uses. This is not always a good assumption to make. Here is one way to circumvent the problem: Rather than write code to find out a document's character set, use PHP built-in functions (assuming you have the MB library compiled in) to do this for you.
<?php require 'QueryPath/QueryPath.php'; $url = 'http://mopy.fr/'; $contents = mb_convert_encoding(file_get_contents($url), 'iso-8859-1', 'auto'); $opts = array('ignore_parser_warnings' => TRUE); print @qp($contents, 'title', $opts)->text() . PHP_EOL;
XML, character sets, and setting the right encoding
Submitted by matt on Tue, 2009-05-05 16:09Working with incorrectly encoded XML documents is painful.
Today I encountered an XML document that did not declare (in its XML header) what encoding it used. If a document does not have an explicit encoding set in the XML declaration, it must be treated as UTF-8. But in this case, the document was actually encoded in some variant of ISO-8859-1 (it appeared to have snippets of MS Word generated HTML copied and pasted into it). When encountering high ASCII characters in the document, the parser (rightly) choked.
Here's what the declaration looked like:
<?xml version="1.0"?>
Because it is encoded as an ISO-8859-1 document, it should have looked like this:
<?xml version="1.0" encoding="iso-8859-1"?>
So what do you do in a case like this? PHP's (actually, libxml's) parser does not allow you to explicitly override the XML declaration. Consider what might appear to be a working option:
$doc = DOMDocument('1.0', 'ISO-8859-1'); $doc->load('my/broken/doc.xml');
This will fail (as will attempts to load the document with SimpleXML). The document's own (implicit) UTF-8 declaration will override the settings for the DOMDocument object. Similarly, trying to set $doc->encoding will also be ineffective for the same reason.
Running an automated replacement is dangerous, though. Just because one document was ISO-8859-1, I cannot assume that all will be. Thus, building a solution could involve using iconv or other similar tools. Indubitably, the correct route is to have the XML producer correctly set the encoding (or correctly convert to contents to UTF-8). Barring, that, though, iconv is your best bet.
Here's how to correct encoding errors from the command line. This should work on most UNIX-like systems, including Mac OS X:
$ iconv -f 'iso-8859-1' -t 'utf-8' bad-iso8859-1.xml > utf-8.xml
In this case, iconv will read the original file, convert it from ISO-8859-1 to UTF-8, and then write the results to utf-8.xml.








