xml

15 Feb

A QueryPath script for checking on a sitemap

in php, programming, querypath, sitemap, xml

Sitemap ScoresSitemap ScoresI've been tuning our sitemap during the last few months, and one thing I needed was a quick tool to check on the effectiveness of various sitemap generation strategies.

To do this, I wrote a quick QueryPath script (see a full-sized image of the output). The script is explained below.

The code is pretty straightforward. It simply retrieves a URL, parses the sitemap contents, and then sorts them. Finally, it displays the top 100 entries. I've tested it on sitemaps with over 20,000 items. While it is a little slow on such a large document, it works fine.

#!/usr/bin/env php
<?php
require 'QueryPath/QueryPath.php';
 
define('MAX_ITEMS', 100);
 
$sitemap = 'http://example.com/sitemap.xml';
 
$urls = array();
print "Parsing sitemap...\n";
$qp = qp($sitemap, ':root>url>loc');
$size = $qp->size();
$max = $size > MAX_ITEMS ? MAX_ITEMS : $size;
printf("Found %d entries; printing top %d\n\n", $size, $max);
 
try {
    foreach ($qp as $url) {
      $loc = $url->text();
    $score = $url->nextAll('priority')->text();
    $urls[$loc] = $score;
    }
} catch (Exception $e) {
  print $e->getMessage();
}
 
arsort($urls);
 
$filter = "%d: %0.5f  %s\n";
 
foreach ($urls as $uri => $score) {
  if ($i++ == $max) break;
   printf($filter, $i, $score, $uri);
};
?>

Basically, the script above simply fetches all of the URLs out of the sitemap, and then sorts them by their corresponding score. Only the top MAX_ITEMS (100) are shown.

15 Feb

5 Differences: Moving from XML Sitemap module to Google's Sitemap Generators

in drupal, google, python, seo, sitemap, xml

For a large site that I maintain, we recently disabled the XML Sitemap module (we're using the 1.x branch) and switched to the Google Sitemap Generators tool (the Python one). We have noticed a few unsurprising things, and a few very surprising things.

We identified five big differences (all positive) that we have seen since moving to the Google Sitemap Generators Python tool.

01 Jun

Escaping JavaScript in QueryPath

in html, javascript, php, querypath, xml

Sometimes the HTML you parse with QueryPath will contain JavaScript or other embedded scripting languages. And sometimes such scripts will contain characters that the XML parser might misinterpret as XML or HTML structures.

There are two ways to escape such content -- both of which are standard, and are often done regardless of whether or not you are using QueryPath.

The first method, which is preferred when working with HTML, is to enclose any scripts inside of HTML comments:

<html>
<head>
< script>
<!--
// Script goes here
-->
< /script>
</head>
<body></body>
</html>

(Extra spacing has been added in the example above to keep the tags from being stripped by this blog's formatter. Those spaces should not be present in your code.)

The comment enclosure will prevent the HTML parser from parsing the contents of the script.

In other cases, XMxmlL CDATA sections may be a better fit for your needs:

<html>
<head>
<![[CDATA
// Script goes here
]]>
< /script>
</head>
<body></body>
</html>

CDATA sections will be readily available in the parsed DOM, but the contents of a CDATA section will not be parsed and interpreted. It is therefore safe to embed JavaScript as well as XML/HTML-like tags.

With these two strategies, you should have the tools necessary to prevent embedded scripts from causing QueryPath parse errors.

28 May

Executing a SPARQL Query from QueryPath

in php, programming, querypath, semantic web, sparql, xml

The Semantic Web. It is a concept that has sparked heated debate for years. While the debate may continue to rage for some time, there are already a host of technologies that can be used to build advanced applications based on XML technology. In this article, we will see how the SPARQL query language can be used to retrieve XML information from remote semantic databases (usually called SPARQL endpoints).

QueryPath already contains all of the tools necessary for running a SPARQL query and handling the results. This is not because QueryPath has been specially fitted to the task, but because SPARQL uses technologies that are widely supported: XML and HTTP. Since QueryPath can be used to make HTTP requests and then digest the XML results, we can use it to execute SPARQL queries and handle the results.

In this article, we will look at a basic SPARQL query, and see how we can use QueryPath to execute it and parse the returned results.

While SPARQL will be introduced here, it is far too robust a language to be explained in a short article. One starting point is the SPARQL Working Group home page.

The queries presented in this chapter will be run against DBPedia, a semantic version of Wikipedia. It makes all of the content from Wikipedia available as semantic content.

The SPARQL Query: A Brief Anatomy

Let's begin by looking at the SPARQL query that we will be running:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?uri ?name ?label
WHERE {
  ?uri foaf:name ?name .
  ?uri rdfs:label ?label
  FILTER (?name = "The Beatles")
  FILTER (lang(?label) = "en")
}

The query above begins by defining two prefixes:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

A prefix is a convenient method for representing a namespace URI with a short string. Above, we create one for the Friend of a Friend namespace (foaf:) and one for the RDF Schema namespace (rdfs). Now, whenever we need to represent entities from those two schemata, we can just use the short prefix instead of the full URL.

The next part of the code above is the actual query:

SELECT ?uri ?name ?label
WHERE {
  ?uri foaf:name ?name .
  ?uri rdfs:label ?label
  FILTER (?name = "The Beatles")
  FILTER (lang(?label) = "en")
}

We are going to use the URI a lot, and it is easy to get hung up on the URI as a URL expressing a location. However, you are better off thinking of the URI as a unique identifier for an object -- a unique identifier that just happens to also be "dereferenceable". We can, in fact, use the URI to access information over the network (in this case).

If you have developed SQL before, this should look vaguely familiar. It functions similarly to a SQL SELECT operation. Here's what the code above does, phrased in plain English:

  1. Select the uri, name, and label
  2. where...
  3. the uri has the name ?name (or, where the uri's name is stored in ?name)
  4. the uri has a label ?label
  5. the name is "The Beatles"
  6. the language of the label is English

There are a few things to note about the structure of the query.

First, remember that the URI (?uri), is just a unique identifier. It is functioning sort of like a primary key for each object we query.

Second, the items that begin with question marks (?) are variables. Their value is assigned when the query is being executed.

Third, the items in the WHERE clause are not simply restrictive, as they are in SQL. In fact, the purpose of lines 3 and 4 isn't so much to limit the items returned, but to express a relationship between items. The general pattern of lines 3 and 4 is:

?subject ?relationship ?object

So ?uri foaf:name ?name can be understood to mean "Some object ID (subject) named (relationship) Some name(object)". As you may have guessed, foaf:name expresses the relationship "is named". Likewise, rdfs:label expresses the relationship "is labeled".

Assuming that we did not have the two FILTER functions, the query would simply return all objects (together with their names and labels) that had a name and a label.

The FILTER function is used to limit what content is returned. Above, we used two filters:

  FILTER (?name = "The Beatles")
  FILTER (lang(?label) = "en")

The first filter says that the value of ?name must match (exactly) the string "The Beatles". Keep in mind that a given item may have multiple foaf:name items. The filter need only match one of the items.

The second filter requires that the label's language be in English. RDFS labels in the DBPedia database tend to have attributes indicating the language of the label. We are only interested in the English language content. In the query above, if we omit this, we will see results in Chinese, German, and Spanish, as well as other languages.

Putting this all together, then, our query will return the URI, the name, and the label for any URIs in the database that...

  • Have a name
  • Have a label
  • Have a name that is "The Beatles"
  • Have a label that is in English.

Next, we're ready to see how this query can be run against a remote, publicly available SPARQL endpoint (server) from QueryPath.

Running the Query

The query is, by far, the most complex aspect of our sample code. Here's what the entire code looks like:

<?php
require '../src/QueryPath/QueryPath.php';
 
// We are using the dbpedia database to execute a SPARQL query.
 
// URL to DB Pedia's SPARQL endpoint.
$url = 'http://dbpedia.org/sparql';
 
// The SPARQL query to run.
$sparql = '
  PREFIX foaf: <http://xmlns.com/foaf/0.1/>
  PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
  SELECT ?uri ?name ?label
  WHERE {
    ?uri foaf:name ?name .
    ?uri rdfs:label ?label
    FILTER (?name = "The Beatles")
    FILTER (lang(?label) = "en")
  }
';
 
// We first set up the parameters that will be sent.
$params = array(
  'query' => $sparql,
  'format' => 'application/sparql-results+xml',
);
 
// DB Pedia wants a GET query, so we create one.
$data = http_build_query($params);
$url .= '?' . $data;
 
// Next, we simply retrieve, parse, and output the contents.
$qp = qp($url, 'head');
 
// Get the headers from the resulting XML.
$headers = array();
foreach ($qp->children('variable') as $col) {
  $headers[] = $col->attr('name');
}
 
// Get rows of data from result.
$rows = array();
$col_count = count($headers);
foreach ($qp->top()->find('results>result') as $row) {
  $cols = array();
  $row->children();
  for ($i = 0; $i < $col_count; ++$i) {
    $cols[$i] = $row->branch()->eq($i)->text();
  }
  $rows[] = $cols;
}
 
// Turn data into table.
$table = '<table><tr><th>' . implode('</th><th>', $headers) . '</th></tr>';
foreach ($rows as $row) {
  $table .= '<tr><td>';
  $table .= implode('</td><td>', $row);
  $table .= '</td></tr>';
}
$table .= '</table>';
 
// Add table to HTML document.
qp(QueryPath::HTML_STUB, 'body')->append($table)->writeHTML();
?>

While the code may look complex at first blush, it is actually a straightforward tool.

We will begin by taking a quick glance at the first dozen lines:

  // URL to DB Pedia's SPARQL endpoint.
  $url = 'http://dbpedia.org/sparql';
 
  // The SPARQL query to run.
  $sparql = '
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    SELECT ?uri ?name ?label
    WHERE {
      ?uri foaf:name ?name .
      ?uri rdfs:label ?label
      FILTER (?name = "The Beatles")
      FILTER (lang(?label) = "en")
    }
  ';
 
  // We first set up the parameters that will be sent.
  $params = array(
    'query' => $sparql,
    'format' => 'application/sparql-results+xml',
  );
 
  // DB Pedia wants a GET query, so we create one.
  $data = http_build_query($params);
  $url .= '?' . $data;
 

The snippet above shows all of the preparation we must make to run the query.

We begin with a base $url, which points to the DBPedia SPARQL endpoint. Next we write our SPARQL query. The query above is the same as the one we saw earlier in this article.

With the query and the base URL, we need to build a full URL to access the remote server. This is done with the $params array. There we create the name/value pairs that will be condensed into a GET string by http_build_query(). Note that we set the MIME type as the value of the $params['format'] entry. This is to tell the remote server what kind of data we expect to have returned.

A SPARQL query need not return information encoded as XML. Other data formats are equally capable of representing SPARQL query results. XML is probably the most widely used, though, and is the easiest for us to parse.

In the last line of the snippet above, we assemble our base URL and query params into a complete URL.

Next, we need to execute the query and handle the results.

<?php
  // Next, we simply retrieve, parse, and output the contents.
  $qp = qp($url, 'head');
 
  // Get the headers from the resulting XML.
  $headers = array();
  foreach ($qp->children('variable') as $col) {
    $headers[] = $col->attr('name');
  }
 
  // Get rows of data from result.
  $rows = array();
  $col_count = count($headers);
  foreach ($qp->top()->find('results>result') as $row) {
    $cols = array();
    $row->children();
    for ($i = 0; $i < $col_count; ++$i) {
      $cols[$i] = $row->branch()->eq($i)->text();
    }
    $rows[] = $cols;
  }
?>

We begin by creating a new QueryPath object, stored in $qp. Based on the CSS query, we can see that it will be pointed to the header element in the returned results. This element will contain the names of each of the returned variables of data.

From there, we build an array of $headers, getting the name of each returned variable. These we will use to generate the headers in our table. The headers come back in variable elements, and each variable has a name attribute. To fetch them, then, we select the variables and loop through them, retrieving the name attribute of each.

Next comes the fancy part. We need to loop through each result and fetch each variable out of each result. Or, to use the table metaphor we SQL developers are familiar with, we loop through each row, and fetch each column of data. This is al accomplished in the this foreach loop:

foreach ($qp->top()->find('results>result') as $row) {
  $cols = array();
  $row->children();
  for ($i = 0; $i < $col_count; ++$i) {
    $cols[$i] = $row->branch()->eq($i)->text();
  }
  $rows[] = $cols;
}  

When this loop is finished, there will be an array of rows, each of which will have an array of columns. The index of the columns should match the index of the headers array. That is how we correlate headers to columns. You may also notice that we use QueryPath's 'branch() method in combination with eq() so that we can (relatively cheaply) get the text for each column.

With this complete, the next thing to do is format the table output:

<?php
// Turn data into table.
$table = '<table><tr><th>' . implode('</th><th>', $headers) . '</th></tr>';
foreach ($rows as $row) {
  $table .= '<tr><td>';
  $table .= implode('</td><td>', $row);
  $table .= '</td></tr>';
}
$table .= '</table>';
 
// Add table to HTML document.
qp(QueryPath::HTML_STUB, 'body')->append($table)->writeHTML();
?>

The code above is straightforward. We are taking the data returned from the SPARQL query and formatting it into an HTML table, looping through each row of data.

On the final line, we create a new QueryPath object using the HTML_STUB HTML stub document. We add our new table to that, write the HTML document to the web browser.

Conclusion

This article illustrates how QueryPath can be used to execute SPARQL queries against remote semantic databases, and how QueryPath can then use the results. SPARQL is a complex language, and the introduction here has been brief. However, with such a robust query language at your disposal, and with QueryPath's HTTP, XML, and HTML capabilities, you can make use of the semantic web from your web applications.

05 May

XML, character sets, and setting the right encoding

in dom, encoding, iconv, os x, php, utf-8, xml

Working with incorrectly encoded XML documents is painful.

Today I encountered an XML document that did not declare (in its XML header) what encoding it used. If a document does not have an explicit encoding set in the XML declaration, it must be treated as UTF-8. But in this case, the document was actually encoded in some variant of ISO-8859-1 (it appeared to have snippets of MS Word generated HTML copied and pasted into it). When encountering high ASCII characters in the document, the parser (rightly) choked.

Here's what the declaration looked like:

<?xml version="1.0"?>

Because it is encoded as an ISO-8859-1 document, it should have looked like this:

<?xml version="1.0" encoding="iso-8859-1"?>

So what do you do in a case like this? PHP's (actually, libxml's) parser does not allow you to explicitly override the XML declaration. Consider what might appear to be a working option:

$doc = DOMDocument('1.0', 'ISO-8859-1');
$doc->load('my/broken/doc.xml');

This will fail (as will attempts to load the document with SimpleXML). The document's own (implicit) UTF-8 declaration will override the settings for the DOMDocument object. Similarly, trying to set $doc->encoding will also be ineffective for the same reason.

Running an automated replacement is dangerous, though. Just because one document was ISO-8859-1, I cannot assume that all will be. Thus, building a solution could involve using iconv or other similar tools. Indubitably, the correct route is to have the XML producer correctly set the encoding (or correctly convert to contents to UTF-8). Barring, that, though, iconv is your best bet.

Here's how to correct encoding errors from the command line. This should work on most UNIX-like systems, including Mac OS X:

$ iconv -f 'iso-8859-1' -t 'utf-8' bad-iso8859-1.xml > utf-8.xml

In this case, iconv will read the original file, convert it from ISO-8859-1 to UTF-8, and then write the results to utf-8.xml.

QueryPath

in dom, html, php, programming, querypath, xml

QueryPath is a tool for manipulating HTML and XML documents in PHP using a chainable interface. It is similar to jQuery in that respect.

A typical QueryPath script looks something like this:

<?php
require_once 'QueryPath/QueryPath.php';