querypath

01 Sep

VOTE: A potentially major change to QueryPath.

in drupal, php, programming, querypath

TL;DR: There's an experimental version of QueryPath 3 for you to try and let me know what you think: https://github.com/technosophos/querypath/zipball/3.0.0-experimental-fin...

Either respond at support-querypath@googlegroups.com or to @querypath on Twitter.

Read on for the onger explanation.

05 May

QueryPath in Practice: Migrating ICANN.org to Drupal

in drupal, html, import, migrate, php, programming, querypath

The Four Kitchens blog is running a story on how they used QueryPath and the Migrate module to migrate over 10,000 pages of content, in many different languages, into Drupal. I love to hear stories about the creative ways developers use QueryPath to accomplish complex tasks. A huge thanks to Mark Theunissen for the detailed write-up.

In related news, the new QueryPath 3 engine is just about done, and will make monster imports like this much faster.

26 Aug

Data URLs and QueryPath: How to embed images into XML or HTML

in dataurl, html, php, programming, querypath

QueryPath 2.1 is adding support for writing files directly into URLs using Data URLs. What this means is that you can encode and embed images or other documents straight into your HTML or XML.

Here's a simple example from the QueryPath 2.1 unit tests:

<?php
$xml = '<?xml version="1.0"?><root><item/></root>';
qp($xml, 'item')->dataURL('secret', 'Hi!', 'text/plain');
?>

The above will generate an XML fragment that looks like this:

<?xml version="1.0"?>
<root>
  <item secret="data:text/plain;base64,SGkh"/>
</root>

The important part there is the attribute secret="data:text/plain;base64,SGkh. This attribute includes an embedded text document with the contents Hi!. What we've done is encode the data and injected it as a document inside of the XML.

Sure, that's novel... but what would we want to use that for? How about adding images directly into a document?

26 Aug

Reflections on Google Summer of Code

in drupal, google, gsoc, querypath, quiz

This was the second year that I have been involved as a mentor for Google's Summer of Code program. And in both cases, I've worked as a mentor for Drupal. Last year, I worked with sivaji on a project involving the Quiz module. This year, I worked with eabrand on QueryPath and the QueryPath module.

In both cases, the projects were highly successful. I'm thrilled to have had the opportunity to work with two very gifted up-and-coming developers.

I think one of the most critical questions to ask of any program like GSOC, is whether or not it produces the results (pedagogical and professional) that it is after. With both Sivaji and Emily, the answer is a resounding yes.

  • Since finishing his GSOC project, Sivaji has begun his professional life as a web developer focused on Drupal. Recently, he and his colleagues started E-ndicus, a Drupal-focused software development company in his home town of Chennai.
  • Emily is now a software engineer at HP. She continues to contribute to QueryPath, and was just this week featured on Google's blog. Last week, she joined me on the Drupal Dojo QueryPath session, too.

I doubt either of these individuals learned much from me during our GSOC projects. More than anything, it just takes hard work, persistence, and attention to detail to finish a GSOC project. But I've certainly learned a lot from them. And both Quiz and QueryPath have benefited enormously from the work of these two.

18 Aug

Slides for my Dojo presentation: "QueryPath: It's like PHP jQuery in Drupal!"

in querypath

I posted the slides from yesterday's Drupal Dojo presentation. These should be much more readable than the video feed.

16 Aug

Drupal Dojo: "QueryPath: It's like PHP jQuery in Drupal!"

in drupal, gsoc, programming, querypath

On August 17th at 12pm EDT (9AM PDT), I will be doing the Drupal Dojo session, "QueryPath: It's like PHP jQuery in Drupal!". To sign up, head over to the webinar signup.

I'm particularly excited about this for three reasons:

  1. Emily will be joining me to talk about her GSoC project.
  2. We will be discussing QueryPath 2.1 and the new Drupal 7 QueryPath module.
  3. The totally gorgeous new QueryPath logo (designed by Michael Mesker) will be unveiled.

This has been an exciting summer for QueryPath, and this webinar will preview many of the QueryPath technologies that are on the cusp of being released.

08 Aug

A PHP jQuery Library: QueryPath Overview

in javascript, jquery, php, programming, querypath

jQuery is a JavaScript library for efficiently working with HTML and CSS. Its chainable and compact API has made it a popular choice for web developers seeking to quickly build rich web applications. But did you know there is a PHP jQuery library? QueryPath is a PHP implementation of jQuery's interface. It provides all of the DOM manipulation functions, a full CSS selector engine, and as much of jQuery's other features as is practically implemented server-side. But that's not all. This powerful library delivers many server-side features designed to make working with XML services simple, robust, and reliable.

03 May

QueryPath and Character Sets: Converting content with mb_convert_encoding()

in encoding, querypath, utf-8

QueryPath can be used to crawl the web, parsing web pages and gleaning information. But the HTML of remote websites is not always as pristine and standards compliant as we would like, and one thing that can be particularly frustrating is determining the encoding of a document. (This gets substantially more complicated when HTTP headers list one encoding and HTML meta tags list another -- a common configuration error).

QueryPath is primarily a library for working with XML and HTML, but it assumes that you know from the outset what character set your document uses. This is not always a good assumption to make. Here is one way to circumvent the problem: Rather than write code to find out a document's character set, use PHP built-in functions (assuming you have the MB library compiled in) to do this for you.

<?php
require 'QueryPath/QueryPath.php';
 
$url = 'http://mopy.fr/';
$contents = mb_convert_encoding(file_get_contents($url), 'iso-8859-1', 'auto');
$opts = array('ignore_parser_warnings' => TRUE);
 
print @qp($contents, 'title', $opts)->text() . PHP_EOL;
15 Feb

A QueryPath script for checking on a sitemap

in php, programming, querypath, sitemap, xml

Sitemap ScoresSitemap ScoresI've been tuning our sitemap during the last few months, and one thing I needed was a quick tool to check on the effectiveness of various sitemap generation strategies.

To do this, I wrote a quick QueryPath script (see a full-sized image of the output). The script is explained below.

The code is pretty straightforward. It simply retrieves a URL, parses the sitemap contents, and then sorts them. Finally, it displays the top 100 entries. I've tested it on sitemaps with over 20,000 items. While it is a little slow on such a large document, it works fine.

#!/usr/bin/env php
<?php
require 'QueryPath/QueryPath.php';
 
define('MAX_ITEMS', 100);
 
$sitemap = 'http://example.com/sitemap.xml';
 
$urls = array();
print "Parsing sitemap...\n";
$qp = qp($sitemap, ':root>url>loc');
$size = $qp->size();
$max = $size > MAX_ITEMS ? MAX_ITEMS : $size;
printf("Found %d entries; printing top %d\n\n", $size, $max);
 
try {
    foreach ($qp as $url) {
      $loc = $url->text();
    $score = $url->nextAll('priority')->text();
    $urls[$loc] = $score;
    }
} catch (Exception $e) {
  print $e->getMessage();
}
 
arsort($urls);
 
$filter = "%d: %0.5f  %s\n";
 
foreach ($urls as $uri => $score) {
  if ($i++ == $max) break;
   printf($filter, $i, $score, $uri);
};
?>

Basically, the script above simply fetches all of the URLs out of the sitemap, and then sorts them by their corresponding score. Only the top MAX_ITEMS (100) are shown.

19 Jan

QueryPath on WebMonkey

in php, programming, querypath

It just came to my attention that a WebMonkey article (Parsing HTML? There's an App for That) from a few months ago suggested using QueryPath as an alternative to attempting to parse HTML by hand.

Webmonkey on QueryPathWebmonkey on QueryPath

Appropriately, last week I wrote a QueryPath script to analyze a site and extract all links so that I could feed them to Siege and simulate something like a real load against a server. It's nice to be able to easily extract data from HTML.