Parsing HTML, the easy way

November 18, 2008

One of the toughest aspects of transitioning an existing site to Drupal can be content migration. While there's an entire category of modules available for import/export tasks, sometimes you just need to bite the bullet and parse some HTML.

One of the tools I like to use for this is the PHP Simple HTML DOM Parser. This allows you to use the Document Object Model (DOM). For our purposes, all this means is that the document is modeled as a tree – each element, such as a <div> or a <p>, has parents, children, and siblings, and you can search for them. (I'm sure to many readers this is already familiar, especially if you are using jQuery). Compared to an event-driven parser, I find that using the DOM results in smaller code that's easier to write.

The SourceForge page really has all the examples you need – but I'll provide a small example of how I use it.

include('simple_html_dom.php');
 
$html = file_get_html('old_page.html');
 
# find every <div class=post> in the document
foreach($html->find('.post') as $post) {
 
  # locate the post date, and convert it to ISO 8601 format
  $date_string = $post->prev_sibling()->innertext;
  $date = date_format(date_create($date_string), "Y-m-d");
 
  $title = $post->find('.post-title', 0)->innertext;
 
  # get rid of empty divs
  foreach($post->find('div') as $div) {
    if ($div->innertext == '') {
      $div->outertext = '';
    }
  }
 
  $body = $post->find('.post-body', 0)->innertext;
 
  # this is a wrapper for Drupal's node_save(), but could be anything
  post_save ($title, $date, $body);
}

While this parser advertises the ability to handle errors in the HTML, I prefer to run it through HTML Tidy first. That way I can be sure there aren't any gremlins in the markup.

Comments

I find BeautifulSoup

I find BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/) much nicer for parsing HTML, and it deals with bad HTML very well.

Hpricot

Cool! Should you find yourself in a RubyRuby is a dynamic, reflective, general purpose object-oriented programming language that combines syntax inspired by Perl with Smalltalk-like features. Ruby originated in Japan during the mid-1990s and was initially developed and designed by Yukihiro "Matz" Matsumoto. environment, I highly recommend Hpricot.

DOM & SimpleXML

The method used in Drupal 7 core for SimpleTest is described bellow. This method uses the phpPHP (PHP: Hypertext Preprocessor) is a computer scripting language, originally designed for producing dynamic web pages.-simplexml library to handle HTML after being loaded with DOM so it can load the HTML soup. I have found this to work very nicely. SimpleXML provides very nice transversing support and xpath support (very powerful).

@$htmlDom = DOMDocument::loadHTML($string); // Or file, w/e
if ($htmlDom) {
$this->elements = simplexml_import_dom($htmlDom);
}

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • You can enable syntax highlighting of source code with the following tags: <code>, <blockcode>. Beside the tag style "<foo>" it is also possible to use "[foo]".
  • Glossary terms will be automatically marked with links to their descriptions. If there are certain phrases or sections of text that should be excluded from glossary marking and linking, use the special markup, [no-glossary] ... [/no-glossary]. Additionally, these HTML elements will not be scanned: a, abbr, acronym, code, pre.

More information about formatting options