Copyright © 2004–2010 OpenSourcery, LLC. This work is licensed under a Creative Commons Attribution 3.0 United States License.
One of the toughest aspects of transitioning an existing site to Drupal can be content migration. While there's an entire category of modules available for import/export tasks, sometimes you just need to bite the bullet and parse some HTML.
One of the tools I like to use for this is the PHP Simple HTML DOM Parser. This allows you to use the Document Object Model (DOM). For our purposes, all this means is that the document is modeled as a tree – each element, such as a <div> or a <p>, has parents, children, and siblings, and you can search for them. (I'm sure to many readers this is already familiar, especially if you are using jQuery). Compared to an event-driven parser, I find that using the DOM results in smaller code that's easier to write.
The SourceForge page really has all the examples you need – but I'll provide a small example of how I use it.
include('simple_html_dom.php'); $html = file_get_html('old_page.html'); # find every <div class=post> in the document foreach($html->find('.post') as $post) { # locate the post date, and convert it to ISO 8601 format $date_string = $post->prev_sibling()->innertext; $date = date_format(date_create($date_string), "Y-m-d"); $title = $post->find('.post-title', 0)->innertext; # get rid of empty divs foreach($post->find('div') as $div) { if ($div->innertext == '') { $div->outertext = ''; } } $body = $post->find('.post-body', 0)->innertext; # this is a wrapper for Drupal's node_save(), but could be anything post_save ($title, $date, $body); }
Tagged as: content migration, Drupal, import, parser