Our New Blog – A Groovy Journey

After several years of running our blog on Pebble we’ve made the move to WordPress, and it’s pretty exciting! But how did we get here? It turns out that migrating a blog from an unsupported platform is not very difficult, all you need is a bit of programming know-how and in a couple of hours you’ll be migrated!

Pebble stores all of its data in XML files on the server, WordPress data can be imported from WordPress eXtended RSS format. XML to XML pretty straightforward, you just need to pick a language! I figured I would try out Groovy since it seemed to offer some nice api’s for processing and producing XML.

It took just a couple of minutes to install Groovy v1.8.2 using MacPorts.

I started out by reading in the Pebble XML:

def inputSource = new InputSource(new FileReader("myxmlfile"))
inputSource.setEncoding('UTF-8')
def categories = new XmlSlurper().parse(inputSource)

You’ll notice the use of the InputSource in order to set the encoding. I’m not sure if there was a better way to let the slurper know that this was a utf-8 XML file, but it was not enough that the XML file specified that as the encoding.

Next up was building the new XML structure. Again pretty simple to start:

def writer = new StringWriter() builder = new MarkupBuilder(writer) builder.rss (
    'xmlns:excerpt':'http://wordpress.org/export/1.1/excerpt/',
    'xmlns:content':'http://purl.org/rss/1.0/modules/content/',
    'xmlns:wfw':'http://wellformedweb.org/CommentAPI/',
    'xmlns:dc':'http://purl.org/dc/elements/1.1/',
    'xmlns:wp':'http://wordpress.org/export/1.1/')
{ channel { ... } }

Here’s where I got stuck, I really wanted to write classes that would return the different chunks of XML and then just put everything together, but I just couldn’t find a way to do that. So, builder became an instance variable in my class, and I split the work out into methods instead of classes, that added to builder. I still felt that the solution was ugly, but I moved on.

The next problem was to create one import file for the entire pebble export structure.. turns out that wasn’t a problem:

new File(exportDirectoryRoot).eachFileRecurse("pebble-user-directory") {
  def fileName = it.name if(fileName.endsWith('.xml')) {...}}

In step one I found the XML files, then I read them in and verified that they were indeed blog entry files:

 def blogEntry = new XmlSlurper().parse(inputSource)
 if( blogEntry.name() == "blogEntry") {...}

That’s all it took! Running the script uncovered some corrupt Pebbe files, so I added a bit of primitive error handling, basically ignoring the corrupt files:

try {
  def inputSource = new InputSource(new FileReader(it))
  inputSource.setEncoding('UTF-8')
  def blogEntry = new XmlSlurper().parse(inputSource)
  if(isABlogEntry(blogEntry)) buildItem(fileName, blogEntry)
} catch (all) {
  println "could not process: " + it.canonicalPath
}

I already knew which errors I would face, so I didn’t worry about the item being incompletely built, as that was a non-issue for this migration. The final step was outputting my XML to the file system:

File file= new File("converted.xml") file.write( "<!--?xml version='1.0' encoding='UTF-8'?-->\n" + writer.toString())

I ran the migration in chunks, and for about 150 entries it takes just a couple of seconds to run. You’ll find the entire class here:

#!/usr/bin/env groovy

import groovy.xml.*
import java.text.SimpleDateFormat
import static groovy.io.FileType.FILES
import org.xml.sax.InputSource

class Converter {
  def commentId = 1
  def exportDirectoryRoot
  def builder

  static main( args) {
    if (args && args.length == 1) {
      def converter = new Converter()
      converter.exportDirectoryRoot = args[0]
      def rssString = converter.buildRssXml()
      converter.save(rssString)
    } else {
      println "Usage: groovy Converter.groovy the-exported-root-pebble-directory"
    }
  }

  def save(rssString) {
    File file= new File("converted.xml")
    file.write( "<?xml version='1.0' encoding='UTF-8'?>\n" + rssString)
  }

  def  buildRssXml() {
    def writer = new StringWriter()
    builder = new MarkupBuilder(writer)
    builder.rss (  'xmlns:excerpt':'http://wordpress.org/export/1.1/excerpt/',
		   'xmlns:content':'http://purl.org/rss/1.0/modules/content/',
		   'xmlns:wfw':'http://wellformedweb.org/CommentAPI/',
		   'xmlns:dc':'http://purl.org/dc/elements/1.1/',
		   'xmlns:wp':'http://wordpress.org/export/1.1/') {

      channel {
	buildBlogInfo()
	buildAllItems()
      }
    }

    writer.toString()
  }

  def buildBlogInfo() {
    builder.'wp:wxr_version' '1.1'
    builder.title 'Crisp\'s Blog'
    builder.link 'https://blog.crisp.se'
    builder.description 'From the Crisp Consultants'
    //No categories!
    //buildCategories()
  }

  def buildCategories(){
    new File(exportDirectoryRoot).eachFileMatch FILES, ~/categories\.xml/, {
      def inputSource = new InputSource(new FileReader(it))
      inputSource.setEncoding('UTF-8')
      def categories = new XmlSlurper().parse(inputSource)
      if(categories.name() == "categories"){
	categories.category.each { buildCategoryDeclaration(it)}
      }
    }
  }

  def buildCategoryDeclaration(category) {
    if(category.id != "/") {
      builder.'wp:category' {
	'wp:category_nicename' category.tags
	'wp:cat_name'  {mkp.yieldUnescaped( "<![CDATA[${category.name}]]>") }
      }
    }
  }

  def buildAllItems(){
    new File(exportDirectoryRoot).eachFileRecurse(FILES) {
      def fileName = it.name
      if(fileName.endsWith('.xml')) {
	try {
	  def inputSource = new InputSource(new FileReader(it))
	  inputSource.setEncoding('UTF-8')
	  def blogEntry = new XmlSlurper().parse(inputSource)
	  if(isABlogEntry(blogEntry))
	    buildItem(fileName, blogEntry)
	} catch (Throwable ex) {
	  ex.printStackTrace()
	  println "could not process: " + it.canonicalPath
	}
      }
    }
  }

  def isABlogEntry(blogEntry) {
    blogEntry.name() == "blogEntry"
  }

  def buildItem(fileName, blogEntry) {
    builder.item {
      title "${blogEntry.title}"
      pubDate "${blogEntry.date}"
      'content:encoded' {mkp.yieldUnescaped( "<![CDATA[${blogEntry.body}]]>") }
      'excerpt:encoded' {mkp.yieldUnescaped( "<![CDATA[${blogEntry.excerpt}]]>") }
      'wp:post_date' formatWordPressDate("${blogEntry.date}")
      'wp:comment_status' 'open'
      'wp:post_name' formatWordPressPostName(fileName)
      'wp:status' blogEntry.state == "published"? "publish" : "draft"
      'wp:post_type' 'post'
      blogEntry.tags.each { buildTags(it) }
      blogEntry.comment.each { buildComment(it) }
      blogEntry.category.each { buildCategories(it, "post_tag")}
    }
  }

  def formatWordPressPostNiceName(title) {
    title.replaceAll("([^a-zA-Z0-9 ])", "").trim().replaceAll(" ", "_").toLowerCase()
  }

  def formatWordPressPostName(fileName) {
    fileName.minus(".xml")
  }

  //"category" or "post_tag"
  def buildCategories(pebbleCategory, domain){
    if("${pebbleCategory}"?.trim()){
      buildAWordPressCategory(domain, "${pebbleCategory}".minus("/"))
    }
  }

  def buildTags(pebbleTags) {
    if("${pebbleTags}"?.trim()){
      "${pebbleTags}".split(" ").each { buildAWordPressCategory("post_tag",it)}
    }
  }

  def buildAWordPressCategory(domainType, value) {
    builder.category(domain:domainType, nicename:value.toLowerCase()) { mkp.yieldUnescaped( "<![CDATA[" + value + "]]>") }
  }

  def formatWordPressDate(pebbleDate) {
    Date originalDate = new SimpleDateFormat("d MMM yyyy HH:mm:ss:SS Z").parse(pebbleDate)
    return new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(originalDate)
  }

  def buildComment(comment){
    builder.'wp:comment' {
      'wp:comment_id' commentId++
      'wp:comment_author' {mkp.yieldUnescaped( "<![CDATA[${comment.author}]]>") }
      'wp:comment_author_email' "${comment.email}"
      'wp:comment_author_url' "${comment.website}"
      'wp:comment_content' {mkp.yieldUnescaped( "<![CDATA[${comment.body}]]>") }
      'wp:comment_author_IP' "${comment.ipAddress}"
      'wp:comment_date' formatWordPressDate("${comment.date}")
      'wp:comment_approved' comment.state == "approved"? "1" : "0"
    }
  }
}

What would I do differently? I would definitely use TDD even for this little one-time script. I slid too easily from writing a little bit of code to understand Groovy, to a working script! Everybody makes mistakes, and some tests for formatting differences between Pebble and WordPress would have been great, just to make sure everything ended up where it was supposed to be, instead of discovering issues by doing an import to WordPress. I’m also still not happy that I wasn’t able to break up my class and functionality, it makes more advanced error handling much more difficult.

So, that’s all it took! Here we are with a WordPress blog. I would love to hear about best practices using Groovy, and how I could have written a more “beautiful” script. Please share your ideas!

Get in touch via my homepage if you have questions or comments!

5 responses on “Our New Blog – A Groovy Journey

Comments are closed.