Working with large XML files

I recently ran up against a large XML file and legacy code that manipulated it.. the code had mysteriously stopped working, there were no unit tests and the XML file was large and had no line breaks. What do you do in this situation?

Loading the 30MB file in any kind of editor made the editor slow. Trying to format the file left the editors unresponsive. There’s no point in doing a grep find on a one line XML file, and while writing a cleanup script to add line breaks makes life better, why reinvent the wheel? Googling yielded several windows programs at first, but digging further, and using social networking finally produced two fantastic products for unix/linux based OS!

xmllint
The first and easiest tool to use turns out to be xmllint! It’s most likely available with your unix/linux distribution just type it at the command prompt. Running xmllint –format my_file.xml > my_file_formated.xml will add line breaks where they make sense in your XML file. Now you can easily perform a grep find, you’re halfway there!

XML-Twig
This perl program was the icing on the cake! You can download it here: http://search.cpan.org/dist/XML-Twig/ and then all you need to do is install it using the provided Makefile and you’re up and running. How did this help? Well among the hundreds of options provided with XML-Twig there’s a tool called xml_split. Just type xml_split my_file.xml and it will split your large unreadable file into manageable smaller ones. You can find more information about the usage here: http://search.cpan.org/dist/XML-Twig/tools/xml_split/xml_split

With just these two tools it’s easy to write unit tests against smaller sections of the XML file. It’s also easy to find the problematic sections in the XML file and easier to make changes and to test the fix!

What tools do you use?

Get in touch via my homepage if you have questions or comments!

6 responses on “Working with large XML files

  1. Looks nice, but defintely didn’t meet my definition of easy on my mac. I was missing the pcre headers, boost headers, and half an hour later I’m still not up and running.. xerces-c headers are my latest missing. I’ll have to finish up the install when I have the time and then I’ll try it out. Thanks for the suggestion!

  2. Absolutely! My problem was that there were no line breaks in the files, once I had the files formatted nicely they were definitely manageable in Emacs 🙂

  3. Excellent info, just in time for me. Using XML::Twig to stream-format a huge document. Thanks for mentioning this and thanks to the author of XML::Twig.

Comments are closed.