Going Zotero: A reflection on XML and interoperability

When I was younger, and keen as hell about XML as the solution to everything, and working on my PhD, I wrote a bibliographic reference management system. This was circa 2002 or so, and I badly needed to procrastinate from working on my dissertation. There’s nothing like being productive on another project to make you feel good about putting something off. At the time, I was juggling a couple of hundred references, plus notes. I looked at the available options at the time (EndNote, RefWorks) and was not impressed with them, or any off-the-shelf reference manager. So I wrote my own. I looked at how some of the other systems worked, and made one that was ‘better.’

My tool was called zNote, because it ran on the Zope web development framework. It had two views: a form-based editor and a modular set of formatted outputs (I had two), and it stored all its data in an XML document type that I designed. I was pretty proud of it, and felt it was better than anything else out there. I developed it basically to serve my own needs, but I did release the code under an open license, and even attracted one collaborator, a guy who contributed a set of XSLT stylesheets for doing more output formats. I continued using it until I finished my degree, and then never particularly went back to it. Zope faded away as a platform, and every paper I’ve written since has had its own ad-hoc bibliography, even though I know better.

A couple of years ago I got wind of Zotero, the browser-based, open source reference manager that was developed at George Mason U, and which has gathered an enormous community of users. I installed it, and liked how it worked, but never really got around to using it seriously. Part of my reluctance was because I had stopped writing in a Word processor, and so the automatic formatting hooks didn’t work for me anymore.

That changed for me last year when I started working seriously in markdown, with Pandoc, because Pandoc has a lovely bibliography formatter built in. All you need is a structured source, and it can pull references in and do all the nice magic. So I started figuring out how to do it right. This spring, I started moving all my references into Zotero, working from my current writing backwards. I got through a few articles this way.

I really wanted to go back and retrieve all those references from my zNote collection – about 450 or so. I figured it wouldn’t be too hard, and it wasn’t. I spent a couple of hours on it this afternoon and am happy to say I have joined the twenty-first century.

XML saves the day, right? Yes and no… my zNotes tool used XML as its native format, and I had a big XML file with everything in it, so it was easy enough to get started. I needed to produce BibTeX, a venerable bibliographic format that is supported by just about everything. BibTeX is not an XML-based format, but it’s not too much of a jump. In fact, when I was working on the conversion today, I realized that I must have looked at BibTeX back in the day, because the basic architecture of zNote was fairly similar. My XML format was much more verbose and explicit about everything, with nested descriptive element names for everything, whereas BibTeX is just a set of key-value pairs, but with some cleverly nested curly-braces for things like names.

With a series of regular expressions, I soon had my zNotes file converted to a BibTeX file that Zotero could import. I felt pretty good about the usefulness of that original XML. But then I looked at my BibTeX file, and realized that I had now more or less all of my information, in a simpler and more readable format – and a format spoken by literally dozens of different systems. Turns out the XML wasn’t that big a deal. Having the data in an open, well-described, structured format was essential. The fact that it had pointy brackets and nested descriptive tags didn’t really amount to much in the long run.

As I’ve argued before, on the web, interoperability trumps technical correctness. The point of XML was repurposability, because back in the day, that was really hard to do. Today, what really drives value is the ability for stuff to flow easily online, from system to system. Simple, open, common formats (like BibTeX, like HTML) work better than bespoke formats, even when the latter arguably do a better job of describing the data.

Now I have all my old refs (and notes!) in Zotero. Zotero does a nice job of giving me a UI on all that data; I can tag it and sort it and collect it and format it (Zotero supports hundreds of export formats). Most importantly, I can easily flow it out in any format I like. Via BibTeX,1 I can integrate it right into Pandoc, so my reference collection is on tap as I write. When the next great reference manager comes along, it’ll go there too. Interoperable formats are great… whether they’re XML or not.

  1. Note that Pandoc’s citation processor can handle about a dozen different reference formats. I tried a bunch of them and came to the conclusion that good old BibTeX is the easiest to use, because Zotero automatically makes nice, straightforward (custom) ‘citekeys’ for BibTeX references. E.g., “@mcluhan1965″