Why do journals provide broken BibTeX files?

Tech-stuff Science · 24 February 2013 ·

I have already written about my love for BibTeX and BibDesk for referencing. One of the best things about it is that you can often fetch bibliographic information straight from the journal’s websites, Google Scholar or other sources.

However, whilst this is great most of the time, there are many cases where journals are delivering broken BibTeX that BibDesk cannot parse, or that are just not well formed. What’s following is a brief rant on journal’s low quality BibTeX download and a few general observations.

When downloading BibTeX citations, often illegal characters appear in the cite keys and names, brackets are missing, throwing errors. Sometimes BibDesk can fix them, sometimes it cannot parse them at all. Also, often the URL field contains a DOI-URL, which duplicates the DOI field. This leads packages like apacite to create false URL fields. I have also noticed that particular in Cambridge University Press Journals, names or titles are all caps, which doesn’t go to well with certain bibtex packages either. Often file endings are missing, confusing Opera (but apparently not Chrome). Lastly, the cite keys provided by journals are often completely useless. For those not using BibTeX, these are the nicknames you assign to an article, so that when you are referencing it later, you just call its nickname, either by writing it out or looking it up in LyX. So you want them to be short and memorable, usually the first author and a date, maybe the beginning of the title. But what’s BJDP:BJDP2052? I’m sure you guessed it:

Wang, B., Low, J., Jing, Z., & Qinghua, Q. (2012). Chinese preschoolers’ implicit and explicit false- belief understanding. British Journal of Developmental Psychology, 30(1), 123–140. doi: 10.1111/ j.2044-835X.2011.02052.x

All this is annoying, particularly since BibTeX is a well documented file format, and these things should be easy to fix. But they happen too frequent to write bug reports to the respective journals, and I am not sure whom to address them to either. There are a couple of observations I made, albeit I haven’t qualified them by actually counting:

Older articles often have more errors than newer ones.
Open Access journals like PLoS are generally better at providing correct citations (But that might follow from the former, given that OA journals tend to be younger as well.)
Google Scholar’s output parses more reliably, but it is consistently missing information like the DOI and the first names. (The former is now an APA requirement, and the latter helps disambiguate authors with common surnames, e.g. Yu, Clark, and Smith)

Curiously, when a BibTeX file isn’t parsing, you might as well try the RIS. RIS is a referencing format used by a wide range of referencing software, but its standard is not as clearly specified. BibDesk reads RIS as well, and actually does a pretty good job. With a few exceptions (less errors than I get in BibTeX), it gets all the bibliographic information right, and even generates useful cite names, consisting of first author:year. So I found myself downloading RIS files more often than I fetched the BibTeX. But why is this? I mused about this on Twitter, and got a reply by Adam R. Maxwell, one of the contributors to the BibDesk software:

<a href="https://twitter.com/antipattern">antipattern</a> Right; there are a lot of heuristics in BibDesk to deal with badly formed RIS. The BibTeX parser is strict to avoid data loss.</p>— Adam R. Maxwell (maxwellarm) February 14, 2013

<a href="https://twitter.com/antipattern">antipattern</a> in contrast, the RIS code is ad hoc and has workarounds for various bastardized RIS forms that have been seen in the wild</p>— Adam R. Maxwell (maxwellarm) February 15, 2013

So there you have it: For those familiar with HTML and browser engines, BibDesk parses BibTeX strict, thus breaking down whenever it encounters an error. However, when reading RIS, it switches to quirksmode, and therefore still renders broken RIS files. The fault ultimately lies with the journals, providing malformed reference files. This should be easily fixed, it just means escaping special characters properly, and providing better naming conventions. Since the bibliographic information is already there, it should be easy to generate functioning BibTeX and RIS files.

On a completely different note, has anyone considered extending the PDF standard to contain bibliographic information? That would make things a lot easier!

Add a comment

Previous comments