Hack4ac: Text-mining and analyzing author contributions in PLOS articles
I had an awesome Saturday at Hack4ac! It was all about the amazing things that we can do with open access to scientific content.
In particular, the focus was on papers and data licensed under the CC-BY license, which lets you not only read the papers yourself, but also write programs to efficiently mine them for connections and clusters and patterns. It seems obvious that you should be able to mine anything you can read, but that's unfortunately not the case — it may require special permissions from the copyright holder (publisher), who may drag their heels. The CC-BY license is one way of making sure that our right to mine is protected.
And fortunately there's a lot of CC-BY content out there already! My team's task was to mine detailed author contributions from open access papers on PLOS and PeerJ — these data record which authors did which jobs on each paper. Most PLOS papers have a semi-structured author contributions section, like:
Conceived and designed the experiments: HQ JKC AR NH. Performed the experiments: HQ JKC AR MP. Analyzed the data: HQ JKC AR MP NH. Contributed reagents/materials/analysis tools: CH. Wrote the paper: HQ JKC AR NH.
and PeerJ exposes similar data in a structured form through their API.
In six short hours, our team wrote programs to download and parse these author contribution data for about half of the 80 thousand articles on PLOS. All of our code to download the data, process them, and present the results is open source. We used github both to collaborate on the code and to publish our results.
I worked with mfenner to get the PLOS data into R for analysis. We used RStudio in a literate programming style, with both R code and Markdown documentation in the same file. RStudio uses the knitr package to take the fairly nice code (we were in a hurry!) and produce even nicer, more readable output. The result is a document that contains the final figures and numbers and all of the code needed to produce them from the raw data, which improves transparency and makes it more likely that minor errors will get spotted quickly.
My favourite graph answers the question, if you did only one thing on a paper, what was it?
Looks like it's usually ‘Performed the experiments’, which brings back memories of late nights in the (computer) lab! It's also interesting that so many people just wrote the paper and did nothing else; I wonder how that worked.
Of course, we're only just scratching the surface here. Do papers with more even division of labour get more or less citations? How do the roles of authors change as they progress through their careers? Fork the code and find out!
Also be sure to check out ScienceGist, a website for crowd-sourced summaries of papers, which won the day (we came second!). It's an awesome idea!