Tim Stuart     About     Archive     Publications     Feed

bioRxiv

A look at bioRxiv preprints

Tim Stuart
2 March 2016

After posting a my first preprint to bioRxiv a few weeks ago, I have been periodically checking the number of views and PDF downloads. I became interested to see how many downloads or views the preprints on bioRxiv typically get, but this type of information isn’t actually available. What are the all-time top bioRxiv preprints? How many people are reading bioRxiv preprints on average? No-one knows! Altmetric must track this data, as it will tell you how a particular preprint ranks in relation to others, but that data hasn’t been made publicly available (as far as I can tell).

Not to worry, a little python script I wrote will scrape this information from the biorxiv website and record it in a file. I ran it on the 29th February this year and collected a snapshot of biorxiv metrics at that time.

Let’s take a look at the data.

library(dplyr)
library(ggplot2)
library(readr)
library(lubridate)

data <- read_tsv("./data/biorxiv_data_2016_2_29.tsv")
collection_date <- ymd("2016_2_29")

str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame':	3095 obs. of  6 variables:
##  $ Title              : chr  "Copy number variants in the sheep genome detected using multiple approaches" "Productive infection of field strains of avian coronavirus infectious bronchitis virus in chicken peripheral blood-derived mono"| __truncated__ "Recurrent selection explains genomic regions of high relative but low absolute differentiation in the greenish warbler ring spe"| __truncated__ "A watershed model of individual differences in fluid intelligence" ...
##  $ URL                : chr  "http://biorxiv.org/content/early/2016/02/26/041475" "http://biorxiv.org/content/early/2016/02/26/041558" "http://biorxiv.org/content/early/2016/02/26/041467" "http://biorxiv.org/content/early/2016/02/26/041368" ...
##  $ Abstract views     : int  131 NA 321 70 161 570 714 276 428 318 ...
##  $ PDF views          : int  10 NA 14 3 8 102 106 45 65 42 ...
##  $ Original submission: chr  "2016_02_26" "2016_02_26" "2016_02_26" "2016_02_26" ...
##  $ Current submission : chr  "2016_02_26" "2016_02_26" "2016_02_26" "2016_02_26" ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 6
##   .. ..$ Title              : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ URL                : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Abstract views     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ PDF views          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Original submission: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Current submission : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

As you can see, the python script gathers data on the title, url, number of abstract views, number of PDF downloads, original upload date, and the date that the current version was uploaded (this will be the same as the original upload date in many cases).

Let’s add the age of the preprint and whether it was revised (true/false).

data <-  data %>%
  mutate(Age = collection_date - ymd(`Original submission`),
         Revised = `Original submission` != `Current submission`)

str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame':	3095 obs. of  8 variables:
##  $ Title              : chr  "Copy number variants in the sheep genome detected using multiple approaches" "Productive infection of field strains of avian coronavirus infectious bronchitis virus in chicken peripheral blood-derived mono"| __truncated__ "Recurrent selection explains genomic regions of high relative but low absolute differentiation in the greenish warbler ring spe"| __truncated__ "A watershed model of individual differences in fluid intelligence" ...
##  $ URL                : chr  "http://biorxiv.org/content/early/2016/02/26/041475" "http://biorxiv.org/content/early/2016/02/26/041558" "http://biorxiv.org/content/early/2016/02/26/041467" "http://biorxiv.org/content/early/2016/02/26/041368" ...
##  $ Abstract views     : int  131 NA 321 70 161 570 714 276 428 318 ...
##  $ PDF views          : int  10 NA 14 3 8 102 106 45 65 42 ...
##  $ Original submission: chr  "2016_02_26" "2016_02_26" "2016_02_26" "2016_02_26" ...
##  $ Current submission : chr  "2016_02_26" "2016_02_26" "2016_02_26" "2016_02_26" ...
##  $ Age                :Class 'difftime'  atomic [1:3095] 3 3 3 3 3 3 3 3 26 4 ...
##   .. ..- attr(*, "units")= chr "days"
##  $ Revised            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

Submissions over time

bioRxiv has supposedly been growing in popularity, let’s take a look at the rate of submissions over time.

First group into weeks:

weekly <- data %>%
  mutate(weeks_past = ceiling(Age / 7),
         `Submission week` = collection_date - weeks(weeks_past)) %>%
  group_by(`Submission week`) %>%
  summarise(Submissions = n())

ggplot(weekly, aes(`Submission week`, Submissions)) +
  geom_point(stat = "identity") +
  geom_smooth() +
  ggtitle("bioRxiv submissions per week") +
  theme_bw()

So the rate of submission really is increasing.

Some bioRxiv metrics

bioRxiv allows you to submit a revision at any time (until the paper is published in a peer-reviewed journal). How many preprints are revised at some stage?

data %>%
  group_by(Revised) %>%
  summarise(Count = n(), Percentage = Count/nrow(data) * 100) %>%
  knitr::kable(., align = 'c')

Revised Count Percentage ——— ——- ———— FALSE 2204 71.21163
TRUE 891 28.78837

So only 28% are ever revised.

What’s the average number of abstract views and PDF downloads for preprints?

abstr_mn <- mean(data$`Abstract views`, na.rm = T)
abstr_med <- median(data$`Abstract views`, na.rm = T)
pdf_mn <- mean(data$`PDF views`, na.rm = T)
pdf_med <- median(data$`PDF views`, na.rm = T)

The mean number of abstract views was 485, and the median was 297. For PDF views, it was 134 and 64.

The mean seems to be pulled up by some outliers with very high numbers of views. Let’s take a look at what those are.

Top 10 bioRxiv preprints

data %>%
  arrange(-`PDF views`) %>%
  select(Title, `Abstract views`, `PDF views`, Age, Revised) %>%
  head(., 10) %>%
  knitr::kable(., align = 'l')

Title | Abstract views | PDF views | Age | Revised ———————————————————————————————————————————- ————— ———- ——— ——– Massive migration from the steppe is a source for Indo-European languages in Europe |17385 |15170 | 384 days | FALSE
Revised estimates for the number of human and bacteria cells in the body |23174 |9625 | 54 days | FALSE
Accelerating Scientific Publication in Biology |12206 |5504 | 233 days | TRUE
The genome of the tardigrade Hypsibius dujardini |22074 |4084 | 90 days | TRUE
Real time selective sequencing using nanopore technology. |6906 |4039 | 26 days | FALSE
Simple multi-trait analysis identifies novel loci associated with growth and obesity measures |4362 |2905 | 236 days | FALSE
A vision for ubiquitous sequencing |2435 |2735 | 298 days | TRUE
TP53 copy number expansion correlates with the evolution of increased body size and an enhanced DNA damage response in elephants |12275 |2502 | 146 days | TRUE
Reconstructing Genetic History of Siberian and Northeastern European Populations |2079 |2410 | 134 days | TRUE
Genome variation and meiotic recombination in Plasmodium falciparum: insights from deep sequencing of genetic crosses |559 |2408 | 206 days | TRUE

All interesting papers, most of which are still not published in journals.

Distribution of views

gather(data=data,key=,value=,na.rm=FALSE,) Let’s first filter out preprints less than 10 days old, as these probably have an artificially low number of views.

older_than_10 <- filter(data, Age > 10)
quantile(older_than_10$`PDF views`, na.rm =TRUE)
##    0%   25%   50%   75%  100% 
##     0    33    65   127 15170
ggplot(older_than_10, aes(`PDF views`)) +
  geom_histogram(bins = 100) + theme_bw()

This would be better looked at on a log scale:

ggplot(older_than_10, aes(log(`PDF views`))) +
  ggtitle("PDF views") +
  geom_density(fill = 'blue', alpha = 0.5) + theme_bw() +
  geom_vline(xintercept = log(pdf_med), col = 'red') +
  geom_vline(xintercept = log(331))  # my paper

ggplot(older_than_10, aes(log(`Abstract views`))) +
  ggtitle("Abstract views") +
  geom_density(fill = 'blue', alpha = 0.5) + theme_bw() +
  geom_vline(xintercept = log(abstr_med), col = 'red') +
  geom_vline(xintercept = log(1678))  # my paper

The red line in both plots show the median, the black line show the datapoint for my own preprint posted a few weeks ago. It seems to have done slightly better than most others.

There’s plenty more I could look at, but for now this satisfies my basic curiosity. I’ve posted the data and code on GitHub, so interested people can download and play with it themselves.