bioRxiv 2017 update

I first looked at the biorxiv submission data back in March 2016. A lot has changed since then, and biorxiv has grown nearly 5-fold. Time for an update.

collection_date <- ymd("2017_10_04")

dat <- fread("~/Documents/GitHub/biorxivData/data/biorxiv_data_2017_10_04.tsv") %>% 
  mutate(Age = collection_date - ymd(`Original submission`),
         Revised = `Original submission` != `Current submission`)

Submissions over time

weekly <- dat %>%
  mutate(weeks_past = ceiling(Age / 7),
         `Submission week` = collection_date - weeks(weeks_past)) %>% 
  group_by(`Submission week`) %>%
  summarise(Submissions = n())

ggplot(weekly, aes(`Submission week`, Submissions)) +
  geom_point(stat = "identity") +
  geom_smooth() +
  ggtitle("bioRxiv submissions per week") +

Last year the number of weekly submissions peaked at around 60, now it’s 5x higher hitting 300 per week earlier in 2017.

How many of these submissions get revised?

dat %>% 
  group_by(Revised) %>%
  summarise(n = n(), `%` = n/nrow(dat) * 100)

This is almost exactly the same percentage as I found last year.

2017 highlights

What have been the most popular preprints so far this year?

days <- collection_date - ymd('2017-01-01')

dat %>%
  filter(Age < days) %>% 
  arrange(desc(`PDF views`)) %>% 
  head(10) %>% 

Some really topical stuff (not surprisingly): p-value controversies, single cell genomics, the index-switching catastrophe, and the recent Venter debacle.


The data is available on my github to explore.