# bioRxiv

2016-03-01 7 minute read

After posting a my first preprint to bioRxiv a few weeks ago, I have been periodically checking the number of views and PDF downloads. I became interested to see how many downloads or views the preprints on bioRxiv typically get, but this type of information isn’t actually available. What are the all-time top bioRxiv preprints? How many people are reading bioRxiv preprints on average? No-one knows! Altmetric must track this data, as it will tell you how a particular preprint ranks in relation to others, but that data hasn’t been made publicly available (as far as I can tell).

Not to worry, a little python script I wrote will scrape this information from the biorxiv website and record it in a file. I ran it on the 29th February this year and collected a snapshot of biorxiv metrics at that time.

Let’s take a look at the data.

library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
##     filter, lag
## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union
library(ggplot2)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
##     date
data <- read_tsv("../data/biorxiv/biorxiv_data_2016_2_29.tsv")
## Parsed with column specification:
## cols(
##   Title = col_character(),
##   URL = col_character(),
##   Abstract views = col_double(),
##   PDF views = col_double(),
##   Original submission = col_character(),
##   Current submission = col_character()
## )
collection_date <- ymd("2016_2_29")

str(data)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 3095 obs. of  6 variables:
##  $Title : chr "Copy number variants in the sheep genome detected using multiple approaches" "Productive infection of field strains of avian coronavirus infectious bronchitis virus in chicken peripheral bl"| __truncated__ "Recurrent selection explains genomic regions of high relative but low absolute differentiation in the greenish "| __truncated__ "A watershed model of individual differences in fluid intelligence" ... ##$ URL                : chr  "http://biorxiv.org/content/early/2016/02/26/041475" "http://biorxiv.org/content/early/2016/02/26/041558" "http://biorxiv.org/content/early/2016/02/26/041467" "http://biorxiv.org/content/early/2016/02/26/041368" ...
##  $Abstract views : num 131 NA 321 70 161 570 714 276 428 318 ... ##$ PDF views          : num  10 NA 14 3 8 102 106 45 65 42 ...
##  $Original submission: chr "2016_02_26" "2016_02_26" "2016_02_26" "2016_02_26" ... ##$ Current submission : chr  "2016_02_26" "2016_02_26" "2016_02_26" "2016_02_26" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Title = col_character(),
##   ..   URL = col_character(),
##   ..   Abstract views = col_double(),
##   ..   PDF views = col_double(),
##   ..   Original submission = col_character(),
##   ..   Current submission = col_character()
##   .. )

As you can see, the python script gathers data on the title, url, number of abstract views, number of PDF downloads, original upload date, and the date that the current version was uploaded (this will be the same as the original upload date in many cases).

Let’s add the age of the preprint and whether it was revised (true/false).

data <-  data %>%
mutate(Age = collection_date - ymd(Original submission),
Revised = Original submission != Current submission)

str(data)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 3095 obs. of  8 variables:
##  $Title : chr "Copy number variants in the sheep genome detected using multiple approaches" "Productive infection of field strains of avian coronavirus infectious bronchitis virus in chicken peripheral bl"| __truncated__ "Recurrent selection explains genomic regions of high relative but low absolute differentiation in the greenish "| __truncated__ "A watershed model of individual differences in fluid intelligence" ... ##$ URL                : chr  "http://biorxiv.org/content/early/2016/02/26/041475" "http://biorxiv.org/content/early/2016/02/26/041558" "http://biorxiv.org/content/early/2016/02/26/041467" "http://biorxiv.org/content/early/2016/02/26/041368" ...
##  $Abstract views : num 131 NA 321 70 161 570 714 276 428 318 ... ##$ PDF views          : num  10 NA 14 3 8 102 106 45 65 42 ...
##  $Original submission: chr "2016_02_26" "2016_02_26" "2016_02_26" "2016_02_26" ... ##$ Current submission : chr  "2016_02_26" "2016_02_26" "2016_02_26" "2016_02_26" ...
##  $Age : 'difftime' num 3 3 3 3 ... ## ..- attr(*, "units")= chr "days" ##$ Revised            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

### Submissions over time

bioRxiv has supposedly been growing in popularity, let’s take a look at the rate of submissions over time.

First group into weeks:

weekly <- data %>%
mutate(weeks_past = ceiling(Age / 7),
Submission week = collection_date - weeks(weeks_past)) %>%
group_by(Submission week) %>%
summarise(Submissions = n())

ggplot(weekly, aes(Submission week, Submissions)) +
geom_point(stat = "identity") +
geom_smooth() +
ggtitle("bioRxiv submissions per week") +
theme_bw()
## geom_smooth() using method = 'loess' and formula 'y ~ x'

So the rate of submission really is increasing.

### Some bioRxiv metrics

bioRxiv allows you to submit a revision at any time (until the paper is published in a peer-reviewed journal). How many preprints are revised at some stage?

data %>%
group_by(Revised) %>%
summarise(Count = n(), Percentage = Count/nrow(data) * 100) %>%
knitr::kable(., align = 'c')
Revised Count Percentage
FALSE 2204 71.21163
TRUE 891 28.78837

So only 28% are ever revised.

What’s the average number of abstract views and PDF downloads for preprints?

abstr_mn <- mean(data$Abstract views, na.rm = T) abstr_med <- median(data$Abstract views, na.rm = T)
pdf_mn <- mean(data$PDF views, na.rm = T) pdf_med <- median(data$PDF views, na.rm = T)

The mean number of abstract views was 485, and the median was 297. For PDF views, it was 134 and 64.

The mean seems to be pulled up by some outliers with very high numbers of views. Let’s take a look at what those are.

### Top 10 bioRxiv preprints

data %>%
arrange(-PDF views) %>%
select(Title, Abstract views, PDF views, Age, Revised) %>%
knitr::kable(., align = 'l')
Title Abstract views PDF views Age Revised
Massive migration from the steppe is a source for Indo-European languages in Europe 17385 15170 384 days FALSE
Revised estimates for the number of human and bacteria cells in the body 23174 9625 54 days FALSE
Accelerating Scientific Publication in Biology 12206 5504 233 days TRUE
The genome of the tardigrade Hypsibius dujardini 22074 4084 90 days TRUE
Real time selective sequencing using nanopore technology. 6906 4039 26 days FALSE
Simple multi-trait analysis identifies novel loci associated with growth and obesity measures 4362 2905 236 days FALSE
A vision for ubiquitous sequencing 2435 2735 298 days TRUE
￼TP53 copy number expansion correlates with the evolution of increased body size and an enhanced DNA damage response in elephants 12275 2502 146 days TRUE
Reconstructing Genetic History of Siberian and Northeastern European Populations 2079 2410 134 days TRUE
Genome variation and meiotic recombination in Plasmodium falciparum: insights from deep sequencing of genetic crosses 559 2408 206 days TRUE

All interesting papers, most of which are still not published in journals.

### Distribution of views

Let’s first filter out preprints less than 10 days old, as these probably have an artificially low number of views.

older_than_10 <- filter(data, Age > 10)
quantile(older_than_10\$PDF views, na.rm =TRUE)
##    0%   25%   50%   75%  100%
##     0    33    65   127 15170
ggplot(older_than_10, aes(PDF views)) +
geom_histogram(bins = 100) + theme_bw()
## Warning: Removed 1 rows containing non-finite values (stat_bin).

This would be better looked at on a log scale:

ggplot(older_than_10, aes(log(PDF views))) +
ggtitle("PDF views") +
geom_density(fill = 'blue', alpha = 0.5) + theme_bw() +
geom_vline(xintercept = log(pdf_med), col = 'red') +
geom_vline(xintercept = log(331))  # my paper
## Warning: Removed 6 rows containing non-finite values (stat_density).

ggplot(older_than_10, aes(log(Abstract views))) +
ggtitle("Abstract views") +
geom_density(fill = 'blue', alpha = 0.5) + theme_bw() +
geom_vline(xintercept = log(abstr_med), col = 'red') +
geom_vline(xintercept = log(1678))  # my paper
## Warning: Removed 2 rows containing non-finite values (stat_density).

The red line in both plots show the median, the black line show the datapoint for my own preprint posted a few weeks ago. It seems to have done slightly better than most others.

There’s plenty more I could look at, but for now this satisfies my basic curiosity. I’ve posted the data and code on GitHub, so interested people can download and play with it themselves.