Tim Stuart     About     Archive     Code     Publications     Feed

Installing Magic

Installing magic

Recently a method for imputing single cell gene expression matricies was posted on biorxiv by David van Dijk et al., called magic (Markov Affinity-based Graph Imputation of Cells). I’ve been analysing single cell RNA-seq data recently, and this method looks like it could be useful when trying to find co-transcriptional networks, as single cell data suffers from dropout which makes finding co-transcriptional networks hard.

read more

R demo

Table of Contents

Getting started

Clone the repo if you haven’t already:

git clone https://github.com/timoast/dac.git

Install RStudio.

Install the following packages:

read more

smRNA analysis notes

I recently analysed some smRNA data for a paper I’m working on. These are my analysis notes.

I used previously published data for Brachypodium, from this paper:

Garvin DF, Schmutz J, Rokhsar D, Bevan MW, Barry K, Lucas S, et al. Genome sequencing and analysis of the model grass Brachypodium distachyon. Nature. 2010;463: 763–768. doi:10.1038/nature08747

First step is to download the data:

$ wget ftp://ftp-trace.ncbi.nlm.nih.gov//sra/sra-instant/reads/ByStudy/sra/SRP/SRP001/SRP001895/SRR035616/SRR035616.sra
$ fastq-dump SRR035616.sra
$ pigz SRR035616.fastq
read more

WGBS Analysis notes -- BS-seeker2

Whole genome bisulfite sequencing analysis notes for BS-seeker2.

Step 1: Trim adapters and low quality bases


| seqtk trimfq -l 50 - \
| pigz > filtered_reads.fq.gz
read more


A look at bioRxiv preprints

Tim Stuart
2 March 2016

After posting a my first preprint to bioRxiv a few weeks ago, I have been periodically checking the number of views and PDF downloads. I became interested to see how many downloads or views the preprints on bioRxiv typically get, but this type of information isn’t actually available. What are the all-time top bioRxiv preprints? How many people are reading bioRxiv preprints on average? No-one knows! Altmetric must track this data, as it will tell you how a particular preprint ranks in relation to others, but that data hasn’t been made publicly available (as far as I can tell).

read more

A paper a day

In an effort to read more papers this year, I’m going to read a paper (or something paper-like) each day for the remainder of the year, and post each paper below as I go.

Dec 31

Galanter JM, Gignoux CR, Oh SS, Torgerson D, Pino-Yanes M, Thakur N, et al. Differential methylation between ethnic sub-groups reflects the effect of genetic ancestry and environmental exposures. eLife. doi:10.7554/eLife.20532

read more

Lorne Genome 2016 presentation

This week I presented some recent work at the Lorne Genome conference. It was my first time at Lorne and it’s a fantastic meeting with lots of interesting talks. It was great to see so many people getting interested in transposon biology too!

Here are a few links to the work I was presenting:

TEPID TE presence/absence variant discovery software:






read more

Extract reads from bam file by read name

While there are very fast and easy ways to extract reads from a bam file according to mapping location, extracting reads by read name is more difficult.

Simple methods, like using grep, are incredibly slow if you want to look for more than a few reads.

Luckily, pysam allows you to index a bam file by read name (using pysam.IndexedReads(AlignmentFile)) while keeping the sort order. However, the index is stored in memory so this can use a lot of RAM (my tests with a 5.7 GB bam file used about 9 GB RAM).

I wrote a small python script (below) that uses pysam to extract reads by read name from a bam file.

Extracting 10 reads from a 5.7 GB bam file, just using grep is slightly faster than the python script:

read more

Estimation of insert size from sam / bam files

I’ve been looking for a simple way to estimate paired-end insert sizes from mapped data. There are a few more complicated tools available (I think picard will do it), as well as simple scripts different people have written that don’t exclude outliers. However outliers due to discordant read alignments will hugely change the mean, so need to be removed.

As I couldn’t find anything that was a nice middle ground, I ended up writing my own simple python script that will take input from stdin and exclude outliers more that two standard deviations from the median, and print the mean and standard deviation to stdout.

It can be used to estimate the insert size from sam or bam files:

$ head -10000 mapped.sam | python mean_size.py
220 35
$ samtools view mapped.bam | head -10000 | python mean_size.py
220 35
read more

MethyC-Seq Analysis Notes

Preliminary steps


  1. Enter a screen
screen -S bcl-conversion
  1. Navigate to directory with run (eg. 130909_SNL119_0105_AC2GYKACXX)
  2. Check sample sheet configured correctly. If only one adapter in lane, remove adapter sequence from sample sheet.
  3. Run the following (modified with correct run name, sample sheet etc). Can change final value to change number of reads in files:
read more