Processing scATAC-seq data

In this tutorial we’ll demonstrate a simple pipeline for processing scATAC-seq data using Sinto, bwa, tabix, and Genrich.

Download data

For this example we’ll download scATAC-seq data from Chen, Lake, and Zhang (2019) This dataset actually contains both gene expression and DNA accessibility measurements for each cell, but we can process the DNA accessibility data exactly as though it was a typical scATAC-seq experiment. In this example we’ll just process one replicate, but there are 12 replicates in total in the original data.

wget https://sra-pub-src-1.s3.amazonaws.com/SRR9672090/SC_20190509A_AdMsBrain_SNARE_chromatin_S16_R1_001.fastq.gz.1
wget https://sra-pub-src-1.s3.amazonaws.com/SRR9672090/SC_20190509A_AdMsBrain_SNARE_chromatin_S16_R2_001.fastq.gz.1
wget https://sra-pub-src-1.s3.amazonaws.com/SRR9672090/SC_20190509A_AdMsBrain_SNARE_chromatin_S16_R3_001.fastq.gz.1

Attach cell barcodes

In this experiment the cell barcodes were sequenced using the first 12 bp of read 1, and the paired-end genomic DNA reads sequenced using read 2 and read 3. We can use the sinto barcode function to transfer the cell barcodes from read 1 into the read names in read 2 and read 3. Storing the cell barcode in the read name is an easy way to track which reads came from which cells.

This step is much faster when using unzipped fastq files, so we’ll first unzip the fastq files we downloaded earlier.

gzip -S .gz.1 -d SC_20190509A_AdMsBrain_SNARE_chromatin_S16_R1_001.fastq.gz.1
gzip -S .gz.1 -d SC_20190509A_AdMsBrain_SNARE_chromatin_S16_R2_001.fastq.gz.1
gzip -S .gz.1 -d SC_20190509A_AdMsBrain_SNARE_chromatin_S16_R3_001.fastq.gz.1

sinto barcode -b 12 --barcode_fastq SC_20190509A_AdMsBrain_SNARE_chromatin_S16_R1_001.fastq \
    --read1 SC_20190509A_AdMsBrain_SNARE_chromatin_S16_R2_001.fastq \
    --read2 SC_20190509A_AdMsBrain_SNARE_chromatin_S16_R3_001.fastq

Function run_barcode called with the following arguments:

barcode_fastq       SC_20190509A_AdMsBrain_SNARE_chromatin_S16_R1_001.fastq
read1       SC_20190509A_AdMsBrain_SNARE_chromatin_S16_R2_001.fastq
read2       SC_20190509A_AdMsBrain_SNARE_chromatin_S16_R3_001.fastq
bases       12
prefix
suffix
func        <function run_barcode at 0x7f28c93714c0>

Function completed in  8.0 m 13.35 s

Map

We can map the paired-end genomic reads using any aligner we like. In this example we’ll use bwa mem. First we need to download the fasta file for the mm10 genome and create a genome index for bwa:

wget https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz
gzip -d mm10.fa.gz
bwa index mm10.fa

Next we can map the reads with bwa mem:

bwa mem -t 8 mm10.fa SC_20190509A_AdMsBrain_SNARE_chromatin_S16_R2_001.barcoded.fastq \
    SC_20190509A_AdMsBrain_SNARE_chromatin_S16_R3_001.barcoded.fastq \
    | samtools view -b - > aln.bam

# sort and index bam file
samtools sort -@ 8 aln.bam -o aln.sort.bam
samtools index -@ 8 aln.sort.bam

# look at some of the aligned reads
samtools view aln.sort.bam | head -n 3

AACCTCCAACTG:D00611:698:CDN4HANXX:5:1106:1580:94969 69      chr1    3000052 0       *       =       3000052 0       TCACATTCTCAGTGCACAATAGAACCCCTTACCTCCAATCCAGAGTAAACAAAGAGTCTACCACAAACACACATC     /</<///</BB/////<//////<///<//<<////<B<BBB///////<<//////<<//<////</<<FF<</     MC:Z:32M43S     AS:i:0  XS:i:0
AACCTCCAACTG:D00611:698:CDN4HANXX:5:1106:1580:94969 137     chr1    3000052 0       32M43S  =       3000052 0       TTGAAGGTCTGGTAGAACTCTGCATTAAACCCCGACTCCAGTTGGAGGTTGTACTCTGCGTTGATACCACTTTTT     BBB/BFFFBFBFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFB<///////     NM:i:0  MD:Z:32 AS:i:32 XS:i:32
TCCGCCGGAAAC:D00611:697:CD0V6ANXX:7:2116:7385:7732  117     chr1    3000098 0       *       =       3000098 0       AAGACGGCATACGAGATTCGCCTTAGTCTCGTGGGCTCGGAGATGTGTATAAGAGACATATACACTCAGCTTTAT     FFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB     MC:Z:33M42S     AS:i:0  XS:i:0

Create a fragment file

Finally we create a fragment file for the dataset using sinto fragments

sinto fragments -b aln.sort.bam -p 8 -f fragments.bed --barcode_regex "[^:]*"

Function run_fragments called with the following arguments:

bam aln.sort.bam
fragments   fragments.bed
min_mapq    30
nproc       8
barcodetag  CB
cells       None
barcode_regex       [^:]*
use_chrom   (?i)^chr
max_distance        5000
chunksize   500000
func        <function run_fragments at 0x7ff6976b5dc0>

Function completed in  7.0 m 30.24 s

# sort, compress, and index
sort -k1,1 -k2,2n fragments.bed > fragments.sort.bed
bgzip -@ 8 fragments.sort.bed
tabix -p bed fragments.sort.bed.gz

# clean up
rm fragments.bed

# take a look at the output
gzip -dc fragments.sort.bed.gz | head

chr1        3003930 3003960 ATGGTGACTCAT    1
chr1        3010068 3010136 CTCGGGTTTAGC    5
chr1        3010538 3010589 GAAACGCAAAGT    8
chr1        3011663 3011770 AACTGGGCTATC    2
chr1        3011684 3011728 CTATTGTGCATA    12
chr1        3012702 3012739 ATCAGAGACGGC    6
chr1        3012724 3012775 TGCGGACGGTGG    3
chr1        3012743 3012804 TGACATCCCCGA    2
chr1        3019505 3019588 CACGCTCGTCTT    1
chr1        3021571 3021620 GTATTCGTCGGG    9

Call peaks

We can call peaks using all cells combined using the mapped bam file and Genrich. Genrich requires reads to be sorted by queryname, so we first re-sort the bam file and then run Genrich using ATAC-seq mode (-j) to find peaks.

samtools sort -n -@ 8 aln.sort.bam -o aln.qname.bam
Genrich -j -t aln.qname.bam -o peaks.bed -v

# look at some of the peaks
head peaks.bed

chr1        3094859 3095377 peak_0  1000    .       913.693115      5.720295        -1      374
chr1        3119581 3120170 peak_1  1000    .       1186.086792     6.389130        -1      420
chr1        3120272 3120746 peak_2  1000    .       965.524841      6.166295        -1      285
chr1        3121354 3121654 peak_3  1000    .       580.268433      4.948659        -1      129
chr1        3292563 3292985 peak_4  1000    .       897.739197      5.319649        -1      257
chr1        3299692 3299934 peak_5  1000    .       489.207764      5.154057        -1      98
chr1        3309998 3310407 peak_6  1000    .       829.686340      6.153750        -1      210
chr1        3322437 3322775 peak_7  1000    .       397.093872      3.806610        -1      128
chr1        3360973 3361167 peak_8  1000    .       262.217590      3.911148        -1      120
chr1        3369527 3369829 peak_9  1000    .       373.499207      4.094081        -1      149

Downstream analysis

Downstream analysis steps, including quantifying counts in each peak for each cell, can be performed using Signac.