Grouping scATAC-seq reads

Tim Stuart 2019-04-01 2 minute read

Often in the analysis of single-cell ATAC-seq data it’s helpful to create a “pseudo-bulk” accessibility profile for each individual cell type or cluster within the dataset. This can allow visualization of the separate accessibility tracks, and peak-calling on the pure cell types that can often enable the identification of rare peaks that are obscured when looking at the whole population.

However, there are few tools I know of that provide a nice solution to the basic problem of splitting a BAM file into given groups of cell barcodes. The simple solutions, like using grep or awk, are incredibly slow (see https://divingintogeneticsandgenomics.rbind.io/post/split-a-10xscatac-bam-file-by-cluster/). A while ago I wrote some scripts to do this using pysam, with the ability to use multiple cores. We used this code in our recent preprint to create separate tracks for different cell types in the brain.

I decided to put this code into a small package that will be for general single-cell data processing, called sinto. Right now, this package has only one function, filterbarcodes, that allows a BAM to be split based on a given set of cell barcodes. This works both for 10x Genomics BAM files with cell barcode read tags, and other formats with the cell barcode stored in the readname (like sci-ATAC-seq). My plan is for this package to grow over time, adding new features as the need arises. One thing I would like to add is the ability to add read tags for each cluster, rather than splitting the BAM, as some genome browsers (like IGV) support splitting the tracks by a read tag.

Install sinto through pip:

pip install sinto