User guide

Create scATAC-seq fragments file

An ATAC-seq fragment file can be created from a BAM file using the fragments command. The fragment file contains the position of each Tn5 integration site, the cell barcode associated with the fragment, and the number of times the fragment was sequenced. PCR duplicates are collapsed.

 sinto fragments [-h] -b BAM -f FRAGMENTS [-m MIN_MAPQ] [-p NPROC]
                    [-t BARCODETAG] [-c CELLS]
                    [--barcode_regex BARCODE_REGEX] [--use_chrom USE_CHROM]
                    [--max_distance MAX_DISTANCE]

 Create ATAC-seq fragment file from BAM file

 optional arguments:
 -h, --help            show this help message and exit
 -b BAM, --bam BAM     Input bam file (must be indexed)
 -f FRAGMENTS, --fragments FRAGMENTS
                         Name and path for output fragments file. Note that the
                         output is not sorted or compressed. To sort the output
                         file use sort -k 1,1 -k2,2n
 -m MIN_MAPQ, --min_mapq MIN_MAPQ
                         Minimum MAPQ required to retain fragment (default =
 -p NPROC, --nproc NPROC
                         Number of processors (default = 1)
                         Read tag storing cell barcode information (default =
 -c CELLS, --cells CELLS
                         Path to file containing cell barcodes to retain, or a
                         comma-separated list of cell barcodes. If None
                         (default), use all cell barocodes present in the BAM
 --barcode_regex BARCODE_REGEX
                         Regular expression used to extract cell barcode from
                         read name. If None (default), extract cell barcode
                         from read tag. Use "[^:]*" to match all characters up
                         to the first colon.
 --use_chrom USE_CHROM
                         Regular expression used to match chromosomes to be
                         included in output. Default is "(?i)^chr" to match all
                         chromosomes starting with "chr", case insensitive
 --max_distance MAX_DISTANCE
                         Maximum distance between integration sites for the
                         fragment to be retained. Allows filtering of
                         implausible fragments that likely result from
                         incorrect mapping positions. Default is 5000 bp.
--chunksize CHUNKSIZE
                        Number of BAM file entries to iterate over before
                        collapsing the fragments and writing to disk. Higher
                        chunksize will use more memory but will be faster.

Fragment file format

The fragment file is a BED format file containing the positions of Tn5 integration sites, the cell barcode that the DNA fragment originated from, and the number of times the fragment was sequenced. See the 10x Genomics website for a further description of the fragment file format.

This is a convenient compressed form of the most useful data generated in a scATAC-seq experiment. The fragments file generated by Sinto needs to be sorted, block-gzip compressed (bgzip), and indexed using tabix.

For example:

sort -k 1,1 -k2,2n frags.bed > frags.sort.bed
bgzip frags.sort.bed
tabix -p bed frags.sort.bed

How the fragment file is generated

Generating the fragment file involves the following steps in order:

  1. Remove soft-clipped bases from the alignment position.

  2. Extract cell barcode sequence associated with the fragment.

  3. Adjust alignment positions for the 9 bp Tn5 shift by applying +4/-5 to the start and end position of the paired reads.

  4. Remove fragments where either read has a MAPQ score less than the specified cutoff.

  5. Remove fragments where the fragment size is greater than the specified maximum.

  6. Collapse PCR duplicates:

    1. Count the frequency of each fragment for each cell barcode.

    2. Within a cell barocode, collapse fragments that share a start or end coordinate on the same chromosome.

    3. Across all cell barcodes, collapse fragments that share the exact start and end coordinates on the same chromosome.

    4. Assign the fragment to the most abundant cell barcode.

    5. Record the read count for the collapsed fragment.

  7. Write fragments to file. Note that fragments are not sorted or compressed.

Additional arguments for the fragments function

Number of processors: --nproc

Multiple cores can be used by specifying the --nproc argument. Note that each process will load fragments from a single chromosome into memory, so the more processors used the more memory required. At a minimum, enough memory to fragments from the largest chromosome into memory is required. The amount of memory this corresponds to will depend on sequencing depth and the genome size. There is no point specifiying more processors than the number of chromosomes.

Minimum mapping quality: --min_mapq

The minimum allowed mapping quality (MAPQ) can be set using --min_mapq. Depending on the aligner used, the MAPQ value can mean different things. Cellranger-atac uses bwa-mem for alignment, which follows the SAM spec and reports Phred scores as MAPQ values:

MAPping Quality. It equals -10 log10 Pr {mapping position is wrong}, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available.

Cell barcode tag: --barcodetag

Different methods may use different tags to store the cell barcode. Cellranger uses the CB tag, which is set as the default for Sinto. Other methods may use different tags, for example SNARE-seq uses the XC tag. You can work out what tag is used by looking at part of the BAM file: samtools view aln.bam | head.

Cell barcode regex: --barcode_regex

Some methods store the cell barcode in the read name rather than under a read tag. If this is the case, you can use a regular expression to extract the cell barcode from the read name. For example, if the first section of your read name up until the first : character corresponds to the cell barcode sequence, you can specify --barcode_regex [^:]* to correcly match the cell barcodes.

Choosing chromosomes to include: --use_chrom

Often a genome build might contain several scaffolds that are not typically used in downstream analysis. This option allows you to specify a regular expression to match chromosome names that will be retained in the output. By default, all chromosomes starting with “chr” are retained, case insensitive (ie, “Chr”, and “CHR” are also retained).

Set the maximum distance between Tn5 integration sites: --max_distance

Incorrect alignment can sometimes generate implausible fragment coordinates. Since we known there is an upper limit to the size of a DNA molecule that can be sequenced on the Illumina platform, very large fragments over 5 kb in size likely originate from incorrect read mapping. We can remove these to reduce the impact of mapping artefacts on the downstream analysis by setting the --max_distance parameter. Fragments larger than this value will not be included in the output file.

Set the maximum number of fragments to hold in memory before collapsing: --chunksize

The fragments algorithm iterates through a position-sorted BAM file and stores fragment information as it iterates through the paired reads. Once all the reads at a genomic locus have been read, the fragments covering that locus can be PCR-collapsed. Sinto performs this step in chunks to balance speed and memory use. The --chunksize parameter controls how many fragments are able to be held in memory before they get collapsed and written to a file. Setting a larger value should require more memory but the function will complete faster.

Filter cell barcodes from BAM file

Reads for a subset of cells can be extracted from a BAM file using the filterbarcodes command. This requires a position-sorted, indexed BAM file, and a file containing a list of cell barcodes to retain.

 sinto filterbarcodes [-h] -b BAM -c CELLS -o OUTPUT [-t] [-s]
                         [-p NPROC] [--barcode_regex BARCODE_REGEX]
                         [--barcodetag BARCODETAG]

Filter reads based on input list of cell barcodes

optional arguments:
-h, --help            show this help message and exit
-b BAM, --bam BAM     Input bam file (must be indexed)
-c CELLS, --cells CELLS
                        File or comma-separated list of cell barcodes. Can be
                        gzip compressed
-t, --trim_suffix     Remove trail 2 characters from cell barcode in BAM
-p NPROC, --nproc NPROC
                        Number of processors (default = 1)
--barcode_regex BARCODE_REGEX
                        Regular expression used to extract cell barcode from
                        read name. If None (default), extract cell barcode
                        from read tag. Use "[^:]*" to match all characters up
                        to the first colon.
--barcodetag BARCODETAG
                        Read tag storing cell barcode information (default =

The input “cells” file should be a tab-delimited text file with cell barcodes in the first column and the groups the cell belongs to in the second column. This could be the cluster number, for example. A cell can belong to multiple groups specified in the file using a comma-separated list of groups. If multiple groups are provided, reads from that cell will be copied to the output BAM file for each of the groups.

Example input “cells” file:


The names of the output BAM files are determined by the name of each group in the input cells file. The example file above would generate three bam files, named A.bam, B.bam, and C.bam. Note that reads from the fourth cell would appear in both B.bam and A.bam.

Add read tags to BAM file

Read tags can be added to a BAM file according to which cell the read belongs to using the addtags command. This requires a position-sorted and indexed BAM file, and a file specifying the tags to be added to each cell, for example:

sinto addtags [-h] -b BAM -f TAGFILE -o OUTPUT [-t] [-s] [-p NPROC]
                    [-m MODE]

Add read tags to reads from individual cells

optional arguments:
-h, --help            show this help message and exit
-b BAM, --bam BAM     Input bam file (must be indexed)
-f TAGFILE, --tagfile TAGFILE
                        Tab-delimited file containing cell barcode, tag to be
                        added, and tag identity. Can be gzip compressed
-o OUTPUT, --output OUTPUT
                        Name for output BAM file
-t, --trim_suffix     Remove trail 2 characters from cell barcode in BAM
-s, --sam             Output sam format (default bam output)
-p NPROC, --nproc NPROC
                        Number of processors (default = 1)
-m MODE, --mode MODE  Either tag (default) or readname. Some BAM file store
                        the cell barcode in the readname rather than under a
                        read tag

This will add a CI tag, with the tag set to A, B, or C depending on the cell barcode sequence.