User guide

Create scATAC-seq fragments file

An ATAC-seq fragment file can be created from a BAM file using the fragments command. The fragment file contains the position of each Tn5 integration site, the cell barcode associated with the fragment, and the number of times the fragment was sequenced. PCR duplicates are collapsed.

 sinto fragments [-h] -b BAM -f FRAGMENTS [-m MIN_MAPQ] [-p NPROC]
                    [-t BARCODETAG] [-c CELLS]
                    [--barcode_regex BARCODE_REGEX] [--use_chrom USE_CHROM]
                    [--max_distance MAX_DISTANCE]

 Create ATAC-seq fragment file from BAM file

 optional arguments:
 -h, --help            show this help message and exit
 -b BAM, --bam BAM     Input bam file (must be indexed)
 -f FRAGMENTS, --fragments FRAGMENTS
                         Name and path for output fragments file. Note that the
                         output is not sorted or compressed. To sort the output
                         file use sort -k 1,1 -k2,2n
 -m MIN_MAPQ, --min_mapq MIN_MAPQ
                         Minimum MAPQ required to retain fragment (default =
                         30)
 -p NPROC, --nproc NPROC
                         Number of processors (default = 1)
 -t BARCODETAG, --barcodetag BARCODETAG
                         Read tag storing cell barcode information (default =
                         "CB")
 -c CELLS, --cells CELLS
                         Path to file containing cell barcodes to retain, or a
                         comma-separated list of cell barcodes. If None
                         (default), use all cell barocodes present in the BAM
                         file.
 --barcode_regex BARCODE_REGEX
                         Regular expression used to extract cell barcode from
                         read name. If None (default), extract cell barcode
                         from read tag. Use "[^:]*" to match all characters up
                         to the first colon.
 --use_chrom USE_CHROM
                         Regular expression used to match chromosomes to be
                         included in output. Default is "(?i)^chr" to match all
                         chromosomes starting with "chr", case insensitive
 --max_distance MAX_DISTANCE
                         Maximum distance between integration sites for the
                         fragment to be retained. Allows filtering of
                         implausible fragments that likely result from
                         incorrect mapping positions. Default is 5000 bp.
--chunksize CHUNKSIZE
                        Number of BAM file entries to iterate over before
                        collapsing the fragments and writing to disk. Higher
                        chunksize will use more memory but will be faster.

Fragment file format

The fragment file is a BED format file containing the positions of Tn5 integration sites, the cell barcode that the DNA fragment originated from, and the number of times the fragment was sequenced. See the 10x Genomics website for a further description of the fragment file format.

This is a convenient compressed form of the most useful data generated in a scATAC-seq experiment. The fragments file generated by Sinto needs to be sorted, block-gzip compressed (bgzip), and indexed using tabix.

For example:

sort -k 1,1 -k2,2n frags.bed > frags.sort.bed
bgzip frags.sort.bed
tabix -p bed frags.sort.bed

How the fragment file is generated

Generating the fragment file involves the following steps in order:

  1. Remove soft-clipped bases from the alignment position.

  2. Extract cell barcode sequence associated with the fragment.

  3. Adjust alignment positions for the 9 bp Tn5 shift by applying +4/-5 to the start and end position of the paired reads.

  4. Remove fragments where either read has a MAPQ score less than the specified cutoff.

  5. Remove fragments where the fragment size is greater than the specified maximum.

  6. Collapse PCR duplicates:

    1. Count the frequency of each fragment for each cell barcode.

    2. Within a cell barocode, collapse fragments that share a start or end coordinate on the same chromosome.

    3. Across all cell barcodes, collapse fragments that share the exact start and end coordinates on the same chromosome.

    4. Assign the fragment to the most abundant cell barcode.

    5. Record the read count for the collapsed fragment.

  7. Write fragments to file. Note that fragments are not sorted or compressed.

Additional arguments for the fragments function

Number of processors: --nproc

Multiple cores can be used by specifying the --nproc argument. Specifying multiple processors will parallelize across chromosomes. Currently, at most one thread is used per chromosome, so there is no point specifiying more processors than the number of chromosomes.

Minimum mapping quality: --min_mapq

The minimum allowed mapping quality (MAPQ) can be set using --min_mapq. Depending on the aligner used, the MAPQ value can mean different things. Cellranger-atac uses bwa-mem for alignment, which follows the SAM spec and reports Phred scores as MAPQ values:

MAPping Quality. It equals -10 log10 Pr {mapping position is wrong}, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available.

Cell barcode tag: --barcodetag

Different methods may use different tags to store the cell barcode. Cellranger uses the CB tag, which is set as the default for Sinto. Other methods may use different tags, for example SNARE-seq uses the XC tag. You can work out what tag is used by looking at part of the BAM file: samtools view aln.bam | head.

Cell barcode regex: --barcode_regex

Some methods store the cell barcode in the read name rather than under a read tag. If this is the case, you can use a regular expression to extract the cell barcode from the read name. For example, if the first section of your read name up until the first : character corresponds to the cell barcode sequence, you can specify --barcode_regex [^:]* to correcly match the cell barcodes.

Choosing chromosomes to include: --use_chrom

Often a genome build might contain several scaffolds that are not typically used in downstream analysis. This option allows you to specify a regular expression to match chromosome names that will be retained in the output. By default, all chromosomes starting with “chr” are retained, case insensitive (ie, “Chr”, and “CHR” are also retained).

Set the maximum distance between Tn5 integration sites: --max_distance

Incorrect alignment can sometimes generate implausible fragment coordinates. Since we known there is an upper limit to the size of a DNA molecule that can be sequenced on the Illumina platform, very large fragments over 5 kb in size likely originate from incorrect read mapping. We can remove these to reduce the impact of mapping artefacts on the downstream analysis by setting the --max_distance parameter. Fragments larger than this value will not be included in the output file.

Set the maximum number of fragments to hold in memory before collapsing: --chunksize

The fragments algorithm iterates through a position-sorted BAM file and stores fragment information as it iterates through the paired reads. Once all the reads at a genomic locus have been read, the fragments covering that locus can be PCR-collapsed. Sinto performs this step in chunks to balance speed and memory use. The --chunksize parameter controls how many fragments are able to be held in memory before they get collapsed and written to a file. Setting a larger value should require more memory but the function will complete faster.

Filter cell barcodes from BAM file

Reads for a subset of cells can be extracted from a BAM file using the filterbarcodes command. This requires a position-sorted, indexed BAM file, and a file containing a list of cell barcodes to retain.

 sinto filterbarcodes [-h] -b BAM -c CELLS -o OUTPUT [-t] [-s]
                         [-p NPROC] [--barcode_regex BARCODE_REGEX]
                         [--barcodetag BARCODETAG]

Filter reads based on input list of cell barcodes

optional arguments:
-h, --help            show this help message and exit
-b BAM, --bam BAM     Input bam file (must be indexed)
-c CELLS, --cells CELLS
                        File or comma-separated list of cell barcodes. Can be
                        gzip compressed
-t, --trim_suffix     Remove trail 2 characters from cell barcode in BAM
                        file
-p NPROC, --nproc NPROC
                        Number of processors (default = 1)
--barcode_regex BARCODE_REGEX
                        Regular expression used to extract cell barcode from
                        read name. If None (default), extract cell barcode
                        from read tag. Use "[^:]*" to match all characters up
                        to the first colon.
--barcodetag BARCODETAG
                        Read tag storing cell barcode information (default =
                        "CB")

The input “cells” file should be a tab-delimited text file with cell barcodes in the first column and the groups the cell belongs to in the second column. This could be the cluster number, for example. A cell can belong to multiple groups specified in the file using a comma-separated list of groups. If multiple groups are provided, reads from that cell will be copied to the output BAM file for each of the groups.

Example input “cells” file:

TGGCAATGTTGAAGCG-1  A
GACCAATCACCATTCC-1  A
CAGGATTCAGAACTTC-1  B
GAACCTAAGAGAGGTA-1  B,A
ACATGGTGTAGACGCA-1  C
CCCTGATTCGGATAGG-1  C

The names of the output BAM files are determined by the name of each group in the input cells file. The example file above would generate three bam files, named A.bam, B.bam, and C.bam. Note that reads from the fourth cell would appear in both B.bam and A.bam.

Add read tags to BAM file

Read tags can be added to a BAM file according to which cell the read belongs to using the addtags command. This requires a position-sorted and indexed BAM file, and a file specifying the tags to be added to each cell, for example:

TGGCAATGTTGAAGCG-1  CI      A
GACCAATCACCATTCC-1  CI      A
CAGGATTCAGAACTTC-1  CI      B
GAACCTAAGAGAGGTA-1  CI      B
ACATGGTGTAGACGCA-1  CI      C
CCCTGATTCGGATAGG-1  CI      C
sinto addtags [-h] -b BAM -f TAGFILE -o OUTPUT [-t] [-s] [-p NPROC]
                    [-m MODE]

Add read tags to reads from individual cells

optional arguments:
-h, --help            show this help message and exit
-b BAM, --bam BAM     Input bam file (must be indexed)
-f TAGFILE, --tagfile TAGFILE
                        Tab-delimited file containing cell barcode, tag to be
                        added, and tag identity. Can be gzip compressed
-o OUTPUT, --output OUTPUT
                        Name for output BAM file
-t, --trim_suffix     Remove trail 2 characters from cell barcode in BAM
                        file
-s, --sam             Output sam format (default bam output)
-p NPROC, --nproc NPROC
                        Number of processors (default = 1)
-m MODE, --mode MODE  Either tag (default) or readname. Some BAM file store
                        the cell barcode in the readname rather than under a
                        read tag

This will add a CI tag, with the tag set to A, B, or C depending on the cell barcode sequence.

Convert read tag to read group

Read groups can be added to a SAM/BAM file based on an arbitrary read tag using the tagtorg command. Let’s assume we have a SAM file called input.sam with the following contents:

@HD VN:1.5  SO:coordinate
@SQ SN:20   LN:63025520
@RG ID:rg1  SM:sample_1     LB:1    PU:1    PL:ILLUMINA
r002        0       20      9       30      3S6M1P1I4M      *       0       0       AAAAGATAAGGATA  *       CB:Z:AAAA-1     RG:Z:rg1
r003        0       20      9       30      3S6M1P1I4M      *       0       0       AAAAGATAAGGATA  *       CB:Z:CCCC-1     RG:Z:rg1

We would like to assign each read to a separate read group according to the value of it’s CB tag. First, we need a list of tag values that we expect to see:

AAAA-1
CCCC-1

Let us assume that the barcodes are stored in a file called barcodes.txt. Then we can replace the read groups in the SAM file using the command:

sinto tagtorg -b input.sam -f barcodes.txt

This will print the following SAM file to screen:

@HD VN:1.5  SO:coordinate
@SQ SN:20   LN:63025520
@RG ID:rg1  SM:sample_1     LB:1    PU:1    PL:ILLUMINA
@RG ID:rg1:CCCC-1   SM:sample_1:CCCC-1      LB:1    PU:1    PL:ILLUMINA
@RG ID:rg1:AAAA-1   SM:sample_1:AAAA-1      LB:1    PU:1    PL:ILLUMINA
r002        0       20      9       30      3S6M1P1I4M      *       0       0       AAAAGATAAGGATA  *       CB:Z:AAAA-1     RG:Z:rg1:AAAA-1
r003        0       20      9       30      3S6M1P1I4M      *       0       0       AAAAGATAAGGATA  *       CB:Z:CCCC-1     RG:Z:rg1:CCCC-1

Two new @RG tags have been added to the header with SM fields that are cell barcode-specic. The two reads r002 and r003 have been assigned new RG tags according to their cell barcode.

usage: sinto tagtorg [-h] -b BAM [--tag TAG] -f TAGFILE [-o OUTPUT] [-O O]

Append a read tag to the read group ID of each read. Also appends the read tag
to the SM field of the read group.

optional arguments:
  -h, --help            show this help message and exit
  -b BAM, --bam BAM     Input SAM/BAM file, '-' reads from stdin
  --tag TAG             Read tag to extract the value from that is appended to
                        the read group. Default is 'CB', the tag that is used
                        in 10x sequencing to identify cells.
  -f TAGFILE, --tagfile TAGFILE
                        List of expected tag values. Reads with tag values
                        that are not in this list are not altered.
  -o OUTPUT, --output OUTPUT
                        Output SAM/BAM file, '-' outputs to stdout (default
                        '-')
  -O OUTPUTFORMAT, --outputformat OUTPUTFORMAT
                        Output format. One of 't' (SAM), 'b' (BAM), or 'u'
                        (uncompressed BAM) ('t' default)

Add cell barcodes to FASTQ read names

Cell barcodes from one FASTQ file can be added to the read names of another, or the same, FASTQ file using the barcode command. This is useful when processing raw single-cell sequencing data, as the cell barcode information can easily be propagated to the aligned BAM file by encoding the cell barcode in the read name. Both gzipped and uncompressed FASTQ files are supported as input. Running on uncompressed FASTQ is usually much faster than running on gzipped FASTQ files.

Running this command will generate new gzipped FASTQ files with the read names modified to contain the cell barcode sequence at the beginning of the read name, separated from the original read name by a : character. The output files will be the name of the input file with .barcoded.fastq.gz at the end of the file name.

sinto barcode [-h] --barcode_fastq BARCODE_FASTQ --read1 READ1
                 [--read2 READ2] -b BASES [--prefix PREFIX]
                 [--suffix SUFFIX]

Add cell barcode sequences to read names in FASTQ file.

optional arguments:
-h, --help            show this help message and exit
--barcode_fastq BARCODE_FASTQ
                        FASTQ file containing cell barcode sequences
--read1 READ1         FASTQ file containing read 1
--read2 READ2         FASTQ file containing read 2
-b BASES, --bases BASES
                        Number of bases to extract from barcode-containing
                        FASTQ
--prefix PREFIX       Prefix to add to cell barcodes
--suffix SUFFIX       Suffix to add to cell barcodes

Additional arguments for the barcode function

Bases: --bases

This controls how many bases from the read containing the cell barcode are used. Bases are counted from the beginning of the read sequence in the FASTQ file. For example, --bases 12 will extract the first 12 sequenced bases from the read and use it as the cell barcode.

Barcode read file: --barcode_fastq

FASTQ file with reads containing the cell barcode sequence.

Read 1 and read 2: --read1 and --read2

FASTQ files containing reads to which the cell barcode information will be added. Note that these files must contain the same number of reads as the barcode-containing FASTQ file, and the reads must appear in the same order.

Example

Take the following two FASTQ files as an example. The first contains cell barcode sequences and the second we want to add those sequences to the read name.

barocde_file.fastq.gz:

@D00611:697:CD0V6ANXX:5:2301:1176:2478 1:N:0:TATCCTCT
CAATACACTATATGGGAGACGTTTTTTTTT
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFF
@D00611:697:CD0V6ANXX:5:2301:1480:2408 1:N:0:TATCCTCT
CAGAGACGTAAACAATGGCGGTTTTTTTTT
+
B<BBBFFFFFFFFFFFFFFFFFFFFFFFFF
@D00611:697:CD0V6ANXX:5:2301:1361:2447 1:N:0:TATCCTAT
AGTCTCGCCACATGGGGGGGATTTTTTTTT

read1.fastq.gz:

@D00611:697:CD0V6ANXX:5:2301:1176:2478 2:N:0:TATCCTCT
GATTTACACAGATGATATGTTTCTATTGCCTGCTTGGGATGGGGGTGGGAGGCAGAGTCCATCTACCTCTCTAAC
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@D00611:697:CD0V6ANXX:5:2301:1480:2408 2:N:0:TATCCTCT
GTGCCTTTGACTTTAGCTAGGCGACAGGGGACGAGTCCATTAGCATACNNNGTAAATTGCTGTTGTCTGTTTTTG
+
<////////B/B/////<//////<///////////<///////////###////////////<///////////
@D00611:697:CD0V6ANXX:5:2301:1361:2447 2:N:0:TATCCTAT
TAATACATGACGGTGTCTTAGTAGCACTTACTATGCACAGGTTAAGACCTGTCTCTTATACACATCTCCGAGCCC

After running sinto barcode with -b 12 to extract the first 12 bases of the barcode sequence we have a new file called read1.barcoded.fastq.gz:

@CAATACACTATA:D00611:697:CD0V6ANXX:5:2301:1176:2478 2:N:0:TATCCTCT
GATTTACACAGATGATATGTTTCTATTGCCTGCTTGGGATGGGGGTGGGAGGCAGAGTCCATCTACCTCTCTAAC
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@CAGAGACGTAAA:D00611:697:CD0V6ANXX:5:2301:1480:2408 2:N:0:TATCCTCT
GTGCCTTTGACTTTAGCTAGGCGACAGGGGACGAGTCCATTAGCATACNNNGTAAATTGCTGTTGTCTGTTTTTG
+
<////////B/B/////<//////<///////////<///////////###////////////<///////////
@AGTCTCGCCACA:D00611:697:CD0V6ANXX:5:2301:1361:2447 2:N:0:TATCCTAT
TAATACATGACGGTGTCTTAGTAGCACTTACTATGCACAGGTTAAGACCTGTCTCTTATACACATCTCCGAGCCC