User guide
Create scATAC-seq fragments file
An ATAC-seq fragment file can be created from a BAM file using the fragments
command.
The fragment file contains the position of each Tn5 integration site, the cell barcode
associated with the fragment, and the number of times the fragment was sequenced.
PCR duplicates are collapsed.
sinto fragments [-h] -b BAM -f FRAGMENTS [-m MIN_MAPQ] [-p NPROC] [-t BARCODETAG]
[-c CELLS] [--barcode_regex BARCODE_REGEX] [--use_chrom USE_CHROM]
[--max_distance MAX_DISTANCE] [--min_distance MIN_DISTANCE]
[--chunksize CHUNKSIZE] [--shift_plus SHIFT_PLUS]
[--shift_minus SHIFT_MINUS] [--collapse_within]
Create ATAC-seq fragment file from BAM file
optional arguments:
-h, --help show this help message and exit
-b BAM, --bam BAM Input bam file (must be indexed)
-f FRAGMENTS, --fragments FRAGMENTS
Name and path for output fragments file. Note that the output is
not sorted or compressed. To sort the output file use sort -k 1,1
-k2,2n
-m MIN_MAPQ, --min_mapq MIN_MAPQ
Minimum MAPQ required to retain fragment (default = 30)
-p NPROC, --nproc NPROC
Number of processors (default = 1)
-t BARCODETAG, --barcodetag BARCODETAG
Read tag storing cell barcode information (default = "CB")
-c CELLS, --cells CELLS
Path to file containing cell barcodes to retain, or a comma-
separated list of cell barcodes. If None (default), use all cell
barocodes present in the BAM file.
--barcode_regex BARCODE_REGEX
Regular expression used to extract cell barcode from read name. If
None (default), extract cell barcode from read tag. Use "[^:]*" to
match all characters up to the first colon.
--use_chrom USE_CHROM
Regular expression used to match chromosomes to be included in
output. Default is "(?i)^chr" to match all chromosomes starting
with "chr", case insensitive
--max_distance MAX_DISTANCE
Maximum distance between integration sites for the fragment to be
retained. Allows filtering of implausible fragments that likely
result from incorrect mapping positions. Default is 5000 bp.
--min_distance MIN_DISTANCE
Minimum distance between integration sites for the fragment to be
retained. Allows filtering of implausible fragments that likely
result from incorrect mapping positions. Default is 10 bp.
--chunksize CHUNKSIZE
Number of BAM file entries to iterate over before collapsing the
fragments and writing to disk. Higher chunksize will use more
memory but will be faster.
--shift_plus SHIFT_PLUS
Number of bases to shift Tn5 insertion position by on the forward
strand.
--shift_minus SHIFT_MINUS
Number of bases to shift Tn5 insertion position by on the reverse
strand.
--collapse_within Take cell barcode into account when collapsing duplicate fragments.
Setting this flag means that fragments with the same coordinates
can be identified provided they originate from a different cell
barcode.
Fragment file format
The fragment file is a BED format file containing the positions of Tn5 integration sites, the cell barcode that the DNA fragment originated from, and the number of times the fragment was sequenced. See the 10x Genomics website for a further description of the fragment file format.
This is a convenient compressed form of the most useful data generated in a scATAC-seq experiment. The fragments file generated by Sinto needs to be sorted, block-gzip compressed (bgzip), and indexed using tabix.
For example:
sort -k 1,1 -k2,2n frags.bed > frags.sort.bed
bgzip frags.sort.bed
tabix -p bed frags.sort.bed
How the fragment file is generated
Generating the fragment file involves the following steps in order:
Extract cell barcode sequence associated with the fragment.
Adjust alignment positions for the 9 bp Tn5 shift by applying +4/-5 to the start and end position of the paired reads.
Remove fragments where either read has a MAPQ score less than the specified cutoff.
Remove fragments where the fragment size is greater than the specified maximum.
Collapse PCR duplicates:
Count the frequency of each fragment for each cell barcode.
Within a cell barcode, collapse fragments that share a start or end coordinate on the same chromosome.
Across all cell barcodes, collapse fragments that share the exact start and end coordinates on the same chromosome.
Assign the fragment to the most abundant cell barcode.
Record the read count for the collapsed fragment.
Write fragments to file. Note that fragments are not sorted or compressed.
Note that setting the --collapse_within
parameter will change how step 5
is handled.
Additional arguments for the fragments function
Number of processors: --nproc
Multiple cores can be used by specifying the --nproc
argument.
Specifying multiple processors will parallelize across chromosomes. Currently,
at most one thread is used per chromosome, so there is no point specifiying
more processors than the number of chromosomes.
Minimum mapping quality: --min_mapq
The minimum allowed mapping quality (MAPQ) can be set using --min_mapq
.
Depending on the aligner used, the MAPQ value can mean different things.
Cellranger-atac uses bwa-mem
for alignment, which follows the SAM spec and
reports Phred scores as MAPQ values:
MAPping Quality. It equals -10 log10 Pr {mapping position is wrong}, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available.
Cell barcode tag: --barcodetag
Different methods may use different tags to store the cell barcode.
Cellranger uses the CB
tag, which is set as the default for Sinto.
Other methods may use different tags, for example SNARE-seq uses the XC
tag.
You can work out what tag is used by looking at part of the BAM file:
samtools view aln.bam | head
.
Cell barcode regex: --barcode_regex
Some methods store the cell barcode in the read name rather than under a read tag.
If this is the case, you can use a regular expression to extract the cell barcode
from the read name. For example, if the first section of your read name
up until the first :
character corresponds to the cell barcode sequence,
you can specify --barcode_regex [^:]*
to correcly match the cell barcodes.
Choosing chromosomes to include: --use_chrom
Often a genome build might contain several scaffolds that are not typically used in downstream analysis. This option allows you to specify a regular expression to match chromosome names that will be retained in the output. By default, all chromosomes starting with “chr” are retained, case insensitive (ie, “Chr”, and “CHR” are also retained).
Set the maximum distance between Tn5 integration sites: --max_distance
Incorrect alignment can sometimes generate implausible fragment coordinates.
Since we known there is an upper limit to the size of a DNA molecule that
can be sequenced on the Illumina platform, very large fragments over 5 kb
in size likely originate from incorrect read mapping. We can remove these
to reduce the impact of mapping artefacts on the downstream analysis
by setting the --max_distance
parameter. Fragments larger than
this value will not be included in the output file.
Set the maximum number of fragments to hold in memory before collapsing: --chunksize
The fragments algorithm iterates through a position-sorted BAM file and stores
fragment information as it iterates through the paired reads. Once all the
reads at a genomic locus have been read, the fragments covering that locus can
be PCR-collapsed. Sinto performs this step in chunks to balance speed and memory
use. The --chunksize
parameter controls how many fragments are able to be
held in memory before they get collapsed and written to a file. Setting a larger
value should require more memory but the function will complete faster.
Change the Tn5 shift applied: --shift_plus
and --shift_minus
The fragments algorithm adjusts Tn5 integration positions based on the 9 bp stagger that is introduced when Tn5 integrates into the DNA. By default, a +4/-5 bp shift is applied to account for this. Different shifts can be applied by setting these parameters.
Change PCR duplicate removal strategy: --collapse_within
PCR duplicates are identified as fragments that share the same start and end
coordinates. By default (and for mostly historical reasons), the cell barcode
is not taken into account when collapsing PCR duplicates. To only consider
fragments as duplicates if they share the same start and end coordinate and
originate from the same cell barcode, the --collapse_within
parameter can
be used.
Filter cell barcodes from BAM file
Reads for a subset of cells can be extracted from a BAM file using the filterbarcodes
command.
This requires a position-sorted, indexed BAM file, and a file containing a list of cell barcodes to retain.
sinto filterbarcodes [-h] -b BAM -c CELLS [-t] [-p NPROC]
[--barcode_regex BARCODE_REGEX]
[--barcodetag BARCODETAG] [--outdir OUTDIR] [-s]
Filter reads based on input list of cell barcodes
optional arguments:
-h, --help show this help message and exit
-b BAM, --bam BAM Input bam file (must be indexed)
-c CELLS, --cells CELLS
File or comma-separated list of cell barcodes. Can be
gzip compressed
-t, --trim_suffix Remove trail 2 characters from cell barcode in BAM
file
-p NPROC, --nproc NPROC
Number of processors (default = 1)
--barcode_regex BARCODE_REGEX
Regular expression used to extract cell barcode from
read name. If None (default), extract cell barcode
from read tag. Use "[^:]*" to match all characters up
to the first colon.
--barcodetag BARCODETAG
Read tag storing cell barcode information (default =
"CB")
--outdir OUTDIR Output file directory
-s, --sam Output sam format (default bam output)
The input “cells” file should be a tab-delimited text file with cell barcodes in the first column and the groups the cell belongs to in the second column. This could be the cluster number, for example. A cell can belong to multiple groups specified in the file using a comma-separated list of groups. If multiple groups are provided, reads from that cell will be copied to the output BAM file for each of the groups.
Example input “cells” file:
TGGCAATGTTGAAGCG-1 A
GACCAATCACCATTCC-1 A
CAGGATTCAGAACTTC-1 B
GAACCTAAGAGAGGTA-1 B,A
ACATGGTGTAGACGCA-1 C
CCCTGATTCGGATAGG-1 C
The names of the output BAM files are determined by the name of each group in the
input cells file. The example file above would generate three bam files,
named A.bam
, B.bam
, and C.bam
. Note that reads from the fourth cell
would appear in both B.bam
and A.bam
.
Convert read tag to read group
Read groups can be added to a SAM/BAM file based on an arbitrary read tag using the
tagtorg
command. Let’s assume we have a SAM file called input.sam
with the following contents:
@HD VN:1.5 SO:coordinate
@SQ SN:20 LN:63025520
@RG ID:rg1 SM:sample_1 LB:1 PU:1 PL:ILLUMINA
r002 0 20 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * CB:Z:AAAA-1 RG:Z:rg1
r003 0 20 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * CB:Z:CCCC-1 RG:Z:rg1
We would like to assign each read to a separate read group according to the value of
it’s CB
tag. First, we need a list of tag values that we expect to see:
AAAA-1
CCCC-1
Let us assume that the barcodes are stored in a file called barcodes.txt
.
Then we can replace the read groups in the SAM file using the command:
sinto tagtorg -b input.sam -f barcodes.txt
This will print the following SAM file to screen:
@HD VN:1.5 SO:coordinate
@SQ SN:20 LN:63025520
@RG ID:rg1 SM:sample_1 LB:1 PU:1 PL:ILLUMINA
@RG ID:rg1:CCCC-1 SM:sample_1:CCCC-1 LB:1 PU:1 PL:ILLUMINA
@RG ID:rg1:AAAA-1 SM:sample_1:AAAA-1 LB:1 PU:1 PL:ILLUMINA
r002 0 20 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * CB:Z:AAAA-1 RG:Z:rg1:AAAA-1
r003 0 20 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * CB:Z:CCCC-1 RG:Z:rg1:CCCC-1
Two new @RG tags have been added to the header with SM fields that are cell barcode-specic. The two reads r002 and r003 have been assigned new RG tags according to their cell barcode.
usage: sinto tagtorg [-h] -b BAM [--tag TAG] -f TAGFILE [-o OUTPUT] [-O O]
Append a read tag to the read group ID of each read. Also appends the read tag
to the SM field of the read group.
optional arguments:
-h, --help show this help message and exit
-b BAM, --bam BAM Input SAM/BAM file, '-' reads from stdin
--tag TAG Read tag to extract the value from that is appended to
the read group. Default is 'CB', the tag that is used
in 10x sequencing to identify cells.
-f TAGFILE, --tagfile TAGFILE
List of expected tag values. Reads with tag values
that are not in this list are not altered.
-o OUTPUT, --output OUTPUT
Output SAM/BAM file, '-' outputs to stdout (default
'-')
-O OUTPUTFORMAT, --outputformat OUTPUTFORMAT
Output format. One of 't' (SAM), 'b' (BAM), or 'u'
(uncompressed BAM) ('t' default)
Copy/move read tag to another read tag
Read tags can be renamed or copied to anthor read tag using the tagtotag
command.
Let’s assume we have a SAM file called input.sam
with the following contents:
@HD VN:1.5 SO:coordinate
@SQ SN:20 LN:63025520
r002 0 20 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * CB:Z:AAAA-1
r003 0 20 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * CB:Z:CCCC-1
We would like to rename the CB tag to another arbitrary tag, let’s call it xx. If we run the following command:
sinto tagtotag --from CB --to xx --delete --bam - -o -
This will print the following SAM file to screen:
@HD VN:1.5 SO:coordinate
@SQ SN:20 LN:63025520
r002 0 20 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * xx:Z:AAAA-1
r003 0 20 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * xx:Z:CCCC-1
The two CB tags have been renamed to xx. If we wish to keep the original CB tag, then
we can drop --delete
from the command.
usage: sinto tagtotag [-h] -b BAM --from FROM_ --to TO [--delete] [-o OUTPUT]
[-O OUTPUTFORMAT]
Copies BAM entries to a new file while copying a read tag to another read tag
and optionally deleting the originating tag.
optional arguments:
-h, --help show this help message and exit
-b BAM, --bam BAM Input SAM/BAM file, '-' reads from stdin
--from FROM_ Read tag to copy from.
--to TO Read tag to copy to.
--delete Delete originating tag after copy (i.e. move).
-o OUTPUT, --output OUTPUT
Output SAM/BAM file, '-' outputs to stdout (default
'-')
-O OUTPUTFORMAT, --outputformat OUTPUTFORMAT
Output format. One of 't' (SAM), 'b' (BAM), or 'u'
(uncompressed BAM) ('t' default)
Add cell barcodes to FASTQ read names
Cell barcodes from one FASTQ file can be added to the read names of another, or the same,
FASTQ file using the barcode
command. This is useful when processing raw single-cell
sequencing data, as the cell barcode information can easily be propagated to the aligned
BAM file by encoding the cell barcode in the read name. Both gzipped and uncompressed
FASTQ files are supported as input. Running on uncompressed FASTQ is usually much faster
than running on gzipped FASTQ files.
Running this command will generate new gzipped FASTQ files with the read names modified to
contain the cell barcode sequence at the beginning of the read name, separated from the
original read name by a :
character. The output files will be the name of the input
file with .barcoded.fastq.gz
at the end of the file name.
sinto barcode [-h] --barcode_fastq BARCODE_FASTQ --read1 READ1
[--read2 READ2] -b BASES [--prefix PREFIX]
[--suffix SUFFIX] [--whitelist WHITELIST]
Add cell barcode sequences to read names in FASTQ file.
optional arguments:
-h, --help show this help message and exit
--barcode_fastq BARCODE_FASTQ
FASTQ file containing cell barcode sequences
--read1 READ1 FASTQ file containing read 1
--read2 READ2 FASTQ file containing read 2
-b BASES, --bases BASES
Number of bases to extract from barcode-containing
FASTQ
--prefix PREFIX Prefix to add to cell barcodes
--suffix SUFFIX Suffix to add to cell barcodes
--whitelist WHITELIST
Text file containing barcode whitelist
Additional arguments for the barcode function
Bases: --bases
This controls how many bases from the read containing the cell barcode are used.
Bases are counted from the beginning of the read sequence in the FASTQ file. For
example, --bases 12
will extract the first 12 sequenced bases from the read
and use it as the cell barcode.
Barcode read file: --barcode_fastq
FASTQ file with reads containing the cell barcode sequence.
Read 1 and read 2: --read1
and --read2
FASTQ files containing reads to which the cell barcode information will be added. Note that these files must contain the same number of reads as the barcode-containing FASTQ file, and the reads must appear in the same order.
Example
Take the following two FASTQ files as an example. The first contains cell barcode sequences and the second we want to add those sequences to the read name.
barocde_file.fastq.gz
:
@D00611:697:CD0V6ANXX:5:2301:1176:2478 1:N:0:TATCCTCT
CAATACACTATATGGGAGACGTTTTTTTTT
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFF
@D00611:697:CD0V6ANXX:5:2301:1480:2408 1:N:0:TATCCTCT
CAGAGACGTAAACAATGGCGGTTTTTTTTT
+
B<BBBFFFFFFFFFFFFFFFFFFFFFFFFF
@D00611:697:CD0V6ANXX:5:2301:1361:2447 1:N:0:TATCCTAT
AGTCTCGCCACATGGGGGGGATTTTTTTTT
read1.fastq.gz
:
@D00611:697:CD0V6ANXX:5:2301:1176:2478 2:N:0:TATCCTCT
GATTTACACAGATGATATGTTTCTATTGCCTGCTTGGGATGGGGGTGGGAGGCAGAGTCCATCTACCTCTCTAAC
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@D00611:697:CD0V6ANXX:5:2301:1480:2408 2:N:0:TATCCTCT
GTGCCTTTGACTTTAGCTAGGCGACAGGGGACGAGTCCATTAGCATACNNNGTAAATTGCTGTTGTCTGTTTTTG
+
<////////B/B/////<//////<///////////<///////////###////////////<///////////
@D00611:697:CD0V6ANXX:5:2301:1361:2447 2:N:0:TATCCTAT
TAATACATGACGGTGTCTTAGTAGCACTTACTATGCACAGGTTAAGACCTGTCTCTTATACACATCTCCGAGCCC
After running sinto barcode
with -b 12
to extract the first 12 bases of the barcode sequence
we have a new file called read1.barcoded.fastq.gz
:
@CAATACACTATA:D00611:697:CD0V6ANXX:5:2301:1176:2478 2:N:0:TATCCTCT
GATTTACACAGATGATATGTTTCTATTGCCTGCTTGGGATGGGGGTGGGAGGCAGAGTCCATCTACCTCTCTAAC
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@CAGAGACGTAAA:D00611:697:CD0V6ANXX:5:2301:1480:2408 2:N:0:TATCCTCT
GTGCCTTTGACTTTAGCTAGGCGACAGGGGACGAGTCCATTAGCATACNNNGTAAATTGCTGTTGTCTGTTTTTG
+
<////////B/B/////<//////<///////////<///////////###////////////<///////////
@AGTCTCGCCACA:D00611:697:CD0V6ANXX:5:2301:1361:2447 2:N:0:TATCCTAT
TAATACATGACGGTGTCTTAGTAGCACTTACTATGCACAGGTTAAGACCTGTCTCTTATACACATCTCCGAGCCC