Bioinformatics snippets

2017-09-14 3 minute read

Trim reads

cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -o output.fq.gz input.fq.gz

Map RNA-seq

# build index
STAR --runThreadN 10 \
     --runMode genomeGenerate \
     --genomeDir genome \
     --genomeFastaFiles genome.fa \
     --sjdbGTFfile genes.gtf
     
# map
STAR --runThreadN 20 \
     --genomeDir genome \
     --alignIntronMax 5000 \
     --alignIntronMin 10 \
     --readFilesCommand zcat \
     --readFilesIn input.fq.gz

Pseudomap RNA-seq with kallisto

# build index 
kallisto index -i index.idx Arabidopsis_thaliana.TAIR10.cds.all.fa

#quantify
for filename in *.fq.gz; do
    samplename=(${filename//.fq.gz/ })
    kallisto quant -i index.idx -o $samplename -b 100 --single -l 150 -s 20 $filename -t 4
done

Create coverage track from bam file

bamCoverage is part of deepTools.

bamCoverage -b reads.bam -o coverage.bw

Take command-line arguments

In a shell script

Required flags are followed by :

index=  proc=  path=  

while getopts x:pA opt; do
  case $opt in
  x)
      index=$OPTARG
      ;;
  p)
      proc=$OPTARG
      ;;
  A)
      path=${OPTARG%/}
      ;;
  esac
done
shift $((OPTIND - 1))

In a python script

from argparse import ArgumentParser

version = pkg_resources.require("program")[0].version
parser = ArgumentParser(description='program description')
group = parser.add_mutually_exclusive_group()
parser.add_argument('--version', action='version', version='%(prog)s '+str(version))
parser.add_argument('--option', help='option description', required=False, default=False, action='store_true')
parser.add_argument('-n', '--name', help='sample name', required=True)

GNU Screen

Start screen: screen -S [screen name]
List screens: screen -list
Detach: screen -d
Attach: screen -r
Close screen: ctr-a-d
Kill screen: ctr-a :quit

Python list comprehension

If ever declaring and empty list then using .append, it can be done with list comprehension.

list3 = []
for x in list2:
    if x[0] in list1:
        list3.append(x)

list3 = [x for x in list2 if x[0] in list1]

Programming with dplyr

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

See http://dplyr.tidyverse.org/articles/programming.html

dplyr uses non-standard evaluation which is convienent for interactive use, but makes programming diffucult.

To write functions using dplyr you need to quote the quote the function input using quo(), or quote the input within the function using enquo(). Then, in the dplyr call, unquote the input using !!. Note that this is only possible with dplyr version >= 0.5.0.9. Here’s an example:

group_x_summarize <- function(df, x) {
  my_var <- enquo(x)
  df %>% 
    group_by(!!my_var) %>% 
    summarize(count = n())
}
group_x_summarize(iris, Species)
## # A tibble: 3 x 2
##   Species    count
##   <fct>      <int>
## 1 setosa        50
## 2 versicolor    50
## 3 virginica     50
group_x_summarize(iris, Sepal.Length) %>% head
## # A tibble: 6 x 2
##   Sepal.Length count
##          <dbl> <int>
## 1          4.3     1
## 2          4.4     3
## 3          4.5     1
## 4          4.6     4
## 5          4.7     2
## 6          4.8     5

If you want to change the output names as well, you need to do this:

group_x_summarize <- function(df, x, out_name) {
  my_var <- enquo(x)
  my_name <- quo_name(out_name)
  df %>% 
    group_by(!!my_var) %>% 
    summarize(!!my_name := n())
}
group_x_summarize(iris, Species, "species_count")
## # A tibble: 3 x 2
##   Species    species_count
##   <fct>              <int>
## 1 setosa                50
## 2 versicolor            50
## 3 virginica             50

Note the use of := rather than =.