De novo genome / transcriptome assembly

De novo genome assembly

# copy datasets
rsync -avz ngschool.local:/ngschool/de_novo_assembly .

# enter directory
cd de_novo_assembly

# !! IMPORT BASH VARIABLES IN EVERY NEW WINDOW !!
source /ngschool/.bashrc

Datasets

# list directory content
ls -lah . */
  1. fastq/                FastQ files
    1. genomics
      1. 600_1/2.fq.gz        paired-end; 2x 100bp @ 600bp insert; 100X
      2. 5000_1/2.fq.gz      mate-pairs; 2x 50bp @ 5000bp insert; 10X
    2. transcriptomics
      1. ranseq.fq.gz        single-end; stranded reads from zebrafish
  2. ref/                reference sequence from C. parapsilosis CDC317 (100 kb)
    1. ref.fa

Quality filtering

Check quality of your reads

# FastQC can be executed with graphical interface
fastqc 
# and select FastQ reads                     File >> Open… 

# or use bash mode to process individuals files
fastqc fastq/600_1.fq.gz

# or all files at once
fastqc fastq/*.fq.gz
## *.fq.gz is regex patter: * means any number of any characters 
## thus using *.fq.gz with program will result in processing of all files that end with `.fq.gz`

In the case you have multi-core processor and numerous FastQ files to process, you may execute FastQC on multiple cores ie. `fastqc -t 4` will run 4 processes. To see all FastQC parameters, execute `fastqc -h`. 

Finally, you can see FastQC report(s) using web browser: 

firefox fastq/600_1.fq_fastqc.html &

# or to see all reports
firefox fastq/*_fastqc.html &

###
# older versions of FastQC may create subdirectory for each FastQ file, then do: 
firefox fastq/600_1.fq_fastqc/fastqc_report.html &

# or to see all reports
firefox fastq/*_fastqc/fastqc_report.html &

Explore quality statistics for sequenced reads

Have a look at the good and bad data examples. 
Have you noticed decreasing quality toward the read ends? 
Do you know why this happens?

Trim / remove reads with poor quality bases

filterReads.py -pHv -q 20 -l 31 -o fastq/600 -i fastq/600_?.fq.gz
## you can check all parameters using `filterReads.py -h`

De novo genome assembly

Note, you can open new terminal window and run HTOP in order to monitor processor and memory consumption by each process. In addition, you can add `time ` in front of each command to see how much CPU time each assembler used. 

SPAdes

# prepare & enter working directory
mkdir spades
cd spades

# assemble using paired-end (PE) reads
spades.py --only-assembler -o 600 -1 ../fastq/600_1.fq.gz -2 ../fastq/600_2.fq.gz
## you can check all parameters using `spades.py -h`
## SPAdes offers read-correction algorithm, but beside being very resources demanding, it usually brings little improvement to the final assembly, so we disable read-correction with --only-assembler parameter

# assemble using PE & mate piars (MP)
spades.py --only-assembler -o 600_5000 --pe1-1 ../fastq/600_1.fq.gz --pe1-2 ../fastq/600_2.fq.gz --pe2-1 ../fastq/5000_1.fq.gz --pe2-2 ../fastq/5000_2.fq.gz

Note, in our examples both, PE and MP reads are in FR orientation. This is for simplicity. Normally, MP reads are in RF orientation, then you need to execute SPAdes with `--mp1-1 ../fastq/5000_1.fq.gz --mp1-2 ../fastq/5000_2.fq.gz`.

Ray

# prepare & enter working directory
cd ..
mkdir ray
cd ray

# assemble using PE reads
Ray -o 600 -p ../fastq/600_?.fq.gz

IDBA

# prepare & enter working directory
cd ..
mkdir idba
cd idba

# convert FastQ to shuffled FastA
fastq2shuffledFasta.py ../fastq/600_?.fq.gz > 600.fa

# assemble using PE reads
idba --num_threads 2 -o 600 -r 600.fa

Platanus

cd ..
mkdir platanus
cd platanus

platanus assemble -m 1 -f <(zcat ../fastq/600_1.fq.gz) <(zcat ../fastq/600_2.fq.gz) -t 2 -o 600
platanus scaffold -c 600_contig.fa -b 600_contigBubble.fa -IP <(zcat ../fastq/600_1.fq.gz) <(zcat ../fastq/600_2.fq.gz) -t 2 -o 600
platanus gap_close -c 600_scaffold.fa -IP <(zcat ../fastq/600_1.fq.gz) <(zcat ../fastq/600_2.fq.gz) -t 2 -o 600

Get basic assembly statistics

cd ../
fasta_stats.py -i spades/600*/scaffolds.fasta ray/600/Scaffolds.fasta idba/600/scaffold.fa platanus/600_scaffold.fa

Do you understand the metrics ie. N50, N90?

Which program was the fastest, used the least memory? And which was the slowest / the most memory hungry? 

Dealing with heterozygous genomes / hybrids

# prepare & enter working directory
mkdir hetero
cd hetero

dipSPAdes

# assemble using paired-end (PE) reads
dipspades.py --only-assembler -o 600 -1 ../fastq/600_1.fq.gz -2 ../fastq/600_2.fq.gz 

# assemble using PE & mate piars (MP)
dipspades.py --only-assembler -o 600_5000 --pe1-1 ../fastq/600_1.fq.gz --pe1-2 ../fastq/600_2.fq.gz --pe2-1 ../fastq/5000_1.fq.gz --pe2-2 ../fastq/5000_2.fq.gz

Redundans

# improve SPAdes assembly using PE and MP libraries
/ngschool/src/redundans/redundans.py -v -i ../fastq/600_?.fq.gz ../fastq/5000_?.fq.gz -f ../spades/600/contigs.fasta -o redundans --sspacebin $SSPACEBIN
/ngschool/src/redundans/redundans.py -v -i ../fastq/600_?.fq.gz ../fastq/5000_?.fq.gz -f ../spades/600_5000/contigs.fasta -o redundans2 --sspacebin $SSPACEBIN

Note, redundans may be installed in another path in your system (ie. ~/src/redundans/redundans.py), so you may have to adapt the command accordingly. To see all options either run `~/src/redundans/redundans.py -h` or visit https://github.com/lpryszcz/redundans. You can play with other assemblies ie. Ray or IDBA or skip PE or MP libraries to see the effect of such changes. Visit https://github.com/lpryszcz/redundans/tree/master/test#run-statistics to get more info about the process 

Can you see which assembly statistics are improved by each step of heterozygous genome assembly pipeline? 

# get statistics
fasta_stats.py -i 600*/dipspades/consensus_contigs.fasta redundans*/scaffolds.filled.fa

Can you see the differences between the programs?
 

Compare assemblies using whole genome alignment

You can assess the accuracy of assembled contigs by aligning them back onto reference sequence. This is only possible, when you have reference genome (or genome of some closely-related species) available. 

cd ../

# index reference genome (need to be done only once per each reference)
lastdb ref/ref.fa ref/ref.fa

# align & generate dotplot on the fly
lastal -f TAB ref/ref.fa spades/600/scaffolds.fasta | last-dotplot - spades.png
# open plot
eog spades.png &

# align & generate dotplot on the fly
lastal -f TAB ref/ref.fa hetero/600_5000/dipspades/consensus_contigs.fasta | last-dotplot - dipspades.png
# open plot
eog dipspades.png &

# align & generate dotplot on the fly
lastal -f TAB ref/ref.fa hetero/redundans/scaffolds.filled.fa | last-dotplot - redundans.png
# open plot
eog redundans.png &

 

Compare assemblies using Quast

Quast evaluates genome assemblies and report several metrics. 

quast.py -R ref/ref.fa hetero/600/spades/scaffolds.fasta hetero/600/dipspades/consensus_contigs.fasta hetero/redundans/contigs.fa hetero/redundans/scaffolds.filled.fa

# open the results
firefox quast_results/latest/report.html &

Explore the results, you can have a look at contig browser (press `View in Icarus contig browser`). 

De novo transcriptome assembly

Quality filtering

Check quality of your reads

# FastQC can be executed with graphical interface
fastqc 
# and select FastQ reads                     File >> Open… 

# or use bash mode 
fastqc fastq/rnaseq.fq.gz


Explore quality statistics for sequences reads

Have a look at the good and bad data examples. 
Have you noticed decreasing quality toward the read ends? 
Do you know why this happens?

Trim / remove reads with poor quality bases

filterReads.py -Hv -q 20 -l 31 -o fastq/rnaseq -i fastq/rnaseq.fq.gz
## you can check all parameters using `filterReads.py -h`

Run Trinity

# check all options with `Trinity -h`

# run Trinity
Trinity --seqType fq --output trinity --single fastq/rnaseq.fq.gz --max_memory 1G --SS_lib_type F
## important parameters --genome_guided_bam if you have genome assembled and want to use is for guided transcript discovery

# get statistics
fasta_stats.py -i trinity/Trinity.fasta

Predict Open Reading Frames (ORFs)

cd trinity

# run transdecoder
TransDecoder.LongOrfs -t Trinity.fasta
## important parameters are -G (genetic code) and -S (stranded RNAseq)

# get transcripts stats
grep ">" Trinity.fasta.transdecoder_dir/longest_orfs.pep | cut -f2 -d" " | sort | uniq -c
     10 type:3prime_partial
     27 type:5prime_partial
     10 type:complete
     69 type:internal

# get homologs or protein domains from PFAM (not covered by the tutorial)
hmmscan --cpu 1 --domtblout pfam.domtblout /opt/interproscan-5.19-58.0/data/pfam/29.0/pfam_a.hmm Trinity.fasta.transdecoder_dir/longest_orfs.pep > pfam.out
# note, you will need to download PFAM db and set proper /path/to/Pfam-A.hmm

# get final transcripts / peptides
TransDecoder.Predict -t Trinity.fasta --retain_pfam_hits pfam.domtblout 

## optionally add blastP hist with --retain_blastp_hits blastp.outfmt6

 

slides