#NGSchool2018 Materials

#NGSchool2018 Materials

Materials in this book are reproduced as an internal material for participants of the Summer School in Bioinformatics & NGS Data Analysis (#NGSchool2018). If you want to use any of the materials included here for other purposes, please ask individual contributors for the permission

# login to remote server (in cases when using own laptop is not possible)
ssh username@vm1.ngschool.xyz
# or
ssh username@147.228.242.14
# login to remote server (in cases when using own laptop is not possible)
ssh username@vm2.ngschool.xyz
# or
ssh username@147.228.242.15
# download miniconda
wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
# install miniconda
bash Miniconda2-latest-Linux-x86_64.sh
# add path to the pre-configured shared environments
# create config file in the home folder
touch .condarc
# put the following text into the file and save (use your favorite text editor)
envs_dirs:
 - /mnt/shared_conda/envs
# re-bash
bash
# check if everything is ok and you see the envs
conda info -e
# activate needed env
source activate nameoftheenvironment
# change to the directory
cd /mnt
# check the content
ls -al
# change to the dir
source /mnt/humanSV/.humansv

 

 

Leszek Pryszcz Mon, 04/09/2018 - 13:56

Preparation for the #NGSchool2018

Preparation for the #NGSchool2018

The course is open for people without any background in Computational Biology, but everyone should be familiar with basics of working in command-line (Linux), programming and statistics. Therefore, please complete mandatory courses before attending the #NGSchool. In addition, if you are interested in other aspects, you are welcome to continue with some of the supplementary courses.

In the case of any problems, feel free to post in #NGSchool2018 group

Mandatory on-line courses

Supplementary courses

Prerequisites

Can I work remotely (SSH) or use VirtualBox? 

We will have access to HPC nodes and one server locally, so in principle remote working will be possible, but installing Ubuntu in your laptop is strongly recommended. You will be able to connect to your remote machine (ie at work), just remember working remotely comes with limitations ie. no or slow graphical rendering. 

Alternatively, you can install Ubuntu in VirtualBox or install it as Windows program using Wubi on WindowsNote, these alternatives come with certain limitations, so standard installation is recommended

Resources

Alina Frolova Thu, 09/06/2018 - 10:15

Guidelines for Speakers

Guidelines for Speakers

General things

  1. Lecture is usually 1 hour long, don't forget time for the questions.
  2. Workshop slots are typically 3 hours, this should include theoretical introduction (often 30 min will be more than enough) and practical exercises (here the more the better;) ). 
  3. Hackathons might require some intro with the slides as well, we will have projectors in the lab rooms for this purpose.
  4. Each student will have a laptop with Ubuntu installed. Please, prepare your exercises so they can run even on older laptops in reasonable amount of time, i.e. for de novo assembly workshop we were using 100Kb region of one chromosome.
  5. In case if some students will have troubles with their laptop, which cannot be solved in a reasonable amount of time, we will provide them with account on the virtual machine.
  6. Please, add the software you will need in your workshop/hackathon to this list. We'll ask all participants to install this upfront (and also install it in our HPC nodes).

Depositing data

Please, deposit the dataset you'd need to the dedicated github repo. You can create a pull request or you can ask us to give you an access. If the files are over 100 MB - please, add the link to the dataset you need here.

Creating materials

Adding new content

  1. Navigate to the materials page.
  2. Select `Workshops`, `Lectures` or `Hackathons`.
  3. Navigate to your page and edit the content.
  4. Upload slides.
  5. Press `Save and publish`.

Adding nicely formatted source code / snippets of code

  1. Change `Text  format` to `Full HTML`. 
  2. Press `Insert code snippet` button.
  3. Select syntax language
  4. Paste your code & press OK.

Quick Edit: In-place editing of content

Drupal offers Quick Edit, meaning you can edit any content and see it's effect in the current browser window. I strongly recommend using it, as it's very handy. You can access it by clicking small pencil symbol in the top right corner of any content and selecting `Quick edit`.

 

Leszek Pryszcz Mon, 04/09/2018 - 13:54

Lectures

Lectures Alina Frolova Mon, 09/10/2018 - 18:57

Precision Oncology. Applications of tumor genome and transcriptome sequencing. Liquid biopsy - Stephan Ossowski

Precision Oncology. Applications of tumor genome and transcriptome sequencing. Liquid biopsy - Stephan Ossowski Stephan Ossowski Mon, 09/10/2018 - 22:08

Pushing state-of-the art in transcriptomics and metagenomics on the road to personalized medicine - Paweł Łabaj

Pushing state-of-the art in transcriptomics and metagenomics on the road to personalized medicine - Paweł Łabaj Paweł P Łabaj Mon, 09/10/2018 - 22:13

Molecular biology methods in forensics - Kamil Januszkiweicz

Molecular biology methods in forensics - Kamil Januszkiweicz Alina Frolova Mon, 09/10/2018 - 22:15

Magnusiomyces/Saprochaete genome project + our experiences with MinION sequencing - Jozef Nosek

Magnusiomyces/Saprochaete genome project + our experiences with MinION sequencing - Jozef Nosek Jozef Nosek Mon, 09/10/2018 - 22:16

Next generation sequencing in clinical diagnosis - what do we need to improve analysis - Monika Goś

Next generation sequencing in clinical diagnosis - what do we need to improve analysis - Monika Goś mgos Mon, 09/10/2018 - 22:17

MinION explained: principles, running and hacking - Leszek Pryszcz

MinION explained: principles, running and hacking - Leszek Pryszcz Leszek Pryszcz Fri, 05/31/2019 - 16:23
slides

Insights into the rare and the small: genome sequencing of non-model organisms - Rosa Fernandez

Insights into the rare and the small: genome sequencing of non-model organisms - Rosa Fernandez Rosa Fernández Mon, 09/10/2018 - 22:18

Genome sequencing reveals how DNA repair is organized in human cells - Fran Supek

Genome sequencing reveals how DNA repair is organized in human cells - Fran Supek Fran Supek Mon, 09/10/2018 - 22:19

Workshops

Workshops Alina Frolova Mon, 09/10/2018 - 18:34

De-novo genome assembly - Robert Vaser

De-novo genome assembly - Robert Vaser

Data set

The read set needed for this workshop is freely available here, but you should download the archive from dropbox which includes the same read set with additional subresults obtained through the workshop, and a smaller data set for nanopolish tutorial obtained from its wiki pages. To unpack it run:

tar -xzvf ngsschool2018_data.tar.gz

and you should get the following files:

  • escherichia_coli_K12_MG1655.fasta - reference genome
  • escherichia_coli_map006_r7_3.fastq - Oxford Nanopore read set consisting of ~25k reads with average length of ~10k (genome coverage around 55x)
  • escherichia_coli_map006_r7_3_layout.fasta - raw assembly after layout step obtained by using rala on overlaps produced by minimap2
  • escherichia_coli_map006_r7_3_consensus.fasta - assembly after consensus step obtained by applying two iterations of racon on the raw assembly from layout step
  • escherichia_coli_map006_r7_3_nanopolished.fasta - assembly after polishing step obtained by applying nanopolish on the assembly obtained from consensus step
  • escherichia_coli_map006_r7_3_illumina_polished.fasta - assembly after polishing step obtained by applying one iteration of racon on the assembly obtained from consensus step
  • escherichia_coli_r9_4.fastq - Oxford Nanopore read subset used in nanopolish tutorial
  • escherichia_coli_r9_4_draft.fasta - draft assembly used in nanopolish tutorial
  • escherichia_coli_r9_4_fast5_files/ - directory with Oxford Nanopore signals matching the read subset used in nanopolish tutorial

You can check more about file formats here: FASTA, FASTQ.

Required tools

Download and compile all tools listed bellow. Those that have (B) after their name should be available through bioconda which makes them easier to install. Otherwise, follow the installation part of READMEs at respective GitHub pages.

Note: to prepare rala for the workshop execute the following commands:

git clone --recursive https://github.com/rvaser/rala
cd rala
git checkout workshop
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make

De novo genome assembly of Escherichia coli

We will now try and assemble the Oxford nanopore dataset of Esherichia coli and assess its quality and completeness. We will use the OLC approach where we will first find all pairwise overlaps between reads, create an assembly graph, simplify it to obtain linear stretches of fragments called contigs, and the end increase the accuracy of our draft assembly with various types of data.

Note: to speed up the execution you can run almost every tool with the option -t <number_of_threads>.

Overlap step

To obtain all possible overlaps, in PAF format (more about the format here) run minimap2 with the following command:

minimap2 -x ava-ont escherichia_coli_map006_r7_3.fastq escherichia_coli_map006_r7_3.fastq > overlaps.paf

Layout step

An assembly graph is build using rala with the following command (execute the same command when changing the src/main.cpp file afterwards):

rala escherichia_coli_map006_r7_3.fastq escherichia_coli_map006_r7_3_overlaps.paf

This will save the graph into assembly_graph.gfa, in GFA format (more about the format here), which can be used in bandage for visualization. The current assembly graph is quite complex and it is hard to reconstruct the genome from it. Therefore, several simplification methods are used on the graph in sequential order which include transitive edge removal, tip removal and bubble popping. This will leave us with a few junctions in the graph for which we will draw pile-ograms, i.e. graphs which show the base coverage of reads obtained from the overlap file. In order to apply the simplification methods and pile-ogram printing, change the src/main.cpp file in rala accordingly:

// build the assembly graph
auto graph = rala::createGraph(input_paths[0], input_paths[1], num_threads);
graph->construct(false, false);

// simplify the assembly graph
graph->remove_transitive_edges();
while (true) {
    uint32_t num_changes = 0;
    num_changes += graph->remove_tips();
    num_changes += graph->remove_bubbles();
    if (num_changes == 0) {
        break;
    }
}

// print the assembly graph
graph->print_gfa("assembly_graph.gfa");
graph->print_json("assembly_graph.json"); // prints pile-ograms of reads that tangle the graph

// extract contigs
std::vector<std::unique_ptr<rala::Sequence>> contigs;

After compiling and running rala again, we will obtain the assembly graph drawn bellow.

Assemlby graph with several simplification methods applied

This run will also create the assembly_graph.json file which contains our pile-ograms. To print them run:

python misc/plotter.py assembly_graph.json

The misc/plotter.py script can be found in rala directory. If any errors arise, you might need to install matplotlib. You should be able to observe special types of pile-ograms, i.e. those with a sudden drop in coverage and those that have hills which have peaks over the average base coverage. To determine if those reads are chimeric or have a repeat region which might induce false overlaps, extract one or several of them from the read set and visualize the alignment in gepard. Bellow is an example of pile-ograms including read with id 5678.

Histogram of read with id 5678

To extract this particular read from the whole set run the following command:

head -n 22714 escherichia_coli_map006_r7_3.fastq | tail -n 2 > test.fasta

where 22714 was obtained by multiplying the read id (5678) with 4 and adding 2 (don't forget to manually change the first character of the sequence header from @ to >).

In order to remove chimeric reads and false overlaps induced by repeat regions, and finally obtain the reconstructed genome, change the src/main.cpp file to look like this:

// build the assembly graph
auto graph = rala::createGraph(input_paths[0], input_paths[1], num_threads);
graph->construct(true, true); // enable chimeric removal and false overlap removal

// simplify the assembly graph
graph->remove_transitive_edges();
while (true) {
    uint32_t num_changes = 0;
    num_changes += graph->remove_tips();
    num_changes += graph->remove_bubbles();
    if (num_changes == 0) {
        break;
    }
}

// print the assembly graph
graph->print_gfa("assembly_graph.gfa");
graph->print_json("assembly_graph.json");

// extract contigs
std::vector<std::unique_ptr<rala::Sequence>> contigs;
graph->create_unitigs(); // join linear paths of the graph into one vertex
graph->extract_contigs(contigs);
for (const auto& it: contigs) {
    fprintf(stdout, ">%s\n%s\n", it->name().c_str(), it->data().c_str());
}

Run make in the build folder and execute the following command to obtain the raw assembly:

rala escherichia_coli_map006_r7_3.fastq overlaps.paf > layout.fasta

You can check the accuracy of the raw assembly with dnadiff (from the mummer package) by running:

dnadiff escherichia_coli_K12_MG1655.fasta layout.fasta

and opening the created out.report file.

Consensus step

In order to increase the accuracy of the raw assembly we need to use a consensus/polishing tool with the whole read set. First we need to map all the reads to the raw assembly with the following command:

minimap2 -x map-ont layout.fasta escherichia_coli_map006_r7_3.fastq > m_1.paf

and afterwards run the polisher with:

racon escherichia_coli_map006_r7_3.fastq m_1.paf layout.fasta > consensus_1.fasta

Racon can be run iterratively to further increase the accuracy so we will do one more iteration with:

minimap2 -x map-ont consensus_1.fasta escherichia_coli_map006_r7_3.fastq > m_2.paf

racon escherichia_coli_map006_r7_3.fastq m_2.paf consensus_1.fasta > consensus_2.fasta

You can check the accuracy now with dnadiff as well (it should be around 99.3%).

Further polishing

To increase the accuracy even further, we can use signal level data with nanopolish. As the current version only supports Albacore basecalled reads (the read set we have was basecalled with an older basecaller), we can just test running nanopolish on the smaller dataset escherichia_coli_r9_4.fastq. We will follow the tutorial from here and run the following commands to polish the small draft assembly escherichia_coli_r9_4_draft.fasta:

# link reads with their corresponding signals
nanopolish index -d escherichia_coli_r9_4_fast5_files/ escherichia_coli_r9_4.fastq

# map the raw reads to the draft assembly and prepare the alignment file
minimap2 -ax map-ont \
  escherichia_coli_r9_4_draft.fasta \
  escherichia_coli_r9_4.fastq > alignments.sam

samtools view -b alignments.sam > alignments.bam
samtools sort -o alignments.sorted.bam alignments.bam
samtools index alignments.sorted.bam

# polish the draft assembly file and transform the result to FASTA format
nanopolish variants --consensus \
  --reads escherichia_coli_r9_4.fastq \
  --bam alignments.sorted.bam \
  --genome escherichia_coli_r9_4_draft.fasta \
  -o polished.vcf

nanopolish vcf2fasta -g escherichia_coli_r9_4_draft.fasta polished.vcf > polished.fasta

You can check with dnadiff the accuracy and number of differences between Escherichia coli reference vs draft assembly and reference vs nanopolished assembly.

For further steps you can use the already nanopolished assembly escherichia_coli_map006_r7_3_nanopolished.fasta.

Another way to increase the accuracy is to use more accurate short reads from second generation of sequencing by either using pilon or racon. In the dropox archive there is the already polished assembly file obtained by polishing the consensus_2.fasta file with racon and Illumina data. Optionally, you can try doing that by downloading the short reads from here and joining the files together. Run the following commands afterwards to polish the assembly:

minimap2 -x sr consensus_2.fasta illumina_reads.fastq > m_3.paf

racon illumina_reads.fastq m_3.paf consensus_2.fasta > consensus_3.fasta

Quality assessment

Quality and completness assessment of the assembly can be done with BUSCO and ideel. In order to run BUSCO you need to download the database of orthologs for Proteobacters, obtainable from here. To find the number of single copy orthologs in an assembly (e.g. 2nd iteration of polishing with racon) run

busco -i consensus_2.fasta -m genome -l <path_to_proteobacter_database> -o consensus_2

You can do that for the reference genome, layout, consensus, nanopolished assembly and illumina polished assembly for comparison.

For ideel you will have to install prodigal and diamond (run make install after cloning the repositories with git), download the Uniprot TrEMBL database (obtainable here), index it with diamond and create a folder called genomes. In the genomes folder put all assembly subresults obtained before (including reference) but change the extensions to .fa (or change the snakefile of ideel to accept .fasta files). Run snakemake (if unavailable, install it with sudo apt-get install snakemake) from the ideel folder which will create the histograms of protein length ratios in the hists folder. An example of the histogram can be seen bellow.

Histogram of query/top hit protein lenghts for the nanopolished assembly

 

Robert Vaser Mon, 09/10/2018 - 18:32

Algorithms for nucleic sequence alignment - Andrey Prjibelski

Algorithms for nucleic sequence alignment - Andrey Prjibelski

Practice slides are here.

Slides with alignment algorithms are here.

Andrey Prjibelski Mon, 09/10/2018 - 18:59

Variants calling based on long reads - Fritz Sedlazeck

Variants calling based on long reads - Fritz Sedlazeck

Presentation slides can be found here.

fsedlazeck Mon, 09/10/2018 - 19:00

Graph data structures for personalized genomics - Andre Kahles

Graph data structures for personalized genomics - Andre Kahles Andre Kahles Mon, 09/10/2018 - 19:01

Detection of RNA modifications & dealing with raw Nanopore data (fast5) - Eva Maria Novoa Pardo

Detection of RNA modifications & dealing with raw Nanopore data (fast5) - Eva Maria Novoa Pardo Eva Maria Novoa Mon, 09/10/2018 - 19:02

Long reads and Mendelian diseases - Alba Sanchis-Juan

Long reads and Mendelian diseases - Alba Sanchis-Juan

Find the link to workshop here

Alba Sanchis Juan Mon, 09/10/2018 - 19:09

Hackathons

Hackathons Alina Frolova Mon, 09/10/2018 - 18:58

Direct RNA-seq

Direct RNA-seq Leszek Pryszcz Mon, 09/10/2018 - 22:22

Human mitochondrial DNA sequencing in forensic applications

Human mitochondrial DNA sequencing in forensic applications mgos Mon, 09/10/2018 - 22:22

De-novo fungal genomes

De-novo fungal genomes Broňa Brejová Mon, 09/10/2018 - 22:23

Bioinformatics analysis for fungal genome hackathon

Bioinformatics analysis for fungal genome hackathon

Plan of bioinformatics analyses

Here is a brief outline of the planned stages of bioinformatics analyses done during the hackathon together with suggested tools. 

Done before MinION run

  1. Illumina read quality assessment, trimming (fastqc,  trimmomatic)
  2. Assembly of Illumina reads (SPAdes)
  3. Assembly of transcripts from RNA-seq data (Trinity)
  4.  Preliminary annotation and gene finder training (Maker, Augustus)

Diagnostics done on the first data collected during minION run

  1. Steps 1-3 below

Done after MinION run

  1.  Basecalling MinION raw reads (Albacore)
  2.  Aligning basecalled reads to SPAdes, selecting those that have at least 200bp alignment (last)
  3.  Collecting read statistics for all and aligning reads
  4.  Assembly of MinION reads (Canu)
  5.  Polishing assembly using Illumina data (BWA mem, pilon)
  6.  Manual finishing of assembly, alignments of reads before and after (BWA mem, last)
  7.  Protein-coding gene annotation (Augustus)
  8.  Annotation of other features, such as repeats, RNA genes etc (tRNAscan-SE, Rfam, RepeatModeller,...)
  9.  Additional analyses, such as annotation of gene function, phylogeny
  10.  Manual annotation improvements
  11.  Submission of data to ENA
Broňa Brejová Thu, 09/13/2018 - 22:24

Human microbiome analysis based on 16s sequencing

Human microbiome analysis based on 16s sequencing mmankowska Mon, 09/10/2018 - 22:23

Nanopore for human structural variation

Nanopore for human structural variation tgambin Mon, 09/10/2018 - 22:24