#NGSchool2017 Workshops

#NGSchool2017 Workshops

Materials in this book are reproduced as an internal material for participants of the Summer School in Bioinformatics & NGS Data Analysis (#NGSchool2017). If you want to use any of the materials included here for other purposes, please ask individual contributors for the permission

# login to local server
ssh USERNAME@192.168.1.111

# syncing the data
rsync -avz --exclude="*.git/" USERNAME@192.168.1.111:/ngschool/2017 ~/ngschool
 # and type your password 

 

lpryszcz Tue, 07/11/2017 - 15:21

Preparation for the #NGSchool2017

Preparation for the #NGSchool2017

 

The course is open for people without any background in Computational Biology, but everyone should be familiar with basics of working in command-line (Linux), programming and statistics. Therefore, please complete mandatory courses before attending the #NGSchool. In addition, if you are interested in other aspects, you are welcome to continue with some of the supplementary courses.

In the case of any problems, feel free to post in #NGSchool2017 group

Mandatory on-line courses

Supplementary courses

Prerequisites

Can I work remotely (SSH) or use VirtualBox? 

We will setup a server locally, so in principle remote working will be possible, but installing Ubuntu in your laptop is strongly recommended. You will be able to connect to your remote machine (ie at work), just remember we'll be in rural region in High Tatras: Internet connection is stable there, but rather slow, especially if ~40 people want to use it at the same time. 

Alternatively, you can install Ubuntu in VirtualBox or install it as Windows program using Wubi on WindowsNote, these alternatives come with certain limitations, so standard installation is recommended

Resources

lpryszcz Wed, 07/12/2017 - 17:24

Day 0

Day 0

Arrivals, opening, Introduction to Linux/NGS & shot talks #1. 

lpryszcz Wed, 07/12/2017 - 17:17

Day 1

Day 1

Shot talks #2

lpryszcz Wed, 07/12/2017 - 17:20

Molecular data integration

Molecular data integration jmarzec Sat, 08/26/2017 - 12:43

0. Preparation

0. Preparation

This WORKSHOT will involve integrating three publicly available prostate cancer expression datasets (GEO IDs: GSE32269GSE8218, ArrayExpress ID: E-TABM-26) generated with Affymetrix U133A array platform. Affymetrix raw data are stored in CEL files (one CEL file per sample).

NOTE: provided data were limited to a subset of samples (representing normal, tumour and metastatic phenotypes) to reduce the analysis computation time.

 

The overall pipeline consists of six main steps:

1. Quality control

2. Data combination and normalisation

3. Study effects assessment

4. Data adjustment for known study effects

5. Differential expression analysis

6. Results summary and visualisation

 

Data integration pipeline

 

Target files

To perform the analysis you will need to use so called target file for each dataset. The target file is tab-delimited and contains three columns in the following order: (1) Name, (2) FileName, (3) Target:

Name	      FileName	         Target
GSE32269_1    [file name].CEL    Tumour
GSE32269_2    [file name].CEL    Tumour
GSE32269_3    [file name].CEL    Normal

...
GSE32269_18   [file name].CEL    Normal

 

Name = sample name

FileName = name of the CEL file (including the path to the directory)

Target = sample phenotype

 

You will also need to use one target file for the combination of all datasets, which contains four columns: (1) Name, (2) FileName, (3) Target, (4) Dataset:

Name	      FileName	                                  Target    Dataset
GSE32269_1    [directory with CEL files]/[file name].CEL  Tumour    GSE32269
GSE32269_2    [directory with CEL files]/[file name].CEL  Tumour    GSE32269
...

E-TABM-26_1   [directory with CEL files]/[file name].CEL  Tumour    E-TABM-26
E-TABM-26_2   [directory with CEL files]/[file name].CEL  Tumour    E-TABM-26

...

GSE8218_1     [directory with CEL files]/[file name].CEL  Normal    GSE8218
GSE8218_2     [directory with CEL files]/[file name].CEL  Normal    GSE8218

Dataset = name of the dataset from which corresponding sample is derived

 

=========================================================================================================

 

jmarzec Fri, 09/08/2017 - 00:05

1. Quality control

1. Quality control

First, set the working directory to the folder where you will perform all the analyses

setwd("[path to the analysis folder]")

 

Load functions

The following functions will be used to distinguish samples from different datasets in clustering and principal component analysis (PCA) plots

##### Assign colours to analysed datasets
getDatasetsColours <- function(datasets) {
    
    ##### Predefined selection of colours for datasets
    datasets.colours <- c("bisque","orange","firebrick","lightslategrey","darkseagreen","darkcyan","dodgerblue")
    
    f.datasets <- factor(datasets)
    vec.datasets <- datasets.colours[1:length(levels(f.datasets))]
    datasets.colour <- rep(0,length(f.datasets))
    for(i in 1:length(f.datasets))
    datasets.colour[i] <- vec.datasets[ f.datasets[i] == levels(f.datasets)]
    
    return( list(vec.datasets, datasets.colour) )
}


##### Assign colours to analysed groups
getTargetsColours <- function(targets) {
    
    ##### Predefined selection of colours for groups
    targets.colours <- c("red","blue","green","darkgoldenrod","darkred","deepskyblue")
    
    f.targets <- factor(targets)
    vec.targets <- targets.colours[1:length(levels(f.targets))]
    targets.colour <- rep(0,length(f.targets))
    for(i in 1:length(f.targets))
    targets.colour[i] <- vec.targets[ f.targets[i] == levels(f.targets)]
    
    return( list(vec.targets, targets.colour) )
}

 

Load libraries

Load R libraries required for the analysis.

library(affy)
library(sva)
library(arrayQualityMetrics)
library(gcrma)
library(limma)
library(hgu133a.db)
library(gplots)

You will also need a piece of code to perform coloured clustering ('a2R_code.R')

source("../scripts/a2R_code.R")

 

Quality control

Load data and perform QC (with arrayQualityMetrics package) for each dataset separately

Start with GSE32269 dataset (18 samples in total, 9 from primary tumour and 9 from metastatic tissues)

##### Go to the folder with data for GSE32269 dataset
setwd("data/GSE32269")


##### Read annotation file
pd <- read.AnnotatedDataFrame(filename = "target.txt" ,header = TRUE, row.name = "Name", sep = "\t")


##### Read CEL files into an Affybatch
dat <- ReadAffy(filenames = pd$FileName, sampleNames = sampleNames(pd), phenoData = pd)

##### Data quality control using arrayQualityMetrics package
##### NOTE: Computationally intense step, may take few minutes to complete
arrayQuality <- arrayQualityMetrics(expressionset = dat, outdir = "QC", reporttitle = "arrayQualityMetrics report for GSE32269", force = TRUE, do.logtransform = TRUE, intgroup = "Target")

 

The QC report is stored in specified directory [directory for QC report] and can be presented in web browser by clicking on the 'index.html' file (see an example index.html). Various QC metrics with suggestive outliers, if detected, will be indicated.

The QC report helps to indicate poor quality samples (in the case of dataset GSE32269 there are three samples with evidently poor quality: GSE32269_10, GSE32269_12 and GSE32269_15)

##### Report detected samples to excluded from downstream analysis
samples2exclude <- c("GSE32269_10","GSE32269_12","GSE32269_15")

 

Do the same for the other two datasets:

E-TABM-26 with 14 samples in total, 7 from normal and 7 from tumour tissues

GSE8218 with 20 samples in total, 10 from normal and 10 from tumour tissues

##### Go to the folder with data for E-TABM-26 dataset
setwd("../../data/E-TABM-26")


##### Read annotation file
pd <- read.AnnotatedDataFrame(filename = "target.txt", header = TRUE, row.name = "Name", sep = "\t")


##### Read CEL files into an Affybatch
dat <- ReadAffy(filenames = pd$FileName, sampleNames = sampleNames(pd), phenoData = pd)

##### Data quality control using arrayQualityMetrics package
##### NOTE: Computationally intense step, may take few minutes to complete
arrayQuality <- arrayQualityMetrics(expressionset = dat, outdir = "QC", reporttitle = "arrayQualityMetrics report for E-TABM-26", force = TRUE, do.logtransform = TRUE, intgroup = "Target")


##### Go to the folder with data for GSE8218 dataset
setwd("../../data/GSE8218")


##### Read annotation file
pd <- read.AnnotatedDataFrame(filename = "target.txt", header = TRUE, row.name = "Name", sep = "\t")


##### Read CEL files into an Affybatch
dat <- ReadAffy(filenames = pd$FileName, sampleNames = sampleNames(pd), phenoData = pd)

##### Data quality control using arrayQualityMetrics package
##### NOTE: Computationally intense step, may take few minutes to complete
arrayQuality <- arrayQualityMetrics(expressionset = dat, outdir = "QC", reporttitle = "arrayQualityMetrics report for GSE8218", force = TRUE, do.logtransform = TRUE, intgroup = "Target")

 

Based on the QC reports three samples from E-TABM-26 dataset and one sample from GSE8218 dataset seem to be of insufficient quality: E-TABM-26_10, E-TABM-26_11, E-TABM-26_12 and GSE8218_12

##### Report detected samples to excluded from downstream analysis
samples2exclude <- c(samples2exclude,"E-TABM-26_10","E-TABM-26_11","E-TABM-26_12","GSE8218_12")

 

=========================================================================================================

jmarzec Fri, 09/08/2017 - 02:00

2. Data combination and normalisation

2. Data combination and normalisation

Once poor quality arrays were detected, load all arrays across studies (ignoring poor quality samples) and normalise them collectively to maximise data consistency.

First, assign colours to datasets and groups using the aforementioned functions

##### Change working directory to the folder with data
setwd("..")

##### Read file with information for all datasets
datasetsInfo <- read.table(file = "target_all.txt", sep = "\t", as.is = TRUE, header = TRUE, row.names = 1)

##### Ignore samples identified as outliers
datasetsInfo <- datasetsInfo[setdiff(rownames(datasetsInfo), as.vector(samples2exclude)),]

##### Assign different colours for samples from individual datasets
datasets <- datasetsInfo$Dataset

datasets.colour <- getDatasetsColours(datasets)


##### Assign different colours for samples representing individual groups
targets <- datasetsInfo$Target

targets.colour <- getTargetsColours(targets)


##### Record number of datasets
datasets_No <- max(as.numeric(factor(datasets)))

 

Load data

Load the data (ignoring poor quality samples) across all datasets (using the target file for the combination of all datasets)

##### Read annotation files for all datasets
pd <- read.AnnotatedDataFrame(filename = "target_all.txt", header = TRUE, row.name = "Name", sep = "\t")

##### Ignore samples identified as outliers
pd <- pd[setdiff(sampleNames(pd), as.vector(samples2exclude))]


##### Read CEL files into an Affybatch
dat <- ReadAffy(filenames = pd$FileName, sampleNames = sampleNames(pd), phenoData = pd)


##### Create folder for results
system("mkdir ../results")

##### Change working directory to results folder
setwd("../results")

 

Normalisation

Perform normalisation collectively on all data using GC-RMA method

##### Perform normalisation with GC-RMA method
##### NOTE: Computationally intensive step, it may take few minutes to complete
gcrma = gcrma(dat)

 

Generate plots demonstrating data distribution before and after normalisation

##### Generate signal intensity before and after GC-RMA normalisation
pdf("Norm_density.pdf")

#####  Before normalisation
hist(dat, col = datasets.colour[[2]], lty = 1)

#####  After normalisation
hist(gcrma, col = datasets.colour[[2]], lty = 1)
dev.off()


##### Generate box-and-whisker plot
pdf("Norm_boxplot.pdf", pointsize = 8, width = 0.2*length(pd$FileName), height = 6)
par(mar=c(13, 4, 3, 2))

#####  Before normalisation
boxplot(dat, col = datasets.colour[[2]], las = 2)

#####  After normalisation
boxplot(exprs(gcrma), col = datasets.colour[[2]], main = "GC-RMA normalised", las = 2)
dev.off()

 

You should get the following plots (in results folder):

file: 'Norm_boxplot.pdf'

Signal intensity plot before normalisation

Signal intensity plot before normalisation

 

... and after normalisation

Signal intensity plot after normalisation

 

 

 

 

 

 

 

 

 

 

 

 

 

file: 'Norm_density.pdf':

Histogram before normalisation (different data distributions for individual studies are evident)

Histogram before normalisation

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

... and after normalisation

Histogram after normalisation

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Write normalised expression data into a file ('Norm_data.exp')

write.exprs(gcrma, file = "Norm_data.exp", sep = "\t")

 

=========================================================================================================

jmarzec Fri, 09/08/2017 - 02:25

3. Study effects assessment

3. Study effects assessment

Clustering analysis

Start with unsupervised clustering analysis to evaluate expression profiles similarities among samples in the context of dataset and phenotypic annotation

#####  Extract matrix of expression values
data <- exprs(gcrma)


#####  Compute distance matrix and hierarchical clustering
d.usa <- dist(t(data), method = "euc")
h.usa <- hclust(d.usa, method = "ward.D2")


#####  Generate coloured dentrogram (indicate datasets)
pdf("Study_effects_cluster_datasets.pdf", width = 0.3*ncol(data), height = 6)
h.usa$labels = colnames(data)
par(mar = c(2,2,2,6))
A2Rplot(h.usa, fact.sup = datasets, box = FALSE, show.labels = TRUE, col.up = "black", main = "", k = length(levels(factor(datasets))) ) # k = changes the detail/number of subgroups shown
dev.off()


#####  Generate coloured dentrogram (indicate groups)
pdf("Study_effects_cluster_targets.pdf", width = 0.3*ncol(data), height = 6)
h.usa$labels = colnames(data)
par(mar = c(2,2,2,6))
A2Rplot(h.usa, fact.sup = targets, box = FALSE, show.labels=TRUE, col.up = "black", main="", k = length(levels(factor(targets))) ) # k = changes the detail/number of subgroups shown.
dev.off()


#####  Generate dentrogram
pdf("Study_effects_dendrogram.pdf", width = 0.2*ncol(data)+2, height = 6, pointsize = 12)
par(mar = c(2,5,2,0))
plot(h.usa, xlab = "", labels = paste(colnames(data), targets, sep = "       "), hang = -1, main="")
dev.off()

 

You should get the following plots:

file 'Study_effects_cluster_datasets.pdf':

Dendrogram with samples coloured based on their dataset annotation (dataset-driven clustering is evident)

Dendrogram with samples coloured based on their dataset annotation

 

 

file 'Study_effects_cluster_targets.pdf':

Dendrogram with samples coloured based on their phenotypic annotation (samples representing various biological groups are mixed)

Dendrogram with samples coloured based on their phenotypic annotation

 

 

file 'Study_effects_dendrogram.pdf':

... and dendrogram with sample names

dendrogram with sample names

 

 

Principal components analysis

Perform principal components analysis (PCA) to identify key components of variability in combined expression data

#####  Keep only probes with variance > 0.1 across all samples
rsd <- apply(data, 1, sd)
data <- data[rsd > 0.1,]


#####  Perform PCA
data_pca <- prcomp(t(data), scale=TRUE)


#####  Get variance importance for all principal components
importance_pca <- summary(data_pca)$importance[2,]
importance_pca <- paste(round(100*importance_pca, 2), "%", sep="")


#####  Set point symbols so that each represents single dataset on the PCA plots
pchs <- rep(list(c(16,1),c(15,0),c(17,2),c(16,10),c(15,7),c(16,13),c(16,1)),10)


#####  Generate PCA plot
pdf("Study_effects_PCA.pdf")
plot(data_pca$x[,1], data_pca$x[,2], type = "n", xlab = paste("PC1 (",importance_pca[1],")", sep = ""), ylab = paste("PC2 (",importance_pca[2],")", sep = ""), ylim =c ( min(data_pca$x[,2]),max(data_pca$x[,2]) + (abs(min(data_pca$x[,2])) + abs(max(data_pca$x[,2])))/4 ))

#####  Use different shape for each dataset
for (i in 1:datasets_No) {
    points(data_pca$x[,1][as.numeric(factor(datasets)) == i], data_pca$x[,2][as.numeric(factor(datasets)) == i], pch = pchs[[i]][1], col = targets.colour[[2]][as.numeric(factor(datasets)) == i])
    points(data_pca$x[,1][as.numeric(factor(datasets)) == i], data_pca$x[,2][as.numeric(factor(datasets)) == i], pch = pchs[[i]][2], col = "black")
}

#####  Adding the legend
legend("topleft", legend = levels(factor(targets)), pch = 16, col = targets.colour[[1]], box.col = "transparent")
legend("topleft", legend = c(rep("", length(targets.colour[[1]]))), pch = 1, box.col = "transparent")
dev.off()

 

This should generate the following PCA plot   ('Study_effects_PCA.pdf')

PCA plot

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

where the first principal components can be clearly explained by the study-specific data variation.

 

=========================================================================================================

jmarzec Fri, 09/08/2017 - 02:36

4. Data adjustment for study effects

4. Data adjustment for study effects

Combat

Adjust data for batch effects using ComBat

#####  Specify known batches
batch <-as.vector(datasets)

#####  Create model matrix for outcome of interest and other covariates beside batch
f.model <- factor(targets, levels = levels(factor(targets)))

mod <- model.matrix(~f.model)


#####  Adjust data for batch effects using ComBat
data_combat <- ComBat(dat = data, batch = batch, mod = mod)

 

Produce box-plot of combined datasets

pdf("Study_effects_boxplot.pdf", pointsize = 8, width = 0.2*ncol(data), height = 6)
par(mar = c(13, 4, 3, 2))

#####  Before adjusting for study effects
boxplot(data.frame(data), col = datasets.colour[[2]], las = 2)

#####  After adjusting for study effects
boxplot(data.frame(data_combat), col = datasets.colour[[2]], main = "Batch effect adjusted", las = 2)
dev.off()

 

The box-plots demonstrate that globally the expression profiles have not changed after applying ComBat   ('Study_effects_boxplot.pdf')

Box-plot before...

Box-plot before

 

 

... and after adjusting the data for study effects

after adjusting the data for study effects

 

... but how about unsupervised clustering and principal components analyses?

#####  Compute distance matrix and hierarchical clustering
d.usa <- dist(t(data_combat), method = "euc")
h.usa <- hclust(d.usa, method = "ward.D2")


##### Generate coloured dentrogram (indicate datasets)
pdf("Study_effects_cluster_datasets_ComBat.pdf", width = 0.3*ncol(data), height = 6)
h.usa$labels = colnames(data_combat)
par(mar = c(2,2,2,6))
A2Rplot(h.usa, fact.sup = datasets, box = FALSE, show.labels = TRUE, col.up = "black", main="", k=length(levels(factor(datasets))) ) # k = changes the detail/number of subgroups shown.
dev.off()


##### Generate coloured dentrogram (indicate groups)
pdf("Study_effects_cluster_targets_ComBat.pdf", width = 0.3*ncol(data), height = 6)
h.usa$labels = colnames(data_combat)
par(mar = c(2,2,2,6))
A2Rplot(h.usa, fact.sup = targets, box = FALSE, show.labels = TRUE, col.up = "black", main=" ", k = length(levels(factor(targets))) ) # k = changes the detail/number of subgroups shown.
dev.off()


##### Generate dentrogram
pdf(paste("Study_effects_dendrogram_ComBat.pdf", sep = ""), width = 0.2*ncol(data)+2, height = 6, pointsize = 12)
par(mar = c(2,5,2,0))
plot(h.usa, xlab = "", labels = paste(colnames(data_combat), targets, sep="       "), hang = -1, main = "")
dev.off()

 

These produce dendrogram with samples coloured based on their dataset annotation (samples from different datasets are mixed)

file 'Study_effects_cluster_datasets_ComBat.pdf':

dendrogram with samples coloured based on their dataset annotation

 

 

dendrogram with samples coloured based on their phenotypic annotation (phenotype-driven clustering is evident)

file 'Study_effects_cluster_targets_ComBat.pdf':

dendrogram with samples coloured based on their phenotypic annotation

 

 

... and dendrogram with sample names

file 'Study_effects_dendrogram_ComBat.pdf':

dendrogram with sample names

 

 

The clustering results show that after applying ComBat the batch effects were reduced and biological variability is now maintained across datasets.

 

Generate PCA plot for the study effects-adjusted data

#####  Keep only probes with variance > 0.1 across all samples
rsd <- apply(data_combat, 1, sd)
data_combat <- data_combat[rsd > 0.1,]


#####  Perform PCA
data_combat_pca <- prcomp(t(data_combat), scale=TRUE)


#####  Get variance importance for all principal components
importance_pca <- summary(data_combat_pca)$importance[2,]
importance_pca <- paste(round(100*importance_pca, 2), "%", sep="")


#####  Generate PCA plot
pdf("Study_effects_PCA_ComBat.pdf")
plot(data_combat_pca$x[,1], data_combat_pca$x[,2], type = "n", xlab = paste("PC1 (",importance_pca[1],")", sep = ""), ylab = paste("PC2 (",importance_pca[2],")", sep = ""), ylim = c( min(data_combat_pca$x[,2]), max(data_combat_pca$x[,2]) + (abs(min(data_combat_pca$x[,2]))+ abs(max(data_combat_pca$x[,2])))/4 ))

#####  Use different shape for each dataset
for (i in 1:datasets_No) {
    points(data_combat_pca$x[,1][as.numeric(factor(datasets)) == i], data_combat_pca$x[,2][as.numeric(factor(datasets)) ==i ], pch = pchs[[i]][1], col = targets.colour[[2]][as.numeric(factor(datasets)) == i])
    points(data_combat_pca$x[,1][as.numeric(factor(datasets)) == i], data_combat_pca$x[,2][as.numeric(factor(datasets)) == i], pch = pchs[[i]][2], col = "black")
}

#####  Adding the legend
legend("topleft", legend = levels(factor(targets)), pch = 16, col = targets.colour[[1]], box.col = "transparent")
legend("topleft", legend = c(rep("", length(targets.colour[[1]]))), pch = 1, box.col = "transparent")
dev.off()

 

The PCA plot ('Study_effects_PCA_ComBat.pdf')

Study_effects_PCA_ComBat

 

 

...demonstrates that the biological effects observed after batch correction are stronger than the study-specific effects.

 

Such pre-processed data is now ready for differential expression analysis.

 

=========================================================================================================

 

jmarzec Fri, 09/08/2017 - 02:50

5. Differential expression analysis

5. Differential expression analysis

Non-specific filtering

First, perform non-specific filtering, by eliminating genes with low expression level variability across samples (they are unlikely to provide information about the phenotype of interest), to reduce the dimensionality of the data

#####  Use 60% of all genes with the highest expression variance across samples in the non-specific filtering step
filterThreshold <- round(0.6*nrow(data_combat), digits = 0)


rsd <- apply(data_combat, 1, sd)
sel <- order(rsd, decreasing=TRUE)[1:filterThreshold]
data_combat <- data_combat[sel,]

 

Model matrix

Then define the study design and then fit the linear model

#####  Create a model matrix representing the study design, with rows corresponding to arrays and columns to coefficients to be estimated
f.model <- factor(targets, levels = levels(factor(targets)))

mod <- model.matrix(~0 + f.model)

colnames(mod) <- levels(factor(targets))


#####  Fit linear model for each gene given a series of arrays
fit <- lmFit(data_combat, design = mod)


#####  Create matrix of possible comparisons
comb <- combn(levels(factor(targets)), 2)


#####   Record the number of groups and comparisons
targetsNo <- length(levels(factor(targets)))
combNo <- factorial(targetsNo)/(factorial(targetsNo-2)*(factorial(2))) # n!/((n-r)!(r!))

contrasts <- NULL
contrastNames <- NULL


#####  Create string with possible contrasts
for (i in 1:combNo) {
    
    contrasts <- c(contrasts, paste(paste(comb[1,i], comb[2,i], sep="vs"),paste(comb[1,i], comb[2,i], sep="-"), sep="="))
    contrastNames[i] <- paste(comb[1,i], comb[2,i], sep=" vs ")
}

contrasts <- paste(contrasts, collapse=", ")

#####  Create contrasts of interest
func = "makeContrasts"
arguments = paste(contrasts, "levels=mod",sep = ", ")

contrast.matrix <- eval(parse(text = paste(func, "(", arguments, ")", sep = "")))

 

Empirical Bayes statistics

Now fit individual contrasts to linear model and apply empirical Bayes statistics

#####  Fit contrasts to linear model
fitContrasts <- contrasts.fit(fit, contrast.matrix)

#####  Apply empirical Bayes statistics
eb <- eBayes(fitContrasts)

 

=========================================================================================================

jmarzec Fri, 09/08/2017 - 02:54

6. Results summary and visualisation

6. Results summary and visualisation

Set preferred p-value (adjusted for multiple-hypothesis testing) and log2 fold-change threshold for calling differentially expressed genes

pThreshold = 0.0001
lfcThreshold = 2

 

Results summary

Annotate probesets and write results into a files

#####  ... do it for each comparison
for (i in 1:ncol(eb$p.value) ) {
    
    topGenes <- topTable(eb, coef = colnames(eb)[i], adjust = "BH",  sort.by = "P", number = nrow(data_combat), genelist = rownames(data_combat))
    
    ##### Retrieve probesets annotation information
    probeid <- topGenes$ID
    
    SYMBOL <- as.character(unlist(lapply(mget(probeid,env=hgu133aSYMBOL), function (symbol) { return(paste(symbol,collapse="; ")) } )))
    NAME <- as.character(unlist(lapply(mget(probeid,env=hgu133aGENENAME), function (name) { return(paste(name,collapse="; ")) } )))
    CHR <- as.character(unlist(lapply(mget(probeid,env=hgu133aCHR), function (Chr) { return(paste(Chr,collapse="; ")) } )))
    MAP <- as.character(unlist(lapply(mget(probeid,env=hgu133aMAP), function (MAP) { return(paste(MAP,collapse="; ")) } )))
    
    ##### Merge probes annotation with statistics
    Annot <- data.frame(probeid, CHR, MAP, SYMBOL, NAME, row.names = NULL)
    Annot <- merge(Annot, topGenes, by.x = "probeid", by.y = "ID" )
    
    ##### Write annotated results into a file
    write.csv(Annot, file=paste("Integ_", colnames(eb$p.value)[i], "_topTable.csv", sep=""), row.names=FALSE)
}

 

This will produce one file (finishing with 'topTable.csv') per comparison. In this case we perform only one comparison so one file is generated: 'Integ_NormalvsTumour_topTable.csv'.

 

Do the same again, but this time consider only differentially expressed genes, according to the pre-defined threshold values (one file finishing with 'DE.csv' is created per comparison) 

#####  Record differentially expressed genes
topGenes.index <- NULL

for (i in 1:ncol(eb$p.value) ) {
    
    topGenes <- topTable(eb, coef = colnames(eb)[i], adjust = "BH",  sort.by = "P", p.value = pThreshold, lfc = lfcThreshold, number = nrow(data_combat), genelist = rownames(data_combat))
    
    ##### Retrieve probesets annotation information
    probeid <- topGenes$ID
    
    SYMBOL <- as.character(unlist(lapply(mget(probeid,env=hgu133aSYMBOL), function (symbol) { return(paste(symbol,collapse="; ")) } )))
    NAME <- as.character(unlist(lapply(mget(probeid,env=hgu133aGENENAME), function (name) { return(paste(name,collapse="; ")) } )))
    CHR <- as.character(unlist(lapply(mget(probeid,env=hgu133aCHR), function (Chr) { return(paste(Chr,collapse="; ")) } )))
    MAP <- as.character(unlist(lapply(mget(probeid,env=hgu133aMAP), function (MAP) { return(paste(MAP,collapse="; ")) } )))
    
    ##### Merge probes annotation with statistics
    Annot <- data.frame(probeid, CHR, MAP, SYMBOL, NAME, row.names = NULL)
    Annot <- merge(Annot, topGenes, by.x = "probeid", by.y = "ID" )
    
    ##### Write annotated results into a file
    write.csv(Annot, file=paste("Integ_", colnames(eb$p.value)[i], "_DE.csv", sep=""), row.names=FALSE)
    
    #####  Record differentially expressed genes
    topGenes.index <- c(topGenes.index,rownames(topGenes))
}

 

P-value distribution

Generate a histogram of p-values to get a general sense of how the test behaved across all genes. This should help to diagnose potential problems associated with multiple hypothesis testing.

for (i in 1:ncol(eb$p.value) ) {
    
    pdf(file = paste("Integ_", colnames(eb$p.value)[i], "_P_hist.pdf", sep = ""))
    histogram <- hist(eb$p.value[,i], breaks = seq(0,1, by = 0.01), main = contrastNames[i], xlab = "p-value")
    abline(v = pThreshold, col = "red")
    dev.off()
}

 

This will produce one file (finishing with 'P_hist.pdf') per comparison. 

file 'Integ_NormalvsTumour_P_hist.pdf':

Integ_NormalvsTumour_P_hist

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The histogram displays a large proportion of differentially expressed genes (p-value <0.0001 indicated by vertical red line). This suggests that application of multiple hypothesis testing adjustment procedure is appropriate for this data.

 

Volcano plot

Produce volcano plot for each contrast (files finishing with 'volcano_plot.pdf')

#####  Generate volcano plots of log2 fold-changes versus significance (adjusted p-values, label top 10 genes)
for (i in 1:ncol(eb$p.value) ) {
    
    topGenes <- topTable(eb, coef = colnames(eb)[i], adjust = "BH", sort.by = "none", number = nrow(data_combat))
    
    pdf(file = paste("Integ_", colnames(eb$p.value)[i], "_volcano_plot.pdf", sep = ""))
    plot(topGenes[,"logFC"], -log2(topGenes[,"adj.P.Val"]), pch = 16, cex = 0.5, xlab = "Log2 fold-change", ylab = "-log2(adjusted p-value)", main = contrastNames[i], col = "grey")
    
    #####  Highlight genes with logFC above specified threshold
    points(topGenes[abs(topGenes[, "logFC"]) > lfcThreshold, "logFC"], -log2(topGenes[abs(topGenes[, "logFC"]) > lfcThreshold, "adj.P.Val"]), cex = 0.5, pch = 16)
    abline(h = -log2(pThreshold), col = "red", lty = 2)
    
    #####  Label top 10 most significant genes
    ord <- order(-log2(topGenes[, "adj.P.Val"]), decreasing = TRUE)
    top10 <- ord[1:10]
    text(topGenes[top10, "logFC"], -log2(topGenes[top10, "adj.P.Val"]), labels = rownames(data_combat[top10,]), cex = 0.6, col = "blue")
    dev.off()
}

 

The volcano plot demonstrates number of genes with log2 fold-change > 2 or < -2 (black points), and adjusted p-value < 0.0001 (above the dashed red line)

file 'Integ_NormalvsTumour_volcano_plot.pdf':

Integ_NormalvsTumour_volcano_plot

 

 

Clustering analysis and heatmap

Finally, perform clustering analysis and produce heatmap (actually two, each with exactly the same clustering results but different colour scale for your preference)

#####  Select top differentially expressed genes
topGenes <- data_combat[unique(topGenes.index),]


#####  Compute distance matrix and hierarchical clustering
hr <- hclust(as.dist(1-cor(t(topGenes), method = "pearson")), method = "ward.D2")
hc <- hclust(as.dist(dist(t(topGenes), method = "euclidean")), method = "ward.D2")


#####  Generate dendrogram
pdf("Integ_DE_dendrogram.pdf", width = 0.2*ncol(data)+2, height = 6, pointsize = 12)
par(mar = c(2,5,2,0))
plot(hc, xlab="", labels=paste(colnames(data_combat), targets, sep="       "), hang = -1, main="")
dev.off()


#####  ...heatmap (blue-red colour scale)
pdf("Integ_DE_heatmap_blue_red.pdf", width = 6, height = 10, pointsize = 12)
heatmap.2(as.matrix(topGenes), Rowv = as.dendrogram(hr), Colv=as.dendrogram(hc), col = colorRampPalette(c("blue","white","red"))(100), scale = "row", ColSideColors = targets.colour[[2]], margins = c(2, 6), labRow = "", labCol = "", trace = "none", key = TRUE)

#####  Add the legend
legend("topright", legend = levels(factor(targets)), fill = targets.colour[[1]], box.col = "transparent", cex=1.2)
dev.off()


#####  ...heatmap (green-red colour scale)
pdf("Integ_DE_heatmap_green_red.pdf", width = 6, height = 10, pointsize = 12)
heatmap.2(as.matrix(topGenes), Rowv = as.dendrogram(hr), Colv=as.dendrogram(hc), col = greenred(75), scale = "row", ColSideColors = targets.colour[[2]], margins = c(2, 6), labRow = "", labCol = "", trace = "none", key = TRUE)

#####  Add the legend
legend("topright", legend = levels(factor(targets)), fill = targets.colour[[1]], box.col = "transparent", cex=1.2)
dev.off()

 

You should get the following clustering results:

file 'Integ_DE_dendrogram.pdf':

dendrogram with sample names

dendrogram with sample names

 

 

... and two heatmaps, one in blue-red colour scale, and the other in green-red colour scale, with blue and green colours indicating down-regulation, respectively, and red colour indicating up-regulation of corresponding genes (probesets).

files 'Integ_DE_heatmap_blue_red.pdf' and 'Integ_DE_heatmap_green_red.pdf':

 

Integ_DE_heatmap_blue_red                             Integ_DE_heatmap_green_red

 

 

=========================================================================================================

jmarzec Fri, 09/08/2017 - 03:07

Day 2

Day 2 lpryszcz Wed, 07/12/2017 - 17:21

De novo genome and transcriptome assembly

De novo genome and transcriptome assembly

This class includes:

  • Next Generation Sequencing basics

  • NGS data QC and filtering

  • Bacterial de novo genome assembly from conventional and single-cell data

  • Genome assembly QC

  • RNA-Seq QC and filtering

  • De novo transcriptome assembly from RNA-Seq data

  • Transcriptome assembly QC

Google drive folder with slides and handout materials https://docs.google.com/presentation/d/1SdrSOjERb1FLFPefvPNKldVxiosMyOy…

 

aprjibelski Thu, 08/31/2017 - 02:26

4 - RNA-Seq, transcriptome assembly and its evaluation

4 - RNA-Seq, transcriptome assembly and its evaluation aprjibelski Thu, 08/31/2017 - 02:39
slides

Day 3

Day 3 lpryszcz Wed, 07/12/2017 - 17:22

Detection of structural variations

Detection of structural variations tgambin Wed, 08/30/2017 - 21:03
slides

CNV detection from WES data - hands on

CNV detection from WES data - hands on

Outline

  • Preliminary Task 0 - prepare working environment
  • Task 1 - calculate and analyze coverage for toy WES data
  • Task 2 - run CODEX pipeline on 99 WES samples from 1000 Genomes Project
  • Task 3 - Extract CNVs from 1000 Genomes SVs dataset
  • Task 4 - Compare CODEX data and 1000 Genomes SVs dataset
  • Task 5 - CNV annotations
  • Task 6 - Detecting homozygous and hemizygous deletions

Preliminary Task 0 - prepare working environment

a) Create working directory  e.g. "mkdir ~/ngschool/2017/CNV_detection/" and data directory, eg. "mkdir ~/ngschool/2017/CNV_detection/data/".  Later we will refer to this directory as "workDir". 

b) Copy  data from pendrive to workDir/data/; and cd workDir/data/

c) Uncompress all files:

find . -name '*.tar.gz' -exec tar -xzvf {} \;

d) and examine data folder:

c) Run R, load libraries and  make sure all libraries are loaded

library(data.table)
library(parallel)
library(RCurl)
library(gdata)
library(Hmisc)
library(matrixStats)
library(DNAcopy)
library(GenomicRanges)
library(Rsubread) 
library(WES.1KG.WUGSC)
library(CODEX)

# set working directory to workDir
workDir <- "~/ngschool/2017/CNV_detection/"
setwd(workDir)

#set number of available cores
cores <- 4

Task 1 - calculate and analyze coverage for toy WES data

a) Load toy WES BAMs and calculate coverage using getcoverage() function from CODEX package. 

Toy data contains 46 BAMs with reads from a small fragment of chr22. 

Corresponding bed file contain definition of 100 exome targets (see content of bedFile, e.g. fread (bedFile) )

dirPath <- system.file("extdata", package = "WES.1KG.WUGSC")
bamFile <- list.files(dirPath, pattern = '*.bam$')
bamdir <- file.path(dirPath, bamFile)
sampname <- as.matrix(read.table(file.path(dirPath, "sampname")))
bedFile <- file.path(dirPath, "chr22_400_to_500.bed")
chr <- 22
bambedObj <- getbambed(bamdir = bamdir, bedFile = bedFile,  sampname = sampname, projectname = "CODEX_demo", chr)
bamdir <- bambedObj$bamdir; sampname <- bambedObj$sampname
ref <- bambedObj$ref; projectname <- bambedObj$projectname; chr <- bambedObj$chr
coverageObj <- getcoverage(bambedObj, mapqthres = 20)
Y <- coverageObj$Y; readlength <- coverageObj$readlength

Object Y contain data on the number of reads overlapping each exon target; samples in columns; targets in rows 

b) For each sample calculate average read depth per target

Y_ac <- apply(Y, 2,function(x){(100*x)/width(ref)})
colnames(Y_ac) <- sampname

c) Examine distribution of read depth. Note, it is bi-modal. What does it mean ? 

plot(density(Y_ac))

 

Distribution of read depth

d) Calculate median coverage per sample

summary(apply(Y_ac, 2, median))
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#  21.06   24.04   25.29   27.37   27.08   76.03 

 

Question 1.1:  Which sample has the lowest and which has the highest coverage?

Question 1.2:  Which targets have a median read depth (across all samples) below 1x? 
 


Task 2 - run CODEX pipeline on 99 WES samples from 1000 Genomes Project

a) Set project name, and create output directory

projectname <- "TGP99"
outputDir <- paste0(workDir,"codex_output/")
dir.create(outputDir)

b) Load and transform pre-calculated coverage data for 99 samples

cfiles <- dir(paste0(workDir, "data/coverage/"), "*bam.coverage*")
cdf <- rbindlist(mclapply(cfiles, function(f){
  print(f);
  df <- fread(paste0(workDir, "data/coverage/",f));
  df$SampleName <- strsplit(f, "\\.")[[1]][1];df 
},mc.cores=cores))
colnames(cdf) <- c("Chr", "Start", "Stop", "ReadCount", "SampleName")
cdf <- cdf[order(cdf$SampleName, cdf$Chr, cdf$Start, cdf$Stop),]
dim(cdf)
#[1] 19170063        5
head(cdf)
#   Chr Start  Stop ReadCount SampleName
#1:   1 14642 14882        64    NA06985
#2:   1 14943 15063        54    NA06985
#3:   1 15751 15990         2    NA06985
#4:   1 16599 16719         0    NA06985
#5:   1 16834 17074         0    NA06985
#6:   1 17211 17331        16    NA06985

c) Set path to file with target data, extract sample names

bedFile <- paste0(workDir, "data/bed/20130108.exome.targets.bed")
sampname <- unique(cdf$SampleName)

d) Let's analyze chr20 

chr <- "20"
targetsChr <-     cdf[which(cdf$Chr==chr & cdf$SampleName == cdf$SampleName[1]),c("Chr", "Start",  "Stop")] 
selChr <- cdf[which(cdf$Chr==chr),] 

e) Create Y matrix

Y <- t(do.call(rbind,lapply(sampname, function(s){selChr$ReadCount [which(selChr$SampleName == s)]})))
colnames(Y) <- sampname
rownames(Y) <- 1:nrow(Y)
dim(Y)
#[1] 4702   99
dim(targetsChr)
#[1] 4702    3
ref <- IRanges(start = targetsChr$Start, end = targetsChr$Stop)

f) Calculate GC content and mapability for each target

gc <- getgc(chr, ref)
mapp <- getmapp(chr, ref)

g) Perform quality control for samples and targets

Qestion 2.1 How GC content influence read depth?

Y_ac <- apply(Y, 2,function(x){(100*x)/width(ref)})
plot(gc,Y_ac[,2], ylab="read depth", xlab="GC conent")
# apply smoothing 
plot(smooth.spline(gc,Y_ac[,2]), col="blue", type="l",lwd=3, ylab="read depth",xlab="GC conent")

read depth vs GC content

Remove low quality targets:

mapp_thresh <- 0.9 # remove exons with mapability < 0.9
cov_thresh_from <- 20 # remove exons covered by less than 20 reads
cov_thresh_to <- 4000 #  remove exons covered by more than 4000 reads
length_thresh_from <- 20 # remove exons of size < 20
length_thresh_to <- 2000 # remove exons of size > 2000
gc_thresh_from <- 20 # remove exons with GC < 20
gc_thresh_to <- 80 # or GC > 80
qcObj <- qc(Y, sampname, chr, ref, mapp, gc, 
            cov_thresh = c(cov_thresh_from, cov_thresh_to), 
            length_thresh = c(length_thresh_from, length_thresh_to), 
            mapp_thresh = mapp_thresh, 
            gc_thresh = c(gc_thresh_from, gc_thresh_to))
Y_qc <- qcObj$Y_qc; sampname_qc <- qcObj$sampname_qc; gc_qc <- qcObj$gc_qc
mapp_qc <- qcObj$mapp_qc; ref_qc <- qcObj$ref_qc; qcmat <- qcObj$qcmat

Question 2.2: How many exons were excluded in QC? 


Question 2.3: Try to change GC threshods to (30,70) . How many exons are now excluded? 

h) Fit Poisson latent factor model for normalization of the read depth data (this may take a while)
 

normObj <- normalize(Y_qc, gc_qc, K = 1:9)
Yhat <- normObj$Yhat; AIC <- normObj$AIC; BIC <- normObj$BIC
RSS <- normObj$RSS; K <- normObj$K

Determine the number of latent factors. AIC, BIC, and deviance reduction plots are generated in a .pdf file. CODEX reports all three statistical metrics (AIC, BIC, percent of Variance  explained) and uses BIC as the default method to determine the number of Poisson factors. Since false positives can be screened out through a closer examination of the post-segmentation data, whereas CNV signals removed in the normalization step cannot be recovered, CODEX opts for a more conservative normalization that, when in doubt, uses a smaller value of K.

choiceofK(AIC, BIC, RSS, K, filename = paste(projectname, "_", chr, "_choiceofK", ".pdf", sep = ""))

Examine plots in PDF file.

Question 2.4: Which K is the best according to AIC, BIC, RSS ? 

h) Fit the Poisson log-likelihood ratio based segmentation procedure to determine CNV regions across all samples. For germline CNV detection, CODEX uses the "integer" mode; for CNV detection involving large recurrent chromosomal aberrations in mixture populations (e.g. somatic CNV detection in cancer),  CODEX opts to use the "fraction" mode.

lmax- Maximum CNV length in number of exons returned.
 

finalcall <- CODEX::segment(Y_qc, Yhat, optK = optK, K = K, sampname_qc,   ref_qc, chr, lmax = 200, mode = "integer")
finalcall <- data.frame(finalcall, stringsAsFactors=F)
finalcall$targetCount <- as.numeric(finalcall$ed_exon) - as.numeric(finalcall$st_exon)
dim(finalcall)
#[1] 73 13

The output has 13 columns with rows corresponding to CNV events. Columns include:

  • sampl_name (sample names), 
  • chr (chromosome), 
  • cnv (deletion or duplication), 
  • st_bp (cnv start position in base pair, the start position of the first exon in the cnv), 
  • ed_bp (cnv end position in base pair, the end position of the last exon in the cnv),
  • length_kb (CNV length in kb), 
  • st_exon (the first exon after QC in the cnv,integer value numbered in qcObj$ref_qc),
  • ed_exon (the last exon after QC in the cnv, integer value numbered in qcObj$ref_qc), 
  • raw_cov (raw coverage),
  • norm_cov (normalized coverage), 
  • copy_no (copy number estimate), 
  • lratio (likelihood ratio of CNV event versus copy neutral event),
  • mBIC (modified  BIC value, used to determine the stop point of segmentation)

 

Question 2.5: How many calls there are?

Question 2.6: How many deletions and how many duplications?  

Question 2.7: Are there any homozygous CNVs?

i) There is no plot function defined in CODEX so let's define one:

plotCall <- function(calls,i , Y_qc, Yhat_opt){
  startIdx <- as.numeric(calls$st_exon[i])
  stopIdx <- as.numeric(calls$ed_exon[i])
  sampleName <- calls$sample_name[i]
  wd <- 20
  startPos <- max(1,(startIdx-wd))
  stopPos <- min((stopIdx+wd), nrow(Y_qc))
  selQC <- Y_qc[startPos:stopPos,]
  selQC[selQC ==0] <- 0.00001
  selYhat <- Yhat_opt[startPos:stopPos,]
  matplot(matrix(rep(startPos:stopPos, ncol(selQC)), ncol=ncol(selQC)), log(selQC/selYhat,2), type="l",lty=1, col="dimgrey",  lwd=1, xlab="exon nr", ylab="logratio(Y/Yhat)")
  lines(startPos:stopPos,log( selQC[,sampleName]/ selYhat[,sampleName],2), lwd=3, col="red")
}

We are plotting logratio of observed read count vs. normalized read count.

Let's see how it works:

plotCall (finalcall, 1, Y_qc, Yhat[[optK]])

Sample deletion called by CODEX

Question 2.8: Assuming deletion is real. Do you think it could be clinically relevant? 

Try to examine other plots calls, e.g. generate plots for deletions  with  likelyhood ratio > 100.


Task 3 - Extract CNVs from 1000 Genomes SVs dataset

These SVs were detected using joint analysis of WGS and WES. Besides CNVs it includes other types of SVs, e.g. inversions. 

In principle, the result of such analysis should be more accurate than the result of CNV caller using WES data only. 

a) We are going to extract CNVs so we can use it as a reference data in our analysis.

mergedSVs <- fread(paste0(workDir, "data/TGP_SV/ALL.wgs.mergedSV.v8.20130502.svs.genotypes.vcf"), skip=55)
sv <- mergedSVs [grep ("SVTYPE=DEL|SVTYPE=DUP|SVTYPE=CNV" , mergedSVs$INFO),]
dim(mergedSVs)
#[1] 68818  2513
dim(sv)
#[1] 51233  2513

Question 3.1: How many structural variants there are ?  

Question 3.2: How many of them are CNVs? 

b) Extract columns that correspond to our 99 samples

sampleNames <- sampname
sv <- sv[,c(1:9,which(names(sv) %in% sampleNames)),with=F]

c) Remove CNVs for which all genotypes are 0|0

sv <- sv[apply(sv[,10:108,with=F], 1, function(x){any(x!="0|0")}),]

Question 3.3: How many different CNVs there are in our 99 samples? 

d) Extract END position from INFO column

sv$END <- sapply(strsplit(sv$INFO, ";"), function(x){as.numeric(gsub("END=","",x[grep ("^END=",x)]))})

e) Further we want to focus on CNVs that encompass at least one exon target

# create GRanges objects for sv and bed
bed <- fread(bedFile)
library(GenomicRanges)
svGr <- GRanges(sv$"#CHROM", IRanges(sv$POS, sv$END))
bedGr <- GRanges(bed$V1, IRanges(bed$V2, bed$V3))

# CNV should completely encompass at least one target region
mm <- as.matrix(findOverlaps(bedGr,svGr, type="within"))
sv$targetCount <- sapply(1:nrow(sv), function(x){length(which(mm[,2]==x))})
svOv <- sv[sv$targetCount > 0,]
dim(svOv)
#[1] 1022  110

f) Flatten vcf, extract deletions and duplication. 

tgpCnvs <- data.frame(do.call(rbind, mclapply(1:nrow(svOv), function(i){
    print(i)
    row <- svOv[i,]
    alt <- strsplit(row$ALT , ",")[[1]]
    firstAllele <- as.numeric(sapply(strsplit(unlist(row[,10:108]), "|"), function(x){x[1]}))
    secondAllele <- as.numeric(sapply(strsplit(unlist(row[,10:108]), "|"), function(x){x[3]}))
    delIdx <- NULL
    if (alt[1] == "<CN0>"){   delIdx <- which(firstAllele == 1 | secondAllele == 1)}
    delAlleleFreq <- (length(which(firstAllele %in% which(alt %in% paste0("<CN0>")) )) + length(which(secondAllele %in% which(alt %in% paste0("<CN0>")) ) ) )/ (2*99)
    
   
    dupIdx <- NULL 
    dupIdx <- which(firstAllele %in% which(alt %in% paste0("<CN",2:10,">")) | secondAllele %in% which(alt %in% paste0("<CN",2:10,">")))
    dupAlleleFreq <- (length(which(firstAllele %in% which(alt %in% paste0("<CN",2:10,">")) )) + length(which(secondAllele %in% which(alt %in% paste0("<CN",2:10,">")) ) ) )/ (2*99)
    
    
    dups <- do.call(rbind, lapply(dupIdx, function(j){
        sampleName <- colnames(row[,10:108])[j]
        c(sampleName = sampleName, svType="DUP", CHROM = row$"#CHROM", POS= row$POS, END=row$END,targetCount=row$targetCount, AF =dupAlleleFreq)
    }))
    dels <-      do.call(rbind, lapply(delIdx, function(j){
        sampleName <- colnames(row[,10:108])[j]
        c(sampleName = sampleName, svType="DEL", CHROM = row$"#CHROM", POS= row$POS, END=row$END, targetCount=row$targetCount,AF =delAlleleFreq)
    }))
  
    rbind(dups, dels)
},mc.cores=cores)), stringsAsFactors=F)
tgpCNVs$POS <- as.numeric(tgpCNVs$POS)
tgpCNVs$END <- as.numeric(tgpCNVs$END)
tgpCNVs$targetCount <- as.numeric(tgpCNVs$targetCount)
tgpCNVs$AF <- as.numeric(tgpCNVs$AF)
tgpCNVs$Length <- tgpCNVs$END - tgpCNVs$POS
table(tgpCNVs$svType)
#    DEL     DUP 
 #  4575    5665 

Each row in tgpCNVs, corresponds to single occurrence of CNV

head(tgpCNVs)
#  sampleName svType CHROM    POS    END targetCount          AF Length
#1    NA18639    DUP     1 668630 850204           4 0.005050505 181574
#2    NA07357    DUP     1 713044 755966           2 0.030303030  42922
#3    NA11918    DUP     1 713044 755966           2 0.030303030  42922
#4    NA11995    DUP     1 713044 755966           2 0.030303030  42922
#5    NA12872    DUP     1 713044 755966           2 0.030303030  42922
#6    NA12873    DUP     1 713044 755966           2 0.030303030  42922

g)  Examine sizes of extracted CNVs

plot(density(tgpCNVs$Length[tgpCNVs$svType=="DEL"]),xlim=c(1,0.3e6), col="red", main="Distribution of CNV sizes", lwd=3)
lines(density(tgpCNVs$Length[tgpCNVs$svType=="DUP"]), col="blue", lwd=3)
legend("topright",c("deletions", "duplications"), col=c("red", "blue"), lwd=3)

Distribution of CNV sizes

Question 3.4: What is a difference in sizes between deletions and duplications ? 

h) Perform similar comparison for CNVs allele frequencies.

Question 3.5: Which type of CNVs have more rare events?

 

i) We often observe an increase number of false positive events in segmental duplications (or Low Copy Repeats, LCRs) due to poor mappability of reads in these regions.

Therefore we will ignore CNVs that overlap with LCRs in our comparative analysis.  First, we need to annotate our reference set of CNVs with segmental duplication track (downloaded from UCSC genome browser).
 

segDups <- fread(paste0(workDir,"data/segDups/seg_dups_hg19"))
segDupsGr <- GRanges(gsub("chr", "",c(segDups$chrom, segDups$otherChrom)), IRanges(c(segDups$chromStart, segDups$otherStart), c(segDups$chromEnd, segDups$otherEnd)))
tgpCNVsGr <- GRanges(tgpCNVs$CHROM, IRanges(tgpCNVs$POS, tgpCNVs$END))
tgpCNVs$segDups <- countOverlaps(tgpCNVsGr, segDupsGr)>0

Question 3.6 How many CNVs overlap with segmental duplications? 


 

Task 4 - Compare CODEX data and 1000 Genomes SVs dataset

 

a) Load pre-calculated CODEX calls for all chromosomes

files <- dir(paste0(workDir, "data/codex_output_all/"), "txt$")
codexOut <- rbindlist (lapply(paste0(workDir, "data/codex_output_all/",files), fread))
codexOut$targetCount <- codexOut$ed_exon - codexOut$st_exon
codexOut$Length <- codexOut$ed_bp - codexOut$st_bp

# and normalized / raw coverage data required by plot function

files <- dir(paste0(workDir, "codex_output/"), "RData$")
normData <- lapply(c(1:22,"X", "Y"), function(chr){
  load(paste0(workDir, "data/codex_output_all/TGP99_",chr,"_image.RData"))
  list(Y_qc= dataToSave$Y_qc, Yhat = dataToSave$Yhat )
})
names(normData) <- c(1:22,"X", "Y")

b) Annotate with segmental duplications

codexOutGr <- GRanges(codexOut$chr, IRanges(as.numeric(codexOut$st_bp), as.numeric(codexOut$ed_bp)))
segDups <- fread(paste0(workDir,"data/segDups/seg_dups_hg19"))
segDupsGr <- GRanges(gsub("chr", "",c(segDups$chrom, segDups$otherChrom)), IRanges(c(segDups$chromStart, segDups$otherStart), c(segDups$chromEnd, segDups$otherEnd)))
codexOut$segDups <- countOverlaps(codexOutGr, segDupsGr)>0

c) Examine number of calls per chromosome

par(mfrow=c(2,1))
barplot(table(codexOut$chr[codexOut$cnv=="del"])[c(1:22, "X", "Y")], main="number of deletions - codex")
barplot(table(tgpCNVs$CHROM[tgpCNVs$svType=="DEL"])[c(1:22, "X", "Y")], main="number of deletions - 1000Genomes SVs")

Deletion counts in CODEX and reference dataset

 

d) Now, let's define function which calculates sensitivity and precision.

Question 4.1: Why not specificity? How can we define it for CNV data ? 

evaluate <- function(tgpData, codexData, returnIdx=F){
    tgpGr <- GRanges(paste0(tgpData$sampleName, "_", tgpData$CHROM), IRanges(as.numeric(tgpData$POS), as.numeric(tgpData$END)))
    codexGr <-  GRanges(paste0(codexData$sample_name, "_",codexData$chr), IRanges(as.numeric(codexData$st_bp), as.numeric(codexData$ed_bp)))
    tpIdx <- suppressWarnings(which(countOverlaps(codexGr,tgpGr) >0))
    fnIdx <- suppressWarnings(which(countOverlaps(tgpGr, codexGr) ==0))
    fpIdx <- suppressWarnings(which(countOverlaps( codexGr,tgpGr) ==0))
    tp <- (length(tpIdx))
    fn <- (length(fnIdx))
    fp <- (length(fpIdx))
    sensitivity <- tp/(tp+fn)
    precision <- tp /(tp+fp)
    if(!returnIdx) return (c(sensitivity=sensitivity, precision=precision))
    return(list(sensitivity=sensitivity, precision=precision, tpIdx=tpIdx, fnIdx=fnIdx, fpIdx=fpIdx))
}

e) Evaluate deletions and duplications separately

codexOut_filtered <- codexOut [!codexOut$segDups  ,]
tgpCNVs_filtered <- tgpCNVs[!tgpCNVs$segDups ,]
codexDels <- codexOut_filtered[codexOut_filtered$cnv =="del",]
codexDups <- codexOut_filtered[codexOut_filtered$cnv =="dup",]
tgpDels <- tgpCNVs_filtered[tgpCNVs_filtered$svType=="DEL" ,]
tgpDups <- tgpCNVs_filtered[tgpCNVs_filtered$svType=="DUP",]

# calc sensitivity and precision
evaluate(tgpDels, codexDels)
#sensitivity   precision 
#  0.3276027   0.2175016 
evaluate(tgpDups, codexDups)
#sensitivity   precision 
# 0.16777409  0.09955643 

f) Examine deletion false positives

# extract tp and fp calls
tp <- codexDels[(evaluate(tgpDels, codexDels, returnIdx=T))$tpIdx,]
fp <- codexDels[(evaluate(tgpDels, codexDels, returnIdx=T))$fpIdx,]

Plot and compare example of TP and FP call

par(mfrow=c(1,2))
i <- 1; plotCall(tp, i,normData[[tp$chr[i]]]$Y_qc, normData[[tp$chr[i]]]$Yhat)
# and compare to 
i <- 1; plotCall(fp, i,normData[[fp$chr[i]]]$Y_qc, normData[[fp$chr[i]]]$Yhat)

Example of TP and FP call

 

Question 4.2: Is there significant difference in lratio between TP and FP calls?

How the best quality calls looks like ? Lets plot TP call with the highest lratio. 

dev.off(); i <- which.max(tp$lratio); plotCall(tp, i,normData[[tp$chr[i]]]$Y_qc, normData[[tp$chr[i]]]$Yhat)

Deletion with the highest lratio

Question 4.3: What is the estimated copy number for this deletion? 

Check if the copy number reported in 1000 Genomes SV track is the same. All genotypes for this deletion are also available here:

https://genome.ucsc.edu/cgi-bin/hgc?hgsid=606627727_m7OgPnk3tPihLDtgQxY…

Question 4.4: What percentage of 2504 samples are homozygous for this deletion? How many are heterozygous? What is the total frequency for this deletion?

Question 4.5: Calculate the fraction of 99 samples having this deletion and see how many of them are homozygous and heterozygous. Compare to the frequencies reported in 1000 Genomes SV track.

g) Analyze false negatives

fnTgp <- tgpDels[(evaluate(tgpDels, codexDels, returnIdx=T))$fnIdx,]
tpTgp <- tgpDels[-(evaluate(tgpDels, codexDels, returnIdx=T))$fnIdx,]

See the difference in allele frequency between TP and FN calls. FN calls tend to be more common:

boxplot(tpTgp$AF, fnTgp$AF, names=c("TP", "FN"), ylab="AF")

Allele frequencies in TP and FN calls

Question 4.6: What would be the sensitivity if we exclude reference calls with AF > 0.3? Note, that common CNVs are usually not clinically relevant.


Task 5 - CNV annotations

a) Load gene annotations

rf <- fread(paste0(workDir, "data/refFlat/refFlat.txt"))
head(rf)

b) Annotate codex CNVs with gene symbolls

rfGr <- GRanges(gsub("chr", "",rf$V3), IRanges(rf$V5, rf$V6))
mm <- data.table(as.matrix(findOverlaps(rfGr, codexOutGr)))
mm$Gene <- rf$V1[mm$queryHits]
mm[,Genes:=paste(Gene,collapse=","),by="subjectHits"]
mm[,GeneCount:=length(Gene),by="subjectHits"]
codexOut$Genes[mm$subjectHits] <- mm$Genes
codexOut$GeneCount[mm$subjectHits] <- mm$GeneCount

c) Indicate known autosomal dominant disease genes

knownADGenes <- readLines("https://raw.githubusercontent.com/macarthur-lab/gene_lists/master/lists/all_ad.tsv")
codexOut$knownADGene <- sapply( strsplit(codexOut$Genes, ","), function(x){length(intersect(x,knownADGenes))>0})

d) Plot frequencies of single gene deletions encompassing known AD genes
 

barplot(sort(table(codexOut$Genes[(codexOut$knownADGene & codexOut$GeneCount ==1 & codexOut$cnv=="del")]),decreasing=T) )

e) Annotate with data from Database of Genomic Variants (DGV)

 

dgv <- fread(paste0(workDir,"data/DGV/dgvSupporting.txt"))
dim(dgv)
#[1] 6668715      23

dgvDels <- dgv[which(dgv$V11 %in% c("deletion","loss")),]
dgvDups <- dgv[which(dgv$V11 %in% c("duplication","gain")),]
dgvDelsGr <- GRanges(gsub("chr", "",dgvDels$V2), IRanges(dgvDels$V3, dgvDels$V4))
dgvDupsGr <- GRanges(gsub("chr", "",dgvDups$V2), IRanges(dgvDups$V3, dgvDups$V4))
codexOut$dgvDelCount <-    countOverlaps (codexOutGr, dgvDelsGr)
codexOut$dgvDupCount <-    countOverlaps (codexOutGr, dgvDupsGr)

How many deletions encompass known AD gene and have no overlap with DGV deletions ? 

 

Task 6 - Detecting homozygous and hemizygous deletions

 

############################################################
# Running HMZDelFinder on 50 samples from 1000 genomes #
############################################################

# set project and data directory 
# replace mainDir with the location you want to store experiment results
dataDir <- paste0(workDir, "data/HMZDelFinder_data/" , sep=""); 



# load HMZDelFinder source code
# Note: source ("https://....")  does not work on some platforms
eval( expr = parse( text = getURL("https://raw.githubusercontent.com/BCM-Lupskilab/HMZDelFinder/master/src/HMZDelFinder.R") ))


# set/create other paths and identifiers
bedFileHMZ <- paste0(dataDir, "tgp_hg19.bed") # set path to BED file
outputDir <- paste0(workDir, "HMZDelFinder_output/" , sep=""); if (!file.exists(outputDir)){dir.create(outputDir)} # create output directory
plotsDir <- paste0(workDir, "HMZDelFinder_output/plots/" , sep=""); if (!file.exists(plotsDir)){dir.create(plotsDir)} # create output plots directory
rpkmFiles <- dir(paste(dataDir, "TGP/",sep=""), "rpkm2.txt$")	# list of RPKM file names
rpkmFids <- gsub(".rpkm2.txt", "", rpkmFiles) 					# list of sample identifiers
rpkmPaths <- paste0(paste0(dataDir, "TGP/"), rpkmFiles) 		# list of paths to RPKM files
aohDir <- paste0(workDir, "HMZDelFinder_output/AOH/" , sep=""); if (!file.exists(aohDir)){dir.create(aohDir)} 
aohRDataOut <- paste(workDir, "HMZDelFinder_output/AOH/extAOH_small.RData", sep="")	# temporary file to store AOH data

############
## NOTE 1 ## 
############
## To use own WES data and create RPKM files from BAM files one can use calcRPKMsFromBAMs function.
## e.g:
#pathToBams <- "/your/path/to/bamfiles/" 
#bamFiles <- paste0(pathToBams, dir(pathToBams, "bam$"))
#rpkmDir <- dataDir  # place to store RPKM files
#sampleNames <- sapply(strsplit(dir(pathToBams, "bam$"), "[/\\.]"), function(x){x[length(x)-1]}) # sample identifiers
#calcRPKMsFromBAMs(bedFile,bamFiles , sampleNames, rpkmDir,4)
##

############
## NOTE 2 ## 
############
## In this example, we are not performing AOH filtering and VCF files are not required.
## If one wants to perform AOH filtering than need to prepare two lists:
## vcfPaths - the list of paths to VCF files
## vcfFids - the list of sample identifiers that corresponds to VCF files (same order)
## e.g.:
# vcfFiles <- dir (vcfDir,"vcf.bz2$", recursive=T, include.dirs=FALSE)
# vcfFids <- sapply(strsplit(vcfFiles,"[/\\.]"), function(x){x[2]})
# vcfPaths<- paste0(inputDir,vcfFids,"/", sapply(strsplit(snpFiles,"/"), function(x){x[2]}), sep="")
##


########################################
# THRESHOLDS
#
# See description of HMZDelFinder function for details
########################################
is_cmg <- FALSE 		# only for CMG project - otherwhise use FALSE
lowRPKMthreshold <- 0.65# RPKM threshold  
maxFrequency <- 0.05	# max frequncy of HMZ deletion; default =0.005
minAOHsize <- 1000		# min AOH size
minAOHsig <- 0.45		# min AOH signal threshold
mc.cores<-4 				# number of cores
vR_id<-"VR"				# ID from VCF FORMAT indicating the number of variant reads, for other variant callers could be "AD"
tR_id<-"DP"				# ID from VCF FORMAT indicating the number total reads 
filter <- "PASS"		# for other variant callers be  '.'

# running HMZDelFinder
results <- runHMZDelFinder (NULL,		# vcfPaths - paths to VCF files for AOH analysis (not used for 1000 genomes) 
							NULL,		# vcfFids - sample identifiers corresponding to VCF files  (not used for 1000 genomes) 
							rpkmPaths, 	# paths to RPKM files 
							rpkmFids,	# samples identifiers corresponding to RPKM files
							mc.cores,	# number of CPU cores
							aohRDataOut,# temp file to store AOH data
							bedFileHMZ,	# bed file with target 
							lowRPKMthreshold, #  RPKM threshold 
							minAOHsize, # min AOH size
							minAOHsig,	# min AOH signal threshold
							is_cmg,		# flag used for CMG specific annotations; TRUE samples are from BHCMG cohort, FALSE otherwhise
							vR_id, 		# ID for 'the number of variants reads' in VCF FORMAT column (default='VR');
							tR_id,		# ID for 'the number of total reads' in VCF FORMAT column (default='DP')
							filter)		# only variants with this value in the VCF FILTER column will be used in AOH analysis 
					
					
# saving results in csv files
write.csv(results$filteredCalls, paste0(outputDir,"hmzCalls.csv"), row.names=F )

# plotting deletions
lapply(1:nrow(results$filteredCalls),function(i){
			plotDeletion (results$filteredCalls, i, results$bedOrdered, results$rpkmDtOrdered,  lowRPKMthreshold, plotsDir, mainText=""  )})
	
## Selected columns from the results$filteredCalls object:					
#					Chr     Start      Stop   Genes Start_idx     FID
#					1:   5 140235634 140236833 PCDHA10    133937 NA11919
#					2:   X  47918257  47919256  ZNF630    167263 NA18856
#					3:  11   7817521   7818489   OR5P2     27561 NA19137
#					4:  11   7817521   7818489   OR5P2     27561 NA19236
#					5:   9 107379729 107380128  OR13C9    161704 NA19473
#					6:   1 196795959 196796135   CFHR1     15101 NA20798
#					7:   5  42629139  42629205     GHR    130161 NA07347
#					8:   5  42629139  42629205     GHR    130161 NA12342
#					9:   5  42629139  42629205     GHR    130161 NA19213
#					10:  16  55866915  55866967    CES1     64208 NA18553
## NOTE: Deletions of CES1, CFHR1 and OR13C9 are located within segmental duplications, and thus they were not reported in the manuscript


					

Example of homozygous deletion detected by HMZDelFinder

tgambin Thu, 09/07/2017 - 10:39

RNA Structure Probing

RNA Structure Probing

0. [due to server constraints do not run this job] Annotate the secondary structure of human mt-SSU rRNA based on the crystal structure (PDB: 3J9M). 

a. Go to http://rnapdbee.cs.put.poznan.pl

b. Get 3J9M structure, and Run

c. Output:

Secondary structure topology resolved & encoded by Hybrid Algorithm:
>strand_A
GCUAAACCUAGCCCCAAACCCACUCCACCUUACUACCAGACAACCUUAGCCAAACCAUUU
ACCCAAAUAAAGUAUAGGCGAUAGAAAUUGAAACCUGGCGCAAUAGAUAUAGUACCGCAA
GGGAAAGAUGAAAAAUUAUAACCAAGCAUAAUAUAGCAAGGACUAACCCCUAUACCUUCU
GCAUAAUGAAUUAACUAGAAAUAACUUUGCAAGGAGAGCCAAAGCUAAGACCCCCGAAAC
CAGACGAGCUACCUAAGAACAGCUAAAAGAGCACACCCGUCUAUGUAGCAAAAUAGUGGG
AAGAUUUAUAGGUAGAGGCGACAAACCUACCGAGCCUGGUGAUAGCUGGUUGUCCAAGAU
AGAAUCUUAGUUCAACUUUAAAUUUGCCCACAGAACCCUCUAAAUCCCCUUGUAAAUUUA
ACUGUUAGUCCAAAGAGGAACAGCUCUUUGGACACUAGGAAAAAACCUUGUAGAGAGAGU
AAAAAAUUUAACACCCAUAGUAGGCCUAAAAGCAGCCACCAAUUAAGAAAGCGUUCAAGC
UCAACACCCACUACCUAAAAAAUCCCAAACAUAUAACUGAACUCCUCACACCCAAUUGGA
CCAAUCUAUCACCCUAUAGAAGAACUAAUGUUAGUAUAAGUAACAUGAAAACAUUCUCCU
CCGCAUAAGCCUGCGUCAGAUUAAAACACUGAACUGACAAUUAACAGCCCAAUAUCUACA
AUCAACCAACAAGUCAUUAUUACCCUCACUGUCAACCCAACACAGGCAUGCUCAUAAGGA
AAGGUUAAAAAAAGUAAAAGGAACUCGGCAAAUCUUACCCCGCCUGUUUACCAAAAACAU
CACCUCUAGCAUCACCAGUAUUAGAGGCACCGCCUGCCCAGUGACACAUGUUUAACGGCC
GCGGUACCCUAACCGUGCAAAGGUAGCAUAAUCACUUGUUCCUUAAAUAGGGACCUGUAU
GAAUGGCUCCACGAGGGUUCAGCUGUCUCUUACUUUUAACCAGUGAAAUUGACCUGCCCG
UGAAGAGGCGGGCAUAACACAGCAAGACGAGAAGACCCUAUGGAGCUUUAAUUUAUUAAU
GCAAACAGUACCUAACAAACCCACAGGUCCUAAACUACCAAACCUGCAUUAAAAAUUUCG
GUUGGGGCGACCUCGGAGCAGAACCCAACCUCCGAGCAGUACAUGCUAAGACUUCACCAG
UCAAAGCGAACUACUAUACUCAAUUGAUCCAAUAACUUGACCAACGGAACAAGUUACCCU
AGGGAUAACAGCGCAAUCCUAUUCUAGAGUCCAUAUCAACAAUAGGGUUUACGACCUCGA
UGUUGGAUCAGGACAUCCCGAUGGUGCAGCCGCUAUUAAAGGUUCGUUUGUUCAACGAUU
AAAGUCCUACGUGAUCUGAGUUCAGACCGGAGUAAUCCAGGUCGGUUUCUAUCUACUUUC
AAAUUCCUCCCUGUACGAAAGGACAAGAGAAAUAAGGCCUACUUCACAAAGCGCCUUCCC
CCGUAAAUGAUAUCAUCUCAACUUAGUAUUAUACCCACACCCACCCAAGAACAGGGUUU
((.....(..((((.........................-...........(...[..).
..----..(.....((.]...))....)..-----.)).)).............((....
))....).....(...)........)).........(.(((.......))).[.(.(...
.....).).((((.(.........((..((..((....))...)).).)...((.(...(
(((.(.....(((((.((((.(((.--.[)))...(((..((((.......{)))).)))
..).))).))))).(.((.......))...)..).))))......).))..((((((...
.((.(....)).).((.(((((.((((.....((...----...)).....)))).))))
)..))..((((((((((......))))))))))...(((......)))....(((.(((.
..(..............(.(...((.........))...).)....(..(((......))
))......---------....................)...))))))........)))))
)...(((((......))))).........)).)))(......((..(((....)))(((.
..((((......(.(((((..-----...---.)))))......(((...((((......
.........(..)....)))).......))).)...((......)).))))......)))
..((((....((((((..((..(....(........((((((.].((.........))..
..((((((.............))))))....(((.(((..(((((....)))..)))))(
((((.----...)))))........((..........(((((.......)))))..)..)
....)))....)).))))....)..))).).))))).)))).)).....)).{..(((((
.....]}.))))).......((..(((.((......((((.(((((....((((.(((.(
(((.......-------------------------------...))))).))))))...(
((......[)))((((((..........]))))))(.((.....((...(((....[..)
))...))....-------)).........(((....))).....)]......)))..)))
)))......((((..(.((((((.(.((.......)))..))))))...).)).).)...
...((((.((((((..((.((((((......))))))...)).(((((.....)))))..
...)))))..)...))).).....(((((((....))).))))..}.))..))).))..-
--(((.(((......(....).....))).)))(((((...(((....))).)))))...
...(((................))).----.......----.((((.......)))).-
>strand_B
CCAGAGUGUAGCUUAACACAAAGCACCCAACUUACACUUAGGAGAUUUCAACUUAACUUG
ACCGCUCUGACCA
--((((((..(((...----.))).((.((.......)).))....-((((------)))
)-))))))..---
>strand_u
CA
..
>strand_AA
AAUAGGUUUGGUCCUAGCCUUUCUAUUAGCUCUUAGUAAGAUUACACAUGCAAGCAUCCC
CGUUCCAGUGAGUUCACCCUCUAAAUCACCACGAUCAAAAGGAACAAGCAUCAAGCACGC
AGCAAUGCAGCUCAAAACGCUUAGCCUAGCCACACCCCCACGGGAAACAGCAGUGAUUAA
CCUUUAGCAAUAAACGAAAGUUUAACUAAGCUAUACUAACCCCAGGGUUGGUCAAUUUCG
UGCCAGCCACCGCGGUCACACGAUUAACCCAAGUCAAUAGAAGCCGGCGUAAAGAGUGUU
UUAGAUCACCCCCUCCCCAAUAAAGCUAAAACUCACCUGAGUUGUAAAAAACUCCAGUUG
ACACAAAAUAGACUACGAAAGUGGCUUUAACAUAUCUGAACACACAAUAGCUAAGACCCA
AACUGGGAUUAGAUACCCCACUAUGCUUAGCCCUAAACCUCAACAGUUAAAUCAACAAAA
CUGCUCGCCAGAACACUACGAGCCACAGCUUAAAACUCAAAGGACCUGGCGGUGCUUCAU
AUCCCUCUAGAGGAGCCUGUUCUGUAAUCGAUAAACCCCGAUCAACCUCACCACCUCUUG
CUCAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGCUACAAAGUAAGCGCAAG
UACCCACGUAAAGACGUUAGGUCAAGGUGUAGCCCAUGAGGUGGCAAGAAAUGGGCUACA
UUUUCUACCCCAGAAAACUACGAUAGCCCUUAUGAAACUUAAGGGUCGAAGGUGGAUUUA
GCAGUAAACUAAGAGUAGAGUGCUUAGUUGAACAGGGCCCUGAAGCGCGUACACACCGCC
CGUCACCCUCCUCAAGUAUACUUCAAAGGACAUUUAACUAAAACCCCUACGCAUUUAUAU
AGAGGAGACAAGUCGUAACAUGGUAAGUGUACUGGAAAGUGCACUUGGACGAAC
-[.[[[[..(((((..]]]]]((((((.((..((((((((..(.(((.(((..(......
)......(((.......(((.(..(((-----))).)..)))...((((....((..(((
(....))).)))......)))).((...))))).(((....))).....)))))))....
..(.(((...(((((....)))).))))).)).)))))).....((((([[[.....(((
.....((.]]])).......)))..)))))..)).))))))--..((([[...(.((((.
..(((...---------...(((((((.....(((.(((.((........))..))).))
).....).....(.((....))))))))).....)))....))))....((((((...((
...((((.........))))...))))))))......{...(.(((((----..]]..))
))).))))).((.......((((....)))).....))...)))))(((((.(((((((.
..(((.(((((..((((((((((.....((........))..........(((((((...
....((.(((...((((((.(....((..(((((.....))).......))..)).....
(((....))).....).).)))...)).)))))....)))))))..)).)))))))).(.
..((((.....)))).----(....(((((((.....--)))))))....)..)..))))
).....(((((((.........))))))).....)))...)))))))))).))..(.(..
((.(...((((((...((((..((..(.(..--((.....))...).)..).)..)))).
.))))))..--).))...)..)..(((((((((....)))))))))}.......
>strand_AX
g
.

d. For the convenience of subsequent analysis, I have converted the structure annotation of strand_AA to unusual notation with upper-lower case encoding (mt-rnr1_str.txt)

  1. If you haven't found them on the USB stick, download 3 FASTQ files and one txt file from https://tinyurl.com/ngschool2017 [144 MB total]
    1. Set up a variable with working directory where you have stored your downloaded files. For example:
working_dir="~/NGSchool2017/Data"
cd $working_dir
    1. You should have files named:
K562denatured_subset.fastq.gz                 
k562vitro_subset.fastq.gz
K562vivo1_dms300_subset.fastq.gz
mt-rnr1_str.txt
  1. Prepare the reference sequence and a mapping index
    1. Copy the sequence of human mitochondria (NC_012920.1) from the USB stick or download it, either:
      1. Manually

go to

https://www.ncbi.nlm.nih.gov/nuccore/251831106

Send to -> File -> FASTA, Rename to chrM.fa, place in 'Template' directory

      1. Automatically
mkdir Template
cd Template
wget -O chrM.fa "https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=fasta&id=NC_012920.1"

    1. Open the downloaded FASTA file in text editor (e.g. vim) and change the name of the reference to "chrM"
    1. Prepare bowtie2 index for mapping of the reads
bowtie2-build chrM.fa chrM

3. Map reads to chrM

    cd $working_dir
    for d in K562denatured k562vitro K562vivo1_dms300
    do
    
    bowtie2 -p 2 -x ./Template/chrM -U "$d"_subset.fastq.gz 2> "$d"_bowtie_out.txt | gzip > "$d".sam.gz
    
    done
    1. Inspect SAM files
    zless -S k562vitro.sam.gz
      1. There are many unmapped reads! Why?
      1. To filter out the unmapped reads we can use a flag attribute. What flag values to keep/discard?
        1. See the website explaining SAM flags: https://broadinstitute.github.io/picard/explain-flags.html
      1. See how to describe the flags in samtools program:
    samtools flags
      1. Remove all unmapped reads
    for d in K562denatured k562vitro K562vivo1_dms300
    do
    
    samtools view -F 0x4 -h "$d".sam.gz | gzip > "$d"_filtered.sam.gz
    
    done
    1. Convert SAM file to BAM
    for d in K562denatured k562vitro K562vivo1_dms300
    do
    
    samtools sort -O bam -T temp -o "$d".bam "$d"_filtered.sam.gz
    samtools index "$d".bam &
    
    done
    wait
    
    1. Display BAM files in IGV
      1. Load K562vivo1_dms300.bam file in the IGV.

    To see all the reads go to: View -> Preferences -> Alignments -> Max Read Depth set to 10000

      1. Is this way display informative? Does it include all the information we need?
    1. Count the termination counts at different positions of the mitochondrial genome
      1. Use a script described in “Reproducible Analysis of Sequencing-Based RNA Structure Probing Data with User-Friendly Tools

     

    wget https://raw.githubusercontent.com/lkie/RNAprobBash/master/summarize_unique_barcodes.sh
      1. Change permissions to allow the script to execute:
    chmod u+x summarize_unique_barcodes.sh
        1. Note for Mac users: I had to replace all the "zcat" to "gzcat" with:
    sed -i -e 's/zcat/gzcat/g' summarize_unique_barcodes.sh

    and I had to update gawk to version 4+

      1. Run the script
    for d in K562denatured k562vitro K562vivo1_dms300
    do
    
    ./summarize_unique_barcodes.sh -f "$d"_filtered.sam.gz -t -o "$d"_out
    
    done

    Inspect generated files. What kind of information do they contain?

    1. Analyze the probing data in RNAprobR
      1. Launch R and load the RNAprobR package
    library("RNAprobR")
      1. Make sure that you are in the working directory with the getwd() function. If you are not, change it with
        setwd("~/NGSchool2017/Data")

        Read-in the datasets (“readsamples()”), compile (“comp()”) and normalize with smooth Winsorization(“swinsor()”):

    euc_GR <- list()
    comp_GR <- list()
    norm_GR <- list()
    norm_GR_subseq <- list()
    
    for(treatment in c("K562denatured", "k562vitro", "K562vivo1_dms300")){
      euc_GR[[treatment]] <- readsamples(samples = paste(treatment, "_out/read_counts.txt", sep = ""))
      comp_GR[[treatment]] <- comp(euc_GR = euc_GR[[treatment]], fasta_file = "Template/chrM.fa")
      norm_GR[[treatment]] <- swinsor(comp_GR[[treatment]])
      norm_GR_subseq[[treatment]] <- norm_GR[[treatment]][644:1597]
    }
    
      1. Read-in the annotated secondary structure. Upper-case is paired. Structure based on paper: "The structure of the human mitochondrial ribosome" (PDB: 3J9M). [it is not a standard way of annotating structure].
    rnr1_str <- scan(file = "mt-rnr1_str.txt", what = character())
    t1 <- strsplit(rnr1_str, "")[[1]]
    isDs <- t1 == toupper(t1)
    rm(t1)
      1. Make histograms of the signal intensity for different classes of nucleotides
    par(mfrow = c(2,2))
    treatment <- "K562vivo1_dms300"
    
    for(nt in c("A", "C", "G", "T")){
      unpaired <- hist(norm_GR_subseq[[treatment]]$swinsor[norm_GR_subseq[[treatment]]$nt == nt & isDs == FALSE], breaks = 10, plot = F)
      paired <- hist(norm_GR_subseq[[treatment]]$swinsor[norm_GR_subseq[[treatment]]$nt == nt & isDs == TRUE], breaks = 10, plot = F)
      ymax <- max(paired$counts, unpaired$counts)
      plot(unpaired, main = paste(nt), xlim = c(0,1), col = "#FF000080", ylim = c(0, ymax), xlab = "Normalized reactivity")
      plot(paired, col = "#0000FF80", add = T)
    }
      1. Perform similar analysis for the denatured samples. How does it compare to the in vivo probing?
      2. Make ROC curves of the DMS (e.g. pROC library)
      3. Make bedgraph files for display in genome browsers
        1. To create a bedgraph file to display in the genome browser, we need to provide the program with a mapping information of the transcript to the genomic coordinates. It can be done using "bed" file. In the Template directory, create "tx.bed" file, with only one line containing:

     

    chrM   0          16569            chrM   0          +
      1. Make bedgraphs with raw counts and normalized values:
    treatment <- "K562vivo1_dms300"
    
    norm2bedgraph(norm_GR = comp_GR[[treatment]], bed_file = "Template/tx.bed", bedgraph_out_file = paste(treatment, "raw", sep = ""), track_name = paste(treatment, "raw"), norm_method = "TC", genome_build = "hg38")
    
    norm2bedgraph(norm_GR = norm_GR[[treatment]], bed_file = "Template/tx.bed", bedgraph_out_file = paste(treatment, "swinsor", sep = ""), track_name = paste(treatment, "swinsor"), norm_method = "swinsor", genome_build = "hg38")
      1. Load and display bedgraph files in IGV
      2. Upload and display bedgraph files in UCSC Genome Browser
      3. Prepare graphical representation of a secondary structure using VARNA
        1. Focus on positions 715 – 800 (see figure S8 in Amunts et al. 2015)
    treatment <- "K562vivo1_dms300"
    five_p_tip <- data.frame(1:86, norm_GR[[treatment]]$nt[715:800], norm_GR[[treatment]]$swinsor[715:800])
    write.table(five_p_tip, file = "five_p_tip.txt", quote = FALSE, row.names = FALSE, col.names = FALSE)
        1. Open VARNA, copy:
    As sequence:
    GTGAGTTCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGCAATGCAGCTCAAAACGCTTAGCCTAGCCAC
    
    As structure:
    (((.......(((.(..(((.....))).)..)))...((((....((..((((....))).)))......)))).((...)))))

    and load the values from the R-generated file "five_p_tip.txt" [Display -> Color map -> Load values]

     

    lkielpinski Mon, 08/21/2017 - 11:12

    Day 4

    Day 4 lpryszcz Wed, 07/12/2017 - 17:22

    ChIP-seq

    ChIP-seq

    Notes on the workshop.

    The data is available on your pendrive and on the server in the following directory: differential.ChIP-seq.data.

    To get up-to date scripts, get them from GitHub.

    git clone https://github.com/ophiothrix/NGS2017.differential.ChIP-seq.workshop

    Ideally, place them both in the same folder, not one inside another.

    The workshop is best done in Rstudio.

    There are two main files:

    differential.chipseq.workshop.Rmd - Rstudio NoteBook Source File - can be run directly in the Rstudio.

    differential.chipseq.workshop.nb.html - Pre-compiled notebook - easier to follow.

    Note! these files in the pendrive are slightly outdated. Use the ones from GitHub link above.

    aholik Fri, 09/08/2017 - 11:56

    Functional genome annotation

    Functional genome annotation mmarcet Thu, 08/24/2017 - 14:55
    slides

    Exercices

    Exercices

    For exercice 1 and 2 we provide a piece of DNA belonging to the prokaryotic species Actinomyces johnsonii named contigA.fasta.

     

    Exercise 1.- Annotate a prokaryotic genome with a homology based model (blast)

    Blast is probably one of the most useful and used tools that exist in genomics, and is used constantly by bioinformaticians and wet-lab scientists alike. It is normally implemented in websites such as NCBI or UniProt. The problem is that when you are working with a newly sequenced genome it will likely not be present in any website. While blast is not very useful as a genome annotation tool, we still will have a look on how to use it.

    1.- The first step consists in converting our genome into a blast database. In order to do that you can use the following command:

    formatdb -p F -i contigA.fasta

    The -p F options indicates that our database contains a nucleotide sequence. For a protein database either put -p T or don't put anything.

    2.- To perform a blast search we need at least one gene or protein to use as query. In this case we provide the proteome of a closely related species: Actinomyces oris. The proteome can be found in the file protein_list.fasta

    3.- Now you can run the blast search using the following command:

    blastall -p tblastn -i protein_list.fasta -d contigA.fasta -m 8 -b 1 -e 0.01 -o contigA.blast

    Where:

    -i indicates the list of proteins

    -d indicates the genome database

    -p is the kind of blast program we're going to use

    -m 8 indicates the output format

    -b 1 indicate we only want one result per protein

    -e is the e-value threshold

    -o is the output file

    Open the file and have a look at the results, why do you think that this method is not useful for gene annotation? What do you think it can be useful for?

     

    Exercise 2.- Annotate a prokaryotic genome using Glimmer3.

    Glimmer3 is one of the most used gene annotation programs in prokaryotes. There are different versions for the annotation of eukaryotes, metagenomes and other, for more information about those you can check out this website: https://ccb.jhu.edu/software/glimmer/ In addition there is an on-line version that can be used: http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi

    We are going to use glimmer to annotate the piece of DNA found in contigA.

    Ab-initio programs use a parameters file to make gene predictions. The web version of Glimmer only runs the program on standard prokaryotic parameters, ideally we want to give it a personalized set. In order to do that we first need to find a set of proteins to train the model. Glimmer allows us to make a first orf prediction and use that as input for the training program.

    1.- Obtain the long ORFs from contigA:

    tigr-glimmer long-orfs contigA.fasta contigA.longOrfs

    Open the file and have a look at the results.

    2.- Now you need to extract the nucleotide sequence from the predicted ORFs.

    tigr-glimmer extract contigA.fasta contigA.longOrfs >contigA.longOrfs.fasta

    As you see now the contigA.longOrfs.fasta file contains the sequences. You can count how many sequences you found by using this command:

    grep -c “>” contigA.longOrfs.fasta

    3.- Use the predicted proteins to create the parameter file that glimmer needs to work:

    cat contigA.longOrfs.fasta | tigr-glimmer build-icm contigA.icm

    4.- And finally make the glimmer prediction

    tigr-glimmer glimmer3 contigA.fasta contigA.icm contigA.results

    This command will output two files: contigA.results.detail and contigA.results.predict. The final coordinates for the prediction can be found in the .predict file. The .detail file shows all the genes that were considered whether they made it to the final annotation or not.

    5.- Extract the ORFs obtained in the final prediction as shown above. How many genes have you predicted? Compare it to the ORF sequences.

    6.- What would happen if we used a different species to build the parameters file? Use the fasta file containing 500 proteins of Streptococcus suis (seqs.Streptococcus.fasta) to obtain a second set of gene predictions. Compare the two of them.

     

    Exercise 3.- Homology based annotation

    contigB.fasta contains a fragment of chromosome VII of the fungus Aspergillus nidulans (also called Emericella nidulans). We are going to use it to have a look at a homology based annotation method. We are going to use exonerate (http://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate).

    To run an annotation method based on homology we first need a set of proteins to search for. These are already provided in the file protein_list.fa

    1.- Execute exonerate to check which options are available.

    exonerate --help

    2.- Choose those options needed to run exonerate considering that the file contigB.fasta contains a genome.

    3.- Have a look at the results. How many proteins do you have?

    Can you play a bit with the outputs in order to obtain a clearer picture of what is going on?

    4.- Repeat the analysis asking that just the best protein is annotated.

    The most useful commend will look like this:

    exonerate -q protein_list.fa -t contigB.fasta -m p2g --showalignment False --showvulgar False --showtargetgff -n 1

    Where:

    -q is the query sequence in this case the protein list.fasta

    -t is the target (contig in our case)

    -m is the model we’re going to apply, in this case it’s a shortcut for protein to genome

    -n 1 show only the best result for each protein

    and the following commands just modify the output: don’t show the alignment (--showalignment), don’t show vulgar annotation (--showvulgar False), show the gff annotation of the results (--showtargetgff)

     

    Exercise 4.- Annotate a eukaryotic genome with Augustus.

    We are going to use the same piece of DNA to predict the proteins in and ab-initio program (Augustus).

    1.- As a reminder, contigB.fasta contains a fragment of chromosome VII of the fungus Aspergillus nidulans (Emericella nidulans). First check whether this species is already available in Augustus.

    To know the list of species for which we have parameters predicted you can user the --species=help command.

    2.- Once you verify that it is, you can check the different options that Augustus provides. We can run Augustus by typing this:

    augustus [parameters] --species=aspergillus_nidulans contigB.fasta

    Where parameters are any of the choices the program offers.

    3.- Run Augustus to obtain proteins with a complete gene_model, with the output including the gff3 (check augustus for the proper command)

    4.- In order to extract the protein sequences we can use a script provided by augustus called getAnnoFasta.pl that you can find in your working folder. Notice that this script is also able to provide the CDS if augustus has been called with the --codingseq=on option.

    5.- What happens if you use a different species to define your parameters? Repeat the augustus prediction using the parameters of Aspergillus fumigatus and of tomato. In both cases extract the protein sequences using getAnnoFasta.pl and save them in separate files.

    6.- Compare the three results, what do you observe?

     

    Exercise 5.- Search BRH between the proteins predicted and A. fumigatus.

    We will search BRH between two closely related genomes, A. nidulans and A. fumigatus. The first step is to perform the blast search.

    1.- Format the two files provided: EMENI.fa and ASPFU.fa

    2.- Now perform the two searches, EMENI.fa to ASPFU.fa and ASPFU.fa to EMENI.fa (use the -m8 option so that the files are easy to parse)

    3.- Process the results with the script get_BRH.py.

     

    python get_BRH.py -s1 EMENI.fa -s2 ASPFU.fa -h1 EMENI2ASPFU.blast -h2 EMENI2ASPNI.blast -o BRH.ASPFU.txt

     

    4.- Look at the results. How many BRH did you find?

    5.- Now repeat the process with the yeast proteome. What did you find? Which species is better to annotate your genome?

     

    Exercise 6.- Perform a MCL clustering between the proteins obtained in exercise 4 (EMENI.fa) and other Aspergillus proteins found in the file (Aspergillus.fa)

    1.- You need the results of an all against all blast search, therefore you'll have to perform two blast searches as before or alternatively you can pool all the proteins in a single file and perform a blast search from this file to itself.

    2.- Perform the mcl clustering.

    mcl fileName --abc -o outfileName -I inflation_value

    Where the inflation value affects the cluster granularity, basically it will make the clusters larger or smaller. The smaller the value the largest the groups will be. The value ranges from 1.2 to 5.0.

    3.- Perform the clustering with different inflation values: 1.2, 2.0 and 5.0. Have a look at the results, are there differences?

     

    Exercise 7: Reconstruct phylogenetic trees for each of the proteins predicted.

    The phylogenetic reconstruction process can be performed in three steps: homology search, alignment reconstruction and phylogenetic reconstruction. For this exercise we assume that the homology search has already been performed and each of the proteins found in ASPNI is found in a different file and contains its sequence and the sequences of its homologs.

    1.- Perform a multiple sequence alignment using muscle:

    muscle -in fileName -out fileName.alg

    2.- Trim the alignment using trimal

    trimal -in fileName.alg -out fileName.trim -gt 0.1

    Take note that trimal is able to change the format of the alignment which is very useful if we want to perform a phylogenetic tree with other programs such as PhyML or RAxML.

    3.- Reconstruct phylogenetic tree using fasttree

    fasttree fileName.trim >fileName.tree

    mmarcet Tue, 09/05/2017 - 15:53

    Team-working: Moving forward and enjoying science together

    Team-working: Moving forward and enjoying science together ptheodorakis Mon, 09/04/2017 - 13:08
    slides

    Day 5

    Day 5 lpryszcz Wed, 07/12/2017 - 17:23

    Day 6

    Day 6 lpryszcz Wed, 07/12/2017 - 17:23

    Day 7

    Day 7

    Closing remarks & departures. 

    lpryszcz Wed, 07/12/2017 - 17:23

    GUIDELINES FOR SPEAKERS

    GUIDELINES FOR SPEAKERS

    Preparing your workshop

    1. Workshop slots are typically 4 hours, this should include theoretical introduction (often 30 min will be more than enough) and practical exercises (here the more the better;) ). 
    2. Each student will have a laptop with Ubuntu installed. Please, prepare your exercises so they can run even on older laptops in reasonable amount of time ie for de novo assembly workshop I'm using 100Kb region of one chromosome. 
    3. Please, add the software you will need in your workshop to this list. We'll cover installation of all dependencies on Day 0. 
    4. Let us know, if you need some help during your workshop. 

    Creating workshop materials

    Adding new content

    1. Navigate to the book page under which you want to create content ie under Day 3.
    2. Press `Add child page`.
    3. Edit the content of new page.
    4. Optionally upload slides.
    5. Press `Save and publish`.

    Adding nicely formatted source code / snippets of code

    1. Change `Text  format` to `Full HTML`.
    2. Press `Insert code snippet` button.
    3. Select syntax language
    4. Paste your code & press OK.

    Quick Edit: In-place editing of content

    Drupal offers Quick Edit, meaning you can edit any content and see it's effect in the current browser window. I strongly recommend using it, as it's very handy. You can access it by clicking small pencil symbol in the top right corner of any content and selecting `Quick edit`.

    Data upload

    Please upload all data related to your workshops into #NGSchool2017 GitHub repository https://github.com/NGSchoolEU/NGSchool2017-materials

     

    lpryszcz Wed, 07/12/2017 - 17:33