#NGSchool2016 Workshops

#NGSchool2016 Workshops

Materials in this book are reproduced as an internal material for participants of the Summer School in Bioinformatics & NGS Data Analysis (#NGSchool2016). If you want to use any of the materials included here for other purposes, please ask individual contributors for the permission. 

lpryszcz Tue, 06/21/2016 - 19:49

Preparation for the NGSchool

Preparation for the NGSchool

The course is open for people without any background in Computational Biology, but everyone should be familiar with basics of working in command-line (Linux), programming and statistics. Therefore, please complete mandatory courses before attending the NGSchool. In addition, if you are interested in other aspects, you are welcome to continue with some of the supplementary courses.

In the case of any problems, feel free to post in Forum > Preparation

Mandatory on-line courses

Supplementary courses

Prerequisites

Can I work remotely (SSH) or use VirtualBox? 

We will setup a server locally, so in principle remote working will be possible, but installing Ubuntu in your laptop is strongly recommended. You will be able to connect to your remote machine (ie at work), just remember we'll be in rural region in High Tatras: Internet connection is stable there, but rather slow, especially if ~40 people wants to use it at the same time. 

Alternatively, you can install Ubuntu in VirtualBox or install it as Windows program using Wubi on Windows XP, Vista & 7Note, these alternatives come with certain limitations, so standard installation is recommended

Resources

lpryszcz Wed, 07/13/2016 - 12:03

Day 0: Welcome and get together session

Day 0: Welcome and get together session

We'll start with short opening ceremony of NGSchool2016, after which we'll hold short talks (2-3 min) of all participants. This will provide us with a great opportunity to get to know each other better. 

lpryszcz Tue, 06/21/2016 - 20:07

Day 1: Introduction to Linux & Bioinformatics

Day 1: Introduction to Linux & Bioinformatics

The first day of the course will cover introduction to Linux & Bioinformatics. In the course of the day we'll install all programmes, that will be needed throughout the course. 

    lpryszcz Wed, 06/22/2016 - 10:31

    Introduction to Linux

    Introduction to Linux rcheplyaka Sat, 07/30/2016 - 19:50
    slides
    slides.pdf50.91 KB

    Perl one-liners and advanced command-line use (optional self-paced study and exercises)

    Perl one-liners and advanced command-line use (optional self-paced study and exercises)

    WORKING ON THE COMMAND LINE

    1. Useful UNIX commands (for command-line pipes)

    2. Regular expressions

    3. Perl one-liners

    4. Exercises

    5. List of useful Perl one-liners

    =================================

    1. Useful UNIX commands (for command-line pipes)

    Pipes are chains of UNIX shell commands connected with the symbol |. In
        command1 | command2
    output of command1 is directly fed to command2 as its input.

    Use "apropos topic" to see if a command exists for topic.
    Use "man command" for information on its use.

    echo    - display a line of text
    bc    - An arbitrary precision calculator language

        TRY:
            echo '2*3' | bc

    cat    - concatenate files and print on the standard output
    tr    - translate or delete characters

        TRY:
            echo 'I am not a bioinformatician. I am a biologist.' > testfile
            cat testfile
            cat test | tr -d '.,;:\/()[]{}'
            cat test | tr -d '.,;:\/()[]{}' | tr ' ' '\n'

    sort    - sort lines of text files
    uniq    - report or omit repeated lines

        TRY:
            cat test | tr -d '.,;:\/()[]{}' | tr ' ' '\n' | sort
            cat test | tr -d '.,;:\/()[]{}' | tr ' ' '\n' | sort | uniq

        NOTE: uniq is most useful after sort

    head    - output the first part of files
    tail    - output the last part of files
    rev    - reverse lines of a file or files
    shuf    - generate random permutations

        TRY:
            cat test | tr -d '.,;:\/()[]{}' | tr ' ' '\n' | sort | uniq | head -1
            cat test | tr -d '.,;:\/()[]{}' | tr ' ' '\n' | sort | uniq | tail -3
            cat test | tr -d '.,;:\/()[]{}' | tr ' ' '\n' | sort | uniq | rev
            cat test | tr -d '.,;:\/()[]{}' | tr ' ' '\n' | sort | uniq | shuf

        NOTE: a useful option for tail is -f, will follow file growth in real-time
        NOTE: a useful option for sort is -k, allows to sort tabulated data by column
        NOTE: try shuf repeatedly to see the randomization effect in full

        TRY:
            cat test | tr -d '.,;:\/()[]{}' | tr ' ' '\n' | sort | uniq | shuf | head -2

        QUESTION: what have we done here? (answer at the end of section)
        NOTE: 'shuf -n 2' does the trick too

    grep    - print lines matching a pattern
    egrep    - print lines matching a pattern (accepts extended regular expressions)

        TRY:
            cat test | tr -d '.,;:\/()[]{}' | tr ' ' '\n' | sort | uniq | grep bio

    awk    - pattern scanning and processing language

        TRY:
            awk 'NR%2==0' file
        NOTE: awk is beyond the scope of this workshop; one useful use is to extract columns (2nd column in this example) from tabulated text files with: awk '{print $2}'

        ANSWERS:
            subsampling lines of a file

    2. Regular expressions

    Regular expressions are patterns (such as those used with grep above) with wildcards and
    other magic that can be used to describe a set of words (any regular language and maybe a bit more).

        TRY:
            apropos perlre


    3. Perl one-liners

    The command
        perl -p -e 'command'
    causes perl to read in lines from input, execute command on each of them and print the result.

    In Perl, two commands use regular expressions:
        m/pattern/
        s/pattern/replacement/

    They automagically operate on the system variable $_, which always contains the last line read in by Perl, which is great for construction of Perl one-liners.

    NOTE: shell scripting, awk or Python one-liners can be used in similar ways, but not covered here. Because Perl is a terribly messy programming language, but it's so great for one-liners.

        READ: https://www.researchgate.net/publication/286450993_Perl_one-liners_for_bioinformaticians
        NOTE: above PDF contains some cosmetic errors (e.g. arguments not needed by the command)
        READ: http://www.catonmat.net/blog/introduction-to-perl-one-liners/

    See the list of various one-liners at the end of this document

    4. Exercises

    a. Number lines of a file with number, colon and space like so:
        Original: some text here
        Numbered: 1: some text here

    b. Get some email with headers in a textfile (maybe part of your mailbox). Transform it into a file that will have two columns, first with the From: e-mail address, the other with the To: address. Such file defines a graph of the social network defined by your mailbox. Can be readily visualized.
    HINT: perl -pe "s/(.+pattern.+)\n/\$1/" will fuse the line containing pattern with the next line (by deleting end-of-line symbol \n)

    c. Calculate the sum of all numbers contained in a file.

    d. Generate a BLAST file with results of your favourite search, preferably not in one of the tabular formats. Use command-line pipeline to transform it into a tab-separated file with the name of the Query in the first column, names of all hits with E-value better than 0.001 in the second, separated by semicolons. HINT: heavy use of grep and perl -pe 's/abc(.+)xyz//' to get rid of unwanted parts of the file


    5. A list of useful Perl one-liners

    Compiled by Peteris Krumins (peter@catonmat.net, @pkrumins on Twitter)
    http://www.catonmat.net -- good coders code, great reuse

    Latest version of this file is always at:

        http://www.catonmat.net/download/perl1line.txt

    These one-liners work both on UNIX systems and Windows. Most likely your
    UNIX system already has Perl. For Windows get the Strawberry Perl at:

        http://www.strawberryperl.com/

    Table of contents:

        1. File Spacing
        2. Line Numbering
        3. Calculations
        4. String Creation and Array Creation
        5. Text Conversion and Substitution
        6. Selective Printing and Deleting of Certain Lines    
        7. Handy Regular Expressions
        8. Perl tricks


    FILE SPACING
    ------------

    # Double space a file
    perl -pe '$\="\n"'
    perl -pe 'BEGIN { $\="\n" }'
    perl -pe '$_ .= "\n"'
    perl -pe 's/$/\n/'
    perl -nE 'say'

    # Double space a file, except the blank lines
    perl -pe '$_ .= "\n" unless /^$/'
    perl -pe '$_ .= "\n" if /\S/'

    # Triple space a file
    perl -pe '$\="\n\n"'
    perl -pe '$_.="\n\n"'

    # N-space a file
    perl -pe '$_.="\n"x7'

    # Add a blank line before every line
    perl -pe 's//\n/'

    # Remove all blank lines
    perl -ne 'print unless /^$/'
    perl -lne 'print if length'
    perl -ne 'print if /\S/'

    # Remove all consecutive blank lines, leaving just one
    perl -00 -pe ''
    perl -00pe0

    # Compress/expand all blank lines into N consecutive ones
    perl -00 -pe '$_.="\n"x4'

    # Fold a file so that every set of 10 lines becomes one tab-separated line
    perl -lpe '$\ = $. % 10 ? "\t" : "\n"'


    LINE NUMBERING
    --------------

    # Number all lines in a file
    perl -pe '$_ = "$. $_"'

    # Number only non-empty lines in a file
    perl -pe '$_ = ++$a." $_" if /./'

    # Number and print only non-empty lines in a file (drop empty lines)
    perl -ne 'print ++$a." $_" if /./'

    # Number all lines but print line numbers only non-empty lines
    perl -pe '$_ = "$. $_" if /./'

    # Number only lines that match a pattern, print others unmodified
    perl -pe '$_ = ++$a." $_" if /regex/'

    # Number and print only lines that match a pattern
    perl -ne 'print ++$a." $_" if /regex/'

    # Number all lines, but print line numbers only for lines that match a pattern
    perl -pe '$_ = "$. $_" if /regex/'

    # Number all lines in a file using a custom format (emulate cat -n)
    perl -ne 'printf "%-5d %s", $., $_'

    # Print the total number of lines in a file (emulate wc -l)
    perl -lne 'END { print $. }'
    perl -le 'print $n=()=<>'
    perl -le 'print scalar(()=<>)'
    perl -le 'print scalar(@foo=<>)'
    perl -ne '}{print $.'
    perl -nE '}{say $.'

    # Print the number of non-empty lines in a file
    perl -le 'print scalar(grep{/./}<>)'
    perl -le 'print ~~grep{/./}<>'
    perl -le 'print~~grep/./,<>'
    perl -E 'say~~grep/./,<>'

    # Print the number of empty lines in a file
    perl -lne '$a++ if /^$/; END {print $a+0}'
    perl -le 'print scalar(grep{/^$/}<>)'
    perl -le 'print ~~grep{/^$/}<>'
    perl -E 'say~~grep{/^$/}<>'

    # Print the number of lines in a file that match a pattern (emulate grep -c)
    perl -lne '$a++ if /regex/; END {print $a+0}'
    perl -nE '$a++ if /regex/; END {say $a+0}'


    CALCULATIONS
    ------------

    # Check if a number is a prime
    perl -lne '(1x$_) !~ /^1?$|^(11+?)\1+$/ && print "$_ is prime"'

    # Print the sum of all the fields on a line
    perl -MList::Util=sum -alne 'print sum @F'

    # Print the sum of all the fields on all lines
    perl -MList::Util=sum -alne 'push @S,@F; END { print sum @S }'
    perl -MList::Util=sum -alne '$s += sum @F; END { print $s }'

    # Shuffle all fields on a line
    perl -MList::Util=shuffle -alne 'print "@{[shuffle @F]}"'
    perl -MList::Util=shuffle -alne 'print join " ", shuffle @F'

    # Find the minimum element on a line
    perl -MList::Util=min -alne 'print min @F'

    # Find the minimum element over all the lines
    perl -MList::Util=min -alne '@M = (@M, @F); END { print min @M }'
    perl -MList::Util=min -alne '$min = min @F; $rmin = $min unless defined $rmin && $min > $rmin; END { print $rmin }'

    # Find the maximum element on a line
    perl -MList::Util=max -alne 'print max @F'

    # Find the maximum element over all the lines
    perl -MList::Util=max -alne '@M = (@M, @F); END { print max @M }'

    # Replace each field with its absolute value
    perl -alne 'print "@{[map { abs } @F]}"'

    # Find the total number of fields (words) on each line
    perl -alne 'print scalar @F'

    # Print the total number of fields (words) on each line followed by the line
    perl -alne 'print scalar @F, " $_"'

    # Find the total number of fields (words) on all lines
    perl -alne '$t += @F; END { print $t}'

    # Print the total number of fields that match a pattern
    perl -alne 'map { /regex/ && $t++ } @F; END { print $t }'
    perl -alne '$t += /regex/ for @F; END { print $t }'
    perl -alne '$t += grep /regex/, @F; END { print $t }'

    # Print the total number of lines that match a pattern
    perl -lne '/regex/ && $t++; END { print $t }'

    # Print the number PI to n decimal places
    perl -Mbignum=bpi -le 'print bpi(n)'

    # Print the number PI to 39 decimal places
    perl -Mbignum=PI -le 'print PI'

    # Print the number E to n decimal places
    perl -Mbignum=bexp -le 'print bexp(1,n+1)'

    # Print the number E to 39 decimal places
    perl -Mbignum=e -le 'print e'

    # Print UNIX time (seconds since Jan 1, 1970, 00:00:00 UTC)
    perl -le 'print time'

    # Print GMT (Greenwich Mean Time) and local computer time
    perl -le 'print scalar gmtime'
    perl -le 'print scalar localtime'

    # Print local computer time in H:M:S format
    perl -le 'print join ":", (localtime)[2,1,0]'

    # Print yesterday's date
    perl -MPOSIX -le '@now = localtime; $now[3] -= 1; print scalar localtime mktime @now'

    # Print date 14 months, 9 days and 7 seconds ago
    perl -MPOSIX -le '@now = localtime; $now[0] -= 7; $now[4] -= 14; $now[7] -= 9; print scalar localtime mktime @now'

    # Prepend timestamps to stdout (GMT, localtime)
    tail -f logfile | perl -ne 'print scalar gmtime," ",$_'
    tail -f logfile | perl -ne 'print scalar localtime," ",$_'

    # Calculate factorial of 5
    perl -MMath::BigInt -le 'print Math::BigInt->new(5)->bfac()'
    perl -le '$f = 1; $f *= $_ for 1..5; print $f'

    # Calculate greatest common divisor (GCM)
    perl -MMath::BigInt=bgcd -le 'print bgcd(@list_of_numbers)'

    # Calculate GCM of numbers 20 and 35 using Euclid's algorithm
    perl -le '$n = 20; $m = 35; ($m,$n) = ($n,$m%$n) while $n; print $m'

    # Calculate least common multiple (LCM) of numbers 35, 20 and 8
    perl -MMath::BigInt=blcm -le 'print blcm(35,20,8)'

    # Calculate LCM of 20 and 35 using Euclid's formula: n*m/gcd(n,m)
    perl -le '$a = $n = 20; $b = $m = 35; ($m,$n) = ($n,$m%$n) while $n; print $a*$b/$m'

    # Generate 10 random numbers between 5 and 15 (excluding 15)
    perl -le '$n=10; $min=5; $max=15; $, = " "; print map { int(rand($max-$min))+$min } 1..$n'

    # Find and print all permutations of a list
    perl -MAlgorithm::Permute -le '$l = [1,2,3,4,5]; $p = Algorithm::Permute->new($l); print @r while @r = $p->next'

    # Generate the power set
    perl -MList::PowerSet=powerset -le '@l = (1,2,3,4,5); for (@{powerset(@l)}) { print "@$_" }'

    # Convert an IP address to unsigned integer
    perl -le '$i=3; $u += ($_<<8*$i--) for "127.0.0.1" =~ /(\d+)/g; print $u'
    perl -le '$ip="127.0.0.1"; $ip =~ s/(\d+)\.?/sprintf("%02x", $1)/ge; print hex($ip)'
    perl -le 'print unpack("N", 127.0.0.1)'
    perl -MSocket -le 'print unpack("N", inet_aton("127.0.0.1"))'

    # Convert an unsigned integer to an IP address
    perl -MSocket -le 'print inet_ntoa(pack("N", 2130706433))'
    perl -le '$ip = 2130706433; print join ".", map { (($ip>>8*($_))&0xFF) } reverse 0..3'
    perl -le '$ip = 2130706433; $, = "."; print map { (($ip>>8*($_))&0xFF) } reverse 0..3'


    STRING CREATION AND ARRAY CREATION
    ----------------------------------

    # Generate and print the alphabet
    perl -le 'print a..z'
    perl -le 'print ("a".."z")'
    perl -le '$, = ","; print ("a".."z")'
    perl -le 'print join ",", ("a".."z")'

    # Generate and print all the strings from "a" to "zz"
    perl -le 'print ("a".."zz")'
    perl -le 'print "aa".."zz"'

    # Create a hex lookup table
    @hex = (0..9, "a".."f")

    # Convert a decimal number to hex using @hex lookup table
    perl -le '$num = 255; @hex = (0..9, "a".."f"); while ($num) { $s = $hex[($num%16)&15].$s; $num = int $num/16 } print $s'
    perl -le '$hex = sprintf("%x", 255); print $hex'
    perl -le '$num = "ff"; print hex $num'

    # Generate a random 8 character password
    perl -le 'print map { ("a".."z")[rand 26] } 1..8'
    perl -le 'print map { ("a".."z", 0..9)[rand 36] } 1..8'

    # Create a string of specific length
    perl -le 'print "a"x50'

    # Create a repeated list of elements
    perl -le '@list = (1,2)x20; print "@list"'

    # Create an array from a string
    @months = split ' ', "Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec"
    @months = qw/Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec/

    # Create a string from an array
    @stuff = ("hello", 0..9, "world"); $string = join '-', @stuff

    # Find the numeric values for characters in the string
    perl -le 'print join ", ", map { ord } split //, "hello world"'

    # Convert a list of numeric ASCII values into a string
    perl -le '@ascii = (99, 111, 100, 105, 110, 103); print pack("C*", @ascii)'
    perl -le '@ascii = (99, 111, 100, 105, 110, 103); print map { chr } @ascii'

    # Generate an array with odd numbers from 1 to 100
    perl -le '@odd = grep {$_ % 2 == 1} 1..100; print "@odd"'
    perl -le '@odd = grep { $_ & 1 } 1..100; print "@odd"'

    # Generate an array with even numbers from 1 to 100
    perl -le '@even = grep {$_ % 2 == 0} 1..100; print "@even"'

    # Find the length of the string
    perl -le 'print length "one-liners are great"'

    # Find the number of elements in an array
    perl -le '@array = ("a".."z"); print scalar @array'
    perl -le '@array = ("a".."z"); print $#array + 1'


    TEXT CONVERSION AND SUBSTITUTION
    --------------------------------

    # ROT13 a string
    'y/A-Za-z/N-ZA-Mn-za-m/'

    # ROT 13 a file
    perl -lpe 'y/A-Za-z/N-ZA-Mn-za-m/' file

    # Base64 encode a string
    perl -MMIME::Base64 -e 'print encode_base64("string")'
    perl -MMIME::Base64 -0777 -ne 'print encode_base64($_)' file

    # Base64 decode a string
    perl -MMIME::Base64 -le 'print decode_base64("base64string")'
    perl -MMIME::Base64 -ne 'print decode_base64($_)' file

    # URL-escape a string
    perl -MURI::Escape -le 'print uri_escape($string)'

    # URL-unescape a string
    perl -MURI::Escape -le 'print uri_unescape($string)'

    # HTML-encode a string
    perl -MHTML::Entities -le 'print encode_entities($string)'

    # HTML-decode a string
    perl -MHTML::Entities -le 'print decode_entities($string)'

    # Convert all text to uppercase
    perl -nle 'print uc'
    perl -ple '$_=uc'
    perl -nle 'print "\U$_"'

    # Convert all text to lowercase
    perl -nle 'print lc'
    perl -ple '$_=lc'
    perl -nle 'print "\L$_"'

    # Uppercase only the first word of each line
    perl -nle 'print ucfirst lc'
    perl -nle 'print "\u\L$_"'

    # Invert the letter case
    perl -ple 'y/A-Za-z/a-zA-Z/'

    # Camel case each line
    perl -ple 's/(\w+)/\u$1/g'
    perl -ple 's/(?<!['])(\w+)/\u\1/g'

    # Strip leading whitespace (spaces, tabs) from the beginning of each line
    perl -ple 's/^[ \t]+//'
    perl -ple 's/^\s+//'

    # Strip trailing whitespace (space, tabs) from the end of each line
    perl -ple 's/[ \t]+$//'

    # Strip whitespace from the beginning and end of each line
    perl -ple 's/^[ \t]+|[ \t]+$//g'

    # Convert UNIX newlines to DOS/Windows newlines
    perl -pe 's|\n|\r\n|'

    # Convert DOS/Windows newlines to UNIX newlines
    perl -pe 's|\r\n|\n|'

    # Convert UNIX newlines to Mac newlines
    perl -pe 's|\n|\r|'

    # Substitute (find and replace) "foo" with "bar" on each line
    perl -pe 's/foo/bar/'

    # Substitute (find and replace) all "foo"s with "bar" on each line
    perl -pe 's/foo/bar/g'

    # Substitute (find and replace) "foo" with "bar" on lines that match "baz"
    perl -pe '/baz/ && s/foo/bar/'

    # Binary patch a file (find and replace a given array of bytes as hex numbers)
    perl -pi -e 's/\x89\xD8\x48\x8B/\x90\x90\x48\x8B/g' file


    SELECTIVE PRINTING AND DELETING OF CERTAIN LINES
    ------------------------------------------------

    # Print the first line of a file (emulate head -1)
    perl -ne 'print; exit'

    # Print the first 10 lines of a file (emulate head -10)
    perl -ne 'print if $. <= 10'
    perl -ne '$. <= 10 && print'
    perl -ne 'print if 1..10'

    # Print the last line of a file (emulate tail -1)
    perl -ne '$last = $_; END { print $last }'
    perl -ne 'print if eof'

    # Print the last 10 lines of a file (emulate tail -10)
    perl -ne 'push @a, $_; @a = @a[@a-10..$#a]; END { print @a }'

    # Print only lines that match a regular expression
    perl -ne '/regex/ && print'

    # Print only lines that do not match a regular expression
    perl -ne '!/regex/ && print'

    # Print the line before a line that matches a regular expression
    perl -ne '/regex/ && $last && print $last; $last = $_'

    # Print the line after a line that matches a regular expression
    perl -ne 'if ($p) { print; $p = 0 } $p++ if /regex/'

    # Print lines that match regex AAA and regex BBB in any order
    perl -ne '/AAA/ && /BBB/ && print'

    # Print lines that don't match match regexes AAA and BBB
    perl -ne '!/AAA/ && !/BBB/ && print'

    # Print lines that match regex AAA followed by regex BBB followed by CCC
    perl -ne '/AAA.*BBB.*CCC/ && print'

    # Print lines that are 80 chars or longer
    perl -ne 'print if length >= 80'

    # Print lines that are less than 80 chars in length
    perl -ne 'print if length < 80'

    # Print only line 13
    perl -ne '$. == 13 && print && exit'

    # Print all lines except line 27
    perl -ne '$. != 27 && print'
    perl -ne 'print if $. != 27'

    # Print only lines 13, 19 and 67
    perl -ne 'print if $. == 13 || $. == 19 || $. == 67'
    perl -ne 'print if int($.) ~~ (13, 19, 67)'

    # Print all lines between two regexes (including lines that match regex)
    perl -ne 'print if /regex1/../regex2/'

    # Print all lines from line 17 to line 30
    perl -ne 'print if $. >= 17 && $. <= 30'
    perl -ne 'print if int($.) ~~ (17..30)'
    perl -ne 'print if grep { $_ == $. } 17..30'

    # Print the longest line
    perl -ne '$l = $_ if length($_) > length($l); END { print $l }'

    # Print the shortest line
    perl -ne '$s = $_ if $. == 1; $s = $_ if length($_) < length($s); END { print $s }'

    # Print all lines that contain a number
    perl -ne 'print if /\d/'

    # Find all lines that contain only a number
    perl -ne 'print if /^\d+$/'

    # Print all lines that contain only characters
    perl -ne 'print if /^[[:alpha:]]+$/

    # Print every second line
    perl -ne 'print if $. % 2'

    # Print every second line, starting the second line
    perl -ne 'print if $. % 2 == 0'

    # Print all lines that repeat
    perl -ne 'print if ++$a{$_} == 2'

    # Print all unique lines
    perl -ne 'print unless $a{$_}++'

    # Print the first field (word) of every line (emulate cut -f 1 -d ' ')
    perl -alne 'print $F[0]'

    OTHER
    -----

    # just lines 15 to 17
       perl -ne 'print if 15 .. 17' file

    # find palindromes
       perl -lne 'print if $_ eq reverse' /usr/dict/words

    # increment all numbers found in these files
       perl i.tiny -pe 's/(\d+)/ 1 + $1 /ge' file1 file2

    # command-line that shows each line with its characters backwards
       perl -nle 'print scalar reverse $_' file1 file2 file3

    # binary edit (careful!) You can create your own virus with character-for-character editing of binary files
       perl -i.bak -pe 's/Mozilla/Slopoke/g' /usr/local/bin/netscape

     

    mlexa Sat, 08/13/2016 - 12:13

    Day 2: De novo genome assembly & annotation

    Day 2: De novo genome assembly & annotation

    In the second day, we will cover de novo genome / transcriptome assembly and functional genome annotation. 

    lpryszcz Wed, 06/22/2016 - 10:32

    De novo genome / transcriptome assembly

    De novo genome / transcriptome assembly

    De novo genome assembly

    # copy datasets
    rsync -avz ngschool.local:/ngschool/de_novo_assembly .
    
    # enter directory
    cd de_novo_assembly
    
    # !! IMPORT BASH VARIABLES IN EVERY NEW WINDOW !!
    source /ngschool/.bashrc

    Datasets

    # list directory content
    ls -lah . */
    1. fastq/                FastQ files
      1. genomics
        1. 600_1/2.fq.gz        paired-end; 2x 100bp @ 600bp insert; 100X
        2. 5000_1/2.fq.gz      mate-pairs; 2x 50bp @ 5000bp insert; 10X
      2. transcriptomics
        1. ranseq.fq.gz        single-end; stranded reads from zebrafish
    2. ref/                reference sequence from C. parapsilosis CDC317 (100 kb)
      1. ref.fa

    Quality filtering

    Check quality of your reads

    # FastQC can be executed with graphical interface
    fastqc 
    # and select FastQ reads                     File >> Open… 
    
    # or use bash mode to process individuals files
    fastqc fastq/600_1.fq.gz
    
    # or all files at once
    fastqc fastq/*.fq.gz
    ## *.fq.gz is regex patter: * means any number of any characters 
    ## thus using *.fq.gz with program will result in processing of all files that end with `.fq.gz`

    In the case you have multi-core processor and numerous FastQ files to process, you may execute FastQC on multiple cores ie. `fastqc -t 4` will run 4 processes. To see all FastQC parameters, execute `fastqc -h`. 

    Finally, you can see FastQC report(s) using web browser: 

    firefox fastq/600_1.fq_fastqc.html &
    
    # or to see all reports
    firefox fastq/*_fastqc.html &
    
    ###
    # older versions of FastQC may create subdirectory for each FastQ file, then do: 
    firefox fastq/600_1.fq_fastqc/fastqc_report.html &
    
    # or to see all reports
    firefox fastq/*_fastqc/fastqc_report.html &

    Explore quality statistics for sequenced reads

    Have a look at the good and bad data examples. 
    Have you noticed decreasing quality toward the read ends? 
    Do you know why this happens?

    Trim / remove reads with poor quality bases

    filterReads.py -pHv -q 20 -l 31 -o fastq/600 -i fastq/600_?.fq.gz
    ## you can check all parameters using `filterReads.py -h`

    De novo genome assembly

    Note, you can open new terminal window and run HTOP in order to monitor processor and memory consumption by each process. In addition, you can add `time ` in front of each command to see how much CPU time each assembler used. 

    SPAdes

    # prepare & enter working directory
    mkdir spades
    cd spades
    
    # assemble using paired-end (PE) reads
    spades.py --only-assembler -o 600 -1 ../fastq/600_1.fq.gz -2 ../fastq/600_2.fq.gz
    ## you can check all parameters using `spades.py -h`
    ## SPAdes offers read-correction algorithm, but beside being very resources demanding, it usually brings little improvement to the final assembly, so we disable read-correction with --only-assembler parameter
    
    # assemble using PE & mate piars (MP)
    spades.py --only-assembler -o 600_5000 --pe1-1 ../fastq/600_1.fq.gz --pe1-2 ../fastq/600_2.fq.gz --pe2-1 ../fastq/5000_1.fq.gz --pe2-2 ../fastq/5000_2.fq.gz

    Note, in our examples both, PE and MP reads are in FR orientation. This is for simplicity. Normally, MP reads are in RF orientation, then you need to execute SPAdes with `--mp1-1 ../fastq/5000_1.fq.gz --mp1-2 ../fastq/5000_2.fq.gz`.

    Ray

    # prepare & enter working directory
    cd ..
    mkdir ray
    cd ray
    
    # assemble using PE reads
    Ray -o 600 -p ../fastq/600_?.fq.gz

    IDBA

    # prepare & enter working directory
    cd ..
    mkdir idba
    cd idba
    
    # convert FastQ to shuffled FastA
    fastq2shuffledFasta.py ../fastq/600_?.fq.gz > 600.fa
    
    # assemble using PE reads
    idba --num_threads 2 -o 600 -r 600.fa

    Platanus

    cd ..
    mkdir platanus
    cd platanus
    
    platanus assemble -m 1 -f <(zcat ../fastq/600_1.fq.gz) <(zcat ../fastq/600_2.fq.gz) -t 2 -o 600
    platanus scaffold -c 600_contig.fa -b 600_contigBubble.fa -IP <(zcat ../fastq/600_1.fq.gz) <(zcat ../fastq/600_2.fq.gz) -t 2 -o 600
    platanus gap_close -c 600_scaffold.fa -IP <(zcat ../fastq/600_1.fq.gz) <(zcat ../fastq/600_2.fq.gz) -t 2 -o 600

    Get basic assembly statistics

    cd ../
    fasta_stats.py -i spades/600*/scaffolds.fasta ray/600/Scaffolds.fasta idba/600/scaffold.fa platanus/600_scaffold.fa

    Do you understand the metrics ie. N50, N90?

    Which program was the fastest, used the least memory? And which was the slowest / the most memory hungry? 

    Dealing with heterozygous genomes / hybrids

    # prepare & enter working directory
    mkdir hetero
    cd hetero

    dipSPAdes

    # assemble using paired-end (PE) reads
    dipspades.py --only-assembler -o 600 -1 ../fastq/600_1.fq.gz -2 ../fastq/600_2.fq.gz 
    
    # assemble using PE & mate piars (MP)
    dipspades.py --only-assembler -o 600_5000 --pe1-1 ../fastq/600_1.fq.gz --pe1-2 ../fastq/600_2.fq.gz --pe2-1 ../fastq/5000_1.fq.gz --pe2-2 ../fastq/5000_2.fq.gz

    Redundans

    # improve SPAdes assembly using PE and MP libraries
    /ngschool/src/redundans/redundans.py -v -i ../fastq/600_?.fq.gz ../fastq/5000_?.fq.gz -f ../spades/600/contigs.fasta -o redundans --sspacebin $SSPACEBIN
    /ngschool/src/redundans/redundans.py -v -i ../fastq/600_?.fq.gz ../fastq/5000_?.fq.gz -f ../spades/600_5000/contigs.fasta -o redundans2 --sspacebin $SSPACEBIN
    

    Note, redundans may be installed in another path in your system (ie. ~/src/redundans/redundans.py), so you may have to adapt the command accordingly. To see all options either run `~/src/redundans/redundans.py -h` or visit https://github.com/lpryszcz/redundans. You can play with other assemblies ie. Ray or IDBA or skip PE or MP libraries to see the effect of such changes. Visit https://github.com/lpryszcz/redundans/tree/master/test#run-statistics to get more info about the process 

    Can you see which assembly statistics are improved by each step of heterozygous genome assembly pipeline? 

    # get statistics
    fasta_stats.py -i 600*/dipspades/consensus_contigs.fasta redundans*/scaffolds.filled.fa

    Can you see the differences between the programs?
     

    Compare assemblies using whole genome alignment

    You can assess the accuracy of assembled contigs by aligning them back onto reference sequence. This is only possible, when you have reference genome (or genome of some closely-related species) available. 

    cd ../
    
    # index reference genome (need to be done only once per each reference)
    lastdb ref/ref.fa ref/ref.fa
    
    # align & generate dotplot on the fly
    lastal -f TAB ref/ref.fa spades/600/scaffolds.fasta | last-dotplot - spades.png
    # open plot
    eog spades.png &
    
    # align & generate dotplot on the fly
    lastal -f TAB ref/ref.fa hetero/600_5000/dipspades/consensus_contigs.fasta | last-dotplot - dipspades.png
    # open plot
    eog dipspades.png &
    
    # align & generate dotplot on the fly
    lastal -f TAB ref/ref.fa hetero/redundans/scaffolds.filled.fa | last-dotplot - redundans.png
    # open plot
    eog redundans.png &

     

    Compare assemblies using Quast

    Quast evaluates genome assemblies and report several metrics. 

    quast.py -R ref/ref.fa hetero/600/spades/scaffolds.fasta hetero/600/dipspades/consensus_contigs.fasta hetero/redundans/contigs.fa hetero/redundans/scaffolds.filled.fa
    
    # open the results
    firefox quast_results/latest/report.html &

    Explore the results, you can have a look at contig browser (press `View in Icarus contig browser`). 

    De novo transcriptome assembly

    Quality filtering

    Check quality of your reads

    # FastQC can be executed with graphical interface
    fastqc 
    # and select FastQ reads                     File >> Open… 
    
    # or use bash mode 
    fastqc fastq/rnaseq.fq.gz


    Explore quality statistics for sequences reads

    Have a look at the good and bad data examples. 
    Have you noticed decreasing quality toward the read ends? 
    Do you know why this happens?

    Trim / remove reads with poor quality bases

    filterReads.py -Hv -q 20 -l 31 -o fastq/rnaseq -i fastq/rnaseq.fq.gz
    ## you can check all parameters using `filterReads.py -h`

    Run Trinity

    # check all options with `Trinity -h`
    
    # run Trinity
    Trinity --seqType fq --output trinity --single fastq/rnaseq.fq.gz --max_memory 1G --SS_lib_type F
    ## important parameters --genome_guided_bam if you have genome assembled and want to use is for guided transcript discovery
    
    # get statistics
    fasta_stats.py -i trinity/Trinity.fasta

    Predict Open Reading Frames (ORFs)

    cd trinity
    
    # run transdecoder
    TransDecoder.LongOrfs -t Trinity.fasta
    ## important parameters are -G (genetic code) and -S (stranded RNAseq)
    
    # get transcripts stats
    grep ">" Trinity.fasta.transdecoder_dir/longest_orfs.pep | cut -f2 -d" " | sort | uniq -c
         10 type:3prime_partial
         27 type:5prime_partial
         10 type:complete
         69 type:internal
    
    # get homologs or protein domains from PFAM (not covered by the tutorial)
    hmmscan --cpu 1 --domtblout pfam.domtblout /opt/interproscan-5.19-58.0/data/pfam/29.0/pfam_a.hmm Trinity.fasta.transdecoder_dir/longest_orfs.pep > pfam.out
    # note, you will need to download PFAM db and set proper /path/to/Pfam-A.hmm
    
    # get final transcripts / peptides
    TransDecoder.Predict -t Trinity.fasta --retain_pfam_hits pfam.domtblout 
    
    ## optionally add blastP hist with --retain_blastp_hits blastp.outfmt6
    

     

    lpryszcz Wed, 06/22/2016 - 19:40
    slides

    Gene prediction and functional annotation

    Gene prediction and functional annotation

    Once we have a genome we may want to predict the genes that are encoded there. This session will deal with tools used to predict genes followed by some tools used to predict the function of those genes.

    mmarcet Tue, 07/26/2016 - 16:54

    Exercices for gene annotation

    Exercices for gene annotation

    For exercice 1 and 2 we provide a piece of DNA belonging to the prokaryotic species Actinomyces johnsonii named contigA.fasta.

     

    Exercise 1.- Annotate a prokaryotic genome with a homology based model (blast)

    Blast is probably one of the most useful and used tools that exist in genomics, and is used constantly by bioinformaticians and wet-lab scientists alike. It is normally implemented in websites such as NCBI or UniProt. The problem is that when you are working with a newly sequenced genome it will likely not be present in any website. While blast is not very useful as a genome annotation tool, we still will have a look on how to use it.

    1.- The first step consists in converting our genome into a blast database. In order to do that you can use the following command:

    formatdb -p F -i contigA.fasta

    The -p F options indicates that our database contains a nucleotide sequence. For a protein database either put -p T or don't put anything.

    2.- To perform a blast search we need at least one gene or protein to use as query. In this case we provide the proteome of a closely related species: Actinomyces oris. The proteome can be found in the file protein_list.fasta

    3.- Now you can run the blast search using the following command:

    blastall -p tblastn -i protein_list.fasta -d contigA.fasta -m 8 -b 1 -e 0.01 -o contigA.blast

    Where:

    -i indicates the list of proteins

    -d indicates the genome database

    -p is the kind of blast program we're going to use

    -m 8 indicates the output format

    -b 1 indicate we only want one result per protein

    -e is the e-value threshold

    -o is the output file

    Open the file and have a look at the results, why do you think that this method is not useful for gene annotation? What do you think it can be useful for?

     

    Exercise 2.- Annotate a prokaryotic genome using Glimmer3.

    Glimmer3 is one of the most used gene annotation programs in prokaryotes. There are different versions for the annotation of eukaryotes, metagenomes and other, for more information about those you can check out this website: https://ccb.jhu.edu/software/glimmer/ In addition there is an on-line version that can be used: http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi

    We are going to use glimmer to annotate the piece of DNA found in contigA.

    Ab-initio programs use a parameters file to make gene predictions. The web version of Glimmer only runs the program on standard prokaryotic parameters, ideally we want to give it a personalized set. In order to do that we first need to find a set of proteins to train the model. Glimmer allows us to make a first orf prediction and use that as input for the training program.

    1.- Obtain the long ORFs from contigA:

    tigr-glimmer long-orfs contigA.fasta contigA.longOrfs

    Open the file and have a look at the results.

    2.- Now you need to extract the nucleotide sequence from the predicted ORFs.

    tigr-glimmer extract contigA.fasta contigA.longOrfs >contigA.longOrfs.fasta

    As you see now the contigA.longOrfs.fasta file contains the sequences. You can count how many sequences you found by using this command:

    grep -c “>” contigA.longOrfs.fasta

    3.- Use the predicted proteins to create the parameter file that glimmer needs to work:

    cat contigA.longOrfs.fasta | tigr-glimmer build-icm contigA.icm

    4.- And finally make the glimmer prediction

    tigr-glimmer glimmer3 contigA.fasta contigA.icm contigA.results

    This command will output two files: contigA.results.detail and contigA.results.predict. The final coordinates for the prediction can be found in the .predict file. The .detail file shows all the genes that were considered whether they made it to the final annotation or not.

    5.- Extract the ORFs obtained in the final prediction as shown above. How many genes have you predicted? Compare it to the ORF sequences.

    6.- What would happen if we used a different species to build the parameters file? Use the fasta file containing 500 proteins of Streptococcus suis (seqs.Streptococcus.fasta) to obtain a second set of gene predictions. Compare the two of them.

     

    Exercise 3.- Homology based annotation

    contigB.fasta contains a fragment of chromosome VII of the fungus Aspergillus nidulans (also called Emericella nidulans). We are going to use it to have a look at a homology based annotation method. We are going to use exonerate (http://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate).

    To run an annotation method based on homology we first need a set of proteins to search for. These are already provided in the file protein_list.fa

    1.- Execute exonerate to check which options are available.

    exonerate --help

    2.- Choose those options needed to run exonerate considering that the file contigB.fasta contains a genome.

    3.- Have a look at the results. How many proteins do you have?

    Can you play a bit with the outputs in order to obtain a clearer picture of what is going on?

    4.- Repeat the analysis asking that just the best protein is annotated.

     

    Exercise 4.- Annotate a eukaryotic genome with Augustus.

    We are going to use the same piece of DNA to predict the proteins in and ab-initio program (Augustus).

    1.- As a reminder, contigB.fasta contains a fragment of chromosome VII of the fungus Aspergillus nidulans (Emericella nidulans). First check whether this species is already available in Augustus.

    To know the list of species for which we have parameters predicted you can user the --species=help command.

    2.- Once you verify that it is, you can check the different options that Augustus provides. We can run Augustus by typing this:

    augustus [parameters] --species=aspergillus_nidulans contigB.fasta

    Where parameters are any of the choices the program offers.

    3.- Run Augustus to obtain proteins with a complete gene_model, with the output including the gff3 (check augustus for the proper command)

    4.- In order to extract the protein sequences we can use a script provided by augustus called getAnnoFasta.pl that you can find in your working folder. Notice that this script is also able to provide the CDS if augustus has been called with the --codingseq=on option.

    5.- What happens if you use a different species to define your parameters? Repeat the augustus prediction using the parameters of Aspergillus fumigatus and of tomato. In both cases extract the protein sequences using getAnnoFasta.pl and save them in separate files.

    6.- Compare the three results, what do you observe? You can perform a blast search of the three predictions against a small protein database, which is located in the main exercices folder and called proteins.uniprot.fa. What do you observe when you perform the blast search of the proteins predicted using the tomato parameters?

     

    Exercise 5.- Search BRH between the proteins predicted and A. fumigatus.

    We will search BRH between two closely related genomes, A. nidulans and A. fumigatus. The first step is to perform the blast search.

    1.- Format the two files provided: EMENI.fa and ASPFU.fa

    2.- Now perform the two searches, EMENI.fa to ASPFU.fa and ASPFU.fa to EMENI.fa

    3.- Process the results with the script get_BRH.py.

     

    python get_BRH.py -s1 EMENI.fa -s2 ASPFU.fa -h1 EMENI2ASPFU.blast -h2 EMENI2ASPNI.blast -o BRH.ASPFU.txt

     

    4.- Look at the results. How many BRH did you find?

    5.- Now repeat the process with the yeast proteome. What did you find? Which species is better to annotate your genome?

     

    Exercise 6.- Perform a MCL clustering between the proteins obtained in exercise 4 (EMENI.fa) and other Aspergillus proteins found in the file (Aspergillus.fa)

    1.- You need the results of an all against all blast search, therefore you'll have to perform two blast searches as before or alternatively you can pool all the proteins in a single file and perform a blast search from this file to itself.

    2.- Perform the mcl clustering.

    mcl fileName --abc -o outfileName -I inflation_value

    Where the inflation value affects the cluster granularity, basically it will make the clusters larger or smaller. The smaller the value the largest the groups will be. The value ranges from 1.2 to 5.0.

    3.- Perform the clustering with different inflation values: 1.2, 2.0 and 5.0. Have a look at the results, are there differences?

     

    Exercise 7: Reconstruct phylogenetic trees for each of the proteins predicted.

    The phylogenetic reconstruction process can be performed in three steps: homology search, alignment reconstruction and phylogenetic reconstruction. For this exercise we assume that the homology search has already been performed and each of the proteins found in ASPNI is found in a different file and contains its sequence and the sequences of its homologs.

    1.- Perform a multiple sequence alignment using muscle:

    muscle -in fileName -out fileName.alg

    2.- Trim the alignment using trimal

    trimal -in fileName.alg -out fileName.trim -gt 0.1

    Take note that trimal is able to change the format of the alignment which is very useful if we want to perform a phylogenetic tree with other programs such as PhyML or RAxML.

    3.- Reconstruct phylogenetic tree using fasttree

    fasttree fileName.trim >fileName.tree

     

    Exercise 8.- Interproscan, how to assess function without comparing directly to another species

    In the event that the previous methods are not effective in annotating your genome or that you want to obtain more information one of the things you can do is perform an interproscan analysis. This program searches for conserved regions in your proteins and compares them to several databases such as PFAM or Interpro. It can also assign GO terms or associate proteins to pathways.

    In order to see which options are available you can either check this website: https://github.com/ebi-pf-team/interproscan/wiki/HowToRun or execute the program: interpro.sh

    1.- Use interpro in order to annotate the A. nidulans proteins.

    interproscan.sh -i EMENI.fa -iprlookup -goterms -f tsv –pathways

    2.- Have a look at the results obtained

    mmarcet Tue, 07/26/2016 - 17:01

    Day 3: CNV detection & differential expression

    Day 3: CNV detection & differential expression

    The third day will introduce the detection of polymorphisms and differentially expressed genes. 

    lpryszcz Wed, 06/22/2016 - 10:32

    CNV detection

    CNV detection gdemidov Tue, 08/02/2016 - 16:46

    Hands-ons

    Hands-ons

    2 theoretical modules and 1 practical one.

    gdemidov Thu, 08/04/2016 - 10:53

    First Theoretical Exercise

    First Theoretical Exercise gdemidov Thu, 08/04/2016 - 10:54
    slides
    2.pdf1.25 MB

    Second Theoretical Exercise

    Second Theoretical Exercise gdemidov Thu, 08/04/2016 - 10:54
    slides
    3.pdf1.94 MB

    Differential expression

    Differential expression

    Dataset

    From large RNA-Seq dataset (PRJNA177642, PMID:23284293), I have selected two conditions in three replicas:

    Accession Tissue Treatment Replica
    SRR1048058 Pineal gland Light 1
    SRR1048059 Pineal gland Light 2
    SRR1048060 Pineal gland Light 3
    SRR1048061 Pineal gland Dark 1
    SRR1048062 Pineal gland Dark 2
    SRR1048063 Pineal gland Dark 3

     

    Experiment: Fish were raised under 12-hr light:12-hr dark (LD) cycles. Fish were exposed to a 1-hr light pulse prior to sampling (light treatment) or kept under constant darkness for control (dark treatment). 6 time points with 4-hr intervals throughout one daily cycle, corresponding to CT2, 6, 10, 14, 18 and 22 were pooled together for RNA-Seq.

    Sequencing: Illumina HiSeq2000 (2x 100bp)

    Aim: Identify circadian genes

    Spliced-aware mapping (FastQ → BAM)

    Enter rnaseq directory & decompress reference sequence

    cd rnaseq
    
    gunzip ref/*.gz
    
    fasta_stats.py -i ref/ref.fa

    To speed-up analysis, I have limited sequencing reads and reference sequence to three chromosomes (1, 3 and 12).

    Index reference genome for particular aligner (skip if already indexed)

    bowtie2-build ref/ref.fa ref/ref

    Run Tophat2 on all FastQ files

    # create tophat directory
    mkdir tophat
    
    # align FASTQ files one-by-one
    tophat2 -p1 --no-coverage-search --rg-id 1 --rg-sample "SRR1048059" -o tophat/SRR1048059 ref/ref fastq/SRR1048059_?.fastq.gz
    
    ## --no-coverage-search // disable coverage scanning for splice-sites detection; it's slow and important only for novel transcripts detection

    Explore output files

    tophat reports aligned and unaligned reads separately as accepted_hits.bam and unmapped.bam. You may want to look at:

    • align_summary.txt alignment statistics
    • junctions.bed detected splice junctions
    • insertions.bed / deletions.bed detected INDELs

    Link output files

    cd tophat
    
    # tophat report sorted BAM file
    ln -s SRR1048059/accepted_hits.bam SRR1048059.bam
    
    # OR use FOR-LOOP
    for f in */accepted_hits.bam; do
       s=`echo $f | cut -f1 -d'/'`; # get sample name
       ln -s $f $s.bam
    done

    Index alignments for fast random access (nearly all tools require that!)

    # one-by-one
    samtools index SRR1048059.bam
    
    # OR use FOR-LOOP
    for f in *.bam; do echo $f; samtools index $f; done

    Get alignment statistics and have a look at unmapped reads

    samtools idxstats SRR316184.bam

    Note, most of above steps  can be done using BASH FOR-LOOP: 

    # OR use BASH FOR-LOOP to process all at once
    for f in fastq/*_1.fastq.gz; do
     s=`echo $f | cut -d"/" -f2 | cut -d"_" -f1`
     if [ ! -s tophat/$s/accepted_hits.bam ]; then
       echo `date` $f $s
       tophat2 -p1 --no-coverage-search -o tophat/$s --rg-id 1 --rg-sample "$s" ref/ref fastq/${s}_?.fastq.gz
       ln -s $s/accepted_hits.bam tophat/$s.bam
       samtools index tophat/$s.bam
     fi
    done

    Running above will likely take a few hours... So let's just explore some more efficient alternative ;)

    Try STAR

    START is ultrafast RNAseq aligner (over 10x faster than tophat2). It's also highly accurate, but it require lots of operating memory, lots meaning typically 10x the genome size, so over 30GB to align on human genome! 

    # index genome
    mkdir ref/ref
    STAR --runMode genomeGenerate --genomeDir ref/ref GenomeDir --genomeFastaFiles ref/ref.fa --sjdbGTFfile ref/ref.gtf --runThreadN 1
    
    # STAR
    ref=ref/ref
    mkdir star
    for f in fastq/*_1.fastq.gz; do
      s=`echo $f | rev | cut -f1 -d"/" | rev | cut -f1 -d "_"`;
      if [ ! -s star/$s/Aligned.sortedByCoord.out.bam ]; then
        echo `date` $f $s;
        mkdir -p star/$s;  
        opts="--readFilesCommand zcat --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical --outSAMtype BAM SortedByCoordinate --outSAMattrRGline ID:1 SM:$s LB:$s PU:unknown PL:illumina";  
        STAR --runThreadN 1 --genomeDir $ref --readFilesIn fastq/${s}_?.fastq.gz --outFileNamePrefix star/$s/ $opts;  
        ln -s $s/Aligned.sortedByCoord.out.bam star/$s.bam;
        samtools index star/$s.bam;
      fi; 
    done; date
    

    If you are unable to index the genome or run STAR due to insufficient memory, you can copy pre-aligned reads

    rsync -av YOUR_USERNAME@ngschool.local:/home/lpryszcz/teaching/rnaseq/star .

    Try Salmon

    Salmon is an alignment-free method for quantifying transcript abudance. I've heard about it 2 years ago and liked it a lot since then. 

    # index transcriptome
    salmon index -t transcripts.fa -i transcripts.index
    
    # quantify transcript abundances
    mkdir salmon
    for f in fastq/*_1.fastq.gz; do
      s=`echo $f | rev | cut -f1 -d"/" | rev | cut -f1 -d "_"`;
      if [ ! -s star/$s/quant.sf ]; then
        salmon quant -p 1 -l IU -i ref/transcripts.index -1 <(zcat fastq/${s}_1.fastq.gz) -2 <(zcat fastq/${s}_2.fastq.gz) -o salmon/$s
      fi; 
    done; date
    

    You have to specify library type correctly (-l). You can read more about LIBTYPE (-l)

    You can explore output of Salmon ie. `less salmon/SRR1048059/quant.sf`. 

    Call differentially expressed genes

    Run cuffdiff

    # enter star or tophat output directory
    cd star
    
    # create output dir for cuffdiff
    mkdir cuffdiff
    
    # run cuffdiff with 2 conditions, 3 replicas each
    cuffdiff -p1 -o cuffdiff/all -L light,dark ../ref/ref.gtf SRR1048058.bam,SRR1048059.bam,SRR1048060.bam SRR1048061.bam,SRR1048062.bam,SRR1048063.bam

    Run cuffdiff second time, ignoring one sample (SRR1048058)

    # run cuffdiff with 2 conditions, but now ignoring 1 replica from light samples
    cuffdiff -p1 -o cuffdiff/five -L light,dark ../ref/ref.gtf SRR1048059.bam,SRR1048060.bam SRR1048061.bam,SRR1048062.bam,SRR1048063.bam

    Explore cuffdiff output files

    • .count_tracking Number of fragments (reads) per gene / transcript / CDS etc. per sample
    • .fpkm_tracking Normalised expression (FPKM) per gene / transcript / CDS etc. per sample
    • .read_group_tracking expression (FPKM) per gene / transcript / CDS etc. per replica per sample
    • .diff Differential expression tests between samples for genes / transcripts / CDS

    FPKM = fragments (reads) per kilobase of exon per million fragments mapped;

    Data analysis and visualisation

    Start R and load cummeRbund

    cd cuffdiff
    R
    library(cummeRbund)

    Load cuffdiff results directory

    cuff_data <- readCufflinks("all")
    cuff_data
    genes(cuff_data) # how many genes
    isoforms(cuff_data) # how many transcripts
    CDS(cuff_data) # how many coding sequences
    TSS(cuff_data) # how many transcription start sites

    Generate density plot, scatter plot and clustering dendrogram

    csDensity(genes(cuff_data))
    csScatter(genes(cuff_data), "dark", "light", smooth=T)
    csDendro(genes(cuff_data), replicates=T)   # sample clustering

     

    Dispersion plot

    dispersionPlot(genes(cuff_data))
    csBoxplot(genes(cuff_data), replicates=T)
    csScatterMatrix(genes(cuff_data))

     

    Dimensionality reduction: PCA and MDS

    PCAplot(genes(cuff_data), "PC1", "PC2")
    MDSplot(genes(cuff_data), replicates=T)
    

    Have you spotted an outlier? Any idea why it appeared?
     

    Get expression bar plot for selected gene ie. clock3 (ENSDARG00000003631)

    mygene <- getGene(cuff_data, "clock3") # select gene
    expressionBarplot(mygene) # get gene expression
    expressionBarplot(isoforms(mygene)) # get expression of its isoforms

     

    Get differentially expressed genes

    gene.diff.data <- diffData(genes(cuff_data))
    sig.genes <- subset(gene.diff.data, significant == "yes") # differentially expressed genes
    geneIdList <- sig.genes$gene_id # names of diff. expressed genes
    length(geneIdList) # how many diff. expressed genes

    Store transcripts differentially regulated at FDR < 0.05

    isoforms_diff = diffData(isoforms(cuff_data))
    isoforms_sig = isoforms_diff[ isoforms_diff$q_value < 0.05, ] # differentially expressed transcripts
    write.table(isoforms_sig$isoform_id, file="cuffdiff.fdr_05.tsv", sep='\t', quote = FALSE, col.names = FALSE, row.names = FALSE)

    Check if given gene is differentially expressed

    "ENSDARG00000003631" %in% geneIdList  
    

    Generate heatmap

    sig.genes.data <- getGenes(cuff_data, geneIdList)
    csHeatmap(sig.genes.data, clustering="both")

    Expression barplot for ten genes

    sig.genes10 <- getGenes(cuff_data, geneIdList[1:10])
    expressionBarplot(sig.genes10)

    Volcano plot

    csVolcano(genes(cuff_data), 'dark', 'light')

    If you wish to store any of the figures, just execute:

    dev.copy(png, 'image.png'); dev.off() # as PNG
    dev.copy(svg, 'image.svg'); dev.off() # as SVG

     

    Detecting novel transcripts

    We will use reads previously mapped by tophat2.

    Run guided transcripts detection using cufflinks

    cd star
    cufflinks -p 1 -o cufflinks --GTF-guide ../ref/ref.gtf SR*.bam

    Compare with reference annotation

    1. Run IGV
    2. Select appropriate reference
    3. Load Cufflinks transcripts // File > Load file >  cufflinks/transcripts.gtf
    4. Load BAM file(s) // File > Load file > *.bam
    5. Zoom in some region with genes
    6. Compare original annotation and cufflinks transcripts
      1. Can you spot any new transcripts?
    7. Save current view // File > Save image
    8. Save session // File > Save session

    You can explore two more tools from cufflinks:

    • cuffmerge      \\ merges multiple GTF annotations
    • cuffcompare   \\ compares multiple GTF annotations
    lpryszcz Tue, 08/02/2016 - 18:33
    slides
    rnaseq.pdf4.96 MB

    NGS alignments visualisation aka working with SAM & BAM

    NGS alignments visualisation aka working with SAM & BAM

    NGS alignments visualisation

    Text (samtools)

    NGS alignements are stored in SAM format (typically in compressed form, called BAM). Samtools is a suite of programs for interacting with NGS data stored in SAM/BAM. 

    ###
    # samtools
    cd samtools
    gunzip SRR1048061.sam.gz 
    
    # some tools report BAM (ie. STAR, tophat)
    # but often you will get SAM file, which you should convert to BAM
    head SRR1048061.sam
    # do you understand all columns? 
    # can you find the sequence? quality of each base?
    # how can you know if given read is aligned to forward or reverse strand? 
    ## hint: have a look at bitflag ie. use https://broadinstitute.github.io/picard/explain-flags.html
    # what does bitflag `339` mean? 
    
    # compress SAM to BAM
    samtools view -Sbu -T ../ref/ref.fa SRR1048061.sam > SRR1048061.unsorted.bam
    samtools sort SRR1048061.unsorted.bam SRR1048061
    
    # or do the same in one commad and using pipe (this will skip saving temporary .unsorted.bam file
    samtools view -Sbu -T ../ref/ref.fa SRR1048061.sam | samtools sort - SRR1048061
    
    # did you spot the difference in size between SAM and BAM?
    ls -lah
    
    # index BAM for random access
    samtools index SRR1048061.bam
    # try indexing unsorted.bam
    samtools index SRR1048061.unsorted.bam
    # why it didn't work?
    
    # view alignments
    samtools view SRR1048061.bam | head
    ## make sure you use head, otherwise samtools will report all aligment to STDOUT
    
    # get alignments from particular region / gene
    ## first select some region ie. clock3 gene
    grep clock3 ../ref/ref.gtf | grep -wP 'gene|transcript'
    # 1	ensembl_havana	gene	18174899	18208348	.	+	.	gene_id "ENSDARG00000003631"; gene_name "clock3";
    
    samtools view SRR1048061.bam "1:18174899-18208348" | head
    
    # count alignments in this region (clock3)
    samtools view SRR1048061.bam "1:18174899-18208348" | wc -l
    samtools view -c SRR1048061.bam "1:18174899-18208348"
    
    # compare with salmon quantification
    grep -P 'ENSDART00000133959|ENSDART00000144403|ENSDART00000012552|ENSDART00000103119|ENSDART00000133758|ENSDART00000136307' ../salmon/SRR1048061/quant.sf
    ## salmon reports expression for each transcript seperately, not for genes, so you need to add these
    grep -P 'ENSDART00000133959|ENSDART00000144403|ENSDART00000012552|ENSDART00000103119|ENSDART00000133758|ENSDART00000136307' ../salmon/SRR1048061/quant.sf | awk '{ s+=$5 } END {print s}'
    ## it's much less than in samtools, right? why?
    
    # try to look only at realible alignments (mapQ > 10)
    samtools view -q10 SRR1048061.bam "1:18174899-18208348" | wc -l
    # and count only one read from each pair
    samtools view -q10 SRR1048061.bam "1:18174899-18208348" | cut -f1 | sort | uniq | wc -l
    # or 
    samtools view -q10 -f64 SRR1048061.bam "1:18174899-18208348" | wc -l
    ## now salmon and samtools estimations are pretty close, right? 
    
    # count all alignments
    samtools view -c SRR1048061.bam
    ## or better use
    samtools idxstats SRR1048061.bam
    samtools idxstats SRR1048061.bam | awk '{a+=$3; b+=$4} END {print a+b, a, b}'
    
    # report stats
    samtools flagstat SRR1048061.bam
    
    # you can even view the alignments in the terminal!
    samtools tview SRR1048061.bam ../ref/ref.fa -p "1:18181755-18181800"
    # can you spot single SNP? 
    
    # genotype
    samtools mpileup SRR1048061.bam -r "1:18181795-18181800" 
    # add reference
    samtools mpileup SRR1048061.bam -f ../ref/ref.fa -r "1:18181795-18181800" 
    # add another sample
    samtools mpileup SRR1048061.bam ../star/SRR1048063.bam -f ../ref/ref.fa -r "1:18181795-18181800"

    Note, samtools offer much more. You can check it just typing samtools ;)

     

    Graphical (IGV)

    igv &

    In the case IGV hangs, you can try running it from the website. Navigate to: http://www.broadinstitute.org/software/igv/download, select appropriate version (2GB is the safest) and press `Launch`. 

    Create new genome // Genomes > Create .genome file...

    • fill: Unique identifier, Descriptive name, FASTA file (ref.fa) and Gene file (ref.gtf)
    • save .genome in ~/igv/genomes

    Generate .tdf for faster depth of coverage data loading

    for f in star/*.bam; do
     if [ ! -s $f.tdf ]; then
       echo `date` $f;
       igvtools count $f $f.tdf ~/igv/genomes/danRer7.genome &> /dev/null;
     fi;
    done; date

    Load alignments // File > Load from File…

    • navigate to chipseq directory
    • select BAM files and load

    Explore some differentially expressed genes ie. clock3

    • File > Save image   // Save current view
    • File > Save session // Save session
       

    lpryszcz Wed, 08/17/2016 - 11:25

    Day 4: Bisulphite sequencing, ChIP-seq & Hi-C

    Day 4: Bisulphite sequencing, ChIP-seq & Hi-C

    In the course of the fourth day, we will cover epigenomics: bisulphite sequencing, ChIP-Seq and Hi-C. 

    lpryszcz Wed, 06/22/2016 - 10:32

    ChIP-Seq data analysis

    ChIP-Seq data analysis

    ChIP-Seq data analysis workshop

    Change to the chip_seq directory and type:
     

    source ./.chipseqrc

    FastQC analysis

    To asses the quality of the sequencing experiment run:

    mkdir fastqc
    fastqc -t 2 -o "./fastqc" reads/*.fastq.gz

    Cutadapt adapter trimming

    To trim the adapters and low quality bases we can use Cutadapt:
    http://cutadapt.readthedocs.io/en/stable/guide.html

    The following options are specified:

      -f ; input file format
      -e ; error rate
      -q ; quality cutoff
      -O ; minimum adapter overlap
      -a ; 3' adapter sequence
      -m ; minimum length
      -o ; output file [STDOUT]

    To extract the name of our file we will use Bash parameter substitution. (For further reading and other tricks refer to

    http://tldp.org/LDP/abs/html/parameter-substitution.html#PARAMSUBREF )

    for f in reads/*.fastq.gz; do
      name=${f%.*.*}; name=${name##*/};
      out_dir=cutadapt
      out_file=$out_dir/trimmed/$name.trimmed.fq.gz
      report_file=$out_dir/$name.report.txt
      if [ ! -d $out_dir/trimmed ]; then mkdir -v -p $out_dir/trimmed; fi
      if [ ! -d fastqc/trimmed ]; then mkdir -v -p fastqc/trimmed; fi
      date +'%D %T'; echo $name;
      cutadapt \
        -f fastq \
        -e 0.1 \
        -q 20 \
        -O 3 \
        -a AGATCGGAAGAGC \
        -m 20 \
        -o "$out_file" \
        $f | tee $report_file;
      fastqc -o "fastqc/trimmed" $out_file
    done

    Align the reads

    First, prepare the index

    mkdir -pv ./ref/danRer10.bwa.index/
    bwa index -p "./ref/danRer10.bwa.index/danRer10.bwa" ./ref/danRer10.dna_sm.toplevel.fa

    Do the proper alignment now:

    To obtain high quality reads for further processing we:

    1.  First align the reads to the reference genome with the bwa mem tool
    2.  Next we use the shell pipe to redirect the resulting SAM alignment to the samtools tool with wich we filter out the low quality alignments
    3.  Subsequently we redirect the output of the previous command and sort the good quality alignments to store it in a BAM file
    4.  In the end we index the sorted BAM file for fast access

    With samtools we will filter out the reads having the SAM flag value:

    4 - read unmapped
    256 - read is not primary alignment

    For processing we can add those flag values to each other.

    for f in ./reads/*.fastq.gz; do
      outDir=./bwa;
      name=`basename $f`; name=${name%%.*};
      if [[ ! -d "$outDir" ]]; then mkdir -vp $outDir; fi
      if [[ ! -s "$outDir/$name.sorted.bam" ]]; then
        echo; date +'%D %T'; echo $name; echo;
        bwa mem \
          -t 2 \
          -M \
          ./ref/danRer10.bwa.index/danRer10.bwa \
          $f \
          | samtools view -b -u -q10 -F260 - \
          | samtools sort -o $outDir/$name.sorted.bam -O bam -T $outDir/$name.tmp -@ 2 -;
        samtools index $outDir/$name.sorted.bam;
      fi
    done
    # for the old version of samtools in the Ubuntu repository, substitute the last 2 commands with:
    #| samtools view -S -b -h -q10 -F260 - \
    #| samtools sort -f - $outdir/$name.sorted.bam;

    How many high quality ("unique") alignments we will work with:

    for f in ./bwa/*.bam; do echo $f; samtools view -c $f; done

    NRF Calculation

    We can simply calculate the Non-redundant Read Fraction using samtools and UNIX shell:

    # specify the file
    file=./bwa/input.sorted.bam
    # and calculate the read counts
    total=`samtools view -c $file`;
    # We need to extract the coordinates of the reads aligning to the Watson and Crick strand
    # separately (do you know why?). To do this we will check for the occurence of the flag value 16.
    plusStrand=`samtools view -F16 $file | cut -f3-4 | sort -u | wc -l`;
    minusStrand=`samtools view -f16 $file | cut -f3-4 | sort -u | wc -l`;
    
    NRF=`echo "($plusStrand + $minusStrand)/$total" | bc -l`
    
    echo $NRF

    Be aware that the NRF value will be lower the higher the sequencing depth.
    We could run this for all the files or do it in python in a bit more sophisticated way, with random read sampling.

    NSC and RSC calculation

    To check the cross-correlation of the reads aligned to both strands we will run the SPP.

    mkdir -v ./spp;
    for f in ./bwa/*.bam; do
      Rscript ./.local/share/phantompeakqualtools/run_spp_modified.R \
        -c=$f \
        -s=-100:2:600 \
        -savp \
        -odir=./spp \
        -out=./spp/spp.result
    done

    Signal extraction scaling

    To perform the first two steps of the signal extraction scaling analysis we will use the bedtools programs.
    It has a great documentation under:
    http://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html

    First we will split our reference genome into 1kb non-overlapping windows:

    mkdir bedtools;
    bedtools makewindows -g ./ref/danRer10.83.len -w 1000 >./bedtools/windows.bed;

    Next we can calculate the read coverage of each of this window. We will use the following command:

    for f in ./bwa/*.sorted.bam; do
      name=`basename $f`; name=${name%%.*};
      bedtools coverage \
        -a ./bedtools/windows.bed \
        -b $f \
        -counts \
        -sorted \
        -g ./ref/danRer10.83.len \
        >./bedtools/${name}_coverage.bed
    done

    Now we can proceed with the next steps in python! ( in the jupyter-notebook)


    Calling the peaks with MACS2 peak caller

    You can read the documentation under:

    https://github.com/taoliu/MACS

    mkdir ./macs2;
    macs2 callpeak \
      --treatment "./bwa/sox2_chip.sorted.bam" \
      --control "./bwa/input.sorted.bam" \
      --name "sox2_chip" \
      --outdir "./macs2" \
      --format BAM \
      --gsize 1.01e8 \
      --qvalue 0.01 \
      --bdg \
      2>&1 \
      | tee ./macs2/sox2_chip.macs2.log

    Create the MACS2 diagnostic plots
     

    cd ./macs2;
    Rscript sox2_chip_model.r;
    cd ..

    The called peaks can be found in the ./macs2/sox2_chip_peaks.narrowpeak file.

    Definition of some specific columns are:

    5th: integer score for display
    7th: fold-change
    8th: -log10pvalue
    9th: -log10qvalue
    10th: relative summit position to peak start

    Visualization

    To see the signal enrichment in the genome browser we need to create files in the BigWig format. To do this we first need to compare the signal in the ChIP sample to the signal in the input sample. We will do it in MACS2, using fold enrichment as the output method.

    macs2 bdgcmp -t ./macs2/sox2_chip_treat_pileup.bdg -c ./macs2/sox2_chip_control_lambda.bdg -o ./macs2/sox2_chip_FE.bdg -m FE

    Next we will change the Bedgraph files to the BigWig files needed for viewing with the following command:

    bdg2bw ./macs2/sox2_chip_FE.bdg ./ref/danRer10.83.len

    Now we can run the IGV, load the provided igv.genome file (in the .ref/ folder), our alignments and predicted peaks to see the results of our analysis.

    Calculate FRiP

    To to calculate the fraction of reads in peaks we can again use bedtools.

    bedtools coverage \
      -a ./macs2/sox2_chip_peaks.narrowPeak \
      -b ./bwa/sox2_chip.sorted.bam \
      -counts \
      -sorted \
      -g ./ref/danRer10.83.len \
      >./bedtools/sox2_chip_peak_coverage.bed

    We can sum all the read counts from the 11th column using awk, for simplicity just run the provided script:

    calculateFRiP.sh ./bwa/sox2_chip.sorted.bam ./bedtools/sox2_chip_peak_coverage.bed

    Obtain the gene list

    In the end we can obtain the list of genes closest to our detected peaks:

    bedtools closest \
      -a ./macs2/sox2_chip_peaks.narrowPeak \
      -b ./ref/ensembl_genes.sorted.bed \
      >./macs2/sox2_chip.genes.closest;

     

    mlapinski Wed, 08/03/2016 - 22:55

    Day 5: Molecular data integration

    Day 5: Molecular data integration

    On the fifth day, you will learn about methods for integrating transcriptomics data from multiple studies.

     

    jmarzec Wed, 06/22/2016 - 10:32

    0. Preparation

    0. Preparation

    This WORKSHOT will involve integrating three publicly available prostate cancer expression datasets (GEO IDs: GSE17951, GSE55945, GSE3325) generated with Affymetrix U133 Plus 2.0 array platform. Affymetrix raw data are stored in CEL files (one CEL file per sample).

    The data for this workshot are available to download from our local server in the following directory: /ngschool/data_integration/data/CEL_files  (each folder contains CEL files for corresponding datasets).

    NOTE: provided data were limited to a subset of samples representing normal and tumour phenotypes to reduce the analysis computation time.

     

    The overall pipeline consists of six main steps:

    1. Quality control

    2. Data combination and normalisation

    3. Study effects assessment

    4. Data adjustment for known study effects

    5. Differential expression analysis

    6. Results summary and visualisation

     

    Overall pipeline scheme

     

    Target files

    To perform the analysis you will need to create so called target file for each dataset. The target file is expected to be tab-delimited and contain three columns in the following order: (1) Name, (2) FileName, (3) Target, e.g.:

    Name	      FileName	                                  Target
    GSE17951_1    [directory with CEL files]/GSM449150.CEL    Tumour
    GSE17951_2    [directory with CEL files]/GSM449151.CEL    Tumour
    GSE17951_3    [directory with CEL files]/GSM449249.CEL    Normal
    
    ...
    GSE17951_22    [directory with CEL files]/GSM449258.CEL    Normal

    Name = sample name

    FileName = name of the CEL file along with path to the directory

    Target = sample phenotype

     

    You will also need to create one target file for the combination of all datasets, which is expected to contain four columns: (1) Name, (2) FileName, (3) Target, (4) Dataset, e.g.:

    Name	      FileName	                                  Target    Dataset
    GSE17951_1    [directory with CEL files]/GSM449150.CEL    Tumour    GSE17951
    GSE17951_2    [directory with CEL files]/GSM449151.CEL    Tumour    GSE17951
    ...
    
    GSE55945_1    [directory with CEL files]/GSM1348933.CEL    Tumour    GSE55945
    GSE55945_2    [directory with CEL files]/GSM1348934.CEL    Tumour    GSE55945
    
    ...
    
    GSE3325_1    [directory with CEL files]/GSM74875.CEL    Tumour    GSE3325
    GSE3325_2    [directory with CEL files]/GSM74876.CEL    Tumour    GSE3325

    Dataset = name of the dataset from which corresponding sample is derived

    Example target files ('target_GSE17951.txt', 'target_GSE55945.txt', 'target_GSE3325.txt' and 'target_all.txt') are here: /ngschool/data_integration/data

     

    =========================================================================================================

     

     

     

    jmarzec Mon, 08/08/2016 - 23:56

    1. Quality control

    1. Quality control

    Load functions

    The following functions will be used to distinguish samples from different datasets in clustering and principal component analysis (PCA) plots

    ##### Assign colours to analysed datasets
    getDatasetsColours <- function(datasets) {
        
        ##### Predefine colours for datasets
        datasets.colours <- c("bisque","orange","firebrick","lightslategrey","darkseagreen","darkcyan","dodgerblue")
        
        f.datasets <- factor(datasets)
        vec.datasets <- datasets.colours[1:length(levels(f.datasets))]
        datasets.colour <- rep(0,length(f.datasets))
        for(i in 1:length(f.datasets))
        datasets.colour[i] <- vec.datasets[ f.datasets[i] == levels(f.datasets)]
        
        return( list(vec.datasets, datasets.colour) )
    }
    
    
    ##### Assign colours to analysed groups
    getTargetsColours <- function(targets) {
        
        ##### Predefine colours for groups
        targets.colours <- c("red","blue","green","darkgoldenrod","darkred","deepskyblue")
        
        f.targets <- factor(targets)
        vec.targets <- targets.colours[1:length(levels(f.targets))]
        targets.colour <- rep(0,length(f.targets))
        for(i in 1:length(f.targets))
        targets.colour[i] <- vec.targets[ f.targets[i] == levels(f.targets)]
        
        return( list(vec.targets, targets.colour) )
    }
    

     

    Load libraries

    Load R libraries requireid for the analysis.

    library(affy)
    library(sva)
    library(arrayQualityMetrics)
    library(gcrma)
    library(limma)
    library(hgu133plus2.db)
    library(gplots)

    You will also need a piece of code to perform coloured clustering ('a2R_code.R')

    source("[directory]/a2R_code.R")

    where [directory] is the path to directory on your computer with the 'a2R_code.R' script. You can download it from our local server (/ngschool/data_integration/scripts/a2R_code.R)

     

    Start the analysis in the directory with target files

    setwd("[directory with target files]")

     

    Quality control

    Load data and perform QC (with arrayQualityMetrics package) for each dataset separately

    Start with GSE17951 dataset (22 samples in total, 13 from normal and 9 from tumour tissues)

    ##### Read annotation file
    pd <- read.AnnotatedDataFrame(filename = "target_GSE17951.txt" ,header = TRUE, row.name = "Name", sep = "\t")
    
    
    ##### Read CEL files into an Affybatch
    dat <- ReadAffy(filenames = pd$FileName, sampleNames = sampleNames(pd), phenoData = pd)
    
    
    ##### Data quality control using arrayQualityMetrics package
    arrayQuality <- arrayQualityMetrics(expressionset = dat, outdir = "[directory for GSE17951 QC report]", reporttitle = "arrayQualityMetrics report for GSE17951", force = TRUE, do.logtransform = TRUE, intgroup = "Target")
    

     

    The QC report is stored in specified directory [directory for QC report] and can be presented in web browser by clicking on the 'index.html' file (see an example index.html). Various QC metrics with suggestive outliers, if detected, will be indicated.

     

    The QC report helps to indicate poor quality samples (in the case of dataset GSE17951 there are no samples with evidently poor quality)

    ##### Report detected samples to excluded from downstream analysis
    samples2exclude <- NULL

     

    Do the same for the other two datasets:

    GSE55945 with 16 samples in total, 7 from normal and 9 from tumour tissues

    GSE3325 with 12 samples in total, 5 from normal and 7 from tumour tissues

    ##### Read annotation file (GSE55945)
    pd <- read.AnnotatedDataFrame(filename = "target_GSE55945.txt", header = TRUE, row.name = "Name", sep = "\t")
    
    
    ##### Read CEL files into an Affybatch
    dat <- ReadAffy(filenames = pd$FileName, sampleNames = sampleNames(pd), phenoData = pd)
    
    
    ##### Data quality control using arrayQualityMetrics package
    arrayQuality <- arrayQualityMetrics(expressionset = dat, outdir = "[directory for GSE55945 QC report]", reporttitle = "arrayQualityMetrics report for GSE55945", force = TRUE, do.logtransform = TRUE, intgroup = "Target")
    
    

     

    Based on the QC report for dataset GSE55945 four samples seem to be of insufficient quality: GSE55945_3, GSE55945_4, GSE55945_6, GSE55945_15

    ##### Report detected samples to excluded from downstream analysis
    samples2exclude <- c(samples2exclude, "GSE55945_3", "GSE55945_4", "GSE55945_6", "GSE55945_15")

     

    ##### Read annotation file (GSE3325)
    pd <- read.AnnotatedDataFrame(filename = "target_GSE3325.txt", header = TRUE, row.name = "Name", sep = "\t")
    
    
    ##### Read CEL files into an Affybatch
    dat <- ReadAffy(filenames = pd$FileName, sampleNames = sampleNames(pd), phenoData = pd)
    
    
    ##### Data quality control using arrayQualityMetrics package
    arrayQuality <- arrayQualityMetrics(expressionset = dat, outdir = "[directory for GSE3325 QC report]", reporttitle = "arrayQualityMetrics report for GSE3325", force = TRUE, do.logtransform = TRUE, intgroup = "Target")
    

     

    Based on the QC report for dataset GSE3325 there is one sample presenting poor quality: GSE3325_9

    ##### Report detected samples to excluded from downstream analysis
    samples2exclude <- c(samples2exclude, "GSE3325_9")

     

    * Automated outliers detection (additional step)

    FYI, arrayQualityMetrics allows automated detection of sample outliers for some of the QC metrics

    ##### Compute useful summary statistics from a data object
    preparedData <- prepdata(expressionset = dat, intgroup = "Target", do.logtransform = TRUE)
    
    
    ##### Array intensity distributions
    QCboxplot <- aqm.boxplot(preparedData)
    QCboxplot@outliers
    
    
    ##### Between array comparison
    QCheatmap <- aqm.heatmap(preparedData)
    QCheatmap@outliers
    
    
    ##### MA plots
    QCmaplot <- aqm.maplot(preparedData)
    QCmaplot@outliers
    
    
    ##### For Affymetrix specific sections
    preparedAffy <- prepaffy(expressionset = dat, x = preparedData)
    
    
    ##### RLE
    QCrle <- aqm.rle(preparedAffy,)
    QCrle@outliers
    
    
    ##### NUSE
    QCnuse <- aqm.nuse(preparedAffy)
    QCnuse@outliers
    
    
    ##### Customising arrayQualityMetrics report
    qm <- list("Boxplot" = QCboxplot, "Heatmap" = QCheatmap, "MAplot" = QCmaplot, "RLE" = QCrle, "NUSE" = QCnuse)
    
    aqm.writereport(modules = qm, reporttitle = "arrayQualityMetrics report for GSE3325", outdir = "[directory for the GSE3325 customised QC report]", arrayTable = pData(dat))
    

     

    =========================================================================================================

     

    jmarzec Mon, 08/08/2016 - 23:59

    2. Data combination and normalisation

    2. Data combination and normalisation

    Once poor quality arrays were detected, load all arrays across studies (ignoring poor quality samples) and normalise them collectively to maximise data consistency.

    First, assign colours to datasets and groups using the aforementioned functions

    ##### Read file with information for all datasets
    datasetsInfo <- read.table(file = "target_all.txt", sep = "\t", as.is = TRUE, header = TRUE, row.names = 1)
    
    ##### Ignore samples identified as outliers
    datasetsInfo <- datasetsInfo[setdiff(rownames(datasetsInfo), as.vector(samples2exclude)),]
    
    
    ##### Assign different colours for samples from individual datasets
    datasets <- datasetsInfo$Dataset
    
    datasets.colour <- getDatasetsColours(datasets)
    
    
    ##### Assign different colours for samples representing individual groups
    targets <- datasetsInfo$Target
    
    targets.colour <- getTargetsColours(targets)
    
    
    ##### Record number of datasets
    datasets_No <- max(as.numeric(factor(datasets)))

     

    Load data

    Load the data (ignoring poor quality samples) across all datasets (using the target file for the combination of all datasets)

    ##### Read annotation files for all datasets
    pd <- read.AnnotatedDataFrame(filename = "target_all.txt", header = TRUE, row.name = "Name", sep = "\t")
    
    ##### Ignore samples identified as outliers
    pd <- pd[setdiff(sampleNames(pd), as.vector(samples2exclude))]
    
    
    ##### Read CEL files into an Affybatch
    dat <- ReadAffy(filenames = pd$FileName, sampleNames = sampleNames(pd), phenoData = pd)
    
    
    ##### Create folder for results
    system("mkdir [directory for results]")
    
    ##### Change working directory to results folder
    setwd("[directory for results]")

     

    Normalisation

    Perform normalisation collectively on all data using GC-RMA method

    ##### NOTE: Computationally intensive step, it may take few minutes to complete
    gcrma = gcrma(dat)

     

    Generate plots demonstrating data distribution before and after normalisation

    ##### Generate signal intensity before and after GC-RMA normalisation
    pdf("Norm_density.pdf")
    
    #####  Before normalisation
    hist(dat, col = datasets.colour[[2]], lty = 1)
    
    #####  After normalisation
    hist(gcrma, col = datasets.colour[[2]], lty = 1)
    dev.off()
    
    
    ##### Generate box-and-whisker plot
    pdf("Norm_boxplot.pdf", pointsize = 8, width = 0.2*length(pd$FileName), height = 6)
    par(mar=c(13, 4, 3, 2))
    
    #####  Before normalisation
    boxplot(dat, col = datasets.colour[[2]], las = 2)
    
    #####  After normalisation
    boxplot(exprs(gcrma), col = datasets.colour[[2]], main = "GC-RMA normalised", las = 2)
    dev.off()

     

    You should get the following plots:

    file: 'Norm_boxplot.pdf'

    Signal intensity plot before normalisation

    Norm_boxplot_1

     

    ... and after normalisation

    Norm_boxplot_2

     

    file: 'Norm_density.pdf':

    Box-plot before normalisation (different data distributions for individual studies are evident)

    Norm_density_1

     

    ... and after normalisation

    Norm_density_2

     

    Write normalised expression data into a file ('Norm_data.exp')

    write.exprs(gcrma, file = "Norm_data.exp", sep = "\t")

     

    =========================================================================================================

     

    jmarzec Mon, 08/08/2016 - 23:59

    3. Study effects assessment

    3. Study effects assessment

    Clustering analysis

    Start with unsupervised clustering analysis to evaluate expression profiles similarities among samples in the context of dataset and phenotypic annotation

    #####  Extract matrix of expression values
    data <- exprs(gcrma)
    
    
    #####  Compute distance matrix and hierarchical clustering
    d.usa <- dist(t(data), method = "euc")
    h.usa <- hclust(d.usa, method = "ward.D2")
    
    
    #####  Generate coloured dentrogram (indicate datasets)
    pdf("Study_effects_cluster_datasets.pdf", width = 0.3*ncol(data), height = 6)
    h.usa$labels = colnames(data)
    par(mar = c(2,2,2,6))
    A2Rplot(h.usa, fact.sup = datasets, box = FALSE, show.labels = TRUE, col.up = "black", main = "", k = length(levels(factor(datasets))) ) # k = changes the detail/number of subgroups shown
    dev.off()
    
    
    #####  Generate coloured dentrogram (indicate groups)
    pdf("Study_effects_cluster_targets.pdf", width = 0.3*ncol(data), height = 6)
    h.usa$labels = colnames(data)
    par(mar = c(2,2,2,6))
    A2Rplot(h.usa, fact.sup = targets, box = FALSE, show.labels=TRUE, col.up = "black", main="", k = length(levels(factor(targets))) ) # k = changes the detail/number of subgroups shown.
    dev.off()
    
    
    #####  Generate dentrogram
    pdf("Study_effects_dendrogram.pdf", width = 0.2*ncol(data)+2, height = 6, pointsize = 12)
    par(mar = c(2,5,2,0))
    plot(h.usa, xlab = "", labels = paste(colnames(data), targets, sep = "       "), hang = -1, main="")
    dev.off()
    

     

    You should get the following plots:

    file 'Study_effects_cluster_datasets.pdf':

    Dendrogram with samples coloured based on their dataset annotation (dataset-driven clustering is evident)

    Study_effects_cluster_datasets

     

     

    file 'Study_effects_cluster_targets.pdf':

    Dendrogram with samples coloured based on their phenotypic annotation (samples representing various biological groups are mixed)

    Study_effects_cluster_targets

     

     

    file 'Study_effects_dendrogram.pdf':

    ... and dendrogram with sample names

    Study_effects_dendrogram

     

    Principal components analysis

    Perform principal components analysis (PCA) to identify key components of variability in combined expression data

    #####  Keep only probes with variance > 0 across all samples
    rsd <- apply(data, 1, sd)
    data <- data[rsd > 0,]
    
    
    #####  Perform PCA
    data_pca <- prcomp(t(data), scale=TRUE)
    
    
    #####  Get variance importance for all principal components
    importance_pca <- summary(data_pca)$importance[2,]
    importance_pca <- paste(round(100*importance_pca, 2), "%", sep="")
    
    
    #####  Set point symbols so that each represents single dataset on the PCA plots
    pchs <- rep(list(c(16,1),c(15,0),c(17,2),c(16,10),c(15,7),c(16,13),c(16,1)),10)
    
    
    #####  Generate PCA plot
    pdf("Study_effects_PCA.pdf")
    plot(data_pca$x[,1], data_pca$x[,2], type = "n", xlab = paste("PC1 (",importance_pca[1],")", sep = ""), ylab = paste("PC2 (",importance_pca[2],")", sep = ""), ylim =c ( min(data_pca$x[,2]),max(data_pca$x[,2]) + (abs(min(data_pca$x[,2])) + abs(max(data_pca$x[,2])))/4 ))
    
    #####  Use different shape for each dataset
    for (i in 1:datasets_No) {
        points(data_pca$x[,1][as.numeric(factor(datasets)) == i], data_pca$x[,2][as.numeric(factor(datasets)) == i], pch = pchs[[i]][1], col = targets.colour[[2]][as.numeric(factor(datasets)) == i])
        points(data_pca$x[,1][as.numeric(factor(datasets)) == i], data_pca$x[,2][as.numeric(factor(datasets)) == i], pch = pchs[[i]][2], col = "black")
    }
    
    #####  Adding the legend
    legend("topleft", legend = levels(factor(targets)), pch = 16, col = targets.colour[[1]], box.col = "transparent")
    legend("topleft", legend = c(rep("", length(targets.colour[[1]]))), pch = 1, box.col = "transparent")
    dev.off()

     

    This should generate the following PCA plot   ('Study_effects_PCA.pdf')

     

    PCA plot

     

    where the first principal components can be clearly explained by the study-specific data variation.

     

    =========================================================================================================

     

    jmarzec Mon, 08/08/2016 - 23:59

    4. Data adjustment for study effects

    4. Data adjustment for study effects

    Combat

    Adjust data for batch effects using ComBat

    #####  Specify known batches
    batch <-as.vector(datasets)
    
    #####  Create model matrix for outcome of interest beside batch
    f.model <- factor(targets, levels = levels(factor(targets)))
    
    mod <- model.matrix(~f.model)
    
    
    #####  Adjust data for batch effects using ComBat
    data_combat <- ComBat(dat = data, batch = batch, mod = mod)

     

    Produce box-plot of combined datasets

    pdf("Study_effects_boxplot.pdf", pointsize = 8, width = 0.2*ncol(data), height = 6)
    par(mar = c(13, 4, 3, 2))
    
    #####  Before adjusting for study effects
    boxplot(data.frame(data), col = datasets.colour[[2]], las = 2)
    
    #####  After adjusting for study effects
    boxplot(data.frame(data_combat), col = datasets.colour[[2]], main = "Batch effect adjusted", las = 2)
    dev.off()

     

    The box-plots demonstrate that globally the expression profiles have not changed after applying ComBat   ('Study_effects_boxplot.pdf')

    Box-plot before...

    Study_effects_boxplot_1

     

    ... and after adjusting the data for study effects

    Study_effects_boxplot_2

     

     

    ... but how about unsupervised clustering and principal components analyses?

    #####  Compute distance matrix and hierarchical clustering
    d.usa <- dist(t(data_combat), method = "euc")
    h.usa <- hclust(d.usa, method = "ward.D2")
    
    
    ##### Generate coloured dentrogram (indicate datasets)
    pdf("Study_effects_cluster_datasets_ComBat.pdf", width = 0.3*ncol(data), height = 6)
    h.usa$labels = colnames(data_combat)
    par(mar = c(2,2,2,6))
    A2Rplot(h.usa, fact.sup = datasets, box = FALSE, show.labels = TRUE, col.up = "black", main="", k=length(levels(factor(datasets))) ) # k = changes the detail/number of subgroups shown.
    dev.off()
    
    
    ##### Generate coloured dentrogram (indicate groups)
    pdf("Study_effects_cluster_targets_ComBat.pdf", width = 0.3*ncol(data), height = 6)
    h.usa$labels = colnames(data_combat)
    par(mar = c(2,2,2,6))
    A2Rplot(h.usa, fact.sup = targets, box = FALSE, show.labels = TRUE, col.up = "black", main=" ", k = length(levels(factor(targets))) ) # k = changes the detail/number of subgroups shown.
    dev.off()
    
    
    ##### Generate dentrogram
    pdf(paste("Study_effects_dendrogram_ComBat.pdf", sep = ""), width = 0.2*ncol(data)+2, height = 6, pointsize = 12)
    par(mar = c(2,5,2,0))
    plot(h.usa, xlab = "", labels = paste(colnames(data_combat), targets, sep="       "), hang = -1, main = "")
    dev.off()
    

     

    These produce dendrogram with samples coloured based on their dataset annotation (samples from different datasets are mixed)

    file 'Study_effects_cluster_datasets_ComBat.pdf':

    Study_effects_cluster_datasets_ComBat

     

     

    dendrogram with samples coloured based on their phenotypic annotation (phenotype-driven clustering is evident)

    file 'Study_effects_cluster_targets_ComBat.pdf':

    Study_effects_cluster_targets_ComBat

     

     

    ... and dendrogram with sample names

    file 'Study_effects_dendrogram_ComBat.pdf':

    Study_effects_dendrogram_ComBat

     

    The clustering results show that after applying ComBat the batch effects were reduced and biological variability is now maintained across datasets.

     

    Generate PCA plot for the study effects-adjusted data

    #####  Keep only probes with variance > 0 across all samples
    rsd <- apply(data_combat, 1, sd)
    data_combat <- data_combat[rsd > 0,]
    
    
    #####  Perform PCA
    data_combat_pca <- prcomp(t(data_combat), scale=TRUE)
    
    
    #####  Get variance importance for all principal components
    importance_pca <- summary(data_combat_pca)$importance[2,]
    importance_pca <- paste(round(100*importance_pca, 2), "%", sep="")
    
    
    #####  Generate PCA plot
    pdf("Study_effects_PCA_ComBat.pdf")
    plot(data_combat_pca$x[,1], data_combat_pca$x[,2], type = "n", xlab = paste("PC1 (",importance_pca[1],")", sep = ""), ylab = paste("PC2 (",importance_pca[2],")", sep = ""), ylim = c( min(data_combat_pca$x[,2]), max(data_combat_pca$x[,2]) + (abs(min(data_combat_pca$x[,2]))+ abs(max(data_combat_pca$x[,2])))/4 ))
    
    #####  Use different shape for each dataset
    for (i in 1:datasets_No) {
        points(data_combat_pca$x[,1][as.numeric(factor(datasets)) == i], data_combat_pca$x[,2][as.numeric(factor(datasets)) ==i ], pch = pchs[[i]][1], col = targets.colour[[2]][as.numeric(factor(datasets)) == i])
        points(data_combat_pca$x[,1][as.numeric(factor(datasets)) == i], data_combat_pca$x[,2][as.numeric(factor(datasets)) == i], pch = pchs[[i]][2], col = "black")
    }
    
    #####  Adding the legend
    legend("topleft", legend = levels(factor(targets)), pch = 16, col = targets.colour[[1]], box.col = "transparent")
    legend("topleft", legend = c(rep("", length(targets.colour[[1]]))), pch = 1, box.col = "transparent")
    dev.off()
    

     

    The PCA plot ('Study_effects_PCA_ComBat.pdf')

     

    Study_effects_PCA_ComBat

     

    ...demonstrates that the biological effects observed after batch correction are stronger than the study-specific effects.

     

    Such pre-processed data is now ready for differential expression analysis.

     

    =========================================================================================================

     

    jmarzec Mon, 08/08/2016 - 23:58

    5. Differential expression analysis

    5. Differential expression analysis

    Non-specific filtering

    First, perform non-specific filtering, by eliminating genes with low expression level variability across samples (they are unlikely to provide information about the phenotype of interest), to reduce the dimensionality of the data

    #####  Use 60% of all genes with the highest expression variance across samples in the non-specific filtering step
    filterThreshold <- round(0.6*nrow(data_combat), digits = 0)
    
    
    rsd <- apply(data_combat, 1, sd)
    sel <- order(rsd, decreasing=TRUE)[1:filterThreshold]
    data_combat <- data_combat[sel,]

     

    Model matrix

    Then define the study design and then fit the linear model

    #####  Create a model matrix representing the study design, with rows corresponding to arrays and columns to coefficients to be estimated
    f.model <- factor(targets, levels = levels(factor(targets)))
    
    mod <- model.matrix(~0 + f.model)
    
    colnames(mod) <- levels(factor(targets))
    
    
    #####  Fit linear model for each gene given a series of arrays
    fit <- lmFit(data_combat, mod)

     

    Now rather extra bit of code, which determines all possible comparisons between sample groups and defines the contrasts' names. It's useful for pipeline automation in case when there are more than one comparison, because it allows automated naming of corresponding files produced for each contrast

    #####  Create matrix of possible comparisons
    comb <- combn(levels(factor(targets)), 2)
    
    #####  Get number of possible comparisons using the following formula:
    #
    # n!/((n-r)!(r!))
    #
    # n = the number of classes to compare
    # r = the number of elements for single comparison
    #
    ################################################################################
    
    #####   Record the number of groups and comparisons
    targetsNo <- length(levels(factor(targets)))
    combNo <- factorial(targetsNo)/(factorial(targetsNo-2)*(factorial(2))) # n!/((n-r)!(r!))
    
    contrasts <- NULL
    contrastNames <- NULL
    
    
    #####  Create string with possible contrasts
    for (i in 1:combNo) {
        
        contrasts <- c(contrasts, paste(paste(comb[1,i], comb[2,i], sep="vs"),paste(comb[1,i], comb[2,i], sep="-"), sep="="))
        contrastNames[i] <- paste(comb[1,i], comb[2,i], sep=" vs ")
    }
    
    contrasts <- paste(contrasts, collapse=", ")
    
    #####  Create contrasts of interest
    func = "makeContrasts"
    arguments = paste(contrasts, "levels=mod",sep = ", ")
    
    contrast.matrix <- eval(parse(text = paste(func, "(", arguments, ")", sep = "")))

     

    Empirical Bayes statistics

    Now fit individual contrasts to linear model and apply empirical Bayes statistics

    #####  Fit contrasts to linear model
    fitContrasts <- contrasts.fit(fit, contrast.matrix)
    
    #####  Apply empirical Bayes statistics
    eb <- eBayes(fitContrasts)

     

    =========================================================================================================

     

    jmarzec Mon, 08/08/2016 - 23:58

    6. Results summary and visualisation

    6. Results summary and visualisation

    Set preferred p-value (adjusted for multiple-hypothesis testing) and log2 fold-change threshold for calling differentially expressed genes

    pThreshold = 0.0001
    lfcThreshold = 2

     

    Results summary

    Annotate probesets and write results into a files

    #####  ...do it for each comparison
    for (i in 1:ncol(eb$p.value) ) {
        
        topGenes <- topTable(eb, coef = colnames(eb)[i], adjust = "BH",  sort.by = "P", number = nrow(data_combat), genelist = rownames(data_combat))
        
        ##### Retrieve probesets annotation information
        probeid <- topGenes$ID
        
        SYMBOL <- as.character(unlist(lapply(mget(probeid,env=hgu133plus2SYMBOL), function (symbol) { return(paste(symbol,collapse="; ")) } )))
        NAME <- as.character(unlist(lapply(mget(probeid,env=hgu133plus2GENENAME), function (name) { return(paste(name,collapse="; ")) } )))
        CHR <- as.character(unlist(lapply(mget(probeid,env=hgu133plus2CHR), function (Chr) { return(paste(Chr,collapse="; ")) } )))
        MAP <- as.character(unlist(lapply(mget(probeid,env=hgu133plus2MAP), function (MAP) { return(paste(MAP,collapse="; ")) } )))
        
        ##### Merge probes annotation with statistics
        Annot <- data.frame(probeid, CHR, MAP, SYMBOL, NAME, row.names = NULL)
        Annot <- merge(Annot, topGenes, by.x = "probeid", by.y = "ID" )
        
        ##### Write annotated results into a file
        write.csv(Annot, file=paste("Integ_", colnames(eb$p.value)[i], "_topTable.csv", sep=""), row.names=FALSE)
    }

    This will produce one file (finishing with 'topTable.csv') per comparison. In this case we perform only one comparison so one file is generated: 'Integ_NormalvsTumour_topTable.csv'.

     

    Do the same again, but this time consider only differentially expressed genes, according to the pre-defined threshold values (one file finishing with 'DE.csv' is created per comparison) 

    #####  Record differentially expressed genes
    topGenes.index <- NULL
    
    for (i in 1:ncol(eb$p.value) ) {
        
        topGenes <- topTable(eb, coef = colnames(eb)[i], adjust = "BH",  sort.by = "P", p.value = pThreshold, lfc = lfcThreshold, number = nrow(data_combat), genelist = rownames(data_combat))
        
        ##### Retrieve probesets annotation information
        probeid <- topGenes$ID
        
        SYMBOL <- as.character(unlist(lapply(mget(probeid,env=hgu133plus2SYMBOL), function (symbol) { return(paste(symbol,collapse="; ")) } )))
        NAME <- as.character(unlist(lapply(mget(probeid,env=hgu133plus2GENENAME), function (name) { return(paste(name,collapse="; ")) } )))
        CHR <- as.character(unlist(lapply(mget(probeid,env=hgu133plus2CHR), function (Chr) { return(paste(Chr,collapse="; ")) } )))
        MAP <- as.character(unlist(lapply(mget(probeid,env=hgu133plus2MAP), function (MAP) { return(paste(MAP,collapse="; ")) } )))
        
        ##### Merge probes annotation with statistics
        Annot <- data.frame(probeid, CHR, MAP, SYMBOL, NAME, row.names = NULL)
        Annot <- merge(Annot, topGenes, by.x = "probeid", by.y = "ID" )
        
        ##### Write annotated results into a file
        write.csv(Annot, file=paste("Integ_", colnames(eb$p.value)[i], "_DE.csv", sep=""), row.names=FALSE)
        
        #####  Record differentially expressed genes
        topGenes.index <- c(topGenes.index,rownames(topGenes))
    }
    

     

    P-value distribution

    Generate a histogram of p-values to get a general sense of how the test behaved across all genes. This should help to diagnose potential problems associated with multiple hypothesis testing.

    for (i in 1:ncol(eb$p.value) ) {
        
        pdf(file = paste("Integ_", colnames(eb$p.value)[i], "_P_hist.pdf", sep = ""))
        histogram <- hist(eb$p.value[,i], breaks = seq(0,1, by = 0.01), main = contrastNames[i], xlab = "p-value")
        abline(v = pThreshold, col = "red")
        dev.off()
    }

    This will produce one file (finishing with 'P_hist.pdf') per comparison. 

    file 'Integ_NormalvsTumour_P_hist.pdf':

     

    Integ_NormalvsTumour_P_hist

    The histogram displays a large proportion of differentially expressed genes (p-value <0.0001 indicated by vertical red line). This suggests that application of multiple hypothesis testing adjustment procedure is appropriate for this data.

     

    Volcano plot

    Produce volcano plot for each contrast (files finishing with 'volcano_plot.pdf')

    #####  Generate volcano plots of log2 fold-changes versus significance (adjusted p-values, label top 10 genes)
    for (i in 1:ncol(eb$p.value) ) {
        
        topGenes <- topTable(eb, coef = colnames(eb)[i], adjust = "BH", sort.by = "none", number = nrow(data_combat))
        
        pdf(file = paste("Integ_", colnames(eb$p.value)[i], "_volcano_plot", ".pdf", sep = ""))
        plot(topGenes[,"logFC"], -log2(topGenes[,"adj.P.Val"]), pch = 16, cex = 0.5, xlab = "Log2 fold-change", ylab = "-log2(adjusted p-value)", main = contrastNames[i], col = "grey")
        
        #####  Highlight genes with logFC above specified threshold
        points(topGenes[abs(topGenes[, "logFC"]) > lfcThreshold, "logFC"], -log2(topGenes[abs(topGenes[, "logFC"]) > lfcThreshold, "adj.P.Val"]), cex = 0.5, pch = 16)
        abline(h = -log2(pThreshold), col = "red", lty = 2)
        
        #####  Label top 10 most significant genes
        ord <- order(-log2(topGenes[, "adj.P.Val"]), decreasing = TRUE)
        top10 <- ord[1:10]
        text(topGenes[top10, "logFC"], -log2(topGenes[top10, "adj.P.Val"]), labels = rownames(data_combat[top10,]), cex = 0.6, col = "blue")
        dev.off()
    }

     

    The volcano plot demonstrates number of genes with log2 fold-change > 2 or < -2 (black points), and adjusted p-value < 0.0001 (above the dashed red line)

    file 'Integ_NormalvsTumour_volcano_plot.pdf':

     

    Integ_NormalvsTumour_volcano_plot

     

     

    Clustering analysis and heatmap

    Finally, perform clustering analysis and produce heatmap (actually two, each with exactly the same clustering results but different colour scale for your preference)

    #####  Select top differentially expressed genes
    topGenes <- data_combat[unique(topGenes.index),]
    
    
    #####  Compute distance matrix and hierarchical clustering
    hr <- hclust(as.dist(1-cor(t(topGenes), method = "pearson")), method = "ward.D2")
    hc <- hclust(as.dist(dist(t(topGenes), method = "euclidean")), method = "ward.D2")
    
    
    #####  Generate dendrogram
    pdf("Integ_DE_dendrogram.pdf", width = 0.2*ncol(data)+2, height = 6, pointsize = 12)
    par(mar = c(2,5,2,0))
    plot(hc, xlab="", labels=paste(colnames(data_combat), targets, sep="       "), hang = -1, main="")
    dev.off()
    
    
    #####  ...heatmap (blue-red colour scale)
    pdf("Integ_DE_heatmap_blue_red.pdf", width = 6, height = 10, pointsize = 12)
    heatmap.2(as.matrix(topGenes), Rowv = as.dendrogram(hr), Colv=as.dendrogram(hc), col = colorRampPalette(c("blue","white","red"))(100), scale = "row", ColSideColors = targets.colour[[2]], margins = c(2, 6), labRow = "", labCol = "", trace = "none", key = TRUE)
    
    #####  Add the legend
    legend("topright", legend = levels(factor(targets)), fill = targets.colour[[1]], box.col = "transparent", cex=1.2)
    dev.off()
    
    
    #####  ...heatmap (green-red colour scale)
    pdf("Integ_DE_heatmap_green_red.pdf", width = 6, height = 10, pointsize = 12)
    heatmap.2(as.matrix(topGenes), Rowv = as.dendrogram(hr), Colv=as.dendrogram(hc), col = greenred(75), scale = "row", ColSideColors = targets.colour[[2]], margins = c(2, 6), labRow = "", labCol = "", trace = "none", key = TRUE)
    
    #####  Add the legend
    legend("topright", legend = levels(factor(targets)), fill = targets.colour[[1]], box.col = "transparent", cex=1.2)
    dev.off()

     

    You should get the following clustering results:

    file 'Integ_DE_dendrogram.pdf':

    dendrogram with sample names

     

    Integ_DE_dendrogram

     

     

    ... and two heatmaps, one in blue-red colour scale, and the other in green-red colour scale, with blue and green colours indicating down-regulation, respectively, and red colour indicating up-regulation of corresponding genes (probesets).

    files 'Integ_DE_heatmap_blue_red.pdf' and 'Integ_DE_heatmap_green_red.pdf':

     

    Integ_DE_heatmap

     

    =========================================================================================================

     

    jmarzec Mon, 08/08/2016 - 23:57

    Day 6: NGS & Biomedicine

    Day 6: NGS & Biomedicine

    During the sixth day, we will introduce the applications of NGS in biomedicine, so called personalised medicine. 

    lpryszcz Wed, 06/22/2016 - 10:32

    0. Programme

    0. Programme

    9.00h-10.00h: Applications of NGS in Biomedicine (Presentation)

    10.00-11.00h: .bam file processing and variant calling, 1st part (Hands-on)

    11.00h-11.15h: Coffee break

    11.15h-12.30h: .bam file processing and variant calling, 2nd part (Hands-on)

    12.30h-13:00h: Discussion and example of a web-based variant browser (Presentation)

    sderdak Wed, 07/20/2016 - 15:26

    1. Presentation: Applications of NGS in Biomedicine

    1. Presentation: Applications of NGS in Biomedicine

    Please find accompanying literature on the course server at:

    Day6_NGS_Biomedicine_Sophia/literature

    sderdak Wed, 07/20/2016 - 15:33

    2. Hands-on: .bam file processing and variant calling, pt1

    2. Hands-on: .bam file processing and variant calling, pt1

    Please find genomic resources, executables and input and intermediate files for the exercise on the course server at:

    Day6_NGS_Biomedicine_Sophia/resources

    Day6_NGS_Biomedicine_Sophia/tools

    Day6_NGS_Biomedicine_Sophia/hands-on/1stexample

    1st example

    For this example, we extracted a 1000 base-region of a whole genome paired end sequencing experiment of a tumor and its matching control sample on the Illumina platform. This region contains a couple of variants, one of which appears to be a somatic variant only present in the tumor sample. We will use this example to go through some of the typical steps in a variant calling analysis, we will visualize the intermediate output files in a genome browser and we will take a detailed look at the file formats.

    Files from Day6_NGS_Biomedicine_Sophia/hands-on/1stexample/1_full alignments:

    • Open the Integrative Genomics Viewer
    java -jar igv.jar
    • Load the two separate full bam files
    • Select the correct genome to be displayed: Human (b37)
    • Select the region covered by the bam file: chr9:131456000-131457000
    • Explore the zooming and navigating features
    • Inspect the variant position: chr9:131456174 ; in which gene falls this position? Which bases are read in the control and tumor samples, respectively?
    • Leave the IGV open with the two bam files loaded
    • Inform yourself on the options of samtools view:
    samtools view
    • Inspect the bam files and their header section running samtools view:

    samtools view -h control.bam | less
    • According to the bam file header, which is the sort order of the alignments and which program has been used for alignment?

    Files from Day6_NGS_Biomedicine_Sophia/hands-on/1stexample/2_with_readgroups:

    • Load the two separate full bam files with readgroups in the IGV (in addition to the full ones from the previous step)
    • Which differences do you see when mouse-over the reads from the first vs the read group step? Which additional information is extracted from the read group-containing bam files?
    • Close the two bam files from the first step and their coverage tracks in the IGV; keep the two bam files with read groups open.
    • Inspect the bam files and their header section running samtools view, as in the previous step. Which lines have been added to the header? Which fields have been added to the read records?

    Files from Day6_NGS_Biomedicine_Sophia/hands-on/1stexample/3_after_rmdup:

    • Load the two separate bam files after rmdup in the IGV (in addition to the full ones with read groups from the previous step)
    • Can you see a difference?
    • Inspect the bam files and their header section. What information has been added to the header in the rmdup step?
    • Run samtools view on the full bam files from step 1 or 2 and count the lines, then run samtools view on the bam files after rmdup and count the lines. How many reads have been removed in the control and in the tumor bam file in the rmdup step, respectively?
    samtools view control.bam | wc -l
    • Close the two bam files with read groups and their coverage tracks in the IGV; keep the two bam files after rmdup open.
    • Inform yourself on the options of the GATK core engine and the realignment tests:
    java -jar GenomeAnalysisTK.jar -h java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -h java -jar GenomeAnalysisTK.jar -T IndelRealigner -h
    • Use the bam files after rmdup as input to run the local realignment in two steps:
      • Run GATK RealignerTargetCreator:
    java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -I tumor_rmdup.bam -I control_rmdup.bam -R human37.fa --known 1000G_phase1.indels.hg19.vcf --known Mills_and_1000G_gold_standard.indels.hg19.vcf -o IndelRealigner.intervals -L chr9
      • Run GATK IndelRealigner:
    java -jar GenomeAnalysisTK.jar -T IndelRealigner -I tumor_rmdup.bam -I control_rmdup.bam -R human37.fa -targetIntervals IndelRealigner.intervals -L chr9 -o samplepair.bam

    Files from Day6_NGS_Biomedicine_Sophia/hands-on/1stexample/4_realigned:

    • Use the realigned bam file you generated yourself or the one provided in the folder.
    • Inspect the bam file and its header section. What information has been added to the header in the realignment step?
    • Run samtools view on the samplepair bam file and count the lines; compare to the numbers you counted for the two bam files after rmdup.
    • Load the samplepair bam file in the IGV.
    • In the IGV, group alignments by read group. 
    • Inspect the variant position again: chr9:131456174

    Files from Day6_NGS_Biomedicine_Sophia/hands-on/1stexample/5_pileup:

    • The files in this folder were generated running samtools mpileup on the whole genomic region covered by the bam files. Use them to control your own results. We will use the control.RG and tumor.RG files for excluding specific read groups in one of he exercises, though.
    • Inform yourself on the options of samtools mpileup:
    samtools mpileup
    • For practise, use the samplepair bam file as input and run samtools mpileup for the genomic interval shown in the presentation:
    samtools mpileup -f human37.fa -r chr9:131456174-131456185 samplepair.bam
    • What is the total coverage at our variant position of interest at chr9:131456174 ?
    • Run samtools mpileup for the same interval, but exclude the tumor or the control sample, respectively:
    samtools mpileup -f human37.fa -r chr9:131456174-131456185 -G tumor.RG samplepair.bam samtools mpileup -f human37.fa -r chr9:131456174-131456185 -G control.RG samplepair.bam
    • What is the coverage at our variant position of interest at chr9:131456174 in the control and tumor sample, respectively, using these commands?
    • What is the coverage at our variant position of interest at chr9:131456174 in the control and tumor sample, respectively, setting the base quality cutoff to 20? And setting the base quality cutoff to 30?
    samtools mpileup -f human37.fa -r chr9:131456174-131456185 -Q 20 -G tumor.RG samplepair.bam samtools mpileup -f human37.fa -r chr9:131456174-131456185 -Q 20 -G control.RG samplepair.bam
    • Run samtools mpileup on the whole range of the bam file and browse through the output. Do you find any other possible variant positions, either in both samples or in one of them, apart from chr9:131456174 ?
    samtools mpileup -f human37.fa samplepair.bam samtools mpileup -f human37.fa -G tumor.RG samplepair.bam samtools mpileup -f human37.fa -G control.RG samplepair.bam

    Files from Day6_NGS_Biomedicine_Sophia/hands-on/1stexample/6_vcf_samtools:

    • Inform yourself on the options of bcftools view:
    bcftools view
    • Generate the vcf file in this folder yourself running samtools mpileup piped to bcftools view:
    samtools mpileup -DSug -f human37.fa samplepair.bam | bcftools view -vcg - > variants.vcf
    • Look inside the vcf file: How many variants does your vcf file contain? How many of them are SNPs, how many are INDELs?
    • Which is the most trustworthy position considering per-sample genotype qualities?
    • Which is the most trustworthy position considering the per-sample genotype likelihoods?
    • Which is the most trustworthy position considering all-over call quality?

    Files from Day6_NGS_Biomedicine_Sophia/hands-on/1stexample/7_vcf_GATK:

    • Inform yourself on the options of the GATK UnifiedGenotyper test:
    java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -h
    • Generate the vcf file in this folder yourself running GATK UnifiedGenotyper:
    java -jar GenomeAnalysisTK.jar -R human37.fa -T UnifiedGenotyper -I samplepair.bam -o variants_GATK.vcf -L chr9:131456000-131457000 -glm BOTH -dt NONE
    • Look inside the vcf file: How many variants does your vcf file contain? How many of them are SNPs, how many are INDELs?
    • Compare the positions in this vcf file with the ones in the vcf file generated by samtools. Which positions are in both files, which are only in one of them?
    • Look at the FORMAT field: Which values are output by samtools, and which by GATK? Look at those output by both programs, are they identical/similar/different?
    • Open the vcf file as another track in the IGV along with the samplepair bam file and inspect how the variant positions are displayed there.

    Files from Day6_NGS_Biomedicine_Sophia/hands-on/1stexample/8_vcf_annotated_snpEff:

    • Inform yourself on the options of the snpEff annotations command:
    java -jar SNPEFF/snpEff.jar eff
    • Generate the annotated vcf file in this folder yourself running snpEff with the samtools vcf file as input:
    java -jar SNPEFF/snpEff.jar eff -c SNPEFF/snpEff.config -noLog -o vcf GRCh37.65 variants.vcf > annotated_variants.vcf
    • Look inside the vcf file: How many different transcripts are annotated at each of the variant positions?
    • How many variants overlap coding regions? Do they have an effect on amino acid encoding?
    • What is the highest functional effect (HIGH, MODERATE, LOW or MODIFIER) annotated for each variant position?
    • Have a look at the additional files that are created by the snpEff command: snpEff_summary.html and snpEff_genes.txt

    Files from Day6_NGS_Biomedicine_Sophia/hands-on/1stexample/9_vcf_annotated_dbSNP:

    • We are not generating this file during this tutorial; please use the prepared file. Note that it has been generated using a different version of samtools and that the control and tumor files are named N402 and N401, respectively.
    • Look inside the file: How many positions are annotated with dbSNP ID's (RS-numbers)?
    • Look up the RS-numbers at http://www.ncbi.nlm.nih.gov/SNP

    Files from Day6_NGS_Biomedicine_Sophia/hands-on/1stexample/10_vcf_filtered:

    • We are not generating this file during this tutorial; please use the prepared file. Note that it has been generated using a different version of samtools and that the control and tumor files are named N402 and N401, respectively.
    • Look inside the file: Which tags have been added to each position? Read the description for each filter tag in the header section of the file and reproduce from the INFO and FORMAT section of the positions, why the tags were added.
    • The position that has been tagged as somatic, overlaps a gene. Which gene is it? Look up more information on this gene in the following ressources: http://omim.org/ , http://genome.ucsc.edu/cgi-bin/hgTracks , http://www.ncbi.nlm.nih.gov/gap/phegeni
    sderdak Mon, 08/01/2016 - 11:15

    3. Hands-on: .bam file processing and variant calling, pt2

    3. Hands-on: .bam file processing and variant calling, pt2

    Please find genomic resources, executables and input and intermediate files for the exercise on the course server at:

    Day6_NGS_Biomedicine_Sophia/resources

    Day6_NGS_Biomedicine_Sophia/tools

    Day6_NGS_Biomedicine_Sophia/hands-on/2ndexample

    2nd example: low clonality somatic variant

    For this example, we extracted another 1000 base-region of a whole genome paired end sequencing experiment from the same sample pair as in the first example, but with higher coverage. This region contains only one variant that is called by the variant calling program. However, we have evidence (let's say from previous experiments or literature) that there may be another somatic variant in the region, which has lower than normal "heterozygous" frequency. This may happen in heterogeneous tumors where only some cells harbour the mutation.

    There will be less tasks for this example than for the first one, we will focus on the pileup format of the region and the position of interest, and try a different approach to distinguish the variant position from the rest.

    Files from Day6_NGS_Biomedicine_Sophia/hands-on/2ndexample/1_realigned:

    • Open the Integrative Genomics Viewer
    java -jar igv.jar
    • Load the samplepair bam file
    • Select the correct genome to be displayed: Human (b37)
    • Select the region covered by the bam file: chr16:81077000-81078000
    • Inspect the variant position: chr16:81077733 ; in which gene falls this position? Which bases are read in the control and tumor samples, respectively?
    • Leave the IGV open with the bam file loaded

    Files from Day6_NGS_Biomedicine_Sophia/hands-on/2ndexample/2_pileup:

    • The files in this folder were generated running samtools mpileup on the whole genomic region covered by the bam files. Use them to control your own results. We will use the control.RG and tumor.RG files for excluding specific read groups in one of the exercises, though.
    • Inform yourself on the options of samtools mpileup:
    samtools mpileup
    • For practise, use the samplepair bam file as input and run samtools mpileup for the genomic position of interest:
    samtools mpileup -f human37.fa -r chr16:81077733-81077733 samplepair.bam
    • What is the total coverage at our variant position of interest?
    • Run samtools mpileup for the same position, but exclude the tumor or the control sample, respectively:
    samtools mpileup -f human37.fa -r chr16:81077733-81077733 -G tumor.RG samplepair.bam samtools mpileup -f human37.fa -r chr16:81077733-81077733 -G control.RG samplepair.bam
    • What is the coverage at our variant position of interest in the control and tumor sample, respectively, using these commands?
    • What is the coverage at our variant position of interest in the control and tumor sample, respectively, setting the base quality cutoff to 20? And setting the base quality cutoff to 30?
    samtools mpileup -f human37.fa -r chr16:81077733-81077733 -Q 20 -G tumor.RG samplepair.bam samtools mpileup -f human37.fa -r chr16:81077733-81077733 -Q 20 -G control.RG samplepair.bam
    • With each of these base quality cutoff values, what is the proportion of alternative bases of all bases covering this position?

    Files from Day6_NGS_Biomedicine_Sophia/hands-on/2ndexample/3_vcf_samtools:

    • Generate the vcf file in this folder yourself running samtools mpileup piped to bcftools view:
    samtools mpileup -DSug -f human37.fa samplepair.bam | bcftools view -vcg - > variants.vcf
    • Look inside the vcf file.
    • Take another look at the options of samtools mpileup and bcftools view. Can you think of tuning any of the parameters in order to be less strict at calling or outputting our variant of interest in the vcf file?

    Files from Day6_NGS_Biomedicine_Sophia/hands-on/2ndexample/4_vcf_GATK:

    • Generate the vcf file in this folder yourself running GATK UnifiedGenotyper:
    java -jar GenomeAnalysisTK.jar -R human37.fa -T UnifiedGenotyper -I samplepair.bam -o variants_GATK.vcf -L chr16:81077733-81077733 -glm BOTH -dt NONE
    • Look inside the vcf file.
    • Take another look at the options of GATK UnifiedGenotyper. Hint: Interesting options to include may be: -ploidy, -out_mode .
    sderdak Mon, 08/01/2016 - 11:18

    Day 7: Recap & Farewell

    Day 7: Recap & Farewell

    During the last day, we will recapitulate the course and we will answer questions or needs raised by the participants during previous days. 

    lpryszcz Wed, 06/22/2016 - 10:32

    GUIDELINES FOR SPEAKERS

    GUIDELINES FOR SPEAKERS

    Preparing your workshop

    1. Workshop slots are typically 4 hours, this should include theoretical introduction (often 30 min will be more than enough) and practical exercises (here the more the better;) ). 
    2. Each student will have a laptop with Ubuntu installed. Please, prepare your exercises so they can run even on older laptops in reasonable amount of time ie for de novo assembly workshop I'm using 100Kb region of one chromosome. 
    3. Please, add the software you will need in your workshop to this list. We'll cover installation of all dependencies on Day 1

    Creating workshop materials

    Adding new content

    1. Navigate to the book page under which you want to create content ie under Day 3.
    2. Press `Add child page`.
    3. Edit the content of new page.
    4. Optionally upload slides.
    5. Press `Save and publish`.

    Adding nicely formatted source code / snippets of code

    1. Change `Text  format` to `Full HTML`.
    2. Press `Insert code snippet` button.
    3. Select syntax language
    4. Paste your code & press OK.

    Quick Edit: In-place editing of content

    Drupal offers Quick Edit, meaning you can edit any content and see it's effect in the current browser window. I strongly recommend using it, as it's very handy. You can access it by clicking small pencil symbol in the top right corner of any content and selecting `Quick edit`.

    Dedicated server for data & computation

    We have a dedicated server for NGSchool (contact Tomas V. with your preferred username, so he can create an account for you). You will be able to access the server through ssh at address ngschool.biocenter.sk up until approx. Aug 10. We will then pack the server and bring it to NGSchool. After the school, the server will be again available for downloading the data until approx. mid-September. We will also create accounts for all participants at the beginning of the school, so that they can access all the materials comfortably. 

    What needs to be done:

    • If you have any files that you want to be available (data sets, etc.), create a subdirectory in /ngschool and download them there. Please, create some readme.txt file with description of what the files are
    • If you need any packages, software etc. installed on the server, let me know (or if you know what you are doing, I can also give you sudo access if necessary); it would be great if all the exercises you are planning were doable on the server without any further installation or downloading large data sets
    • If you need to put up any web material, create a subdirectory in /var/www/html and put it there (also try to keep some readme.txt file)

     

    lpryszcz Wed, 06/22/2016 - 19:06