Child pages
  • Workshop - Genome Assembly Hands-on
Skip to end of metadata
Go to start of metadata

Introduction

The goal of this exercise is to get familiar with genome assemblers, pre-processing, reporting and validation. Exercise will be based on chromosomes of a mutant genotype of bakers’ yeast, as practice of de-novo genome assembly. Here we have the benefits of a reference genome to validate our assembly.

Actions and commands that you should perform or enter are in black code boxes (though, to save ink, they're probably white if you printed it). Sometimes we also include output, so be careful only to type the actual commands (the text on the same line as the dollar sign) like the following:

$ ./example_command
This is sample output that should not be typed into your computer.
It is here instead to illustrate approximately what you should expect to see
when you hit enter.

After each Unix command you type at the Unix prompt, press the return key.

Many of the commands and text we show in these instructions will have placeholder text like yourusername or jobidnumber which will need to be replaced by your relevant values or names. If you are doing this exercise as part of an instructor-led workshop, your username is probably something like tngs01 or tscnut09. For example, if told to type /scratch/yourusername you might actually type /scratch/tngs01, assuming tngs01 is the username you were assigned.

These instructions are located online at the TSC Portal at http://portal.tandysupercomputing.org/display/desktop/Workshop+-+Genome+Assembly+Hands-on. You can copy/paste these commands into PuTTY if you'd like. To paste in PuTTY, just right click anywhere in its window.

Data and directory structure

Log into Tandy

If you need help with this, see Logging In.

Copy the data to your scratch directory

You'll start in your home directory, which is for small, permanent files. We'll be working with slightly larger data files in this exercise, so most of it will take place in your scratch directory, which is located at /scratch/yourusername where yourusername should be replaced by your username on Tandy. Change into your scratch directory:

$ cd /scratch/yourusername

The data files we'll be using for this exercise are located at /opt/apps/common/tandy-ngs/ so you will need to copy them into your scratch directory. This may take a few minutes. Wait for your command prompt to return.

$ cp -r /opt/apps/common/tandy-ngs/ .
$ cd tandy-ngs
$ ls
abyss  data  results  velvet

Look at the layout of the data and inputs

Here's the layout of the files that we'll be working on, which you can generate using the tree command. We will dive into each directory for each task: fastqc, velvet, and abyss, in that order.

$ tree
.
|-- abyss
|   `-- abyss31
|       `-- abyssk31.bsub
|-- data
|   |-- group1
|   |   |-- PE-350.1.fastq
|   |   |-- PE-350.2.fastq
|   |   `-- ref.fasta
|   |-- group2
|   |   |-- PE-350.1.fastq
|   |   |-- PE-350.2.fastq
|   |   `-- ref.fasta
|   |-- group3
|   |   |-- PE-350.1.fastq
|   |   |-- PE-350.2.fastq
|   |   `-- ref.fasta
|   |-- group4
|   |   |-- PE-350.1.fastq
|   |   |-- PE-350.2.fastq
|   |   `-- ref.fasta
|   `-- group5
|       |-- PE-350.1.fastq
|       |-- PE-350.2.fastq
|       `-- ref.fasta
|-- results
|   `-- quast.bsub
`-- velvet
    `-- velvetk31.bsub

Important notes about using Tandy

Don't run jobs on the login nodes!

Since we are using the Tandy cluster supercomputer, only very small tasks can be done directly on the login nodes, which include the computer you're logged into using SSH (possibly using PuTTY) for this exercise. For any larger or longer activity, we will have to submit the task as a job to the scheduler, using a submission script.

Some jargon

  • Node. A computer that makes up part of a cluster supercomputer
  • Cluster. A type of supercomputer made up of many regular computers (nodes)
  • Login node. The computer that you connect to directly in order to interact with a cluster supercomputer
  • Compute node. One of the computers in the cluster that actually runs users' jobs.
  • Job. The basic unit of supercomputer work: a run of a program
  • Scheduler: The program that decides whose jobs run on which compute nodes at what times
  • Submission script. A file that defines a job - it contains a list of commands to the scheduler and the supercomputer

Most of your folders contain a generic submission script which includes the commands that we use for each task. We will modify these for your individual use. A complete description of how to submit jobs to the Tandy Supercomputer can be found at: http://portal.tandysupercomputing.org/display/desktop/Introduction+to+Batch+Computing. Although these submission scripts are not strictly necessary, it is always a good idea to use a script so you can easily modify parameters, and the script serves as a note to yourself later.

Sequencer Read processing

Look at your group's data

Go to your data directory.

$ cd data
$ ls
group1  group2  group3  group4  group5

Each group should be working on a different chromosome in yeast. These instructions will show group 1, but other groups should replace the group number with their own.

$ cd group1
$ ls
PE-350.1.fastq  PE-350.2.fastq  ref.fasta

Our yeast chromosome assembly will start using an Illumina library, PE-350, which is a paired end (fragment) DNA sequencing library, divided into two datasets: .1.fastq and .2.fastq. These are paired sequencing reads very close to 350 bp apart. Also, “ref.fasta” is the included reference sequence for comparison with your assembly. To see what a fastq file looks like, type in the following command to see the top ten lines.

$ head PE-350.1.fastq

Understand the FASTQ format

What information is included, and what does each line mean?

Four consecutive lines define ONE read in the fastq file:

@READ-IDThe unique name of the sequence AKA the “read name”
ATAGAGATAGAGAGAGSequence: in this example, 16 nucleotides
+READ-IDRepeat the ID in first line, optional, can be empty
;9;7;;.7;3933334Quality identifiers/characters for each base representing the PHRED score; same length as sequence, in this example: 16 nucleotides

Each base has quality identifier called PHRED score, typically between values 0-40 for Illumina.  The FASTQ file converts the numeric value to a character because they use less space (fewer bits). There are also two systems of such conversion, PHRED+33 and PHRED+64. PHRED+33 is used more often, and in new Illumina protocols. PHRED+64 is sometimes still used also so be aware!

How does this work?

“Phred quality scores Q are defined as a property which is logarithmically related to the base-calling error probabilities P.” -Wikipedia.

This chart explains the basic idea:

Phred Quality Score

Probability of incorrect base call

Base call accuracy

10

1 in 10

90%

20

1 in 100

99%

30

1 in 1000

99.9%

40

1 in 10000

99.99%

50

1 in 100000

99.999%

If your base has PHRED quality of 20 (i.e. 1% error), PHRED(20)+33=53, which corresponds to the character 5 using the reference chart below. Under PHRED+64 system, 20+64=84, which is character T. When using the table below, can you guess if a sequence quality score is using PHRED+33, or PHRED+64 encoding? (hint: use the cheat sheet below.  NOTE: The symbol “@” is the minimum possible score for Illumina [I] PHRED quality, and the symbol “J” is the maximum possible score for a Sanger [S] PHRED quality score).

 ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
 SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS....................................................
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
 |                   |          |         |         |                  |                     |
33                  53         64         74       84                 104                   126
 S - Sanger/Illumina 1.8+        Phred+33
 I - Solexa/Illumina 1.3/1.5     Phred+64

Questions

Go back to our FASTQ files, and find out answers to these questions:

  • Library PE-350 contains a total of ______ reads

    Hint

    Count number of lines using the command line with the command wc -l PE-350.1.fastq and then divide by 4.

    Or - here's a command that will do the arithmetic for you: expr $(cat PE-350.1.fastq |wc -l) / 4

  • The base quality in my data is encoded in ________ (Phred+33 or Phred+64?) .

How good is your sequencing run? Use FASTQC

Fastqc runs very quickly, so we can safely run this directly on the login node. Make sure you're in the right directory for your group:

$ pwd
/scratch/<yourusername>/tandy-ngs/data/group<groupnumber>

Load the fastqc module so you can access it in your environment. For more information about modules, see Using modules.

$ module load gcc fastqc

Perform the fastqc quality control:

$ fastqc PE-350.1.fastq

Pro tip! Do this again to the PE-350.2.fastq file by using the up arrow to bring up the last command, then use arrow keys to move over to change 1 to a 2 and press enter.

For information on fastqc (read this later) you can run fastqc -h.

After you've run fastqc, you should clean up your environment by unloading all your current modules:

$ module purge

FastQC puts the result files in the same directory as the data - in this case, that's data/group<groupnumber>.

Time saver!

To save time, we have posted pre-prepared fastqc results on the web for you. Enter the URL, being sure to set the group number to your group (just change the group<GROUPNUMBER> to group1 group2, group3, group4 or group5), to see them:

http://www.tandysupercomputing.org/tandybio/group<GROUPNUMBER>/PE-350.1_fastqc/fastqc_report.html
http://www.tandysupercomputing.org/tandybio/group<GROUPNUMBER>/PE-350.2_fastqc/fastqc_report.html

Instructions to download files like your fastqc results are at Transferring Files. Do this later. FASTQC generates a HTML (website-like) report for each FASTQ file you run. You can use the WinSCP program to find the folder in /scratch/<username>/tandy-ngs/data/group<groupnumber> which holds the results of your analysis.  You then can use WinSCP  to transfer the PE-350.1_fastqc  folder to your desktop and double-click on  file fastqc_report.html (on your desktop) to open it in a browser. But in the interest of time, don't do that right now.

Questions

  • For Library PE-350.1
    • Read count: _________
    • Read length: _________
    • Any duplication level? _____, Any adapter sequences present? _____ (hint: look for “over-represented sequences” in report)
  • For Library PE-350.2
    • Read count: _________
    • Read length: _________
    • Any duplication level? _____, Any adapter sequences present? _____ (hint: look for “over-represented sequences” in report)

Assembly

Not all assemblers are the same or give the same results.

Which assembler should I use?

It is in general a good idea to run several assemblers and compare. We are using VELVET and ABYSS in this exercise. All of these assemblers are de-bruijn (K-mer) graph based assemblers.

We'll start with VELVET.

VELVET

Enter the velvet exercise directory

$ cd ../../velvet
$ ls
velvetk31.bsub

Read and understand the submission script, velvetk31.bsub

We will use the text-file editor “nano” (using the -w command line option to prevent unwanted line wrapping) to examine our submission script file, which submits our data to the assemblers. Descriptions of how to use nano are located on the Tandy Supercomputer website, under the section “Edit the submission script” at Introduction to Batch Computing.

$ nano -w velvetk31.bsub

Using nano:  To exit press Ctrl-x (hold down the ctrl key and press x.)  If you have made changes, it will ask if you want to save the changes you made.  Press 'y' to for yes to save. Then just press 'Enter' to accept the filename velvetk31.bsub. 

Be sure to change these lines:

export GROUPNUMBER=1  (change to your group number) 

On these two lines:
#BSUB -o /home/yourusername/velvet_%J_stdout.txt
#BSUB -e /home/yourusername/velvet_%J_stderr.txt


Change them to use “velvetk31” so your result files are easily recognized:
#BSUB -o /home/yourusername/velvetk31_%J_stdout.txt
#BSUB -e /home/yourusername/velvetk31_%J_stderr.txt

You will also need to make sure that all instances of yourusername are replaced with your actual username.

Change yourusername@example.com to your actual email address.

You will have to do this for every bsub file you work on.

The first section of our velvetk31.bsub script sets up the scheduler requests (the queueing system) and environment variables. The assembler commands are velveth and velvetg. What do these commands do?  Read the velvet website here.

If you understand the script and have changed the groupnumber, press ctrl-x, save the file and submit it to the queue with the bsub command as follows.

$ bsub < velvetk31.bsub
Job <jobidnumber> is submitted to queue <queuename>.

What, specifically, was printed on the screen after you pressed enter? 

See your pending, running, and recently completed jobs using the following command:

bjobs -a
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
...

(and you can exclude completed jobs by leaving out the -a)

Examine the job's log file

The log file in your home directory, velvetk31_<jobidnumbmer>_stdout.txt is very useful. Options to examine that file (Hint: for the first 2, try using TAB-completion to enter your commands):

$ tail ~/velvetk31_<jobidnumber>_stdout.txt
$ less ~/velvetk31_<jobidnumber>_stdout.txt
$ bpeek <jobidnumber>

Quit the less command using the q key.

In Linux, the tilde (~) is a shortcut that means "my home directory."

Questions

Find this information from the log file:

Paired end library insert size: _____________   Standard deviation ______________

contig stats: n50 ________  max _______ total __________ reads used ____________ (did it use all the reads?)

Save your results

$ cp velvet31/contigs.fa ../results/velvet31.fasta

Optional: change K-mer size?

Does the assembly get better if I use a different K-mer size? (you can try this later if we have time)

To try different kmers, first copy your bsub script to a new file with a new name:

$ cp velvetk31.bsub velvetk21.bsub
$ nano -w velvetk21.bsub

And change K to a different value (it's currently 31, so try 21, 25.)

then submit this new job:

$ bsub < velvetk21.bsub

See kmer coverages in a histogram format

To see kmer coverage differences in a histogram format, we can use the following Perl script on the velvet stats file in each of your results folders:

$ module load gcc velvet
$ velvet-estimate-exp_cov.pl velvet31/stats.txt
10 | 1 | **********
<snip>
Predicted expected coverage: 18
velvetg parameters: -exp_cov 18 -cov_cutoff 0

CONGRATULATIONS on your first assembly!

ABYSS

Go into your ABYSS exercise directory

Also clear out your currently loaded modules, if any.

$ module purge
$ cd ../abyss/abyss31

We change to a subdirectory because abyss puts all its output into the current working directory.

Understand the submission script

ABYSS is simple as it has just one command to run the entire pipeline. First edit the abyssk31.bsub to change some lines for your use.

$ nano -w abyssk31.bsub

Change yourusername@example.com to your actual email.

Place your real username in these lines, while also giving the output files friendly "abyss31" names:

abyssk31.bsub
#BSUB -o /home/yourusername/abyss31_%J_stdout.txt
#BSUB -e /home/yourusername/abyss31_%J_stderr.txt

Then change your groupnumber to your correct group number in the following line:

abyssk31.bsub
export GROUPNUMBER=<yourgroupnumber>

Submit the ABYSS job

$ bsub < abyssk31.bsub

Analysis & Questions

When ABYSS is finished there is one file of particular interest: `coverage.hist`. This is the K-mer histogram. It is a two-column file with 1st column (coverage), 2nd column (# of K-mers). Plot this K-mer histogram however you want (EXCEL, R, etc.). Only the first 50 lines are very useful, so we will help you shrink the file to include just the first 50 lines.

To look at the first 50 lines, enter:

$ head -n 50 coverage.hist

To create a new file that only includes these first 50 lines:

$ head -50 coverage.hist > short.hist

EMAIL the ‘short.hist’ file to yourself to look at it in a spreadsheet. (enter subject, press enter, type a short message, press enter, then press ctrl-d to send)

$ mail -a short.hist youremail.address@wherever.com

To determine the final assembled size of the genome, use the linux command “grep” at the command line to extract that information from the output file:

$ grep ^Assembled ~/abyss31_<jobidnumber>_stdout.txt
Assembled 563522 k-mer in 245 contigs.

For more information, see: https://groups.google.com/forum/#!topic/abyss-users/RdR6alqM7e8

To determine the highest K-mer coverage in the histogram use:

$ grep median ~/abyss31_<jobidnumber>_stdout.txt

and write down your answer: ___________

You can estimate genome size just based on this K-mer histogram (why do you want to do that?):

Genome_size = Total_Kmers / peak_Kmer_coverage

Hint

To calculate Total_Kmers, Create a third column by multiplying column 1 and column2, and then sum the numbers of the 3rd column

Use the formula to estimate the genome size base on K-mer histogram: ______________ bp

The ~/abyss31_<jobidnumber>_stdout.txt log file is very useful. Find the following information from the log file:

contig stats: n50 ________ max _______ total __________ (compare to the estimate based on K-mer)
scaffold stats: n50 ________ max _______ total __________

Save the results

Where are the results stored?

Now ABYSS is done. Save the results and rename it!

$ cp abyss31-scaffolds.fa ../../results/abyss31.fasta

Optional: Does the assembly get better if I use a different K-mer size?

Changing the K-mer option for assemblies.

Luckily assemblers run fast. Let's run two different additional K-mer options using Abyss. Why? Because the current value K=31 may not be the best! Try 21, and 25. Most assemblers need odd numbers, so 24 won’t work.

First create a new directory:

$ cd ..

(you should be in the ‘abyss’ directory now, to check use “print working directory”)

$ pwd 
/scratch/username/tandy-ngs/abyss
$ mkdir abyss21
$ cp abyss31/abyssk31.bsub abyss21/abyssk21.bsub
$ cd abyss21
$ nano -w abyssk21.bsub

 

Change the output filenames, your job's friendly name, and change K from 31 to a different value of 21:

abyssk21.bsub
#BSUB -o /home/yourusername/abyss21_%J_stdout.txt
#BSUB -e /home/yourusername/abyss21_%J_stderr.txt
 
...
 
#BSUB -J "abyss21"

...
 
export K=21

Save your new submission file, and submit it:

$ bsub < abyssk21.bsub

When it’s done, your output files will automatically have the new K-mer in their names. Then repeat the process for a Kmer value of 25 and submit. Copy the result to your results folder.

$ cp abyss21-scaffolds.fa ../../results/abyss21.fasta
$ cp abyss25-scaffolds.fa ../../results/abyss25.fasta

 

Reporting

A contig is a contiguous length of genomic sequence. A scaffold is composed of ordered contigs and gaps. By far the most widely used statistics for describing the quality of a genome assembly are its scaffold and contig N50s. A contig N50 is calculated by first ordering every contig by length from longest to shortest. Next, starting from the longest contig, the lengths of each contig are summed, until this running sum equals one-half of the total length of all contigs in the assembly. The contig N50 of the assembly is the length of the shortest contig in this list. The scaffold N50 is calculated in the same fashion but uses scaffolds rather than contigs. The longer the scaffold N50 is, the better the assembly is. However, it is important to keep in mind that a poor assembly that has forced unrelated reads and contigs into scaffolds can have an erroneously large N50.

N50 statistic is a metric of the length of a set of sequences. N50 is the contig length such that using equal or longer contigs produces half the bases (http://en.wikipedia.org/wiki/N50_statistic).

Look at your results.

$ cd ../../results
$ ls

This directory contains the results from all the programs. If you have a different assembly (e.g different K-mers) using the same assembler, you should have named them differently. For example, abyss25.fasta vs. abyss31.fasta.

Abyss has a command to get the scaffold statistic (N50) of an assembly: You can use the command `abyss-fac`, which scans the contigs lengths and outputs the N50 for your assembly. We will cover K-mer comparisons later but thought you might like to know:

$ module load icc/13.0.1 openmpi/1.6.5 abyss
$ abyss-fac velvet31.fasta

Validation using Quast

Run Quast

Make sure that you validate the results before releasing your assembly to the public. Some assemblies may appear to have large contigs and scaffolds, but are wrong. Compare and check the assemblies using Quast. So let's evaluate with quast.py (website).

Go into your results directory if you're not there already.

Edit the quast.bsub file in the usual way (nano -w quast.bsub), setting your username in the output log files' paths, and setting your group number near the bottom:

quast.bsub
#BSUB -o /home/yourusername/quast_%J_stdout.txt
#BSUB -e /home/yourusername/quast_%J_stderr.txt
 
...
 
export GROUPNUMBER=1

Submit the Quast job:

$ bsub < quast.bsub

When finished look at the output files to check for errors (protip: TAB autocomplete so you don’t have to type in the jobid):

$ less ~/quast_<jobidnumber>_stderr.txt
$ less ~/quast_<jobidnumber>_stdout.txt 

Press “q” to exit the less file viewer program.

Look at the Quast results

There are a few options to get the Quast results:

  1. Zip the quast directory and mail the whole thing to yourself (Remember to use Ctrl+D to send)

    $ zip -r quast.zip quastresults
    $ mail -a quast-zip <youremailaddress>
  2. Just mail the report.pdf file to yourself (Remember to use Ctrl+D to send)

    $ mail -a quastresults/report.pdf <youremailaddress>
  3. Grab the quast directory with WinSCP as described at Transferring Files

Which assembly has better length statistics? Check out the Quast manual to get answers.

Conclusions and background

This concludes the exercise.

The data used in this exercise comes from a mutant yeast, using a novel method to generate the mutant. Our data comes from an individual called MUTATOR4.

This exercise was originally presented at Oklahoma State University by Dr. Haibao Tang, JCVI.

The paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3364565/

The data (we partitioned the reads to chromosomes so that assembly ran faster in workshop): http://www.ncbi.nlm.nih.gov/sra/DRX001304

Further reading

http://kmergenie.bx.psu.edu/  “kmergenie”

See some slides here: http://ged.msu.edu/angus/tutorials-2013/files/2013-june-18-msu.pdf

Another de novo assembly tutorial: http://www.cbs.dtu.dk/courses/27626/Exercises/denovo_exercise.php (uses quake & jellyfish)

Good outline: http://en.wikibooks.org/wiki/Next_Generation_Sequencing_(NGS)/De_novo_assembly

Github repostiory with lots of slides etc: https://github.com/lexnederbragt/denovo-assembly-tutorial

  • No labels