The goal of this exercise is to get familiar with genome assemblers, pre-processing, reporting and validation. Exercise will be based on chromosomes of a mutant genotype of bakers’ yeast, as practice of de-novo genome assembly. Here we have the benefits of a reference genome to validate our assembly.
Actions and commands that you should perform or enter are in black code boxes (though, to save ink, they're probably white if you printed it). Sometimes we also include output, so be careful only to type the actual commands (the text on the same line as the dollar sign) like the following:
After each Unix command you type at the Unix prompt, press the return key.
Many of the commands and text we show in these instructions will have placeholder text like yourusername or jobidnumber which will need to be replaced by your relevant values or names. If you are doing this exercise as part of an instructor-led workshop, your username is probably something like tngs01 or tscnut09. For example, if told to type
/scratch/yourusername you might actually type
tngs01 is the username you were assigned.
These instructions are located online at the TSC Portal at http://portal.tandysupercomputing.org/display/desktop/Workshop+-+Genome+Assembly+Hands-on. You can copy/paste these commands into PuTTY if you'd like. To paste in PuTTY, just right click anywhere in its window.
If you need help with this, see Logging In.
You'll start in your home directory, which is for small, permanent files. We'll be working with slightly larger data files in this exercise, so most of it will take place in your scratch directory, which is located at /scratch/yourusername where yourusername should be replaced by your username on Tandy. Change into your scratch directory:
The data files we'll be using for this exercise are located at
/opt/apps/common/tandy-ngs/ so you will need to copy them into your scratch directory. This may take a few minutes. Wait for your command prompt to return.
Here's the layout of the files that we'll be working on, which you can generate using the
tree command. We will dive into each directory for each task: fastqc, velvet, and abyss, in that order.
Don't run jobs on the login nodes!
Since we are using the Tandy cluster supercomputer, only very small tasks can be done directly on the login nodes, which include the computer you're logged into using SSH (possibly using PuTTY) for this exercise. For any larger or longer activity, we will have to submit the task as a job to the scheduler, using a submission script.
Most of your folders contain a generic submission script which includes the commands that we use for each task. We will modify these for your individual use. A complete description of how to submit jobs to the Tandy Supercomputer can be found at: http://portal.tandysupercomputing.org/display/desktop/Introduction+to+Batch+Computing. Although these submission scripts are not strictly necessary, it is always a good idea to use a script so you can easily modify parameters, and the script serves as a note to yourself later.
Go to your data directory.
Each group should be working on a different chromosome in yeast. These instructions will show group 1, but other groups should replace the group number with their own.
Our yeast chromosome assembly will start using an Illumina library, PE-350, which is a paired end (fragment) DNA sequencing library, divided into two datasets: .1.fastq and .2.fastq. These are paired sequencing reads very close to 350 bp apart. Also, “ref.fasta” is the included reference sequence for comparison with your assembly. To see what a fastq file looks like, type in the following command to see the top ten lines.
What information is included, and what does each line mean?
Four consecutive lines define ONE read in the fastq file:
|@READ-ID||The unique name of the sequence AKA the “read name”|
|ATAGAGATAGAGAGAG||Sequence: in this example, 16 nucleotides|
|+READ-ID||Repeat the ID in first line, optional, can be empty|
|;9;7;;.7;3933334||Quality identifiers/characters for each base representing the PHRED score; same length as sequence, in this example: 16 nucleotides|
Each base has quality identifier called PHRED score, typically between values 0-40 for Illumina. The FASTQ file converts the numeric value to a character because they use less space (fewer bits). There are also two systems of such conversion, PHRED+33 and PHRED+64. PHRED+33 is used more often, and in new Illumina protocols. PHRED+64 is sometimes still used also so be aware!
“Phred quality scores Q are defined as a property which is logarithmically related to the base-calling error probabilities P.” -Wikipedia.
This chart explains the basic idea:
Phred Quality Score
Probability of incorrect base call
Base call accuracy
1 in 10
1 in 100
1 in 1000
1 in 10000
1 in 100000
If your base has PHRED quality of 20 (i.e. 1% error), PHRED(20)+33=53, which corresponds to the character 5 using the reference chart below. Under PHRED+64 system, 20+64=84, which is character T. When using the table below, can you guess if a sequence quality score is using PHRED+33, or PHRED+64 encoding? (hint: use the cheat sheet below. NOTE: The symbol “@” is the minimum possible score for Illumina [I] PHRED quality, and the symbol “J” is the maximum possible score for a Sanger [S] PHRED quality score).
Go back to our FASTQ files, and find out answers to these questions:
Library PE-350 contains a total of ______ reads
Count number of lines using the command line with the command
wc -l PE-350.1.fastq and then divide by 4.
Or - here's a command that will do the arithmetic for you:
expr $(cat PE-350.1.fastq |wc -l) / 4
The base quality in my data is encoded in ________ (Phred+33 or Phred+64?) .
Fastqc runs very quickly, so we can safely run this directly on the login node. Make sure you're in the right directory for your group:
Load the fastqc module so you can access it in your environment. For more information about modules, see Using modules.
Perform the fastqc quality control:
Pro tip! Do this again to the
PE-350.2.fastq file by using the up arrow to bring up the last command, then use arrow keys to move over to change 1 to a 2 and press enter.
For information on fastqc (read this later) you can run
After you've run fastqc, you should clean up your environment by unloading all your current modules:
FastQC puts the result files in the same directory as the data - in this case, that's
To save time, we have posted pre-prepared fastqc results on the web for you. Enter the URL, being sure to set the group number to your group (just change the group<GROUPNUMBER> to group1 group2, group3, group4 or group5), to see them:
Instructions to download files like your fastqc results are at Transferring Files. Do this later. FASTQC generates a HTML (website-like) report for each FASTQ file you run. You can use the WinSCP program to find the folder in
/scratch/<username>/tandy-ngs/data/group<groupnumber> which holds the results of your analysis. You then can use WinSCP to transfer the PE-350.1_fastqc folder to your desktop and double-click on file fastqc_report.html (on your desktop) to open it in a browser. But in the interest of time, don't do that right now.
Not all assemblers are the same or give the same results.
Which assembler should I use?
It is in general a good idea to run several assemblers and compare. We are using VELVET and ABYSS in this exercise. All of these assemblers are de-bruijn (K-mer) graph based assemblers.
We'll start with VELVET.
We will use the text-file editor “nano” (using the -w command line option to prevent unwanted line wrapping) to examine our submission script file, which submits our data to the assemblers. Descriptions of how to use nano are located on the Tandy Supercomputer website, under the section “Edit the submission script” at Introduction to Batch Computing.
Using nano: To exit press Ctrl-x (hold down the ctrl key and press x.) If you have made changes, it will ask if you want to save the changes you made. Press 'y' to for yes to save. Then just press 'Enter' to accept the filename velvetk31.bsub.
Be sure to change these lines:
export GROUPNUMBER=1 (change to your group number)
On these two lines:
#BSUB -o /home/yourusername/velvet_%J_stdout.txt
#BSUB -e /home/yourusername/velvet_%J_stderr.txt
Change them to use “velvetk31” so your result files are easily recognized:
#BSUB -o /home/yourusername/velvetk31_%J_stdout.txt
#BSUB -e /home/yourusername/velvetk31_%J_stderr.txt
You will also need to make sure that all instances of
yourusername are replaced with your actual username.
Change firstname.lastname@example.org to your actual email address.
You will have to do this for every bsub file you work on.
The first section of our velvetk31.bsub script sets up the scheduler requests (the queueing system) and environment variables. The assembler commands are velveth and velvetg. What do these commands do? Read the velvet website here.
If you understand the script and have changed the groupnumber, press ctrl-x, save the file and submit it to the queue with the
bsub command as follows.
What, specifically, was printed on the screen after you pressed enter?
See your pending, running, and recently completed jobs using the following command:
(and you can exclude completed jobs by leaving out the
The log file in your home directory,
velvetk31_<jobidnumbmer>_stdout.txt is very useful. Options to examine that file (Hint: for the first 2, try using TAB-completion to enter your commands):
less command using the
In Linux, the tilde (~) is a shortcut that means "my home directory."
Find this information from the log file:
Paired end library insert size: _____________ Standard deviation ______________
contig stats: n50 ________ max _______ total __________ reads used ____________ (did it use all the reads?)
Does the assembly get better if I use a different K-mer size? (you can try this later if we have time)
To try different kmers, first copy your bsub script to a new file with a new name:
And change K to a different value (it's currently 31, so try 21, 25.)
then submit this new job:
To see kmer coverage differences in a histogram format, we can use the following Perl script on the velvet stats file in each of your results folders:
CONGRATULATIONS on your first assembly!
Also clear out your currently loaded modules, if any.
We change to a subdirectory because abyss puts all its output into the current working directory.
ABYSS is simple as it has just one command to run the entire pipeline. First edit the abyssk31.bsub to change some lines for your use.
Change email@example.com to your actual email.
Place your real username in these lines, while also giving the output files friendly "abyss31" names:
Then change your groupnumber to your correct group number in the following line:
When ABYSS is finished there is one file of particular interest: `coverage.hist`. This is the K-mer histogram. It is a two-column file with 1st column (coverage), 2nd column (# of K-mers). Plot this K-mer histogram however you want (EXCEL, R, etc.). Only the first 50 lines are very useful, so we will help you shrink the file to include just the first 50 lines.
To look at the first 50 lines, enter:
To create a new file that only includes these first 50 lines:
EMAIL the ‘short.hist’ file to yourself to look at it in a spreadsheet. (enter subject, press enter, type a short message, press enter, then press ctrl-d to send)
To determine the final assembled size of the genome, use the linux command “grep” at the command line to extract that information from the output file:
For more information, see: https://groups.google.com/forum/#!topic/abyss-users/RdR6alqM7e8
To determine the highest K-mer coverage in the histogram use:
and write down your answer: ___________
You can estimate genome size just based on this K-mer histogram (why do you want to do that?):
Genome_size = Total_Kmers / peak_Kmer_coverage
Use the formula to estimate the genome size base on K-mer histogram: ______________ bp
~/abyss31_<jobidnumber>_stdout.txt log file is very useful. Find the following information from the log file:
contig stats: n50 ________ max _______ total __________ (compare to the estimate based on K-mer)
scaffold stats: n50 ________ max _______ total __________
Where are the results stored?
Now ABYSS is done. Save the results and rename it!
Luckily assemblers run fast. Let's run two different additional K-mer options using Abyss. Why? Because the current value K=31 may not be the best! Try 21, and 25. Most assemblers need odd numbers, so 24 won’t work.
First create a new directory:
(you should be in the ‘abyss’ directory now, to check use “print working directory”)
Change the output filenames, your job's friendly name, and change K from 31 to a different value of 21:
Save your new submission file, and submit it:
When it’s done, your output files will automatically have the new K-mer in their names. Then repeat the process for a Kmer value of 25 and submit. Copy the result to your results folder.
A contig is a contiguous length of genomic sequence. A scaffold is composed of ordered contigs and gaps. By far the most widely used statistics for describing the quality of a genome assembly are its scaffold and contig N50s. A contig N50 is calculated by first ordering every contig by length from longest to shortest. Next, starting from the longest contig, the lengths of each contig are summed, until this running sum equals one-half of the total length of all contigs in the assembly. The contig N50 of the assembly is the length of the shortest contig in this list. The scaffold N50 is calculated in the same fashion but uses scaffolds rather than contigs. The longer the scaffold N50 is, the better the assembly is. However, it is important to keep in mind that a poor assembly that has forced unrelated reads and contigs into scaffolds can have an erroneously large N50.
N50 statistic is a metric of the length of a set of sequences. N50 is the contig length such that using equal or longer contigs produces half the bases (http://en.wikipedia.org/wiki/N50_statistic).
Look at your results.
This directory contains the results from all the programs. If you have a different assembly (e.g different K-mers) using the same assembler, you should have named them differently. For example,
Abyss has a command to get the scaffold statistic (N50) of an assembly: You can use the command `abyss-fac`, which scans the contigs lengths and outputs the N50 for your assembly. We will cover K-mer comparisons later but thought you might like to know:
Make sure that you validate the results before releasing your assembly to the public. Some assemblies may appear to have large contigs and scaffolds, but are wrong. Compare and check the assemblies using Quast. So let's evaluate with quast.py (website).
Go into your results directory if you're not there already.
quast.bsub file in the usual way (
nano -w quast.bsub), setting your username in the output log files' paths, and setting your group number near the bottom:
Submit the Quast job:
When finished look at the output files to check for errors (protip: TAB autocomplete so you don’t have to type in the jobid):
Press “q” to exit the
less file viewer program.
There are a few options to get the Quast results:
Zip the quast directory and mail the whole thing to yourself (Remember to use Ctrl+D to send)
Just mail the report.pdf file to yourself (Remember to use Ctrl+D to send)
Which assembly has better length statistics? Check out the Quast manual to get answers.
This concludes the exercise.
The data used in this exercise comes from a mutant yeast, using a novel method to generate the mutant. Our data comes from an individual called MUTATOR4.
This exercise was originally presented at Oklahoma State University by Dr. Haibao Tang, JCVI.
The data (we partitioned the reads to chromosomes so that assembly ran faster in workshop): http://www.ncbi.nlm.nih.gov/sra/DRX001304
See some slides here: http://ged.msu.edu/angus/tutorials-2013/files/2013-june-18-msu.pdf
Another de novo assembly tutorial: http://www.cbs.dtu.dk/courses/27626/Exercises/denovo_exercise.php (uses quake & jellyfish)
Github repostiory with lots of slides etc: https://github.com/lexnederbragt/denovo-assembly-tutorial