In this exercise, you will do two projects: Illumina short-read assembly and PacBio long-read assembly. Both are small bacterial genomes. When you work with large eukaryotic genomes that require more memory and CPU cores to assemble, you will need to use more powerful machines on BioHPC.
Some steps could take very long time to run (marked as [DO IT AT HOME] in the instructions). Please skip these steps for now. Use the result files we provide you to continue with. You can finish the "DO IT AT HOME" steps later during the week.
1.1 Copy the data files to you working directory.
You should see 5 files.
SRR1982238_1.fastq.gz & SRR1982238_2.fastq.gz: Illumina short-read data;
SRR10405213.fastq.gz: PacBio (CLR) long-read data;
spades_results.tar.gz: assembly results from short-read data;
canu_results.tar.gz: assembly results from long-read data.
1.2 Getting familiar with "screen"
If you do not know about Linux "screen" or "tmux" commands, now it is time to get familiar with it.
Most of the tools you will be using in this exercise take long time to finish. You will need to use the "screen" persistent sessions to run the software, so that you can safely detach from the session, and the job will keep running in the background on the server.
2.1 Trim adapters and low quality sequences from the raw reads.
The Illumina adapter sequences (either Tru-seq or Nextera) and low quality sequences should be removed from the reads before assembly. Trimmomatic is a software for this purpose.
When you work with your own data, you need to adjust at least two of the parameters:
"MINLEN:150" needs to be adjusted based on your original read length. The data used here is 250x2 bp. "MINLEN:150" would only keep reads >150bp after trimming. If your input data is in the most common format "2x150 bp", you can use "MINLEN:100".
"ILLUMINACLIP:/programs/trimmomatic/adapters/TruSeq3-PE-2.fa:2:30:10". The file "TruSeq3-PE-2.fa" is a fasta file with adapter sequences. Your sequencing service provider should be able to tell you what adapters were used. In the directory "/programs/trimmomatic/adapters/", you can find other adapter sequences. If you are not able to find the adapter information, one option is to use other trimming tools like "bbduk" (https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/), which keep all adapter sequences in a single file.
Trimmomatic produces 4 new files: r1.fastq.gz, r2.fastq.gz, u1.fastq.gz, u2.fastq.gz. We will use the r1, r2 and u1 files for genome assembly.
The r1 and r2 files contain reads that are still paired after trimming. The u1 and u2 files contain single-end reads, due to the other read of the pair too short after trimming. You would find the u1.fastq.gz file much larger than the u2.fastq.gz file. These are reads from short DNA fragments (shorter then read length, so that the two reads of the pair completely overlap ). Trimmomatic only keep one end of the pair and put it in the u1 file.
2.2 Run SPAdes
[DO IT AT HOME] It could take an hour to finish this step. Skip it for now.
When you do this step at home, make sure to run it in "screen" session.
SPAdes automatically try multiple k-mer sizes based on the input reads length (up to k=127 for the installation on BioHPC computers). You can also define k-mer values by yourself using the "-k" option, e.g. -k 85. It would be useful if you need to assemble many genomes, a single fixed "-k" would make it a lot faster, but the quality might be slightly compromised.
The output files are in the directory "spades_run".
2.3 Evaluate genome contiguity with QUAST
As you skipped step 2.2, we have pre-run SPAdes results for you to finish the rest of the workshop.
De-compress the "spades_results.tar.gz" file in your working directory.
A few files of interest in the directory "spades_results":
Let's examine the spades.log file in the output directory.
Check the steps in the SPAdes pipeline:
You would see that "SPAdes" actually assembled the genome with different k-mer sizes from K21 to K127, and give you the results with the best k-mer.
What is the estimated genome size?
You would find genome sizes estimated based on different k-mers. I would use the medium value to represent the genome size.
Now we will generate some statistics of the assembly using the software Quast. You will run Quast on both the contig and scaffolds.
Many report files will be generated under the new directory "quast_results". The most important numbers are summarized in the last 10 lines of the report.txt file.
The numbers we are interested are:
Largest contig, Total length, N50.
In this particular run, you would find N50 is the same for contigs and scaffolds. But the largest scaffold is much larger than the largest contig.
2.4 Install BUSCO with Conda
BUSCO is the software to evaluate the genome completeness. You will install BUSCO version 4.1.4 using Conda.
Note (5/17/2017): Installing BUSCO with Conda could be very slow, and sometimes not working. You can skip this step and move one to 2.5
Install miniconda3 in your home directory if it has not been installed.
The “chmod u+x” command makes the file executable.
During the installation, you will be asked for multiple questions:
1) “Please, press ENTER to continue”: press “ENTER” key;
2) “More”: keep pressing “SPACE” key until you reach next question;
3) “Do you accept the license terms?”: enter “yes”
4) “Miniconda3 will now be installed into this location … ”: press “ENTER” key and accept the default “/home/xxxxx/miniconda3”.
5) “Do you wish the installer to initialize Miniconda3” : Press ENTER to accept the default “no”;
Install BUSCO with Conda
2.5 Evaluate genome completeness with BUSCO
When working with real data, BUSCO could take a long time to finish, it is a good idea to run BUSCO in "screen".
Detailed instructions how to run BUSCO can be found at https://busco.ezlab.org/busco_userguide.html#running-busco
Note (5/17/2017): If you skipped step 2.4. You can activate Busco pre-installed on BioHPC .
Retrieve available lineages and save to a file named "lineages_list". Examine the file content.
Press "q" to exit "less" command
Run busco command, set the lineage bacillales_odb10, and let BUSCO to determine the "augustus_species" for you.
The results are saved in the directory "busco_scaffold":
We would like to see the "C" score 90% or better for a good genome assembly. "C" score is the the percentage of BUSCO genes in full length.
The result shows that among the 450 core-genes in bacillales, 434 (96.4%) are in full length in this assembly, and only 15 (3.4%) genes are missing. This is a really good BUSCO sore.
After you are done with BUSCO, you would want to kill the "screen" session where BUSCO was running by pressing "ctr-d".
In this second project, we will assemble PacBio reads of a E. coli genome.
The data file is downloaded from https://www.ebi.ac.uk/ena/data/view/SRR10405213. The expected genome size is 4.6MB.
3.1 Run CANU
[DO IT AT HOME] It would take an hour to finish. Skip this step for now.
Run canu in "screen".
The assembled contigs will be in the directory ecoli_run.
3.2 Polish the genome assembly.
We have pre-run the software canu. The results are in the file canu_results.tar.gz.
You would see up to three fasta files:
*.contigs.fasta: Final filtered assembly result. If there are two alleles for diploids genome, only one allele is included in this file.
*.unitigs.fasta: This file includes all alternative paths, e.g. allelic haplotypes are represented as two sequences.
*.unassembled.fasta: A fasta file of contigs with poor support and not included in *.contigs.fasta.
There is also a *.report file which gives you the statistics of the raw data and final assembly, including N50 for the raw reads, error corrected reads and assembly, as well as total length of the assembly.
Here you will run Racon to polish the genome. First, you will run minimap2 to align the raw reads to the assembled genome:
Now run Racon ("-t 4": run on 4 threads):
The corrected fasta file is written into a new file "racon_corrected.fasta".