assembly_2019_exercise

In this exercise, we will do two projects: Illumina short reads assembly and PacBio long reads assembly. Both are bacterial genomes. First do assembly with paired-end Illumina data.

Part 1. Trim adapters from raw reads

mkdir /workdir/$USER 

cd /workdir/$USER

cp /shared_data/assembly_workshop_2019/*fastq.gz ./

The Illumina adapter sequences (either Tru-seq or Nextera) and low quality sequences should be removed from the reads before assembly. Trimmomatic is a software for this purpose.

java -jar /programs/trimmomatic/trimmomatic-0.39.jar PE -phred33 SRR1982238_1.fastq.gz SRR1982238_2.fastq.gz r1.fastq.gz u1.fastq.gz r2.fastq.gz u2.fastq.gz ILLUMINACLIP:/programs/trimmomatic/adapters/TruSeq3-PE-2.fa:2:30:10 LEADING:10 TRAILING:10 SLIDINGWINDOW:4:15 MINLEN:150

There will be 4 new files created after this step, r1.fastq.gz, r2.fastq.gz, u1.fastq.gz, u2.fastq.gz. We will use the r1, r2 and u1 files for next step.

The r1 and r2 files contain paired-end reads after trimming. The u1 and u2 files contain single-end reads after trimming, due to one read of the pair too short after trimming. You would find u1 file much larger than the u2 file. These are reads from short DNA fragments (shorter then read length, so that the read pair completely overlap ), Trimmomatic only keep one end of these sequences and put them in the u1 file.

Part 2. Do assembly with SPAdes

[DO IT AT HOME] It could take an hour to finish the run. Run the software in "screen".

SPAdes automatically select k-mer size for you based on reads length (up to k=127 for the installation on BioHPC computers). If you need to define custom k-mer values, you can use "-k" option, e.g. -k 85,95,105,115.

export OMP_NUM_THREADS=6
/programs/spades/bin/spades.py -t 6 --careful --s1 u1.fastq.gz --pe1-1 r1.fastq.gz  --pe1-2 r2.fastq.gz -o spades_run

Part 3. Assessment with QUAST

It would take an hour for the assembly to finish. While you are waiting for the run to finish, you can use the SPAdes results we have prepared for you to finish the rest of the workshop.

cp -r /shared_data/assembly_workshop_2019/spades_results ./ 
cd spades_results

grep "=====" spades.log

grep "genome size" spades.log

You would find several genome sizes estimated based on different kmer sizes. I would use the medium value.

Now we will generate some statistics of the assembly use the software Quast. We will run Quast on both the contig and scaffolds.

/programs/quast-5.0.2/quast.py contigs.fasta

/programs/quast-5.0.2/quast.py scaffolds.fasta

Many report files will be generated. The most important numbers are summarized in the last 10 lines of the report.txt file.

 tail quast_results/*/report.txt

In this particular run, you would find N50 is the same for contigs and scaffolds. But the largest scaffold is much larger than the largest contig.

Part 4. Evaluate with BUSCO

BUSCO is a tool to evaluate the completeness of gene space in your assembly, by checking the presence/absence of common genes of the species you are working on.

First, we need to download the BUSCO dataset that is closest to your genome. Since the genome we assemble here is in the genus of Listaria, and BUSCO has a dataset for Bacillales (the "order" above Listeria) , we will download the file bacillales_odb9.tar.gz.

wget https://busco.ezlab.org/datasets/bacillales_odb9.tar.gz

tar xvfz bacillales_odb9.tar.gz

Now we run BUSCO. Make sure you are in the directory: /workdir/$USER/spades_results

cp /programs/busco-3.1.0/config/config.ini /workdir/$USER/
cp -r /programs/Augustus-3.3.2/config /workdir/$USER/

export AUGUSTUS_CONFIG_PATH=/workdir/$USER/config

export BUSCO_CONFIG_FILE=/workdir/$USER/config.ini

 
export PYTHONPATH=/programs/busco-3.1.0/lib/python3.6/site-packages
export PATH=/programs/busco-3.1.0/scripts:/programs/Augustus-3.3.2/bin:/programs/Augustus-3.3.2/scripts:$PATH

run_BUSCO.py --in ./contigs.fasta --lineage_path ./bacillales_odb9 --mode genome --out busco_bacillales --cpu 8

The results you would want to see is in the file: run_busco_bacillales/short_summary_trinityBUSCO.txt.

cat run_busco_bacillales/short_summary_busco_bacillales.txt

We would like to see the "C" score close to 90% or better. "C" score is the the percentage of BUSCO genes in full length. In this file, you would a string:

The result shows that among the 526 core-genes in bacillales, 510 (97%) are in full length in this assembly, and only 14 (2.6%) genes are missing. This is a really good BUSCO sore.

Part 5. Assemble PacBio data with Canu

[DO IT AT HOME] In this second project, we will assemble PacBio reads of a E. coli genome. It would take an hour to finish. Do it in "screen".

cd /workdir/$USER
cp /shared_data/assembly_workshop_2019/SRR10405213.fastq.gz ./ 
/programs/canu-1.8/Linux-amd64/bin/canu  useGrid=false maxThreads=8 maxMemory=24g -p ecoli -d /workdir/$USER/ecoli genomeSize=4.6m  -pacbio-raw SRR10405213.fastq.gz

*.contigs.fasta: Final filtered assembly result. If there are two alleles for diploids genome, only one allele is included in this file.

*.unitigs.fasta: This file includes all alternative paths and not filtered, e.g. allelic haplotypes are represented as two sequences.

There is also a *.report file which gives you the statistics of the raw data and final assembly, including N50 for the raw reads, error corrected reads and assembly, as well as total length of the assembly.

Here we will use Pilon to polish the genome. Pilon is primarily developed for polishing genomes with Illumina reads. Latest version of Pilon provides limited support for polishing with long reads. To do this, you will first run minimap2 to align the raw reads to the assembled genome:

 /programs/minimap2-2.17/minimap2 -a ecoli/ecoli.contigs.fasta  SRR10405213.fastq.gz | samtools view -b - > ecoli.bam

samtools sort -@ 6  -m 5G -T ./ -o ecoli.sorted.bam ecoli.bam

samtools index ecoli.sorted.bam

java -jar /programs/pilon-1.23/pilon-1.23.jar --genome ecoli/ecoli.contigs.fasta  --pacbio  ecoli.sorted.bam --output ecoli.pilon

Whole genome assembly workshop

Part 1. Trim adapters from raw reads

Part 2. Do assembly with SPAdes

Part 3. Assessment with QUAST

Part 4. Evaluate with BUSCO

Part 5. Assemble PacBio data with Canu