In this exercise, you will do genome annotation using Braker2 (https://github.com/Gaius-Augustus/BRAKER) and PASA (https://github.com/PASApipeline/PASApipeline/wiki), and will integrate the results with EVM (https://evidencemodeler.github.io/).
Some steps could take very long time to run (marked as [DO IT AT HOME] in the instructions). Please skip these steps for now. Use the result files we provide you to continue with. You can finish the "DO IT AT HOME" steps later during the week.
1.1 Copy the data files to you working directory.
1.2 Examine the input files
You will use 4 of these files as input for this exercise. The rest are intermediate results from the pipeline.
1.3 Install GeneMark Ex.
GeneMark-EX, one of the dependencies of Braker2, requires an academic license. You will need to download the software and the license files by yourself.
Go to the GeneMark web site: http://exon.gatech.edu/GeneMark/license_download.cgi. Check the boxes for "GeneMark-ES/ET/EP ver 4.61_lic" and "LINUX 64" next to it, fill out the form, then click "I agree". In the next page, right click and copy the two link addresses: "download program" and "64 bit" license, and paste the two link addresses in the commands below. (To be compatible with braker2 version, use the gmes_linux_64.tar.gz file we provided, which is compatible with the braker2 in our docker container. The file is located in /shared_data/genomic_2020/project2/ )
1.4 Pull Docker images for Braker2 and PASA from the Dockerhub
A Docker image file is a template file that can be used to start a Docker container, an enclosed virtual Linux instance. All BioHPC computers run CentOS Linux 7.5. But within a Docker container, you can run a different type of Linux, for example, Ubuntu.
The Docker image of biohpc/braker2 is built by the BioHPC team, it includes all software required by the Braker2 pipeline, except the GeneMark-EX software and license files. When running the Braker2 pipeline, you need to put the GeneMark software directory and license file in the same directory as the input files. Detailed information about this image can be found at https://biohpc.cornell.edu/lab/userguide.aspx?a=software&i=56#c .
The PASA image is provided by the software developer.
For security reasons, on the BioHPC system you need to run Docker through a wrapper "docker1", which is developed by Cornell Bioinformatics Facility. Detailed information about docker1 can be found at https://biohpc.cornell.edu/lab/userguide.aspx?a=software&i=340#c . In the following commands, you will download two Docker images from the Dockerhub.
If someone else has already pulled the image on the same computer, you would see a message that this image exists.
The repetitive regions of the genome need to be masked before doing ab initio annotation. It is recommended to do "soft" masking (converting repeat regions to lower case), instead of "hard" masking (converting repeats to "N"). As "Repeat masking" is time consuming, we provide you with pre-run results.
2.1 [DO IT AT HOME] Build a custom repeat database with RepeatModeler.
-pa 4: use 4 processors
-LTRStruct: including the LTR discovery module in the run.
After this step, you should see a directory named "RM_xxxxx.date" and two new files.
mySpecies-families.fa: Consensus repeat sequences in FASTA format.
mySpecies-families.stk: Multiple sequence alignment of the repeats in STOCKHOLM format.
The file mySpecies-families.fa is what you want. It can be used as database for RepeatMasker.
For plant genomes, an alternative software for building custom database is EDTA (https://github.com/oushujun/EDTA)
2.2 [DO IT AT HOME] Mask the genome with RepeatMasker
The output file "genome_mask/genome.fa.masked" is the soft masked genome file.
As the annotation software Braker2 cannot use RNA-seq reads directly, you will need to map the reads to the genome assembly first. You will use the RNA-seq read mapping software STAR. Run STAR with "--twopassMode Basic" to improve alignment to the splicing junctions.
2.1 [DO IT AT HOME] Create a STAR database from the assembly FASTA file
The output from this step is a directory named STARgenome which contains the STAR database.
2.2 [DO IT AT HOME] Map RNA-seq reads to the assembly
After these steps, you should see a file RNA.sorted.bam. This file will be used as input for Braker2.
In this exercise, you will train the gene prediction model using Illumina RNA-seq data. Braker2 requires two input files: repeat masked genome file and RNA-seq BAM file.
3.1 Examine the two input files for Braker2
3.1.1 Repeatmasker output: genome_mask/genome.fa.masked
3.1.2 START output: RNA.sorted.bam
3.2 [DO IT AT HOME] Run Braker2.
You the run the command through Docker ("docker1" command on BioHPC). It will take hours to finish. Run it in screen session.
When you run a software in Docker, you run it as root. After the job is finished, you will need to get back the ownership of the output results. We have a special tool for this purpose:
As docker images takes storage space on the server. To remove a docker image file that is not needed anymore, use the command "docker1 rmi imageName". (use "docker1 images" to find image names if you do not know the name. )
3.3 Examine the Braker2 output
Detailed information about the Braker2 output can be found at https://github.com/Gaius-Augustus/BRAKER#output-of-braker.
The PacBio Iso-seq is good platform to generate full length transcript sequence, which can be assembled and cleaned through its proprietary SMRT Link Iso-Seq application. Alternatively, the transcript.fasta file could be output from reference genome guided RNA-seq reads assembly software, e.g. StringTie2, cufflinks and Trinity.
The transcript sequence file is expected to be free of adapters and poly-A tail. Other software, e.g. SMRT Link Iso-Seq application should give you cleaned transcript sequences. You can also use software like cutadapt to trim poly-A tail and adapter sequences.
4.1 [DO IT AT HOME] Run PASA
4.1.1 To customize the output file names (otherwise the default prefix is "sample_mydb_pasa.sqlite"), you need to edit the provided template file sqlite.confs/alignAssembly.config, and change the line "DATABASE=/tmp/sample_mydb_pasa.sqlite" to "DATABASE=/tmp/mySpecies".
4.1.2 Run PASA. As it is not aware of soft masking, you can use either the genome.fa or genome.fa.masked.
4.2 Examine PASA results
PASA produced many result files. The two most relevant annotation results are:
Read the EVM (https://evidencemodeler.github.io/) documentation for details.
5.1 Validate the gff3 files from PASA and Braker2.
EVM provides a validation tool gff3_gene_prediction_file_validator.pl. If there are errors, you will get error messages. Otherwise, not output.
* use the "ln -s" command to create symbolic links, so that we can use the files in the current directory.
5.2 Create a EVM weight file
EVM requires a weights file, which has three columns including the evidence class, type, and weight. The "class" parameter can be one of the following: ABINITIO_PREDICTION, PROTEIN, or TRANSCRIPT. The "type" corresponds to the second column (source) in the input gff3 file. The weights can be set intuitively (ie. weight(pasa) >> weight (protein) >= weight(prediction)). (See EVM manual for example weights for different data types).
A weights.txt file has be prepared for you. Examine its content.
5.3 [DO IT AT HOME] Run EVM
5.3.1 Partition the inputs
The genome sequences and gff3 files are partitioned based on individual contigs, and large contigs are segmented into smaller overlapping chunks. This would allow you to process the chunks in parallel.
5.3.2 Generating the EVM command set
EVM would create one command for each chunk you partitioned in the previous step. The commands are in the file commands.list.
5.2.3 Execute the commands:
Here you will use the GNU parallel to run the commands.list in parallel.
5.3.4 Combining the partitions and convert to gff3 output
After these steps, you get a final EVM.all.gff3 annotation file.
6.1 Get some basic statistics of the gff3 file
Using the Linux "uniq -c" command on 3rd column of the gff3 file.
6.2 Check the results using genome browser IGV
Download these files to you laptop using Filezilla:
IGV is a JAVA software that can be run on Windows, MAC or a Linux computer. To launch IGV on your laptop, go to IGV web site (https://software.broadinstitute.org/software/igv/ ), click “Download”, and download the Windows or Mac version for your laptop. Double click the IGV installation tool to install IGV. On Windows computer, the software is installed in the directory C:\Program Files\IGV_2.6.3. Double click “igv.bat” to start IGV. After double clicking, it might take a few seconds before you see the software starting.
You will need t create our own genome database. Click “Genomes”->”Create .genome” file. Fill out the following fields:
Then save the genome database on your computer.
From menu “File” -> “Load file”, open the “RNA.sorted.bam” and "augustus.hints.gff3"
Navigate around different regions of the genome, comparing the annotations from Augustus and EVM with RNA-seq evidence.