1.1 Install the software "figtree" on your laptop.
Go to the web site: https://github.com/rambaut/figtree/releases . Download the software file based on your system.
Windows users can download the file: FigTree.v1.4.4.zip, de-compress. To run the software, double click the file "figtree.jar" located inside the lib directory. (Double click "figtree.exe" might not work for some computers)
Mac users can download the file FigTree.v1.4.4.dmg. Double click to install.
1.2 Prepare the working directory on the Linux server.
The "cp -r" command would copy the whole directory of project3.
In this exercise, you will use RAxML-ng to build phylogenetic trees from two Multiple Sequence Alignment (MSA) files. "prim.phy" is a PHYLIP formatted file of a primate gene (DNA sequences). "leafy_align.fas" is a FASTA formatted file of a plant gene (protein sequences).
The raxml-ng is a tool with several different functions (Tutorial: https://www.preprints.org/manuscript/201905.0056/v1 ). We will use five of them in this exercise: "check", "search", "bootstrap", "evaluate" and "support". To run the "check" function, for example, you need to use the command "raxml-ng --check". Run "raxml-ng" without function name defaults to the "search" function.
2.1 Verify the format and data consistency of the input files
RAxML-NG can read MSA files in FASTA, PHYLIP and CATG formats. The "check" function automatically detect and verify the file formats.
If you see the message "Alignment can be successfully read by RAxML-NG", your input files are ok.
2.2 Inferring ML trees
Run "raxml-ng search" and find the best ML tree.
Out of the 20 trees, the best tree is saved to a file "A1.raxml.bestTree".
There is a log file named "A1.raxml.log", with likelihoods of all 20 trees. Use "grep" to extract the 20 "logLikelihood" values.
You should see output like:
[00:00:02] [worker #1] ML tree search #2, logLikelihood: -5708.924752 [00:00:02] [worker #0] ML tree search #1, logLikelihood: -5708.925657 [00:00:04] [worker #1] ML tree search #4, logLikelihood: -5708.933482 [00:00:05] [worker #0] ML tree search #3, logLikelihood: -5708.935278 [00:00:07] [worker #1] ML tree search #6, logLikelihood: -5708.939390 [00:00:07] [worker #0] ML tree search #5, logLikelihood: -5708.927252 [00:00:09] [worker #0] ML tree search #7, logLikelihood: -5708.948094 [00:00:09] [worker #1] ML tree search #8, logLikelihood: -5709.375515 [00:00:11] [worker #0] ML tree search #9, logLikelihood: -5709.364187 [00:00:12] [worker #1] ML tree search #10, logLikelihood: -5709.028093 [00:00:14] [worker #0] ML tree search #11, logLikelihood: -5709.017239 [00:00:14] [worker #1] ML tree search #12, logLikelihood: -5709.025309 [00:00:16] [worker #0] ML tree search #13, logLikelihood: -5709.020900 [00:00:16] [worker #1] ML tree search #14, logLikelihood: -5709.012512 [00:00:18] [worker #0] ML tree search #15, logLikelihood: -5709.019845 [00:00:18] [worker #1] ML tree search #16, logLikelihood: -5709.015344 [00:00:20] [worker #0] ML tree search #17, logLikelihood: -5709.034675 [00:00:20] [worker #1] ML tree search #18, logLikelihood: -5709.015198 [00:00:21] [worker #0] ML tree search #19, logLikelihood: -5709.015413 [00:00:22] [worker #1] ML tree search #20, logLikelihood: -5709.043286
Final LogLikelihood: -5708.924752
The final LogLikelihood might be slightly different between different runs. That is because the starting tree used by "search" function is randomly selected. If you fix the random number seed in the command, e.g. "--seed 2". You would get same score between runs.
Next you will build a tree for leafy gene using the protein MSA file "leafy_align.fas". The model used here is "LG+G4" (fixed empirical substitution matrix "LG", 4 discrete GAMMA categories).
2.3 Bootstrapping and branch support
In the previous step you generated ML trees from 20 distinct random starting trees, and output the tree with the best likelihood. Now we will get Bootstrapping support values for the trees.
After this step, you would see a tree file "C1.raxml.support", which is a tree file with bootstrapping support values.
2.4 Using "figtree" to visualize the trees
Use Filezilla to download the two tree files "C1.raxml.support" and "C2.raxml.support" to your laptop. Start "figtree". Open the file "C1.raxml.support". When prompted for names for branches and nodes, enter "BS".
After opening the file in "figtree", check the boxes for "Tip Label" and "Node Label". Expand "Tip Label" and "Node Label" to change "font" and increase size as needed. Expand "Node Label", select "BS" for "Display".
2.5 Combining search and bootstrapping in one command
In practice, we normally use the "--all" function to run raxml-ng, which combines the steps in 2.2 & 2.3 into a single command. The "--all" function is good for small to medium size data set. If you are working with a very large data set, it would be better to run the two steps separately, so that you can customize each step differently.
After this, you would get a tree file with bootstrapping support values: "D1.raxml.support".
2.6 Tree likelihood evaluation
"--evaluate" re-optimizes all branch lengths and model parameters without changing the tree topology. This step is optional, as in most cases, we are more interested in tree topology than the branch lengths.
Evaluate the likelihood of A1.raxml.bestTree under the following models: GTR+G, GTR+R4, GTR, JC and JC+G.
Each "--evaluate" command would create 5 output files, including *.raxml.log, and *.raxml.bestTree.
Use "grep" to extract the final likelihood scores from log files produced with the five different models:
Which model should be preferred? Compare likelihoods and AIC/AICc/BIC scores (lower=better).
2.7 Using modeltest-ng to find the best model
It is always each to determine best model to use. The "modeltest-ng" tool can be used to estimate the best model to use, it also provides you with the actual command to run "raxml-ng search".
When you work on real projects, please check the BioHPC software page (https://biohpc.cornell.edu/lab/userguide.aspx?a=software&i=794#c) to run the latest version. The newer version only works on AVX2 CPUs.
Based on the command provided by modeltest-ng output, now you can run raxml-ng as:
VCF (https://en.wikipedia.org/wiki/Variant_Call_Format) is a file format commonly used for SNP genotyping results. In this exercise, you will use Tassel to build a Neighbor-Joining tree (NJ) from a VCF file. Tassel ( https://www.maizegenetics.net/tassel) is a very popular genetics software, developed by Ed Buckler lab here at Cornell. You can run Tassel either as a GUI software on your laptop, or as a command-line tool on the server. In this exercise, you will run the command lines on the Linux server.
-Xmx5g: specify the maximum memory size for Java;
-treeSaveDistance false: because we do not supply pre-calculated distance matrix;
-vcf: input vcf file name;
-tree Neighbor: neighbor joining tree;
-exportType Text: output newick format
Download the tree file "tree.nj.nwk" to your laptop and open in "figtree". To get the best view, you will need to adjust "Expansion" and "Zoom".
In practice, when you work with VCF files, it is always a good idea to pre-filter the VCF file. (It is the same concept as running "Gblocks" on MSA files, to remove the un-reliable sites). You can use "Tassel", "VCFtools" or "BCFtools" to do the filtering. Commonly used filters including: missing data, allele frequency. You can also use more sophisticated filters like "segregation pattern", "LD", "read depth", et al.
Orthofinder is a tool to identify orthologous genes from multiple genomes. In this exercise, you will run the software with sequences from four different E. coli strains. Once you have the ortholog gene groups, you can run MSA software on each ortholog group, followed by Raxml-ng to infer phylogeny.
4.1 Examine and pre-process the input files.
At the beginning of this exercise, you have copied the four E. coli fasta files into your working directory. The files should be under the directory /workdir/$USER/project3/faa, named "*_translated_cds.faa.gz".
Use the command "zcat file_name.gz | head -n 40" to check the first 40 lines of a ".gz" file. Use the command "wc -l" to count the number of genes per strain.
The protein sequences in the input files have very long titles, like ">lcl|NC_000913.3_prot_NP_414548.1_7 [gene=yaaJ] [locus_tag=b0007] [db_xref=UniProtKB/Swiss-Prot:P30143] [protein=putative transporter YaaJ] [protein_id=NP_414548.1] [location=complement(6529..7959)] [gbkey=CDS]".
There are two potential problems with these sequence titles:
It would be a good idea to simplify the sequence titles in the fasta files before running Orthofinder. In Linux, the "sed" command can be used to replace text. In the following commands, you will de-compress the original file with "zcat", pipe through two rounds of "sed", and re-direct the output to new files named "strain*.faa".
I also like the sequence titles to start with strain names like "s1" "s2" "s3" "s4", so that in the MSA results, you can easily tell from which strain each sequence comes from. As each input files are different, you will need to use the "sed" command creatively to modify the sequence titles.
After these steps, you will have 4 fasta files with short sequence titles: strain1.faa, strain2.faa, strain3.faa and strain4.faa
Use "grep" to check the sequence titles in the cleaned fasta files:
4.2 Run Orthofinder
Unfortunately, you will not be able to run Orthofinder using the training computers. The software requires CPUs that support AVX2. In BioHPC system, it only works on medium or large memory gen2 computers.
The step-by-step instructions of running Orthofinder on BioHPC can be found on this page: https://biohpc.cornell.edu/lab/userguide.aspx?a=software&i=629#c . It is very straight forward. The software reads in a directory of protein fasta files, and the output is a directory with many files. You will examine the output files in the next step.
4.3 Examine the output files
The output of Orthofinder should be under the working directory /workdir/$USER/project3/. The file is "Results_Oct06.tar.gz".
De-compress the Results_Oct04.tar.gz file:
You would find several sub-directories in the Orthofinder results.
First, you will check the files under the directory "Orthogroups".
There is a total of 5411 ortholog groups in the four E. coli strains. Among these groups, 3557 of the groups have one gene per strain, which is ideal for building phylogenetic tree. The ID of these "one-gene-per-strain" ortholog groups are stored in the file "Orthogroups_SingleCopyOrthologues.txt".
Next let's check the directory Orthogroup_Sequences.
In this directory, you would find 6423 fasta files. Each file contains sequences for an ortholog group. These files can be used to run MSA software.
Last you can check the directory Comparative_Genomics_Statistics, which includes summary reports of the run.
4.4 Run MSA for all single copy genes
Orthofinder output a directory called "Single_Copy_Orthologue_Sequences" which contains fasta files for all single copy genes that are present in every species. These sequences are ideal for construction of a species tree. In this exercise, you will run MSA & Gblocks on each one of the FASTA files.
First you need to create a batch script to process all files in the directory. Linux command "xargs" can be used for this purpose.
Run the batch script in parallel to take advantage of multiple CPU cores. You will use the "GNU parallel command" to run the batch script, set "-j 2" so that 2 jobs would run simultaneously. This would take a while, run it in "screen".
After it is done, you should see a set of new files: aln_*.fa-gb. These are the cleaned MSA files.
4.5 Create a species tree from combined MSA of individual ortholog groups
Here you will concatenate all cleaned MSA files into one single fasta file, and create a species tree using raxml-ng .
concate_msa.py: a Python script to concatenate the MSA files of each ortholog groups, and output a single file merged.fasta. (If you will use this script concate_msa.py for your own data, you need to make sure that your files are: 1) In the same order of species, make sure to use " --inputorder" when running mafft or other MSA software to keep the order; 2) Only work with FASTA format.)
The final output is a species tree: merged.raxml.bestTree.