1.1 Create a working directory "/workdir/$USER".
Copy all data files for this exercise from "/shared_data/epigenomics/exercise1/" into the working directory.
1.2 Install Filezilla client on your laptop
Filezilla is a sftp client software. If you do not have any sftp client software on you laptop, download Filezilla client from this page. Double click to install.
* The installer might prompt you to install other additional software, e.g. virus protector, always click "no" to decline.
1.3 Install IGV on your laptop
Go to the IGV web site (https://software.broadinstitute.org/software/igv/ ), click “Download”. Double click the IGV installation tool to install IGV. On Windows computer, the software is installed in the directory C:\Program Files.
1.4 Getting familiar with "screen"
If you do not know about Linux "screen" or "tmux" commands, now it is time to get familiar with it.
Most of the tools you will be using in this exercise take long time to finish. You will need to use the "screen" persistent sessions to run the software, so that you can safely detach from the session, and the job will keep running in the background on the server.
2.1 GFF3, GTF and BED
You are provided with a gff3 formatted file: ara.gff3. Inspect the content of the file.
Generate some basic statistics of the gff3 file based on the 3rd column "Feature type". The command would tell you the number of gene features in this file, and whether this gff3 file contains non-protein-coding gene feature (e.g. transposable_element, miRNA, et al. ).
Converting the gff3 file to a gtf file, and then convert back to a new gff3 file.
Compare the difference between gff3 and gtf file formats, especially the last column of the two files.
By comparing the original ara.gff3 and the new ara_converted.gff3, you would find some information are lost in the ara_converted.gff3. For example, you do not see the "gene" features in the new file.
Convert the ara.gff3 file to a "bed" formatted file, using the "awk" command.
2.2 Extract protein and transcript sequences from the genome.
Inspect the genome sequence file "ara.fasta" and the "ara.gff3" file. You would find that,
The Linux "sed" command can be used to fix this problem, and write to a new file "ara_2.gff3"
Now you can use gffread to extract protein and transcript sequences. The output are two new files: transcript.fasta and protein.fasta
2.3 BEDgraph and BigWig files
As wig file format is replaced by BigWig format now, we will work with BEDgraph and BigWig file here.
If the input file sample1.bedGraph is not sorted, use the Linux "sort" function to sort the file first (sort by column 1 and 2)
Run "bedGraphToBigWig", which a tool in the UCSC Kent Utilities package.
4.2 Download files to your laptop.
Using Filezilla to download the "sample1.bw" files.
4.3 Launch IGV on your laptop.
Double click “igv.bat” in the directory "C:\Program Files\IGV_2.8.11" to start IGV. It might take a few seconds before you see the software starting.
Most commonly used genomes are already loaded in IGV. In this exercise, you will use the "A thaliana (TAIR 10)". In the pull-down menu at the upper-left corner, click "More" and select "A thaliana (TAIR 10)".
From menu “File” -> “Load file”, open the “sample1.bw”.
Select "1" from the "chromosome" pull down menu.