Practical Linux examples: Exercises

  1. Login (ssh) to the machine that you are assigned for this workshop. Prepare working directory, and copy data files into the working directory
  mkdir /workdir/$USER
  cd /workdir/$USER
  cp /shared_data/Linux_workshop2/* ./

  1. Inspect the contents of a compressed gff3 file (human.gff3.gz), using these Linux functions: zcat, head, tail, cut, less
zcat human.gff3.gz | head -n 100
zcat human.gff3.gz | tail -n 100
zcat human.gff3.gz| head -n 2000 | tail -n 1000 > mynew.gtf
zcat human.gff3.gz| head -n 4000 | tail -n 100 | cut -f 2-5,8
zcat human.gff3.gz| head -n 4000 | tail -n 100 | less -S

  1. Count the number of genes listed in the file, using these Linux functions: awk, uniq
zcat human.gff3.gz | wc -l 
zcat human.gff3.gz | awk '{if ($3=="gene") print}' |wc -l
zcat human.gff3.gz | grep -v "^#" | cut -f 3 | sort | uniq -c

  1. Convert the GFF3 file to BED file, using only the lines which third column equals to "gene". Then add "chr" to the chromosome name, so that chromosome 1 would be "chr1" instead of "1". Use Linux functions awk and sed for this exercise.
#The following three lines are in one single command
zcat human.gff3.gz | \
awk 'BEGIN {OFS = "\t"};{if ($3=="gene") print  $1,$4-1,$5}' | \
sed "s/^/chr/"> mygenes.bed 

  1. Get the size distribution of all genes, using the Linux functions: awk, sort and uniq
  1. Count the number of genes and pseudogenes in sliding windows across the whole chromosome, using Bedtools
bedtools makewindows -g hg19.txt -w 1000000 -s 1000000 > win1mb.bed

zcat human.gff3.gz | \
awk 'BEGIN {OFS = "\t"}; {if ($3=="gene") print $1,$4-1,$5}' | \
bedtools coverage -a win1mb.bed -b stdin -counts | \
LC_ALL=C sort -k1,1V -k2,2n > gene.cover.bed

 
zcat human.gff3.gz | \
awk 'BEGIN {OFS = "\t"}; \
{if (($3=="processed_pseudogene") || ($3=="pseudogene")) print $1,$4-1,$5}' | \
bedtools coverage -a win1mb.bed -b stdin -counts | \
LC_ALL=C sort -k1,1V -k2,2n > pseudogene.cover.bed

 
paste gene.cover.bed pseudogene.cover.bed | \
cut -f 1,2,3,4,8 > genecounts.txt

 

  1. In this exercise, you will use the fastx software to trim sequence adapters from a fastq file, then get size distribution of the trimmed sequencing reads. You will use Linux functions grep, wc -l, awk. • Estimate the percentage of sequencing reads that contain the adapter sequence " AGATCGGAAGAGC". You can use the first 10,000 sequencing reads to estimate. Sometimes the first 10,000 reads are all low quality reads, then this estimation would not be accurate. In that case, you might want to use reads from the middle of the file by using "head -n xxxxx | tail -n xxxxx" command) ;

    • Remove the adapter sequences. Several software can do this, e.g. cutadapt and bbduk. In this exercise, you will use a pretty old tool called fastx_clipper, and write the output into a new fastq file "clean.fastq". • Calculate the read length distribution. In a fastq file, each sequence record has 4 lines and the second line of the record is the actual DNA sequence. The "awk" function has a variable "NR" that records the line number for each row. The expression (NR%4) gives you the remainder of NR divided by 4. The statement "if (NR%4 == 2) print length($0)" means "output the size of the second line in every sequence record". The output of awk can then be piped into "LC_ALL=C sort -n | uniq -c" to get the read size distribution.

zcat SRR836349.fastq.gz |head -n 40000 | grep AGATCGGAAGAGC | wc -l
	
zcat SRR836349.fastq.gz | fastx_clipper -a AGATCGGAAGAGC -Q33 > clean.fastq
	
awk '{if (NR%4 == 2) print length($0)}' clean.fastq | LC_ALL=C sort -n | uniq -c

  1. Create a shell script . In this exercise, you will create a shell script like this:

gzip reads_1.fastq gzip reads_2.fastq gzip reads_3.fastq gzip reads_4.fastq gzip reads_5.fastq

There are many different ways to create this shell script. Here you will use the “yes” and “paste” commands to create this script "myscript.sh".

yes "gzip" |head -n 5 > t1
ls -1 reads*fastq > t2
paste -d " " t1 t2 > myscript.sh

· yes "gzip" |head -n 5 > t1 : Create a text file with 5 lines of “gzip”

· ls -1 reads*fastq > t2: Create a text file "t2" with all file names ("1" is the number 1);

paste -d " " t1 t2: Join t1 and t2 files horizontally, with space character as delimiter.