Practical Linux examples: Exercises
mkdir /workdir/$USER
cd /workdir/$USER
cp /shared_data/Linux_workshop2/* ./
zcat human.gff3.gz | head -n 100
zcat human.gff3.gz | tail -n 100
zcat human.gff3.gz| head -n 2000 | tail -n 1000 > mynew.gtf
zcat human.gff3.gz| head -n 4000 | tail -n 100 | cut -f 2-5,8
zcat human.gff3.gz| head -n 4000 | tail -n 100 | less -S
zcat human.gff3.gz | wc -l
zcat human.gff3.gz | awk '{if ($3=="gene") print}' |wc -l
zcat human.gff3.gz | grep -v "^#" | cut -f 3 | sort | uniq -c
#The following three lines are in one single command
zcat human.gff3.gz | \
awk 'BEGIN {OFS = "\t"};{if ($3=="gene") print $1,$4-1,$5}' | \
sed "s/^/chr/"> mygenes.bed
Calculate the nucleotide sequence sizes for each of the genes. The size of a gene is calculated by subtracting "gene start position" (column4) from "gene end position" (column 5), then adding 1 because the GFF3 coordinate system is 1-based. (If a BED file is used to calculate the gene size, you do not need to add 1 because the start position is 0-based in the BED file).
To get the size distribution, you need to do two more things: 1) Use the expression " int(($5-$4+1)/1000) " to convert the sizes from "base-pair" to "kilo-base-pair"; 2) The expression " LC_ALL=C sort -n | uniq -c" is used to get the histogram. The parameter "-n" is to tell "sort" to do numerical sorting; “-S 2G” is to set the buffer size to 2 gb to speed up the sorting.
The output is in the file "gene_dist.txt".
zcat human.gff3.gz | awk '{if ($3=="gene") print $5-$4+1}'
#The following three lines are in one single command
zcat human.gff3.gz | \
awk '{if ($3=="gene") print int(($5-$4+1)/1000)'} | \
LC_ALL=C sort -S 2G -n | uniq -c > gene_dist.txt
bedtools makewindows -g hg19.txt -w 1000000 -s 1000000 > win1mb.bed
zcat human.gff3.gz | \
awk 'BEGIN {OFS = "\t"}; {if ($3=="gene") print $1,$4-1,$5}' | \
bedtools coverage -a win1mb.bed -b stdin -counts | \
LC_ALL=C sort -k1,1V -k2,2n > gene.cover.bed
zcat human.gff3.gz | \
awk 'BEGIN {OFS = "\t"}; \
{if (($3=="processed_pseudogene") || ($3=="pseudogene")) print $1,$4-1,$5}' | \
bedtools coverage -a win1mb.bed -b stdin -counts | \
LC_ALL=C sort -k1,1V -k2,2n > pseudogene.cover.bed
paste gene.cover.bed pseudogene.cover.bed | \
cut -f 1,2,3,4,8 > genecounts.txt
In this exercise, you will use the fastx software to trim sequence adapters from a fastq file, then get size distribution of the trimmed sequencing reads. You will use Linux functions grep, wc -l, awk. • Estimate the percentage of sequencing reads that contain the adapter sequence " AGATCGGAAGAGC". You can use the first 10,000 sequencing reads to estimate. Sometimes the first 10,000 reads are all low quality reads, then this estimation would not be accurate. In that case, you might want to use reads from the middle of the file by using "head -n xxxxx | tail -n xxxxx" command) ;
• Remove the adapter sequences. Several software can do this, e.g. cutadapt and bbduk. In this exercise, you will use a pretty old tool called fastx_clipper, and write the output into a new fastq file "clean.fastq". • Calculate the read length distribution. In a fastq file, each sequence record has 4 lines and the second line of the record is the actual DNA sequence. The "awk" function has a variable "NR" that records the line number for each row. The expression (NR%4) gives you the remainder of NR divided by 4. The statement "if (NR%4 == 2) print length($0)" means "output the size of the second line in every sequence record". The output of awk can then be piped into "LC_ALL=C sort -n | uniq -c" to get the read size distribution.
zcat SRR836349.fastq.gz |head -n 40000 | grep AGATCGGAAGAGC | wc -l
zcat SRR836349.fastq.gz | fastx_clipper -a AGATCGGAAGAGC -Q33 > clean.fastq
awk '{if (NR%4 == 2) print length($0)}' clean.fastq | LC_ALL=C sort -n | uniq -c
gzip reads_1.fastq gzip reads_2.fastq gzip reads_3.fastq gzip reads_4.fastq gzip reads_5.fastq
There are many different ways to create this shell script. Here you will use the “yes” and “paste” commands to create this script "myscript.sh".
yes "gzip" |head -n 5 > t1
ls -1 reads*fastq > t2
paste -d " " t1 t2 > myscript.sh
· yes "gzip" |head -n 5 > t1 : Create a text file with 5 lines of “gzip”
· ls -1 reads*fastq > t2: Create a text file "t2" with all file names ("1" is the number 1);
paste -d " " t1 t2: Join t1 and t2 files horizontally, with space character as delimiter.