This exercise will work with BEDTools
, deepTools
, and Homer
. We'll be mixing in some linux
(bash
), awk
, and a little R
along the way. The tools are most powerful when you can combine all of these (you could certainly swap python
for R
).
Documentation for most (but not all!) of the BEDTools command is available here: https://bedtools.readthedocs.io/en/latest/content/overview.html#summary-of-available-tools
Or you can get help for any BEDTools command directly from the command line. The help output from BEDTools writes to stderr
, so you can make it easier to scroll through using less
by redirecting stderr
like this:
# See all of the available BEDTools commands
bedtools -h 2>&1 | less
# Get help for the makewindows command (as
# https://bedtools.readthedocs.io/en/latest/content/tools/makewindows.html
# is an empty page)
bedtools makewindows -h 2>&1 | less
Within less, space bar
moves you forward by one window, and the b
key moves you back one window. When you are done, type q
to exit less.
# This is a comment and will not do anything if you copy/paste
# it into the command line terminal (provided the leading #
# is included)
## This is my convention for the BEDTools
## section for showing expected output
## (i.e., prepended by ## )
mkdir -p /workdir/$USER/exercise3
cd /workdir/$USER/exercise3
cp /shared_data/epigenomics/exercise3/* /workdir/$USER/exercise3/
ls -1
## ATAC_Rep1_peaks_filter.narrowPeak
## ATAC_Rep2_peaks_filter.narrowPeak
## CS_Rep1_peaks_filter.narrowPeak
## CS_Rep2_peaks_filter.narrowPeak
## sacCer3.chrom.sizes
# In general, don't forget to start screen (especially for commands that take a while)
screen
Although gff3 files can be used directly as BEDTools inputs (-a "query" or -b "database") the ouput can be pretty messy because the whole gff3 file line will be included. Also, you need to make sure the chromosome names and sorting match (e.g., "chrIV" and "IV" won't match) , so this can be fixed during conversion, if need be. Here, we'll sort the chromosomes, remove the mitochondrial genome, and extract the gene names.
cd /workdir/$USER/exercise3
# get the gff3 annotation for the yeast sacCer3 reference genome (copy/paste the full command)
wget "ftp://ftp.ensemblgenomes.org/pub/fungi/release-49/gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.49.gff3.gz"
gunzip Saccharomyces_cerevisiae.R64-1-1.49.gff3.gz
Then convert the gff3 file to a sorted bed with autosomal gene names. Copy/paste the entire block of code below. If it doesn't run (it's very quick -- you'll see a command prompt when it's done), then press Enter
(or return
) to kick it off. Whether the command runs or not will depend on if you copy/paste the final line ending.
awk '
BEGIN{FS=OFS="\t"}
{
if($3=="gene" && $1!="Mito"){
split($9,anno,";")
split(anno[1],gene,":") # the gene name will be the 2nd element of the array "gene"
print $1,$4-1,$5,gene[2],".",$7
}
}
' Saccharomyces_cerevisiae.R64-1-1.49.gff3 | \
sort -k1,1 -k2,2n > sacCer3_autosomal_genes.bed
Now inspect the output.
less sacCer3_autosomal_genes.bed
Within less, space bar
moves you forward by one window, and the b
key moves you back one window. When you are done, type q
to exit less. Did I mention that already?
bedtools intersect
In Exercise 2 from last week, you used MACS2 to call some ChIP-seq and ATAC-seq peaks, producing *_peaks_filter.narrowPeak
files, which are in bed format. You've already copied these *_peaks_filter.narrowPeak
files to your working folder for today's exercise in step 1 above. Below, we'll work with the ChIP-seq replicate 1 file CS_Rep1_peaks_filter.narrowPeak
. Feel free to re-run the commands with the other three files. Also, feel free to inspect any intermediate output using less
, or piping into less:
<command1> | <command2> | less
.
# Fix the chr names in the peak file (remove chr)
sed 's/chr//' CS_Rep1_peaks_filter.narrowPeak > CS_Rep1.narrowPeak.bed
# What genes (if any) fall directly under the peaks, and how large is the overlap?
# -wo = Write the original A and B entries plus the number of base pairs of
# overlap between the two features.
bedtools intersect -wo -a CS_Rep1.narrowPeak.bed -b sacCer3_autosomal_genes.bed \
> CS_Rep1.narrowPeak.genes.bed
less CS_Rep1.narrowPeak.genes.bed # the overlap size is the last column
# How many total overlaps are there (71)
wc -l CS_Rep1.narrowPeak.genes.bed
# How many distinct genes are overlapped? (60)
cut -f 14 CS_Rep1.narrowPeak.genes.bed | sort -u | wc -l
# What genes are overlapped by more than one peak?
cut -f 14 CS_Rep1.narrowPeak.genes.bed | sort | uniq -c | sort -nr | awk '$1>1'
## 3 YLR162W
## 3 YHR056C
## 3 YHR054C
## 2 YLR467W
## 2 YLR162W-A
## 2 YLL066W-B
## 2 YIR019C
## 2 YCR108C
bedtools closest
Note that bedtools closest
requires presorted inputs.
What is the closest gene in either direction to each peak?
# -D b = report upstream or downstream distance relative to the strand of "database" b (gene)
bedtools closest -D b -a CS_Rep1.narrowPeak.bed -b sacCer3_autosomal_genes.bed > CS_Rep1.narrowPeak.closestGene.bed
# how many input query peaks? (328)
wc -l CS_Rep1.narrowPeak.bed
# confirm that the peak names are all distinct (I'm just paranoid...)
cut -f 4 CS_Rep1.narrowPeak.bed | sort -u | wc -l # yes, 328
# How many peak/gene combinations? (337) # Why is this larger than 328?
wc -l CS_Rep1.narrowPeak.closestGene.bed
# How many distinct genes? (289) # Why is this smaller than 328?
cut -f 14 CS_Rep1.narrowPeak.closestGene.bed | sort -u | wc -l
What is the closest gene downstream of each peak, including overlaps?
# -D b = report distance relative to the strand of "database" -b (in this case, the gene)
# -id = ignore queries downstream of a gene
# Note: documentation for bedtools closest -id and -D options is confusing (wrong???)
bedtools closest -id -D b -a CS_Rep1.narrowPeak.bed -b sacCer3_autosomal_genes.bed > CS_Rep1.narrowPeak.closestDownStreamGene.bed
# How many peak/gene combos? (336)
wc -l CS_Rep1.narrowPeak.closestDownStreamGene.bed
# How many distinct query peaks in the output? (328, as there should be)
cut -f 4 CS_Rep1.narrowPeak.closestDownStreamGene.bed | sort -u | wc -l
# Why are some peaks reported twice?
cut -f 4 CS_Rep1.narrowPeak.closestDownStreamGene.bed | sort | uniq -c | sort -nr | awk '$1>1'
## 3 CS_Rep1_peak_292
## 2 CS_Rep1_peak_36
## 2 CS_Rep1_peak_200b
## 2 CS_Rep1_peak_200a
## 2 CS_Rep1_peak_149
## 2 CS_Rep1_peak_115
## 2 CS_Rep1_peak_107
# Because many genes overlap in the yeast genome, and the twice reported peaks fall within an overlap:
cut -f 4 CS_Rep1.narrowPeak.closestDownStreamGene.bed | sort | uniq -c \
| sort -nr | awk '$1>1{print $2}' \
| xargs -I % grep % CS_Rep1.narrowPeak.closestDownStreamGene.bed \
| cut -f 1-4,11-14,16,17
## XV 720250 720472 CS_Rep1_peak_292 XV 720180 720510 YOR199W + 0
## XV 720250 720472 CS_Rep1_peak_292 XV 720416 720815 YOR200W + 0
## XV 720250 720472 CS_Rep1_peak_292 XV 720469 721708 YOR201C - 0
## IV 167510 167719 CS_Rep1_peak_36 IV 167358 167715 YDL162C - 0
## IV 167510 167719 CS_Rep1_peak_36 IV 167713 169078 YDL161W + 0
## XII 489913 490429 CS_Rep1_peak_200b XII 489572 489929 YLR162W + 0
## XII 489913 490429 CS_Rep1_peak_200b XII 490405 490594 YLR162W-A + 0
## XII 489913 490429 CS_Rep1_peak_200a XII 489572 489929 YLR162W + 0
## XII 489913 490429 CS_Rep1_peak_200a XII 490405 490594 YLR162W-A + 0
## VIII 214698 215016 CS_Rep1_peak_149 VIII 214507 214702 YHR054W-A + 0
## VIII 214698 215016 CS_Rep1_peak_149 VIII 214532 214718 YHR055C - 0
## VII 123384 123597 CS_Rep1_peak_115 VII 123571 124042 YGL199C - 0
## VII 123384 123597 CS_Rep1_peak_115 VII 123590 124298 YGL198W + 0
## VI 106367 106606 CS_Rep1_peak_107 VI 106414 106732 YFL015W-A + 0
## VI 106367 106606 CS_Rep1_peak_107 VI 106468 106963 YFL015C - 0
# Distribution of peak distance to closest downstream gene
cut -f 17 CS_Rep1.narrowPeak.closestDownStreamGene.bed | sort -n | uniq -c | less
What is the closest gene downstream of each peak, excluding overlaps?
# What is the closest, downstream, non-overlapping (DSNO) gene to each peak?
# -io = ignore overlaps
bedtools closest -io -id -D b -a CS_Rep1.narrowPeak.bed -b sacCer3_autosomal_genes.bed \
> CS_Rep1.narrowPeak.closestDSNOGene.bed
# Distribution of peak distance to closest DSNO gene
cut -f 17 CS_Rep1.narrowPeak.closestDSNOGene.bed | sort -n | uniq -c | less
Once you have a list of genes either underlying or closest to your peaks, you can do additional analyses such as tests for enrichment of functional categories. This was covered in our previous RNA-seq worshop so we won't cover it again here.
bedtools intersect
, merge
, and jaccard
How many genes in the yeast autosome overlap?
bedtools intersect -wo -a sacCer3_autosomal_genes.bed -b sacCer3_autosomal_genes.bed \
| awk '$4!=$10' > sacCer3_overlap_genes.bed
# Count distinct number of autosomal genes that overlap (1224)
cut -f 4 sacCer3_overlap_genes.bed | sort -u | wc -l
What proportion of the autosomal "gene space" consists of overlapping genes?
# Get the "autosomal gene space" (prop. of genome falling within a gene) by merging all overlaps
bedtools merge -c 4 -o collapse -i sacCer3_autosomal_genes.bed > sacCer3_geneSpace.bed
# Get the distinct overlaps
bedtools merge -c 4 -o distinct -i sacCer3_overlap_genes.bed > sacCer3_geneOverlapSpace.bed
# What proportion of the autosomal gene space consists of overlapping genes? (11.8%)
bedtools jaccard -a sacCer3_geneOverlapSpace.bed -b sacCer3_geneSpace.bed
## intersection union jaccard n_intersections
## 1015525 8636593 0.117584 574
bedtools jaccard
Below is kind of a "toy" example using our ChIP-seq and ATAC-seq peak calls. It is based on a more full-blown example using Dnase hypersensivity data, that you can read about here: http://quinlanlab.org/tutorials/bedtools/bedtools.html#a-jaccard-statistic-for-all-400-pairwise-comparisons
Calculate the Jaccard similarity coefficient (Jaccard index) for all pairwise combinations of *_peaks_filter.narrowPeak
files using gnu parallel:
parallel "bedtools jaccard -a {1}_peaks_filter.narrowPeak -b {2}_peaks_filter.narrowPeak \
| awk 'NR>1' \
| cut -f 3 \
> {1}.{2}.jaccard" \
::: $(ls *_peaks_filter.narrowPeak | sed 's/_peaks_filter.narrowPeak//') \
::: $(ls *_peaks_filter.narrowPeak | sed 's/_peaks_filter.narrowPeak//')
# see what was produced
# ls -ltr sorts the files by modification time, with the most recently modified at the bottom
ls -ltr
# or, more specifically
ls -l *.jaccard
Merge the pairwise similarities into one file (one similarity per line):
for jacc in *.jaccard; do
name1=$(echo $jacc | cut -d "." -f1)
name2=$(echo $jacc | cut -d "." -f2)
simil=$(cat $jacc)
echo $name1$'\t'$name2$'\t'$simil
done > pairwise.jacc.txt
Create a matrix that can be read into R:
awk '
{
name1 = $1
if (!(name1 in names)) {
numNames++
names[name1] = 1
orderedNames[numNames] = name1
}
name2 = $2
sims[name1":"name2] = ($3==1)? "1.0" : $3
}
END {
for (n=1; n<=numNames; n++) {
header = header "\t" orderedNames[n]
}
print header
for (n1=1; n1<=numNames; n1++) {
line = orderedNames[n1]
for (n2=1; n2<=numNames; n2++) {
line=line "\t" sims[orderedNames[n1]":"orderedNames[n2]]
}
print line
}
}
' pairwise.jacc.txt > jacc.matrix.txt
Make a list of group labels that can be used in R:
tail -n +2 jacc.matrix.txt | cut -f 1 | cut -f 1 -d "_" > labels.txt
# Then start R
R
Make a PCA plot and a hierarchically clustered similarity heat map in R:
library(ggplot2)
x <- read.table('jacc.matrix.txt')
labels <- read.table('labels.txt')
labels[,1]
df_pca <- prcomp(x)
df_out <- as.data.frame(df_pca$x)
df_out$group <- labels[,1]
df_out
percentage <- round(df_pca$sdev / sum(df_pca$sdev) * 100, 2)
percentage <- paste( colnames(df_out), "(", paste( as.character(percentage), "%", ")", sep="") )
p<-ggplot(df_out,aes(x=PC1,y=PC2,color=group))
p<-p+geom_point(size=3) + xlab(percentage[1]) + ylab(percentage[2])
pdf('CS_ATAC_pca.pdf')
print(p)
dev.off()
library(gplots)
library(RColorBrewer)
pdf('CS_ATAC_simClust.pdf')
heatmap.2(as.matrix(x),
col=brewer.pal(9,"Blues"),
margins = c(14, 14),
density.info = "none",
lhei = c(2, 8),
trace="none")
dev.off()
# It's done when you see this:
## null device
## 1
Quit R when it's done:
q()
## Save workspace image? [y/n/c]: n
Then download CS_ATAC_pca.pdf
and CS_ATAC_simClust.pdf
with FileZilla and take a look. Which assay appears to be more reproducible?
bedtools makewindows
, bedtools coverage
# There no online documentation for makewindows. Here is how you can read the help:
bedtools makewindows -h 2>&1 | less
# Make the windows
bedtools makewindows -g sacCer3.chrom.sizes -w 50000 \
| sed 's/^chr//' \
> sacCer3.chrom.50Kwin.bed
# Get the gene coverage for each window
bedtools coverage -a sacCer3.chrom.50Kwin.bed -b sacCer3_autosomal_genes.bed \
> sacCer3.numGenes.50Kwin.bed
# Just for (nerdy) fun, let's combine it into one command
bedtools makewindows -g sacCer3.chrom.sizes -w 50000 \
| sed 's/^chr//' \
| bedtools coverage -a - -b sacCer3_autosomal_genes.bed \
> sacCer3.numGenes.50KwinV2.bed
# and confirm that the output is the same (if you're paranoid, like me...)
diff -s sacCer3.numGenes.50Kwin.bed sacCer3.numGenes.50KwinV2.bed
## Files sacCer3.numGenes.50Kwin.bed and sacCer3.numGenes.50KwinV2.bed are identical
# Start R
R
Make a plot of the coverage in R:
geneNum=read.delim("sacCer3.numGenes.50Kwin.bed",header=FALSE)
names(geneNum)=c("chr","start","end","numGenes","bpGenic","bpInterval","propGenic")
geneNum$cumulBinNum=1:dim(geneNum)[1]
geneNum$color='black'
myColor='gray30'
for (myChr in unique(geneNum$chr)) {
geneNum[geneNum$chr==myChr,]$color = myColor
if (myColor=='gray30') {myColor='gray60'}
else {myColor='gray30'}
}
myPlotTitle='Yeast: number of genes per 50Kb bin'
png(file='yeastNumGenesHist.png', title=myPlotTitle,height=8.5,width=11,units="in",res=600)
plot(geneNum$cumulBinNum,geneNum$numGenes,type='h',col=geneNum$color,lend='square',lwd=3.5,
xaxs="i",xlab="Cumulative Genomic Position (x50Kb)",ylab="Number of Genes",main=myPlotTitle)
dev.off()
Quit R as before, then download the yeastNumGenesHist.png
file with FileZilla and have a look.
https://bedtools.readthedocs.io/en/latest/index.html
https://bedtools.readthedocs.io/en/latest/content/overview.html#summary-of-available-tools
http://quinlanlab.org/tutorials/bedtools/bedtools.html
https://bedtools.readthedocs.io/en/latest/content/advanced-usage.html
Quinlan AR (2014) Curr Protoc Bioinform https://doi.org/10.1002/0471250953.bi1112s47
https://github.com/arq5x/bedtools-protocols/blob/master/bedtools.md
https://gettinggeneticsdone.blogspot.com/2014/03/visualize-coverage-exome-targeted-ngs-bedtools.html
bedtools-discuss Google group: https://groups.google.com/g/bedtools-discuss
As it is much more efficient to run deepTools on bigWig instead of bam files, you will first convert the two supplied bam files to bigWig files.
DeepTools are Python3 code. set Linux environment on BioHPC to run deepTools.
export PATH=/programs/deeptools-3.5/bin:$PATH
export PYTHONPATH=/programs/deeptools-3.5/lib64/python3.6/site-packages:/programs/deeptools-3.5/lib/python3.6/site-packages
Run "bamCoverage" to convert bam files to bigWig files. Recommended setting for MNase-seq data:
--binsize 1: set bin size to 1 nucleotide, which is the highest resolution.
--MNase: set MNase mode, so that only central 3 nucleotide is used.
bamCoverage --numberOfProcessors 4 --bam SRR922443_s3.bam -o SRR922443_s3.bw \
--binSize 1 \
--MNase \
--normalizeUsing CPM \
--ignoreForNormalization Mito \
--minFragmentLength 100 \
--maxFragmentLength 300
bamCoverage --numberOfProcessors 4 --bam SRR922443_s4.bam -o SRR922443_s4.bw \
--binSize 1 \
--MNase \
--normalizeUsing CPM \
--ignoreForNormalization Mito \
--minFragmentLength 100 \
--maxFragmentLength 300
After this, you should see two bigWig files: SRR922443_s3.bw and SRR922443_s4.bw
The multiBigwigSummary function calculate average scores in each genomic regions for every samples. The genomic regions could be defined as sliding windows across the genome, or defined in a bed file. The output file from multiBigwigSummary can be used to plot PCA or sample correlation heatmap.
First, you will run "multiBigwigSummary bins" to calculate score in sliding windows (window size 10kb).
multiBigwigSummary bins --binSize 10000 --bwfiles SRR922443_s3.bw SRR922443_s4.bw --outFileName bin_score.npz --outRawCounts bin_score.txt --numberOfProcessors 4
less bin_score.txt
In the output file bin_score.txt, you can see average scores in each sliding windows for the two samples. The score is the normalized coverage depth from the bigWig file, averaged by all sites in the region.
Next, you will run calculate score in each gene regions as defined in a bed file.
You are provided with a gene annotation file in gtf format. Convert the gtf file to a transcript bed file. A 6-column bed file format is used, the 6th column is "strand".
awk -v OFS='\t' '{if (($3=="transcript") && ($1 !="Mito")) {print $1,($4-1),$5,".",".",$7} }' R64.gtf > transcript.bed
Run multiBigwigSummary:
multiBigwigSummary BED-file \
--bwfiles SRR922443_s3.bw SRR922443_s4.bw \
--BED transcript.bed \
-out transcript_score.npz --outRawCounts transcript_score.txt
less transcript_score.txt
Run computeMatrix on the bigWig file you created in the previous step, using the reference-point mode. The reference-points are defined in the transcript.bed file. The computeMatrix would automatically use either the start or end position of the transcript depending on the strand of the gene.
computeMatrix reference-point -S SRR922443_s3.bw -R transcript.bed -a 2000 -b 2000 -o tss_ref.gz --numberOfProcessors 4 --outFileNameMatrix tss_ref.txt --missingDataAsZero
plotProfile -m tss_ref.gz -out SRR922443_profile.png
plotHeatmap -m tss_ref.gz -out SRR922443_heatmap.png
Using FileZilla to download the two .png files to your laptop and open the two files.
You will do a small research project here. The hypothesis is that GC content is low around the transcription start and end sites of each gene. You will calculate the GC content of gene body to confirm this.
First, calculate GC% in 20bp sliding windows across the genome.
bedtools makewindows -g R64.genome.txt -w 20 -s 1 > R64_20bp.bed
bedtools nuc -fi R64.fa -bed R64_20bp.bed > R64.GC.txt
Create a bedGraph file from the GC% results. In this bedGraph file, you will use the mid-point of the 20bp window to represent the window.
awk -v OFS='\t' '{print $1,$2+10,$2+11,$5}' R64.GC.txt > R64.GC.bedgraph
Convert the bedGraph file to bigWig file. To make this to work, you will have to extend the chromosome length by 10 bp in the chromosomal size file, due to the reason that you use the mid-point of the 20bp window in the bedGraph file.
export PATH=/programs/kentUtils/bin:$PATH
awk -v OFS='\t' '{print $1,$2+10}' R64.genome.txt > R64.genome2.txt
bedGraphToBigWig R64.GC.bedgraph R64.genome2.txt R64.GC.bw
Run computMatrix in scale-regions mode. Use the TSS and TES as the anchor points for scaling. Also, you will include 100bp upstream and downstream of each gene.
computeMatrix scale-regions -S R64.GC.bw -R transcript.bed --regionBodyLength 1000 --upstream 100 --downstream 100 --startLabel TSS --endLabel TES -o gc.gz --numberOfProcessors 4
plotProfile -m gc.gz -out gc_profile.png
plotHeatmap -m gc.gz -out gc_heatmap.png
Download the two png files and open them in your laptop.