Homework 8
1 The Experiment
The species is Dropsophila melanogaster (fruit fly). The two conditions are WT (which we’ll also refer to as “W”) and a model of a particular disease caused by activation of a gene called Yki in adult midgut stem cells (we’ll refer to those flies as type “Y”). The “Y” flies present systemic organ wasting and metabolic abnormalities.
The goal is to understand the mechanism: Why does upregulating this gene in this tissue make the animals sick in this way? Looking for the differentially expressed genes between healthy and affected flies is a first piece in the puzzle.
The tissue of interest is the “malpighian tubule”. So we’ll be comparing W to Y in that tissue to find the differentially expressed genes.
This is done (naturally) with RNA-Seq. There are three replicates of each condition. This is paired-end data with 150 base reads. We know that RNA-Seq alignment is harder than DNA-Seq (or ChIP-Seq). That’s why they generated paired-end data with nice long 150-base reads. Last HW we had single-end 50-base reads. So, we’re going to have to juggle more fastq files this time (two per sample).
We are going to push the data through a differential expression analysis pipeline all the way to q-values. Then we’ll perform pathway enrichment analysis.
This represents one of the most common types of analyses performed in bioinformatics.
Fly only has few chromosomes, X, Y, 2, 3, 4 and also mitochondrial.
This analysis is going to use both UNIX and Python. You do not have to code, you just have to run the code that’s given below. It all uses standard packages. The Python will be done in a so-called Jupyter notebook. We do not assume you know what those are, it’s explained below.
But start at the UNIX command line. Make a directory called ‘hw8’ and cd into it. Case matters in what follows so make sure it’s lowercase hw8 and not HW8.
Copy/Paste the following code blocks to the UNIX command line and hit enter. You can copy each grey block in one fell swoop. But will have to wait for each block to finish before proceeding to the next. Some of them execute immediately but some take a few minutes.
2 Link FASTQ files
Declare how the samples correspond to conditions:
declare -A conditions=(
[W]="SRR25526915 SRR25526914 SRR25526913"
[Y]="SRR25526912 SRR25526910 SRR25526909"
)
Link the FASTQ files to a local directory, using clearer names:
cd ~/hw8
mkdir fastq
for condition in "${!conditions[@]}"
do
samples=( ${conditions[$condition]} )
for i in $( seq 1 ${#samples[@]} )
do
sample=${samples[$((i - 1))]}
ln --symbolic /project/rna-seq/fastq/${sample}_1.fastq.gz fastq/${condition}${i}_1.fastq.gz
ln --symbolic /project/rna-seq/fastq/${sample}_2.fastq.gz fastq/${condition}${i}_2.fastq.gz
done
done
2.1 From this point onwards use the linked sample names
declare -A conditions=(
[W]="W1 W2 W3"
[Y]="Y1 Y2 Y3"
)
cd into the fastq directory and do an “ls -l” Notice how the symbolic links look that you just created.
cd fastq
ls -l
Notice it does not show the file sizes, because these are sym links.
You can make it show the sizes with either of these commands
ll -L
ll -hL
(1) Look at the output and report what these two commands do that’s different.
The files with ’_1’ in the file name are the forward reads and the ones with “_2” are the reverse. We’re going to start by aligning one read pair in the UCSC genome browser. Get the first read-pair from the file “Y2”: the forward
zcat Y2_1.fastq.gz | head -4
and the reverse
zcat Y2_2.fastq.gz | head -4
Go to the Dropophila genome in the genome browser. If you select “Other” on the “Genomes” menu you’ll see a quick link to “fly”.
Turn off all tracks except “Ensembl Genes”
Go to BLAT (under the Tools menu) and enter both the forward and reverse reads in fastA format. Call the one from _1 “forward” and the other “reverse”.
You just want to grab the read sequence from the fastq records (2nd row), not all four rows.
(2) Find the gene they map to in the genome browser and supply a screen shot of the whole gene showing where the reads align.
(3) Based on the alignment of the forward and reverse read pair, how long is the fragment?
3 Download genome
3.1 Specify the genome and its location
cd ~/hw8
mkdir genome
species="Drosophila_melanogaster"
version="BDGP6.46"
release="110"
fasta_file="${species}.${version}.dna.primary_assembly.fa"
gtf_file="${species}.${version}.${release}.chr.gtf"
3.2 Download the chromosomes (arms)
for chromosome in 2L 2R 3L 3R 4 X Y mitochondrion_genome
do
wget -O genome/${species}.${version}.dna.primary_assembly.${chromosome}.fa.gz \
${release}/fasta/$(echo $species | awk '{print tolower($0)}')/dna/${species}.${version}.dna.primary_assembly.${chromosome}.fa.gz
http://ftp.ensembl.org/pub/release-done
3.3 Concatenate the chromosomes
zcat genome/${species}.${version}.dna.primary_assembly.*.fa.gz > genome/${fasta_file}
rm genome/${species}.${version}.dna.primary_assembly.*.fa.gz
3.4 Download GTF file
wget -P genome http://ftp.ensembl.org/pub/release-${release}/gtf/$(echo $species | awk '{print tolower($0)}')/${gtf_file}.gz
gunzip --to-stdout genome/${gtf_file}.gz > genome/${gtf_file}
The GTF file is another standard file format you should familiarize yourself with. This format is used to describe genes in the genome. They give genome coordinates of various features associated to genes such as exons, start and stop codons, etc. Take a look at the file:
cd genome
less -S Drosophila_melanogaster.BDGP6.46.110.chr.gtf
The 3rd column gives the type of feature described by that row. Use cut and sort to list all the various types.
cut -f 3 Drosophila_melanogaster.BDGP6.46.110.chr.gtf |grep -v ^# | sort -u
(4) One is ‘gene’ and the others are gene parts like exon or CDS. How many types are there? These should all be familiar except perhaps one mysterious one that starts with capital S. Use google to figure out and report what that is (a short definition is sufficient). How many different genes does that occur in the file? There are just a few, give the actual gene names from the GTF records, not the gene IDs.
Certain steps in tihs pipeline don’t depend on the others. The following chart shows which steps are dependent on which. Notice steps 4 and 5 can be run in parallel, so we will do that.
(5) Execute those steps together (they’re the next two steps below) and after you’ve done that use the following command to see the status. Supply a screen shot of the status table to show they were done together.
batch list-jobs --max-results 13
Notice this is how you see more than the last 10 jobs.
4 Build genome index
cd ~/hw8
mkdir index
read_length=150
batch submit-job --job-name INDEX --vcpu 8 --memory 15000 --command "
STAR --runMode genomeGenerate \
--runThreadN 8 \
--limitGenomeGenerateRAM $(( 12 * 1024**3 )) \
--genomeDir index \
--genomeFastaFiles genome/${fasta_file} \
--sjdbGTFfile genome/${gtf_file} \
--sjdbOverhang $((read_length - 1))"
5 fastQC (Quality Control)
cd ~/hw8
mkdir qc
for condition in "${!conditions[@]}"
do
samples=( ${conditions[$condition]} )
for sample in "${samples[@]}"
do
batch submit-job --job-name QC-${sample}_1 --vcpu 2 --memory 3000 --command "fastqc --threads \$(nproc) --outdir qc fastq/${sample}_1.fastq.gz"
batch submit-job --job-name QC-${sample}_2 --vcpu 2 --memory 3000 --command "fastqc --threads \$(nproc) --outdir qc fastq/${sample}_2.fastq.gz"
done
done
There are six samples, but since this is paired-end data there are 12 fastq files. So the step above submitted 12 jobs. If you submitted these while step 4’s jobs are also running then you’ll have at least 13, so use batch list-jobs --max-results 13
(the default is to show only 10 most recent jobs).
When the fastQC jobs is finished, there should be a bunch of files in the qc directory.
cd qc
ls -l
unzip '*zip'
(6) Click on W1_1_fastqc.html in the qc directory and look at the Per base sequence content. It failed but how bad is it? Should we be particularly concerned?
We’re going to look at just one stat over all 12 files. From the qc directory execute this command
grep "Total Deduplicated" *fastqc/*data.txt
(7) This is pulling out the Total Deduplicated Percentage from all 12 files. These give the percent of reads that are left after removing duplicates. What is the average in each of the two experimental conditions W and Y?
Take note of the huge number of dups in these samples. But unlike ChIP-Seq or DNA-Seq we’re not going to remove them, because a much higher percent of them will be real dups in RNA-Seq as opposed to PCR dups. If the number of dups is not terribly different between W and T, we hope it will not adversely affect our DE analysis. The alternative of removing all dups will remove a huge amount of real signal. Too much.
6 Align and quantify
cd ~/hw8
mkdir alignment
for condition in "${!conditions[@]}"
do
samples=( ${conditions[$condition]} )
for sample in "${samples[@]}"
do
batch submit-job --job-name ALIGN-${sample} --vcpu 6 --memory 20000 --command "
STAR --runMode alignReads \
--genomeDir index \
--runRNGseed 42 \
--runThreadN 6 \
--outSAMtype BAM Unsorted \
--outBAMcompression 10 \
--readFilesCommand zcat \
--readFilesIn fastq/${sample}_1.fastq.gz fastq/${sample}_2.fastq.gz \
--quantMode GeneCounts \
--outFileNamePrefix alignment/${sample}
"
done
done
(8) Six alignment jobs were executed. But there are 12 fastq files. Look at the command and try to figure out why STAR was only called six times. Explain.
Once finished, STAR will write the results to the alignment directory. There’s a file of summary stats for each sample - look at the file Y1Log.final.out to see the info reported.
The following will grep out the number of read (pairs) sequenced for each sample:
cd ~/hw8
cd alignment
grep "Number of input reads" *final.out
(9) People always try to sequence at equal depth for all samples, but it will inevitably vary a lot from sample to sample. How much smaller percentwise is the smallest compared to the largest?
6.0.1 Quantifications
The aligner STAR has done some extra work for us by also running a simple gene-level quantification script. This is only possible because when we built the STAR index we gave it the gene annotation (gtf) file. Otherwise STAR would have no idea where the genes were. The following commands will list the files with the gene quants.
ls -l | grep ReadsPerGene
The first column is the gene ID and the second column is the relevant count. Reads pairs are counted only once. So in effect we’re counting fragments, not reads.
head -20 W1ReadsPerGene.out.tab
The first four rows are summary statistics, the gene counts come after that.
These data were not sequenced “strand specific” so we’ve lost the information about which strand a read came from, it ends up being a coin flip for each read pair. That’s why we’re ignoring columns 3 and 4 of those RadsPerGene files, those are strand specific counts, which are meaningless here. The relevant count is column 2. You might expect column 2 to be equal to column 3 plus column 4 but it is not. There’s a good reason. But don’t worry about it here.
(10) In sample W1, find the gene that’s most highly expressed. You can sort on the 2nd column with the following options to the sort command. It’s piped to ‘grep FB’ to get just the genes.
sort -r -g -k 2 <filename> | grep FB | head
Report the highest expressed gene in sample W1 with the gene ID and count.
(11) Is it the same gene that’s the most highly expressed in all six samples? If not, which sample has a different one? Report the sample name, gene ID and count.
7 Sort and index BAM files
The following will sort the BAM files by chromosome/position. Reads that align next to each other in the genome are then consecutive in the BAM file. This step makes them easier to parse into coverage plots, for example.
cd ~/hw8
mkdir sorted
for condition in "${!conditions[@]}"
do
samples=( ${conditions[$condition]} )
for sample in "${samples[@]}"
do
batch submit-job --job-name SORT-INDEX-$sample --vcpu 2 --memory 16000 --command "
samtools sort --threads 1 -o sorted/${sample}.bam alignment/${sample}Aligned.out.bam \
&& cd sorted \
&& samtools index --threads \$(nproc) ${sample}.bam
"
done
done
8 DE analysis
You can run the DE analysis while the previous jobs are running (refer to the flow chart of dependencies above).
We are going to do the DE analysis in Python. But you do not have to write any code, you are going to run the commands given below interactively in a Jupyter notebook (see picture below).
Each person on the server has limited amount of RAM. Some of the following may fail if you have too many tabs open. So this is a good time to close some of them. If you run out of RAM you’ll get a popup warning that the “kernel has restarted”.
Hopefully that won’t happen but if it does, close the other tabs you’re not using and then in the notebook you probably need to backtrack a bit to rerun some earlier steps. You would not have to redo any compute that you’ve already done at the unix command line, just previous steps in the notebook you’re about to open. And once you do go back to Unix, you will have to redefine your variables because those will have been reset too.
The DE analysis steps are to be executed in order in a Jupyter notebook. You can copy the code from here to the cells in the notebook and execute the cell by pressing Shift+Enter
.
8.1 Load modules
Copy the entire following code block into one cell in the notebook and hit Shift-Enter to execute it.
Code
import jupyter_black
import matplotlib.pyplot as plt
import pandas as pd
from pathlib import Path
from pydeseq2.ds import DeseqStats
from pydeseq2.dds import DeseqDataSet
jupyter_black.load()"font.family"] = "Liberation Sans" plt.rcParams[
The above should run quickly, like seconds. There is no output.
8.2 Combine the quantifications in one table
Don’t forget to hit Shift+Enter.
Code
= {"W": ["W1", "W2", "W3"], "Y": ["Y1", "Y2", "Y3"]}
study
= Path.home() / "hw8"
HOMEWORK_DIRECTORY
= HOMEWORK_DIRECTORY / "alignment"
alignment_directory
= [
counts
pd.read_csv(/ f"{sample}ReadsPerGene.out.tab",
alignment_directory ="\t",
sep=0,
index_col=4,
skiprows=None,
header=["unstranded", "forward_stranded", "reverse_stranded"],
names"unstranded"]].rename({"unstranded": sample}, axis="columns")
)[[for condition, samples in study.items()
for sample in samples
]
= pd.concat(counts, axis="columns")
counts counts
This should also run quickly, the output should look like this. This is the table of raw counts. Rows are genes, columns are samples, cells are the number of read (pairs) from that sample that align to that gene.
8.3 Filter low expressors
The number of rows is shown below the table.
(12) We’re going to remove all genes where the mean expression across all six samples is less or equal to 10. These are just very low expressed genes that are unlikely to be DE since they’re probably low everywhere. And they just make the multiple testing problem worse. Compare the before and after and report how many genes were removed exactly.
Code
= counts.loc[counts.mean(axis="columns") >= 10, :]
counts_filtered counts_filtered
8.4 Dendogram Clustering
Next we’re going to cluster the samples by making a tree. This is helpful for QC purposes. This will reveal, for example, if samples might have been switched between conditions by accident. This should only take a few seconds to run, it should produce a tree diagram.
(13) Supply a screenshot of the tree and interpret it for QC purposes.
Code
import plotly.figure_factory as ff
= ff.create_dendrogram(counts_filtered.T, labels=counts_filtered.columns)
fig fig.show()
8.5 Execute the DE procedure
8.5.1 Make the metadata table
Code
= pd.DataFrame(study).melt()
metadata = ("Condition", "Sample")
metadata.columns "Sample", inplace=True)
metadata.set_index(
metadata
8.5.2 Determine normalization factors and dispersions
Code
= DeseqDataSet(
dds =counts_filtered.T,
counts=metadata,
metadata="Condition",
design_factors=True,
refit_cooks=1,
n_cpus
)
dds.fit_size_factors()
dds.fit_genewise_dispersions()
dds.fit_dispersion_trend()
dds.fit_dispersion_prior() dds.fit_MAP_dispersions()
It will take a couple minutes, watch for this output to the screen:
8.5.3 Deal with outliers
Code
dds.fit_LFC()
dds.calculate_cooks() dds.refit()
Output should look like this. Wait for it.
8.5.4 Perform DE analysis
Code
= DeseqStats(
analysis =0.05, cooks_filter=True, independent_filter=True, n_cpus=1
dds, alpha
)
analysis.run_wald_test()
analysis._cooks_filtering()
analysis._independent_filtering()
analysis.lfc_shrink() analysis.summary()
Output should look like this, wait for it. After this we have our q-values.
padj is a synonym for q-value
8.5.5 Show MA plot
Next we’re going to visualize the data with an MvA plot to get global picture of differential expression between WT and Y. In an MvA the Y-axis represents fold-change between the two conditions (W and Y). The X-axis overall expression level of the gene. Every gene is one point. So higher expressed genes are to the right, lower to the left. And the non differentially expressed genes are the ones near the dashed line.
(14) Include a screen shot of the plot in your write-up.
analysis.plot_MA()
8.6 Sort by q-value (aka padj)
Code
= analysis.results_df
results
= results[(results["padj"] < 0.05) & (results["log2FoldChange"].abs() > 2)]
top_genes "padj")[:20] top_genes.sort_values(
(15) What is the most significant gene in this table and its q-value?
8.6.1 Store results
This will save the results table to your differential directory.
Code
= HOMEWORK_DIRECTORY / "differential"
differential_directory =True)
differential_directory.mkdir(exist_ok
"padj").to_csv(differential_directory / "results.csv") results.sort_values(
We’ll also need a list of the top DE genes for pathway analysis later. So execute the following, it will save that list in the differential directory.
with open(differential_directory / "top_genes.txt", "w") as file:
print("\n".join(top_genes.index), file=file)
(16) Find the file of all genes and their q-values in the hw8/differential directory, called results.csv. How many genes have q-value less or equal to 0.01?
Close the tab with the jupyter notebook where you did the DE analysis. If you don’t there might not be enough RAM to do this next step.
9 Visualize the alignment using IGV
Instead of the UCSC genome browser we’re going to visualize this data in another popular browser called IGV. IGV can show the location of every read which UCSC does not.
IGV has to be installed to run locally. But instead of installing it on your laptop, we’ve installed it here on the server. It’s pretty stable but can be tempermental if, for example, there are a million reads for some gene.
So we’re not going to turn the reads tracks on, just the coverage tracks. But the code to turn them on is there but commented out, if you want to play with it.
9.1 Index genome
Return to the UNIX command line hw8 directory and run the following:
cd ~/hw8
samtools faidx --output genome/Drosophila_melanogaster.BDGP6.46.dna.primary_assembly.fa.fai \
genome/Drosophila_melanogaster.BDGP6.46.dna.primary_assembly.fa
9.2 Create coverage files
We’re going to define the necessary variables again, just in case.
cd ~/hw8
mkdir graphs
declare -A conditions=(
[W]="W1 W2 W3"
[Y]="Y1 Y2 Y3"
)
for condition in "${!conditions[@]}"
do
samples=( ${conditions[$condition]} )
for sample in "${samples[@]}"
do
batch submit-job --job-name COVERAGE-${sample} --vcpu 2 --memory 15000 --command "
bamCoverage --binSize 1 \
--normalizeUsing CPM \
--effectiveGenomeSize 142573017 \
--numberOfProcessors \$(nproc) \
--bam sorted/${sample}.bam \
--outFileFormat bigwig \
--outFileName graphs/${sample}.bw
bigWigToWig graphs/${sample}.bw graphs/${sample}.wig
igvtools toTDF -z 10 graphs/${sample}.wig graphs/${sample}.tdf genome/Drosophila_melanogaster.BDGP6.46.dna.primary_assembly.fa.fai
"
done
done
Making coverage plots is always a big compute. It will take a few minutes to run, monitor with batch ‘list-jobs’.
9.3 Use IGV
Create a new notebook.
.
*Note we’re coloring WT blue and the other condition “Y” red. Those colors do not mean strand here like they did last week. We don’t actually know the strand in this data.
Copy the following code into the cell of the notebook and hit Shift-Enter.
The code below has the reads/alignments track commented out. As explained above, that’s intentional.
import os
import igv_notebook
from itertools import chain
from pathlib import Path
from urllib.parse import urlparse
igv_notebook.init()
= Path.home() / "hw8"
HOMEWORK_DIRECTORY
= os.readlink(f"/proc/{os.environ['JPY_PARENT_PID']}/cwd")
JUPYTER_SERVER_ROOT = HOMEWORK_DIRECTORY.relative_to(JUPYTER_SERVER_ROOT)
RELATIVE_HOMEWORK_DIRECTORY
= "/" / RELATIVE_HOMEWORK_DIRECTORY / "genome"
genome sorted = "/" / RELATIVE_HOMEWORK_DIRECTORY / "sorted"
= "/" / RELATIVE_HOMEWORK_DIRECTORY / "graphs"
graphs
= {
reference "id": "BDGP6.46",
"name": "Drosophila melanogaster (version BDGP6.46, release 110)",
"fastaURL": f"{genome}/Drosophila_melanogaster.BDGP6.46.dna.primary_assembly.fa",
"indexURL": f"{genome}/Drosophila_melanogaster.BDGP6.46.dna.primary_assembly.fa.fai",
"tracks": [
{"name": "Gencode genes",
"format": "gtf",
"url": f"{genome}/Drosophila_melanogaster.BDGP6.46.110.chr.gtf.gz",
}
],
}
= {"W": ["W1", "W2", "W3"], "Y": ["Y1", "Y2", "Y3"]}
study
= igv_notebook.Browser(
browser
{"reference": reference,
"locus": "3L:18,166,779-18,186,895",
"tracks": list(
chain(*[
(
{"name": f"{sample} coverage",
"url": f"{graphs}/{sample}.tdf",
"format": "tdf",
"autoscaleGroup": 1,
"color": "blue" if condition == "W" else "red",
},# {
# "name": f"{sample} reads",
# "url": f"{sorted}/{sample}.bam",
# "indexURL": f"{sorted}/{sample}.bam.bai",
# "format": "bam",
# "type": "alignment",
# "showCoverage": False,
# "color": "blue" if condition == "W" else "red",
# },
)for condition, samples in study.items()
for sample in samples
]
)
),
} )
(17) You’re going to surf to a DE gene and take a screen shot. Everybody’s going to do a different gene. Find the spreadsheet we saved in the differential directory “results.csv”. Let K=500+N+M where the first letter of your first name is the Nth letter of the alphabet and the first letter of your last name is the Mth letter of the alphabet. Get the geneID for the Kth gene on the list differential/results.csv (be careful not to count the header line). What is the q-value?
(18) To find it in IGV we need to lookup its name (it doesn’t work for gene ID’s for some reason). To do that go to the genome directory and grep the .gtf file for that gene ID (and pipe to “head -1” to get the “gene” record). You’ll find the gene name there. Copy the gene name and enter it in the “Locus search” box in IGV and hit enter. Take a screen shot showing all six tracks. Does the gene look differential visually?
10 Enrichment analysis
We’re going to end with Pathway analysis as is typical.
10.1 Load the data
- Go to STRING (right click to open in new tab)
- Click SEARCH
- Select Multple proteins from the menu on the left
- Get the gene IDs from the file you saved in differential.
- Select the approprate organism
- Click SEARCH
- Click CONTINUE (possibly click CONTINUE once more if the Large Network Warning appears)
- You will now see PPI (protein-protein interaction) network
10.2 Enrichment analysis
- Scroll down and click on the Analysis tab
- At the bottom is a standard pathway enrichment analysis. It’s divided into three natural categories of gene sets: Process, Function and Component.
- Sort them by False Discovery Rate
(19) What is the most siginificant gene set in each of those three categories.
(20) Select those three categories with the mouse to highlight those genes in the protein interaction network. Use the “Export” button to export the PPI and include it in your write-up.