Homework 8

Author

Gregory R. Grant

Published

November 4, 2023

1 The Experiment

The species is Dropsophila melanogaster (fruit fly). The two conditions are WT (which we’ll also refer to as “W”) and a model of a particular disease caused by activation of a gene called Yki in adult midgut stem cells (we’ll refer to those flies as type “Y”). The “Y” flies present systemic organ wasting and metabolic abnormalities.

The goal is to understand the mechanism: Why does upregulating this gene in this tissue make the animals sick in this way? Looking for the differentially expressed genes between healthy and affected flies is a first piece in the puzzle.

The tissue of interest is the “malpighian tubule”. So we’ll be comparing W to Y in that tissue to find the differentially expressed genes.

This is done (naturally) with RNA-Seq. There are three replicates of each condition. This is paired-end data with 150 base reads. We know that RNA-Seq alignment is harder than DNA-Seq (or ChIP-Seq). That’s why they generated paired-end data with nice long 150-base reads. Last HW we had single-end 50-base reads. So, we’re going to have to juggle more fastq files this time (two per sample).

We are going to push the data through a differential expression analysis pipeline all the way to q-values. Then we’ll perform pathway enrichment analysis.

This represents one of the most common types of analyses performed in bioinformatics.

Fly only has few chromosomes, X, Y, 2, 3, 4 and also mitochondrial.

This analysis is going to use both UNIX and Python. You do not have to code, you just have to run the code that’s given below. It all uses standard packages. The Python will be done in a so-called Jupyter notebook. We do not assume you know what those are, it’s explained below.

But start at the UNIX command line. Make a directory called ‘hw8’ and cd into it. Case matters in what follows so make sure it’s lowercase hw8 and not HW8.

Copy/Paste the following code blocks to the UNIX command line and hit enter. You can copy each grey block in one fell swoop. But will have to wait for each block to finish before proceeding to the next. Some of them execute immediately but some take a few minutes.

3 Download genome

3.1 Specify the genome and its location

cd ~/hw8
mkdir genome
species="Drosophila_melanogaster"
version="BDGP6.46"
release="110"

fasta_file="${species}.${version}.dna.primary_assembly.fa"
gtf_file="${species}.${version}.${release}.chr.gtf"

3.2 Download the chromosomes (arms)

for chromosome in 2L 2R 3L 3R 4 X Y mitochondrion_genome
do
    wget -O genome/${species}.${version}.dna.primary_assembly.${chromosome}.fa.gz \
            http://ftp.ensembl.org/pub/release-${release}/fasta/$(echo $species | awk '{print tolower($0)}')/dna/${species}.${version}.dna.primary_assembly.${chromosome}.fa.gz
done

3.3 Concatenate the chromosomes

zcat genome/${species}.${version}.dna.primary_assembly.*.fa.gz > genome/${fasta_file}
rm genome/${species}.${version}.dna.primary_assembly.*.fa.gz

3.4 Download GTF file

wget -P genome http://ftp.ensembl.org/pub/release-${release}/gtf/$(echo $species | awk '{print tolower($0)}')/${gtf_file}.gz
gunzip --to-stdout genome/${gtf_file}.gz > genome/${gtf_file}

The GTF file is another standard file format you should familiarize yourself with. This format is used to describe genes in the genome. They give genome coordinates of various features associated to genes such as exons, start and stop codons, etc. Take a look at the file:

cd genome
less -S Drosophila_melanogaster.BDGP6.46.110.chr.gtf

The 3rd column gives the type of feature described by that row. Use cut and sort to list all the various types.

cut -f 3 Drosophila_melanogaster.BDGP6.46.110.chr.gtf |grep -v ^# | sort -u

(4) One is ‘gene’ and the others are gene parts like exon or CDS. How many types are there? These should all be familiar except perhaps one mysterious one that starts with capital S. Use google to figure out and report what that is (a short definition is sufficient). How many different genes does that occur in the file? There are just a few, give the actual gene names from the GTF records, not the gene IDs.

Parallelization

Certain steps in tihs pipeline don’t depend on the others. The following chart shows which steps are dependent on which. Notice steps 4 and 5 can be run in parallel, so we will do that.

(5) Execute those steps together (they’re the next two steps below) and after you’ve done that use the following command to see the status. Supply a screen shot of the status table to show they were done together.

batch list-jobs --max-results 13

Notice this is how you see more than the last 10 jobs.

stateDiagram-v2
    state "2. Link FASTQ files" as link_fastq_files
    state "3. Download genome" as download_genome
    state "4. Build genome index" as build_genome_index
    state "5. Quality control (fastQC)" as read_qc
    state "6. Align and quantify" as align_and_quantify
    state "7. Sort and index BAM files" as sort_and_index_bam_files
    state "9. Visualize the alignment using IGV" as visualize_the_alignment_using_IGV
    state "8. DE analysis" as de_analysis
    state "10. Enrichment Analysis" as enr_analysis

  link_fastq_files --> download_genome

  download_genome --> read_qc
  download_genome --> build_genome_index

  build_genome_index --> align_and_quantify

  align_and_quantify --> sort_and_index_bam_files
  sort_and_index_bam_files --> visualize_the_alignment_using_IGV
  
  align_and_quantify --> de_analysis
  de_analysis --> enr_analysis

Figure 1: Pipeline state diagram

4 Build genome index

cd ~/hw8
mkdir index

read_length=150

batch submit-job --job-name INDEX --vcpu 8 --memory 15000 --command "
    STAR --runMode genomeGenerate \
         --runThreadN 8 \
         --limitGenomeGenerateRAM $(( 12 * 1024**3 )) \
         --genomeDir index \
         --genomeFastaFiles genome/${fasta_file} \
         --sjdbGTFfile genome/${gtf_file} \
         --sjdbOverhang $((read_length - 1))"

5 fastQC (Quality Control)

cd ~/hw8
mkdir qc

for condition in "${!conditions[@]}"
do
    samples=( ${conditions[$condition]} )
    for sample in "${samples[@]}"
    do
        batch submit-job --job-name QC-${sample}_1 --vcpu 2 --memory 3000 --command "fastqc --threads \$(nproc) --outdir qc fastq/${sample}_1.fastq.gz"
        batch submit-job --job-name QC-${sample}_2 --vcpu 2 --memory 3000 --command "fastqc --threads \$(nproc) --outdir qc fastq/${sample}_2.fastq.gz"
    done
done
Note

There are six samples, but since this is paired-end data there are 12 fastq files. So the step above submitted 12 jobs. If you submitted these while step 4’s jobs are also running then you’ll have at least 13, so use batch list-jobs --max-results 13 (the default is to show only 10 most recent jobs).

When the fastQC jobs is finished, there should be a bunch of files in the qc directory.

cd qc
ls -l
unzip '*zip'

(6) Click on W1_1_fastqc.html in the qc directory and look at the Per base sequence content. It failed but how bad is it? Should we be particularly concerned?

We’re going to look at just one stat over all 12 files. From the qc directory execute this command

grep "Total Deduplicated" *fastqc/*data.txt

(7) This is pulling out the Total Deduplicated Percentage from all 12 files. These give the percent of reads that are left after removing duplicates. What is the average in each of the two experimental conditions W and Y?

Note

Take note of the huge number of dups in these samples. But unlike ChIP-Seq or DNA-Seq we’re not going to remove them, because a much higher percent of them will be real dups in RNA-Seq as opposed to PCR dups. If the number of dups is not terribly different between W and T, we hope it will not adversely affect our DE analysis. The alternative of removing all dups will remove a huge amount of real signal. Too much.

6 Align and quantify

cd ~/hw8
mkdir alignment

for condition in "${!conditions[@]}"
do
    samples=( ${conditions[$condition]} )
    for sample in "${samples[@]}"
    do
        batch submit-job --job-name ALIGN-${sample} --vcpu 6 --memory 20000 --command "
            STAR --runMode alignReads \
                 --genomeDir index \
                 --runRNGseed 42 \
                 --runThreadN 6 \
                 --outSAMtype BAM Unsorted \
                 --outBAMcompression 10 \
                 --readFilesCommand zcat \
                 --readFilesIn fastq/${sample}_1.fastq.gz fastq/${sample}_2.fastq.gz \
                 --quantMode GeneCounts \
                 --outFileNamePrefix alignment/${sample}
            "
    done
done

(8) Six alignment jobs were executed. But there are 12 fastq files. Look at the command and try to figure out why STAR was only called six times. Explain.

Once finished, STAR will write the results to the alignment directory. There’s a file of summary stats for each sample - look at the file Y1Log.final.out to see the info reported.

The following will grep out the number of read (pairs) sequenced for each sample:

cd ~/hw8
cd alignment
grep "Number of input reads" *final.out

(9) People always try to sequence at equal depth for all samples, but it will inevitably vary a lot from sample to sample. How much smaller percentwise is the smallest compared to the largest?

6.0.1 Quantifications

The aligner STAR has done some extra work for us by also running a simple gene-level quantification script. This is only possible because when we built the STAR index we gave it the gene annotation (gtf) file. Otherwise STAR would have no idea where the genes were. The following commands will list the files with the gene quants.

ls -l | grep ReadsPerGene

The first column is the gene ID and the second column is the relevant count. Reads pairs are counted only once. So in effect we’re counting fragments, not reads.

head -20 W1ReadsPerGene.out.tab

The first four rows are summary statistics, the gene counts come after that.

These data were not sequenced “strand specific” so we’ve lost the information about which strand a read came from, it ends up being a coin flip for each read pair. That’s why we’re ignoring columns 3 and 4 of those RadsPerGene files, those are strand specific counts, which are meaningless here. The relevant count is column 2. You might expect column 2 to be equal to column 3 plus column 4 but it is not. There’s a good reason. But don’t worry about it here.

(10) In sample W1, find the gene that’s most highly expressed. You can sort on the 2nd column with the following options to the sort command. It’s piped to ‘grep FB’ to get just the genes.

sort -r -g -k 2 <filename> | grep FB | head

Report the highest expressed gene in sample W1 with the gene ID and count.

(11) Is it the same gene that’s the most highly expressed in all six samples? If not, which sample has a different one? Report the sample name, gene ID and count.

7 Sort and index BAM files

The following will sort the BAM files by chromosome/position. Reads that align next to each other in the genome are then consecutive in the BAM file. This step makes them easier to parse into coverage plots, for example.

cd ~/hw8
mkdir sorted

for condition in "${!conditions[@]}"
do
    samples=( ${conditions[$condition]} )
    for sample in "${samples[@]}"
    do
        batch submit-job --job-name SORT-INDEX-$sample --vcpu 2 --memory 16000 --command "
            samtools sort --threads 1 -o sorted/${sample}.bam alignment/${sample}Aligned.out.bam \
            && cd sorted \
            && samtools index --threads \$(nproc) ${sample}.bam
        "
    done
done

8 DE analysis

You can run the DE analysis while the previous jobs are running (refer to the flow chart of dependencies above).

We are going to do the DE analysis in Python. But you do not have to write any code, you are going to run the commands given below interactively in a Jupyter notebook (see picture below).

IMPORTANT

Each person on the server has limited amount of RAM. Some of the following may fail if you have too many tabs open. So this is a good time to close some of them. If you run out of RAM you’ll get a popup warning that the “kernel has restarted”.

Hopefully that won’t happen but if it does, close the other tabs you’re not using and then in the notebook you probably need to backtrack a bit to rerun some earlier steps. You would not have to redo any compute that you’ve already done at the unix command line, just previous steps in the notebook you’re about to open. And once you do go back to Unix, you will have to redefine your variables because those will have been reset too.

The DE analysis steps are to be executed in order in a Jupyter notebook. You can copy the code from here to the cells in the notebook and execute the cell by pressing Shift+Enter.

Notebook

Notebook Cell

Notebook Cell Type

8.1 Load modules

IMPORTANT

Copy the entire following code block into one cell in the notebook and hit Shift-Enter to execute it.

Code
import jupyter_black
import matplotlib.pyplot as plt
import pandas as pd

from pathlib import Path
from pydeseq2.ds import DeseqStats
from pydeseq2.dds import DeseqDataSet

jupyter_black.load()
plt.rcParams["font.family"] = "Liberation Sans"

The above should run quickly, like seconds. There is no output.

8.2 Combine the quantifications in one table

Don’t forget to hit Shift+Enter.

Code
study = {"W": ["W1", "W2", "W3"], "Y": ["Y1", "Y2", "Y3"]}

HOMEWORK_DIRECTORY = Path.home() / "hw8"

alignment_directory = HOMEWORK_DIRECTORY / "alignment"

counts = [
    pd.read_csv(
        alignment_directory / f"{sample}ReadsPerGene.out.tab",
        sep="\t",
        index_col=0,
        skiprows=4,
        header=None,
        names=["unstranded", "forward_stranded", "reverse_stranded"],
    )[["unstranded"]].rename({"unstranded": sample}, axis="columns")
    for condition, samples in study.items()
    for sample in samples
]

counts = pd.concat(counts, axis="columns")
counts

This should also run quickly, the output should look like this. This is the table of raw counts. Rows are genes, columns are samples, cells are the number of read (pairs) from that sample that align to that gene.

Expression Spreadsheet of Raw Counts

8.3 Filter low expressors

The number of rows is shown below the table.

(12) We’re going to remove all genes where the mean expression across all six samples is less or equal to 10. These are just very low expressed genes that are unlikely to be DE since they’re probably low everywhere. And they just make the multiple testing problem worse. Compare the before and after and report how many genes were removed exactly.

Code
counts_filtered = counts.loc[counts.mean(axis="columns") >= 10, :]
counts_filtered

8.4 Dendogram Clustering

Next we’re going to cluster the samples by making a tree. This is helpful for QC purposes. This will reveal, for example, if samples might have been switched between conditions by accident. This should only take a few seconds to run, it should produce a tree diagram.

(13) Supply a screenshot of the tree and interpret it for QC purposes.

Code
import plotly.figure_factory as ff

fig = ff.create_dendrogram(counts_filtered.T, labels=counts_filtered.columns)
fig.show()

8.5 Execute the DE procedure

8.5.1 Make the metadata table

Code
metadata = pd.DataFrame(study).melt()
metadata.columns = ("Condition", "Sample")
metadata.set_index("Sample", inplace=True)

metadata

8.5.2 Determine normalization factors and dispersions

Code
dds = DeseqDataSet(
    counts=counts_filtered.T,
    metadata=metadata,
    design_factors="Condition",
    refit_cooks=True,
    n_cpus=1,
)

dds.fit_size_factors()
dds.fit_genewise_dispersions()
dds.fit_dispersion_trend()
dds.fit_dispersion_prior()
dds.fit_MAP_dispersions()
NOW WAIT

It will take a couple minutes, watch for this output to the screen:

Expected output

8.5.3 Deal with outliers

Code
dds.fit_LFC()
dds.calculate_cooks()
dds.refit()

Output should look like this. Wait for it.

Expected output

8.5.4 Perform DE analysis

Code
analysis = DeseqStats(
    dds, alpha=0.05, cooks_filter=True, independent_filter=True, n_cpus=1
)

analysis.run_wald_test()
analysis._cooks_filtering()
analysis._independent_filtering()
analysis.lfc_shrink()
analysis.summary()

Output should look like this, wait for it. After this we have our q-values.

Expected output

Note

padj is a synonym for q-value

8.5.5 Show MA plot

Next we’re going to visualize the data with an MvA plot to get global picture of differential expression between WT and Y. In an MvA the Y-axis represents fold-change between the two conditions (W and Y). The X-axis overall expression level of the gene. Every gene is one point. So higher expressed genes are to the right, lower to the left. And the non differentially expressed genes are the ones near the dashed line.

(14) Include a screen shot of the plot in your write-up.

analysis.plot_MA()

8.6 Sort by q-value (aka padj)

Code
results = analysis.results_df

top_genes = results[(results["padj"] < 0.05) & (results["log2FoldChange"].abs() > 2)]
top_genes.sort_values("padj")[:20]

(15) What is the most significant gene in this table and its q-value?

8.6.1 Store results

This will save the results table to your differential directory.

Code
differential_directory = HOMEWORK_DIRECTORY / "differential"
differential_directory.mkdir(exist_ok=True)

results.sort_values("padj").to_csv(differential_directory / "results.csv")

We’ll also need a list of the top DE genes for pathway analysis later. So execute the following, it will save that list in the differential directory.

with open(differential_directory / "top_genes.txt", "w") as file:
    print("\n".join(top_genes.index), file=file)

(16) Find the file of all genes and their q-values in the hw8/differential directory, called results.csv. How many genes have q-value less or equal to 0.01?

NOW CLEAN UP

Close the tab with the jupyter notebook where you did the DE analysis. If you don’t there might not be enough RAM to do this next step.

9 Visualize the alignment using IGV

Instead of the UCSC genome browser we’re going to visualize this data in another popular browser called IGV. IGV can show the location of every read which UCSC does not.

IGV has to be installed to run locally. But instead of installing it on your laptop, we’ve installed it here on the server. It’s pretty stable but can be tempermental if, for example, there are a million reads for some gene.

So we’re not going to turn the reads tracks on, just the coverage tracks. But the code to turn them on is there but commented out, if you want to play with it.

9.1 Index genome

Return to the UNIX command line hw8 directory and run the following:

cd ~/hw8
samtools faidx --output genome/Drosophila_melanogaster.BDGP6.46.dna.primary_assembly.fa.fai \
                        genome/Drosophila_melanogaster.BDGP6.46.dna.primary_assembly.fa

9.2 Create coverage files

We’re going to define the necessary variables again, just in case.

cd ~/hw8
mkdir graphs

declare -A conditions=(
    [W]="W1 W2 W3"
    [Y]="Y1 Y2 Y3"
)

for condition in "${!conditions[@]}"
do
    samples=( ${conditions[$condition]} )
    for sample in "${samples[@]}"
    do
        batch submit-job --job-name COVERAGE-${sample} --vcpu 2 --memory 15000 --command "
            bamCoverage --binSize 1 \
                        --normalizeUsing CPM \
                        --effectiveGenomeSize 142573017 \
                        --numberOfProcessors \$(nproc) \
                        --bam sorted/${sample}.bam \
                        --outFileFormat bigwig \
                        --outFileName graphs/${sample}.bw
            bigWigToWig graphs/${sample}.bw graphs/${sample}.wig
            igvtools toTDF -z 10 graphs/${sample}.wig graphs/${sample}.tdf genome/Drosophila_melanogaster.BDGP6.46.dna.primary_assembly.fa.fai
        "
    done
done
NOW WAIT

Making coverage plots is always a big compute. It will take a few minutes to run, monitor with batch ‘list-jobs’.

9.3 Use IGV

Create a new notebook.

.

COLORING

*Note we’re coloring WT blue and the other condition “Y” red. Those colors do not mean strand here like they did last week. We don’t actually know the strand in this data.

Copy the following code into the cell of the notebook and hit Shift-Enter.

The code below has the reads/alignments track commented out. As explained above, that’s intentional.

import os
import igv_notebook

from itertools import chain
from pathlib import Path
from urllib.parse import urlparse

igv_notebook.init()

HOMEWORK_DIRECTORY = Path.home() / "hw8"

JUPYTER_SERVER_ROOT = os.readlink(f"/proc/{os.environ['JPY_PARENT_PID']}/cwd")
RELATIVE_HOMEWORK_DIRECTORY = HOMEWORK_DIRECTORY.relative_to(JUPYTER_SERVER_ROOT)

genome = "/" / RELATIVE_HOMEWORK_DIRECTORY / "genome"
sorted = "/" / RELATIVE_HOMEWORK_DIRECTORY / "sorted"
graphs = "/" / RELATIVE_HOMEWORK_DIRECTORY / "graphs"

reference = {
    "id": "BDGP6.46",
    "name": "Drosophila melanogaster (version BDGP6.46, release 110)",
    "fastaURL": f"{genome}/Drosophila_melanogaster.BDGP6.46.dna.primary_assembly.fa",
    "indexURL": f"{genome}/Drosophila_melanogaster.BDGP6.46.dna.primary_assembly.fa.fai",
    "tracks": [
        {
            "name": "Gencode genes",
            "format": "gtf",
            "url": f"{genome}/Drosophila_melanogaster.BDGP6.46.110.chr.gtf.gz",
        }
    ],
}

study = {"W": ["W1", "W2", "W3"], "Y": ["Y1", "Y2", "Y3"]}

browser = igv_notebook.Browser(
    {
        "reference": reference,
        "locus": "3L:18,166,779-18,186,895",
        "tracks": list(
            chain(
                *[
                    (
                        {
                            "name": f"{sample} coverage",
                            "url": f"{graphs}/{sample}.tdf",
                            "format": "tdf",
                            "autoscaleGroup": 1,
                            "color": "blue" if condition == "W" else "red",
                        },
#                        {
#                            "name": f"{sample} reads",
#                            "url": f"{sorted}/{sample}.bam",
#                            "indexURL": f"{sorted}/{sample}.bam.bai",
#                            "format": "bam",
#                            "type": "alignment",
#                            "showCoverage": False,
#                            "color": "blue" if condition == "W" else "red",
#                        },
                    )
                    for condition, samples in study.items()
                    for sample in samples
                ]
            )
        ),
    }
)
EXERCISES 17 and 18

(17) You’re going to surf to a DE gene and take a screen shot. Everybody’s going to do a different gene. Find the spreadsheet we saved in the differential directory “results.csv”. Let K=500+N+M where the first letter of your first name is the Nth letter of the alphabet and the first letter of your last name is the Mth letter of the alphabet. Get the geneID for the Kth gene on the list differential/results.csv (be careful not to count the header line). What is the q-value?

(18) To find it in IGV we need to lookup its name (it doesn’t work for gene ID’s for some reason). To do that go to the genome directory and grep the .gtf file for that gene ID (and pipe to “head -1” to get the “gene” record). You’ll find the gene name there. Copy the gene name and enter it in the “Locus search” box in IGV and hit enter. Take a screen shot showing all six tracks. Does the gene look differential visually?

10 Enrichment analysis

We’re going to end with Pathway analysis as is typical.

10.1 Load the data

  1. Go to STRING (right click to open in new tab)
  2. Click SEARCH
  3. Select Multple proteins from the menu on the left
  4. Get the gene IDs from the file you saved in differential.
  5. Select the approprate organism
  6. Click SEARCH
  7. Click CONTINUE (possibly click CONTINUE once more if the Large Network Warning appears)
  8. You will now see PPI (protein-protein interaction) network

10.2 Enrichment analysis

  1. Scroll down and click on the Analysis tab
  2. At the bottom is a standard pathway enrichment analysis. It’s divided into three natural categories of gene sets: Process, Function and Component.
  3. Sort them by False Discovery Rate

(19) What is the most siginificant gene set in each of those three categories.

(20) Select those three categories with the mouse to highlight those genes in the protein interaction network. Use the “Export” button to export the PPI and include it in your write-up.