Datasets

Systematic dataset Test 1 (T1)

Our systematic dataset Test 1 (T1) has 13,000 (non-overlapping) genes with numbers of splice forms per gene varying from one to five. Each dataset has 50 million paired-end reads using mouse genome build mm9 (Mouse Genome Sequencing Consortium, et al. 2002). Data set T1 was designed to determine upper bounds on the accuracies of the methods by providing data as ideal as possible. Three different types of alternate splicing were included: (i) exon skipping, (ii) alternate 3’/5’ site (start/end of transcription) and (iii) alternate splice sites at the ends of internal exons. The vast majority of alternate splicing events can be explained by these three categories. The basic forms for dataset T1 were taken from real Ref-Seq genes (Pruitt KD, et al. 2014) as aligned to the reference genome by the UCSC genome browser (Kent WJ, et al. 2002). The first 1,000 genes were chosen to have one splice form and at least 5 exons each. For each of the three categories of alternate splicing there are an additional 4,000 genes with 1,000 having two forms, 1,000 having three forms, 1,000 having four forms and 1,000 having five forms. In this data set we did not mix different types of alternate splicing in the same gene. In other words, for each gene all forms differ by splicing types of just one of the three categories. In total there are 13,000 genes comprising 43,000 splice forms. The alternate splice forms were generated by randomly removing some exons (exon skipping), or by randomly leaving exons off one or both ends of the transcript (truncation), or by randomly altering the start or end splice site of individual exons (mostly by multiples of three as is typically observed in real data). The reads for Test 1 were generated with the following parameters: read length=100, fragment length min=200, fragment length max=500, fragment length median=300, basewise error=0%, substitution frequency=0%, indel frequency=0%, intron frequency=0%. All transcripts were expressed at the same high level with at approximately 40X coverage. We take fragmentation into account yielding data that differ from strictly uniform by the randomness of the fragmentation process. The results of this data set should effectively give bounds on the accuracy of the various methods, since no method could do better in reality than it does on perfect data.

Download:
hayer1_T1.tar.bz2 (hayer1_T1.tar.bz2.md5sum)
File name File size in mb
hayer1_T1.tar.bz2 18,106.76
simulated_reads_test1.cig 16,518.19
simulator_config_geneinfo_test1 9.42
simulator_config_featurequantifications_test1 35.74
simulated_reads_test1.forward.fa 5,520.71
simulated_reads_test1.reverse.fa 5,198.00
simulated_reads_junctions-crossed_test1.txt 1,234.61
fraglenhisto_test1.txt 0.00
simulated_reads2genes_test1.txt 1,188.07
simulated_reads_test1.log 0.00
true_alignment.bam 1,998.46
true_alignment.bam.bai 2.67
tophat_without_gene_models.bam.bai 3.49
tophat_with_gene_models.bam 1,941.81
tophat_without_gene_models.bam 1,893.04
tophat_with_gene_models.bam.bai 2.88
star_without_gene_models.bam 2,031.73
star_without_gene_models.bam.bai 3.05
star_with_gene_models.bam 2,030.98
star_with_gene_models.bam.bai 3.05
gene_models.gtf

ENSEMBL Perfect (EP) and ENSEMBL Realistic (ER)

Two additional simulated data sets were generated to assess the effect of polymorphisms and sequencing artifacts. Data set EP (ENSEMBL Perfect) was generated with the same parameters as T1, except the full set of 93,778 unmodified ENSEMBL transcript models was used. Data set ER (ENSEMBL Realistic) was generated with the same parameters except for the following changes, chosen to mimic real data: basewise error=0,5%, substitution frequency=0.1%, indel frequency=0.05%, intron frequency=30%. These are fairly low polymorphism rates that would be expected if comparing human sequencing data to the human reference genome. In reality the polymorphism rates in other organisms will be much higher. We chose these parameters in order to attain reasonable lower bounds on algorithm accuracy in practice, where human/human comparisons are currently among the least polymorphic of all vertebrates. Intron frequency of 30% may seem high at first, but it is typical. However, in spite of comprising nearly a third of the reads, since introns are so much larger than exons the majority of intron signal results in very low coverage. Two-thirds of all transcripts are expressed above zero, with levels of expressed transcripts given by an exponential distribution. Short genes will be underrepresented because of the fragment length distribution so we removed all genes under 200 bases.

Download:
hayer1_EP.tar.bz2 (hayer1_EP.tar.bz2.md5sum),
File name File size in mb
hayer1_EP.tar.bz2 17,870.41
simulated_reads_EP.cig 16,487.73
simulated_reads_EP.forward.fa 5,520.71
simulated_reads_EP.reverse.fa 5,520.71
simulator_config_featurequantifications_ensembl-mm9_EP 53.52
simulated_reads_EP.log 0.00
true_alignment.bam 1,966.32
simulator_config_geneinfo_ensembl-mm9 14.96
fraglenhisto_EP.txt 0.00
simulated_reads_junctions-crossed_EP.bed 13.87
simulated_reads2genes_EP.txt 1,126.62
true_alignment.bam.bai 3.00
tophat_without_gene_models.bam 2,047.93
tophat_without_gene_models.bam.bai 4.15
tophat_with_gene_models.bam 1,926.33
tophat_with_gene_models.bam.bai 3.12
star_without_gene_models.bam 1,992.92
star_without_gene_models.bam.bai 3.21
star_with_gene_models.bam 1,992.92
star_with_gene_models.bam.bai 3.21
hayer1_ER.tar.bz2 (hayer1_ER.tar.bz2.md5sum)
File name File size in mb
hayer1_ER.tar.bz2 23,674.97
simulated_reads_ER.cig 16,148.54
fraglenhisto_ER.txt 0.00
simulated_reads_ER.log 0.00
simulated_reads_indels_ER.txt 1.81
simulated_reads_substitutions_ER.txt 4.18
simulator_config_geneinfo_ensembl-mm9 14.96
simulated_reads2genes_ER.txt 1,125.66
simulated_reads_ER.forward.fa 5,520.71
simulated_reads_ER.reverse.fa 5,520.71
simulated_reads_junctions-crossed_ER.bed 13.78
simulator_config_featurequantifications_ensembl-mm9_ED 53.51
true_alignment.bam 2,754.96
true_alignment.bam.bai 4.00
tophat_without_gene_models.bam 3,561.19
tophat_without_gene_models.bam.bai 5.78
tophat_with_gene_models.bam 3,228.77
tophat_with_gene_models.bam.bai 5.45
star_without_gene_models.bam 3,138.41
star_without_gene_models.bam.bai 5.17
star_with_gene_models.bam 3,138.41
star_with_gene_models.bam.bai 5.17
gene_models.gtf

In vitro transcription dataset (IVT)

Finally, we used data from in vitro transcription of 1062 human cDNAs (IVT), sequenced by both poly A seq and total RNA-seq (Lahens et al. 2014). Because these are cDNAs, we know the exact nucleotide sequence, including all the exon-exon junctions. Of these genes 50 have multiple splice forms. This data provides an evaluation of each algorithm on real data.

Download:
hayer1_IVT.tar.bz2 (hayer1_IVT.tar.bz2.md5sum)
File name File size in mb
hayer1_IVT.tar.bz2 78,390.47
IVT_polyA_star_with_gene_models.bam 198.87
IVT_polyA_star_with_gene_models.bam.bai 3,312.47
IVT_polyA_star_without_gene_models.bam 1,095.67
IVT_polyA_star_without_gene_models.bam.bai 0.00
IVT_polyA_tophat_with_gene_models.bam 1,095.67
IVT_polyA_tophat_with_gene_models.bam.bai 3,792.72
IVT_polyA_tophat_without_gene_models.bam 0.69
IVT_polyA_tophat_without_gene_models.bam.bai 0.19
IVT_ribo_star_with_gene_models.bam 309.98
IVT_ribo_star_with_gene_models.bam.bai 1.48
IVT_ribo_star_without_gene_models.bam 337.19
IVT_ribo_star_without_gene_models.bam.bai 1.65
IVT_ribo_tophat_with_gene_models.bam 324.23
IVT_ribo_tophat_with_gene_models.bam.bai 339.59
IVT_ribo_tophat_without_gene_models.bam 1.51
IVT_ribo_tophat_without_gene_models.bam.bai 339.52
simulated_reads_spikeins.cig 1.51
simulated_reads_spikeins.forward.fa 10,154.08
simulated_reads_spikeins.log 1.66
simulated_reads_spikeins.reverse.fa 11,641.36
simulated_reads_spikeins.sam 2.39
simulated_reads2genes_spikeins.txt 7,053.00
simulator_config_featurequantifications_spikeins_OLD 1.68
simulator_config_geneinfo_spikeins 8,167.85
star_with_gene_models.bam 2.19
star_with_gene_models.bam.bai 10,241.58
star_without_gene_models.bam 1.89
star_without_gene_models.bam.bai 11,637.21
tophat_with_gene_models.bam 2.38
tophat_without_gene_models.bam 7,093.60
tophat_without_gene_models.bam.bai 1.85
true_alignment.bam 8,164.35
true_alignment.bam.bai 2.19
gene_models.gtf