BEERS2: RNA-Seq simulation through high fidelity in silico modeling
Software
BEERS2 is available for use under the GPLv3 open-source license at github.com/itmat/beers2. Please see the Github README for an installation guide and basic usage. The BEERS2 API is documented at ReadTheDocs.
Simulated Read Data
Simulated RNA-seq reads for example benchmarks were generated using CAMPAREE primed from real mouse liver RNA-seq available at GSE98562. BEERS2 was run on the CAMPAREE samples under seven varying sequencing conditions with varying bias parameters.
Output is both in FASTQ (gzipped tar files of all 8 samples) or individual BAM files for each sample, with alignments giving the true origin of each simulated read to the reference genome (GRCm38 / mm10). Read IDs include the ENSEMBL transcript ID of the original transcript along with sample number (before the transcript ID) and an allele number (either 1 or 2, following the transcript ID). Also provided are the configuration files used to generate these.
Simulated Samples
BEERS2 was run on output from CAMPAREE simulating 8 mouse liver samples. The output files from CAMPAREE are available for all 8 samples here. This contains a VCF file with the phased variants of all 8 samples, which appear as fixed variants present in the simulated reads in BEERS2 output. Moreover, it contains expression rates of all genes and transcripts along with allelic balance files. Note that these are the 'true' expression rates of the samples rather than the true transcript or read counts of the sequenced reads from those samples. This differs from true transcript counts in the sequenced aliquots by sampling noise.
Quantifications
There are two 'true' quantification values that are useful for benchmarking. First, are the input transcript counts from the CAMPAREE simulation. These give the actual abundances of each transcript in the simulated RNA sample. As such, they are the best benchmark for transcript per million (TPM) values Compared to CAMPAREE counts, these include the random sampling noise of which specific transcripts happen to be present in the sequenced sample.
The second type of quantification value is the number of sequenced reads that originated from each transcript. This represents the ideal quantified values of read counts per transcript under perfect alignment or mapping and perfect disambiguation of multimapping reads. These could be derived from the simulated BAM files by inspecting the read IDs, which contain the transcript IDs. However, we provide them here for convenience. These tables aggregate all values from all samples and BEERS runs into a single parquet file which can be read using the Python package pandas or the R package arrow.
We also provide the full quantification files, which break down quantification by each sample's specific allele. Most users will prefer to use the simplified quantification files (top row).
Input transcript counts | Output sequenced fragments |
---|---|
TPM | Read Counts |
full transcript quants | full read counts |