RPKM (Reads Per Kilobase Per Million) is a normalization method widely used in RNA-seq analysis. The calculation of RPKM involves the input of transcript length, which often is the length of a 'virtual' transcript annotated in a gene model. With a chosen gene model (e.g. ENSEMBL, or UCSC), transcript lengths are considered fixed, i.e. independent of sample data. In reality, it is well known that, for a specific transcript, its length can vary between different samples due to events such as exon skipping, alternative transcription start sites and 3' alternative polyadenylation. For example, it has been reported that cancer cell lines often expressed mRNA isoforms with shorter 3′ untranslated regions (UTRs) [Mayr 2009]. In the latest Array Studio, Omicsoft has introduced a new option - automatically trimming UTRs, a data-driven method for dynamically redefining transcript lengths. Briefly, for each gene, we first calculate the coverages of the gene. We then identify the left and right boundaries, which are defined as the positions that have 8% of the maximum coverage of the left/right gene UTR regions. The two boundary positions are the same for ALL transcripts. Then for each transcript, we calculate the trimmed exon length for each end – those two trimming positions have to fall in the UTR region AND the first/last exon of that specific transcript. Here, we use examples to demonstrate that applying this UTR trimming option can improve the accuracy of RPKM calculation as well as the transcript level quantification.
We first examine the expression of the gene UBA52, using a TCGA sample TCGA-24-1846-01A-01R-1567-13 as an example. With the UCSC gene model, the RPKM values for the transcripts uc002njr and uc002njs are 119.6 and 194.5 respectively. After applying the new UTR trimming option, the RPKMs for the two transcripts become 610.5 and 1120, a five-time increase for both transcripts. Such dramatic changes in RPKMs are largely due to the fact that only a small proportion of annotated 3' long UTR regions are actually expressed, which can be clearly seen by read coverages (see Figure below). The stripping of both 5' and 3' unexpressed UTR regions leads to more accurate transcript lengths than the original, which significant improves the RPKM calculation.
Besides improving the RPKM calculation, UTR trimming can also benefit quantification by RSEM, a de facto algorithm utilized by both TCGA and Omicsoft Oncoland for transcript-level quantification. Here we take RPL28 as an example. This gene has 5 transcript isoforms, and the predominant form in the same TCGA sample above is listed as uc010yga. But if we examine the read coverage for this particular gene, clearly there is an issue - the transcript uc010yga has no or minimal coverage in the last exon (exon 5) (see Figure below). Furthermore, the exon junction data does not support uc010yga being the most predominant transcript (data not shown). In contrast, if we apply the UTR trimming option, and the most abundant transcript for this gene changes to uc002qkv, which matches well with the read coverages in the browser. Again, the true UTR region of this gene is significantly shorter than what has been annotated, which leads to bias in RSEM quantification. UTR trimming, however, can remedy the problem and allows the identification of true abundant transcripts.
Another example is from gene IGBP1. The real 5' end UTR region in this TCGA sample is much shorter than the region being annotated in the UCSC gene model. RSEM assigns a majority of reads to the transcript uc004dxw, which is clearly questionable due to the fact that no junction reads are observed between the first and second exon of uc004dxw (see Figure below, including junction reads). With UTR trimming, RSEM correctly identifies the true predominant transcript - uc004dxv, containing greater than 99% of the total reads mapped to this gene.