PRICING & INQUIRIES

For pricing and inquiries, send an email to sales@omicsoft.com.

5001 Weston Parkway, Suite 201
Cary, NC 27513
US

888-259-6642

Overview

Omicsoft is the leading provider of Next Generation Sequencing, Cancer Genomics, Immunology, and Bioinformatics solutions for Next Generation Sequencing Data and Gene Expression Analysis.

Exciting Updates and Latest News

Keeping you up-to-date with the latest in NGS, Bioinformatics Analysis, and cancer genomics with blogs on Array Suite, OncoLand (TCGA and more), ImmunoLand, and more.

Filtering by Category: Research

[Array Studio Analysis] Getting Started with RT-PCR Analysis

Vivian Zhang

Although RNA-seq has become the invaluable tool to study gene expression, RT-PCR (reverse transcription-polymerase chain reaction) is still the most sensitive method and widely-used for small-scale mRNA expression studies or RNA-seq analysis validation. In this article, we would like to introduce to you how to perform RT-PCR analysis using Array Studio. For more details, please check out our tutorial series: Getting Started with RT-PCR Analysis, which has step-by-step video tutorial clips to help you quickly become RT-PCR analysis expert.

In this tutorial, we provide tutorial video clips on:

 

1. Importing RT-PCR data

Array Studio allows user to import Ct or abundance data from text files or Excel spreadsheets. The Import RT-PCR Wizard function simplifies the data importing and normalizing processes.

Array Studio can process different data formats, no matter its "Tall-skinny" or Matrix data format. 

Array Studio can process different data formats, no matter its "Tall-skinny" or Matrix data format. 

With your data ready to import, Import RT-PCR Wizard offers step-by-step instructions on:

  • Choosing the correct input format
  • Selecting the annotation and data columns 
  • Previewing raw data for missing values 
  • Attaching Annotation and Design metadata 
  • Combining or remove technical replicates
  • Specifying default values for missing data 
  • Transforming Ct data to delta-Ct
  • Normalizing data 
  • Previewing data 

After importing through the RT-PCR Wizard, three data tables are generated: a data table, annotation table and design table. They are standard data formats in Array Studio.

 

2. Downstream Analysis

2.1 Visualizing RT-PCR Data- Adding Views to RT-PCR data

Once the data is in an Array Studio project, a variety of functions are available for downstream analysis. To start with, data visualization provides a good overview of the data. Array Studio has up to 40 different views available for your RT-PCR data. Here are a few commonly used views:

 

2.2 Data Processing-QC and Excluding/Subsetting Data

Sometimes single assay or sample experiment fails and should be removed from downstream analysis to allow for more accurate detection of real differences among groups. These failed experiments can be detected and easily removed from an Array Studio data object. For example, we can use Principal Component Analysis (PCA) to detect and remove outlier samples. To further subset data, we can use hierarchical clustering.

3D PCA plot and hierarchical clustering heatmap. In the upper PCA plot, each dot represents a sample. In the bottom heatmap,   data is clearly separated by source tissue but not so much by group.

3D PCA plot and hierarchical clustering heatmap. In the upper PCA plot, each dot represents a sample. In the bottom heatmap, data is clearly separated by source tissue but not so much by group.

 

 

2.3 Statistical Inference-Two-Way ANOVA of RT-PCR Data

Array Studio has a few different statistical inference modules to identify statistical significant differences between groups, for example, ANOVA and general linear model. Here is an example of Two-Way ANOVA analysis.

Two-Way ANOVA analysis results using source tissue and group as factors. This analysis generates one volcano plot for each test, in addition to the report table. The volcano plot is interactive -- selecting a subset of samples in one plot automatically selects the corresponding samples in another plot. In this example, the three selected genes CDH1, PFN2 and NOTCH2 that are affected in Breast are not similarly affected in Lymphoid, with NOTCH2 affected in the opposite direction.

Two-Way ANOVA analysis results using source tissue and group as factors. This analysis generates one volcano plot for each test, in addition to the report table. The volcano plot is interactive -- selecting a subset of samples in one plot automatically selects the corresponding samples in another plot. In this example, the three selected genes CDH1, PFN2 and NOTCH2 that are affected in Breast are not similarly affected in Lymphoid, with NOTCH2 affected in the opposite direction.

 

2.4 Omic Data Analysis-Integration of RT-PCR and RNA-seq/Microarray Data

RT-PCR data can be compared to other gene or transcript level data, such as from RNA-Seq or microarray, using Microarray-Microarray Integration. Careful data matching is important to ensure proper matching of data. 

Variable view of RT-PCR and RNA-Seq data integration for gene TFPI as an example.

Variable view of RT-PCR and RNA-Seq data integration for gene TFPI as an example.

 

Please check out our tutorial series: Getting Started with RT-PCR Analysis to learn how to perform the above RT-PCA analyses. 

[OncoLand Case Study] Empower OncoLand with Array Studio Analysis: Visualize "mutation burden" in each tumor in TCGALand

Vivian Zhang

One of the common goals in cancer research is identification of genes or samples with mutations that occur during tumor development. The number of identified mutations in cancer samples can vary wildly, but some tumors tend to aggregate widespread alterations. This Nature paper about the mutation landscape and significance across 12 major cancer types (as part of the TCGA Pan-Cancer effort) is a good example. In the very first figure, the authors investigated the mutation frequencies of six transition (Ti) and transversion (Tv) categories for each cancer type:

Figure 1: Mutation frequencies, spectra and contexts across 12 cancer types.   Kandoth, Cyriac, et al. "Mutational landscape and significance across 12 major cancer types."   Nature   502.7471 (2013): 333-339.

Figure 1: Mutation frequencies, spectra and contexts across 12 cancer types. Kandoth, Cyriac, et al. "Mutational landscape and significance across 12 major cancer types." Nature 502.7471 (2013): 333-339.

In another recent Nature paper, Whole-genome mutational burden analysis of three pluripotency induction methods, the authors researched mutational subtypes in each sample:

Figure 2: Characterization of variants caused by reprogramming method.   Bhutani, Kunal, et al. "Whole-genome mutational burden analysis of three pluripotency induction methods."   Nature communications   7 (2016). 

Figure 2: Characterization of variants caused by reprogramming method. Bhutani, Kunal, et al. "Whole-genome mutational burden analysis of three pluripotency induction methods." Nature communications 7 (2016). 

Using OncoLand, you can easily calculate and visualize total mutation burden of every sample or tumor type. Check out this OncoLand case study: Visualize "mutation burden" of each tumor in TCGALand

1. Calculate total mutation burden of every sample in TCGALand.

To calculate the number of total mutations per tumor sample (mutation burden), you can simply use Summarize Sample Mutation Count under Analytics tab in Land. By specifying the individual nucleotide changes, for example "A->C", the result will calculate the total number of mutations (from a selected GeneSet) mutated in each sample (from selected SampleSet).

You can further summarize the data by downloading this TotalMutationBurdenByNTchange table to Array Studio's local analysis. For example, adding a variable view to better visualize mutation burden across samples:

Mutation burden variable view. Y-axis represents mutation number. X-axis represents different samples.

Mutation burden variable view. Y-axis represents mutation number. X-axis represents different samples.

2. Calculate average mutation burden in each tumor in TCGALand

Using local analysis functions, you can further research mutation burden in each tumor in Land data. The Summarize function allows user to calculate the mean mutation number grouped by tumor type, or other preferred grouping options.  

After Stacking the table, you can plot another variable view to visualize the distribution of each type of nucleotide change in each tumor type

                                                                               Stack table by row to generate variable view 

                                                                               Stack table by row to generate variable view 

In this way, we can easily tell which tumor type has the highest mutation burden.

To learn how to exactly perform the above analysis, please watch our OncoLand case study: Visualize "mutation burden" of each tumor in TCGALand.

[OncoLand Case Study] Summarize per-sample and per-tumor mutations across multiple genes

Vivian Zhang

Summarizing mutation frequencies within a protein complex, members of a pathway, or even across the genome, can give insights into differences between tumors. Combining the power of OncoLand and Array Studio functions, you can explore mutation frequencies. For example, let's take a research example using the Swi/Snf complex, which can regulate chromatin remodeling. 

Swi/Snf complex is multi-subunit ATP-dependent chromatin-remodeling complex. Early studies have suggested that the Swi/Snf complex plays a role in cancer development, likely to be tumor suppressors. ( Nature Reviews Cancer article: The SWI/SNF complex — chromatin and cancer). Mutations in the members of this complex have been linked to various cancers. You can leverage OncoLand to query samples containing those mutations. Please check out the detailed OncoLand case study video tutorials.

 

Identify samples with mutations in the Swi/Snf complex

To find out how often the genes from the Swi/Snf complex are mutated in tumors, you can use Summarize Sample Mutation Count to generate a SampleSet through Analytics tab and use this SampleSet for downstream analysis:

SampleSet results from Summarize Sample Mutation Count analysis by inputing all gene names from Swi/Snf complex as GeneSet and group by Tumor Type. The mutation count is sorted by the number of mutations in each sample.

SampleSet results from Summarize Sample Mutation Count analysis by inputing all gene names from Swi/Snf complex as GeneSet and group by Tumor Type. The mutation count is sorted by the number of mutations in each sample.

 

Visualize differences in Swi/Snf complex mutations using TCGALand Views

There are multiple ways to visualize mutation (frequency) differences in Swi/Snf. Without using land views, we can still achieve this goal in Array Studio. Array Studio empowers users to perform hundreds of different types of analysis with flexibility, and can potentially save biologists the hassle of waiting for a bioinformatician to get back the results for weeks. However, with OncoLand, we can visualize the mutation frequency in minutes. The following analysis pipeline clearly demonstrates the difference of using Array Studio and OncoLand.

OncoLand makes cancer genomics research easy. Again, please check out our case study video tutorials for more details.

[OncoLand Case Study] Find genes that are frequently co-mutated with your gene-of-interest: Co-mutation of TP53 and ATRX when IDH1-R132 is mutated

Vivian Zhang

The IDH1 gene encodes isocitrate dehydrogenase, which is  involved in NADPH production, especially in the brain. Mutations in IDH1 are frequently found in low grade and high grade gliomas (Low grade (grade II), anaplastic (grade III), and glioblastoma (GBM, grade IV).). (Research Article: IDH1 and IDH2 Mutations in Gliomas) These mutations play an important role in gliomagenesis and thus have clinical interest. We can query OncoLand to learn about IDH1 mutations, and other genes frequently co-mutated. For details, please refer to our OncoLand case study wiki:

Identify mutation hotspots in a gene of interest

In several cancers, IDH1 is frequently mutated at arginine 132, which alters the enzyme's active site. We can visualize the frequencies of mutations at different sites in each tumor. As we can see, our data confirms that IDH1 arginine 132 is frequently mutated in low grade gliomas (LGG) and glioblastoma (GBM):

TCGALand DNA-Seq Somatic Mutation Site Distribution View. 

TCGALand DNA-Seq Somatic Mutation Site Distribution View. 

The user can create a SampleSet, for example the one shown below, IDH1_mutaion, from the Analytics | Generate Sample Set | Generate Site Mutation Status SampleSet. 

SampleSet: IDH1_mutation

SampleSet: IDH1_mutation

Identify other genes that are co-mutated with your gene of interest

With the SampleSet, we can identify the gene mutations that are correlated through Analytics | Integration Analysis | Sample Grouping to Mutation. The test may take a few minutes if all genes are queried, and the results will be available from the Analytics | Open Result Set menu. From the results table, we can rank genes with the PValue from the Fisher Exact Test to identify the correlated genes, for instance ARRX and TP53 in LGG and GBM:

Analytics | Integration Analysis | Sample Grouping to Mutation Test results. Rank by PValue, filter by only co-occurring gene in LGG and GBM.

Analytics | Integration Analysis | Sample Grouping to Mutation Test results. Rank by PValue, filter by only co-occurring gene in LGG and GBM.

Visualize Co-mutation patterns with the Alteration Omicprint

There are several ways to visualize co-mutation frequencies of multiple genes. While the "Alteration Distribution" displays the number of samples mutated in any gene of the GeneSet, "Somatic Co-mutation Frequencies" will display the distribution of samples with different mutation loads. The "Alteration Omicprint" efficiently displays per-sample mutation status of one, ten, or even hundreds of genes. You can also generate custom Omicprinst based on custom queries if you want to query mutation status. Please check out our case study tutorial videos to learn how to perform the analysis. 

Alteration Omicprint displays gene alteration status for multiple genes for corresponding samples. Custom quires for IDH1 and TP53 somatic mutation status, and BMP2 RNA-Seq FPKM are created. Next, check out Custom Query Omicprint view. For each custom query, sample status is displayed. As we can see, samples with mutated IDH1 and TP53 frequently over-express BMP2 in GBM. 

Alteration Omicprint displays gene alteration status for multiple genes for corresponding samples. Custom quires for IDH1 and TP53 somatic mutation status, and BMP2 RNA-Seq FPKM are created. Next, check out Custom Query Omicprint view. For each custom query, sample status is displayed. As we can see, samples with mutated IDH1 and TP53 frequently over-express BMP2 in GBM. 

[Feature Review] Comprehensive Quality Control of Next Generation Sequencing Data

Vivian Zhang

Next-generation sequencing (NGS) technology is revolutionizing genomic research. NGS has become one of the most commonly used methods in genomic and even clinical research. With  increased data output capacity and dramatically dropped costs associated with it, researchers are producing trillions (TB) of base pairs of data everyday. With the large amount of data, data quality control is always critical to ensure the quality and reliability of the data. Omicsoft's NGS analytics provides comprehensive functions for NGS raw data and aligned data QC, both for DNA-Seq (Exome-Seq, WGS, and targeted sequencing) and RNA-Seq. 

[NGS RAW DATA QC]

In Array Studio, the NGS Raw Data QC Wizard is an easy-to-use choice to run multiple QC commands simultaneously. The Raw Data QC Wizard provides options including Basic statistics, Base Distribution, Quality BoxPlot, K-Mer Analysis and Sequence Duplication.  

The Basic Statistics module generates some simple composition statistics for the files analyzed, such as sequence length, GC content etc. The NGS Base Distribution module can be used to check for uniformity between the different bases, as one would expect to see about equal distribution of the four bases across the length of the read. The Quality BoxPlot module is used to look at the quality score for each base pair in a file (aggregated over all reads from that file). It gives the user an idea of where the quality score starts to drop off for each file. The "K-Mer Analysis (K=5)" module counts the enrichment of every 5-mer within the sequence library. It calculates an observed/expected ratio for each k-mer based on the base content of the library as a whole and then uses the actual count that the k-mer appears. This can help find over-represented sequences which are not aligned in the data.

[SEQUENCING ALIGNMENT AND ALIGNED DATA QC]

After raw data QC, the user can move forward to the next step in his or her NGS analysis with more confidence in the result. The user can use Omicsoft Sequencing Aligner (OSA) to align the data to the genome of choice. OSA (Omicsoft Sequence Aligner) is a fast and accurate alignment tool for NGS data. OSA is the base aligner for RNA-Seq, DNA-Seq, miRNA-Seq data in FusionMap, Oshell, and Array Suite (ArrayStudioand ArrayServer).

Figure: Percentage of alignment reads that match to 10 million 100bp paired ends simulation data with 0%, 0.5% (default), 1% and 2% error rates. Gene model provided (left) and not provided (right).

Figure: Percentage of alignment reads that match to 10 million 100bp paired ends simulation data with 0%, 0.5% (default), 1% and 2% error rates. Gene model provided (left) and not provided (right).

Figure: Alignment job run time of 10 millions 100bp paired ends simulation data with 0%, 0.5% (default), 1% and 2% error rates. Gene model provided   

Figure: Alignment job run time of 10 millions 100bp paired ends simulation data with 0%, 0.5% (default), 1% and 2% error rates. Gene model provided

 

However, even with an accurate aligner like OSA, it is important to examine the aligned data quality. Omicsoft provides comprehensive DNA-Seq QC Metrics and RNA-Seq QC Metrics. These metrics include alignment metrics, coverage metrics, duplication metrics, insert size metrics, flag metrics, profile metrics and more. A total number of more than 100 metrics ensures that the aligned data is fully examined and ready for downstream analysis. An example list of metrics of RNA-Seq data can be found: Aligned data QC.

 

 

 

 

[IMMUNOLOGICAL RESEARCH] Research an Immunological Genome Study in ImmunoLand

Vivian Zhang

Traditionally immunology studies are focused on a particular protein or pathway. However, immunological activity is a system-level response, which is well suited for large-scale integrative approaches and requires an overall perspective on the immune system(s). With advanced technologies enabling large-scale, genome-level approaches, immunology studies are embracing the era of immunogenomics (Related Readings: Beyond the transcriptome: completion of act one of the Immunological Genome Project. ).

ImmunoLand is Omicsoft's most recently developed Land database. It is an immune-related genomics database and visualization software that helps users explore public and private immune-focused genomics datasets. In ImmunoLand, researchers can search a gene, multiple genes, a pathway, a project or multiple projects across more than 22,000 samples from public projects, including GEO (Gene Expression Omnibus), SRA (Sequence Read Archive), ArrayExpress, dbGAP (The Database of Genotypes and Phenotypes), and other large data repositories like BluePrint, GTEx, and ImmGen (The Immunological Genome Project). 

Here is how:

Immunological genomics studies are currently conducted based on many different diseases, immune cells, activation responses, treatments, tissues, states of cell differentiation and so forth. In ImmunoLand, each study in the database is carefully reviewed by Omicsoft’s curators, with meta data clean-up occurring, outliers removal, and then statistically-driven comparisons generated for each study. ImmunoLand allows the users to be able to search across projects, or search directly for a project of interest. For example, let's search for the project GSE37448 from the Immunological Genome Project:

Figure: Gene Expression Intensity Heatmap categorized by disease category

Figure: Gene Expression Intensity Heatmap categorized by disease category

By default, the view is displaying a heatmap of the expression intensity of samples, categorized by disease category. It is interesting to look at the heatmap of the genes with highest differential expression across cell types: 

Figure: Gene Expression Intensity Heatmap of genes with Gene Rank Expression Intensity <100

Figure: Gene Expression Intensity Heatmap of genes with Gene Rank Expression Intensity <100

Figure: Expression Per-Gene View showing gene CD3G

Figure: Expression Per-Gene View showing gene CD3G

Next, the user can search for their gene(s) of interest across projects to compare different comparisons (diseases, immune cells, activation responses, treatments, tissues, states of cell differentiation). The GSE37448 study was done in mouse. It might, for instance, be interesting to check out the gene expression in human organs in ImmunoLand2015 instead of ImmunoMouse2015.

 

[ONCOLOGY RESEARCH] OncoLand helps to quickly confirm functional gene fusion events from Cell Reports article

Vivian Zhang

Typical molecular and cell biology research takes years and significant amount of money from design, grant applications, experiments, and validation and reporting. A significant problem is that often times the results are not expected or not applicable for real world applications. The return on investment of research can be problematic and hard to justify (Science economics: What science is really worth). 

Luckily, at least in cancer research, there are increasing number of public datasets available each year. With the large amount of public data, researchers can potentially save time and money, with the discovery of a candidate gene target. If you are a biologist, a bioinformatican, a PI or a R&D project leader or decision maker in pharmaceutical, biotech companies or academic institutes doing cancer research, OncoLand can help. 

In five minutes, OncoLand can help you to identify recurrent fusion genes in gastric cancer, which recently too some effort to identify in a Cell Reports paper (1).

Figure 1. Characteristics of Somatic SVs Identified by DNA-PET in GC. (A) SV filtering procedure for GC patient 125 is shown. Yap et  al.  2015

Figure 1. Characteristics of Somatic SVs Identified by DNA-PET in GC. (A) SV filtering procedure for GC patient 125 is shown. Yap et al. 2015

In the article, Yao et al. identified recurrent fusions between CLDN18, a tight junction gene, and ARHGAP26, a gene encoding a RHOA inhibitor. According to the article, "CLDN18-ARHGAP26 negatively affects cell-cell and cell-matrix interactions and epithelial barrier function, thereby potentially contributing to gastritis, a known risk factor for GC." To verify the discovery of CLDN18-ARHGAP26 fusion, the user can use TCGA (The Cancer Genome Atlas) Land data.

First, the user can search gene CLDN18 and ARHGAP26 through multiple gene search in TCGA Land. By clicking on RNA-Seq Fusion Details, the user could check the list of fusions for the two genes. Interestingly, all the fusions identified are from Stomach Adenocarcinoma (STAD) samples, consistent with the article where the authors identified the fusion gene in gastric cancer. 

The user can check out fusion site frequency and RPKM under RNA-Seq Fusion tab. 

Furthermore, the user can also check out our Genome Browser to visualize the reads of the fusion genes. The Land tab will link to the browser through Browse Selected Samples. Our browser will automatically show a multiple panels view splitting the browser into two panels for the two fusion genes:


REFERENCE:

(1) Yao, Fei, et al. "Recurrent Fusion Genes in Gastric Cancer: CLDN18-ARHGAP26 Induces Loss of Epithelial Integrity." Cell Reports (2015).

Choice of Gene Annotation on RNA-Seq Results

Matt Newman

Many users (and potential users) have asked us about our choice of gene annotation (or gene model, in Omicsoft lingo), in the OncoLand and ImmunoLand products. For the past three years, we've used an implementation of the UCSC gene model, that we refer to as the "Omicsoft Gene Model".  It consisted of UCSC gene annotation + mirBase (for miRNAs) + the mitochondrial genes from Ensembl.

It was recently announced that UCSC will be moving to the GENCODE basic gene annotation for future incarnations of the gene annotation for their GRCh38 reference library. This is really good news for everyone, as it will hopefully simplify and standardize the reporting of transcript and gene IDs across publications, tools, etc.  It's something we are actively looking at for next year's releases of OncoLand and ImmunoLand as well (in addition to likely maintaining our current B37.3 and Omicsoft gene model results).

An interesting read on the effect of the gene annotation source on RNA-Seq can be found here: http://www.biomedcentral.com/1471-2164/16/97.  In it, the author found that the source of gene annotation does have a profound effect on RNA-Seq alignment, gene expression calculations, and differential expression results. 

[Oncology Research] Oncovirus: Tumor-Virus Association

Vivian Zhang

The tumor-virus associations are known in many types of cancer, including Human papillomavirus (HPV) head-and-neck squamous cell carcinoma, Hepatitis B virus in hepatocellular carcinoma and Epstein-Barr virus (EBV) in gastric carcinoma tumors (1,2). Infection with the hepatitis B virus has been linked to the development of hepatocellular carcinoma. HBV-induced chronic active hepatitis (CAH) and cirrhosis are important risk factors in liver carcinogenesis (3). The detection capacity and sensitivity of RNA-Seq allow researchers to study this association across human genome.

The Cancer Genome Atlas (TCGA) provides viral sequences generated by RNA-Seq. OncoLand, with its integration analysis capacity among mutation, gene expression and structural variation data, provides easy and fast approaches to examine the virus-gene expression association in tumor samples.

OncoLand provides virus sequence count data from more than 4000 GeneBank IDs that were generated by TCGA. In OncoLand, it’s easy to show viral count (expression) per sample. It is clear that Hepatitis B virus is highly expressed in Liver Hepatocellular Carcinoma (LIHC): 

To investigate the single gene-Hepatitis B virus association, OncoLand enables user to create cohorts with user-defined high and low virus counts through creation of SampleSet=>Group SampleSet from Selection: 

With cohorts annotated by high and low virus count, Integration Analysis provides a Kruskal–Wallis test to compare the gene expression between high and low viral cohorts across the rest of the genome. Users could examine one individual gene, a list of molecular signatures of interest or all genes by specifying a Gene Set: 

The result provides statistical results, including gene expression levels in different cohorts:

The user can export the results to Array Studio for further analysis, focus in on a gene or genes of interest for further validation of the association, or export the results to excel for reporting. Tumor-related genes could be good candidates for explaining the molecular viral oncology mechanism and for cancer drug discovery. 

 

 

REFERENCE

1.     Khoury, Joseph D., et al. The landscape of DNA virus associations across human malignant cancers using RNA-Seq: an analysis of 3775 cases. Journal of virology (2013): JVI-00340.

2.     Di Bisceglie, Adrian M. Hepatitis B and hepatocellular carcinoma. Hepatology49.S5 (2009): S56-S60.

3.       Kremsdorf, D., et al. Hepatitis B virus-related hepatocellular carcinoma: paradigms for viral-related human carcinogenesis. Oncogene 25.27 (2006): 3823-3833.



UTR Trimming improves transcript-level quantification and RPKM accuracy in RNA-Seq analysis

Jason Lu

RPKM (Reads Per Kilobase Per Million) is a normalization method widely used in RNA-seq analysis. The calculation of RPKM involves the input of transcript length, which often is the length of a 'virtual' transcript annotated in a gene model. With a chosen gene model (e.g. ENSEMBL, or UCSC), transcript lengths are considered fixed, i.e. independent of sample data. In reality, it is well known that, for a specific transcript, its length can vary between different samples due to events such as exon skipping, alternative transcription start sites and 3' alternative polyadenylation. For example, it has been reported that cancer cell lines often expressed mRNA isoforms with shorter 3′ untranslated regions (UTRs) [Mayr 2009]. In the latest Array Studio, Omicsoft has introduced a new option - automatically trimming UTRs, a data-driven method for dynamically redefining transcript lengths. Briefly, for each gene, we first calculate the coverages of the gene. We then identify the left and right boundaries, which are defined as the positions that have 8% of the maximum coverage of the left/right gene UTR regions. The two boundary positions are the same for ALL transcripts. Then for each transcript, we calculate the trimmed exon length for each end – those two trimming positions have to fall in the UTR region AND the first/last exon of that specific transcript. Here, we use examples to demonstrate that applying this UTR trimming option can improve the accuracy of RPKM calculation as well as the transcript level quantification.

We first examine the expression of the gene UBA52, using a TCGA sample TCGA-24-1846-01A-01R-1567-13 as an example. With the UCSC gene model, the RPKM values for the transcripts uc002njr and uc002njs are 119.6 and 194.5 respectively. After applying the new UTR trimming option, the RPKMs for the two transcripts become 610.5 and 1120, a five-time increase for both transcripts. Such dramatic changes in RPKMs are largely due to the fact that only a small proportion of annotated 3' long UTR regions are actually expressed, which can be clearly seen by read coverages (see Figure below). The stripping of both 5' and 3' unexpressed UTR regions leads to more accurate transcript lengths than the original, which significant improves the RPKM calculation.

 

Besides improving the RPKM calculation, UTR trimming can also benefit quantification by RSEM, a de facto algorithm utilized by both TCGA and Omicsoft Oncoland for transcript-level quantification. Here we take RPL28 as an example. This gene has 5 transcript isoforms, and the predominant form in the same TCGA sample above is listed as uc010yga. But if we examine the read coverage for this particular gene, clearly there is an issue - the transcript uc010yga has no or minimal coverage in the last exon (exon 5) (see Figure below). Furthermore, the exon junction data does not support uc010yga being the most predominant transcript (data not shown). In contrast, if we apply the UTR trimming option, and the most abundant transcript for this gene changes to uc002qkv, which matches well with the read coverages in the browser. Again, the true UTR region of this gene is significantly shorter than what has been annotated, which leads to bias in RSEM quantification. UTR trimming, however, can remedy the problem and allows the identification of true abundant transcripts.     

Another example is from gene IGBP1. The real 5' end UTR region in this TCGA sample is much shorter than the region being annotated in the UCSC gene model. RSEM assigns a majority of reads to the transcript uc004dxw, which is clearly questionable due to the fact that no junction reads are observed between the first and second exon of uc004dxw (see Figure below, including junction reads). With UTR trimming, RSEM correctly identifies the true predominant transcript - uc004dxv, containing greater than 99% of the total reads mapped to this gene.

    

 

 



Detect bi-directional fusion genes from RNA-Seq dataset

Gary Ge

20140221_FusionMap.png

In RNA-Seq datasets, fusion genes can be detected based on both paired end connectivity and single end read spanning. In a paired-end NGS dataset, a discordant read pair is one that is not aligned to the reference genome with the expected distance or orientation. If a set of discordant read pairs are mapped to two different genes, a fusion gene is suggested (Fusion PE detection). On the other hand, single-end reads that span the fusion junctions provide base-pair evidence for the fusion events (Fusion SE detection). ArrayStudio/ArrayServer software provides functions to do both fusion detection algorithms and the Land contains fusion views for both detection types.

In our recent data analysis of the CCLE and TCGA datasets, we found that there are a significant number of fusion genes where both bi-directional fusion genes are expressed. Taking the BCR-ABL1 fusion as one example, after stringent filtering, there are eight cell lines expressing BCR-ABL1  fusion and five cell lines expressing ABL1-BCR in CCLE's 780 RNA-Seq samples in CGHub. Here is the fusion RPKM views for both fusions in the CCLE-Land:

Fusion RPKM views for BCR-ABL1 and ABL1-BCR. Samples are filtered by splice pattern, open reading frame, exon boundary based on fusion breakpoints information and # of minimum supporting fusion junction seed reads. There are a few more samples if we relax these filters. Charts are based on fusion junctions, not on gene level. There are multiple fusion junction or alternative fusion isoforms from the same fusion gene.

By checking the fusion junction spanning reads in Omicsoft Genome Browser, the fusion from ABL1 to BCR is very convincing. Taking the cell line NALM1 as one example, it is expressing both BCR-ABL1 and ABL1-BCR.

BCR-ABL1 fusion at read level in Omicsoft Genome Browser (two panes for fusion display).

ABL1-BCR fusion at read level in Omicsoft Genome Browser (two panes for fusion display).

There are papers reporting bidirectional fusion gene expression, such as the following one for BCR and ABL1:

http://bloodjournal.hematologylibrary.org/content/81/1/158.full.pdf

We also found the expression of ERG-TMPRSS2 fusion in TCGA samples, as well as the FGFR3-TACC3, TACC3-FGFR3 fusion. In inter-chromosome cases, fusions are formed on both directions due to translocation (such as BCR-ABL). Within the same chromosome or genes nearby, fusions in both directions are formed by tandem repeats (such as the FGFR3-TACC3 example). Here is a literature figure showing the FGFR3-TACC3 fusion:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3561838/figure/F2/

Also Read: