PRICING & INQUIRIES

For pricing and inquiries, send an email to sales@omicsoft.com.

5001 Weston Parkway, Suite 201
Cary, NC 27513
US

888-259-6642

Overview

Omicsoft is the leading provider of Next Generation Sequencing, Cancer Genomics, Immunology, and Bioinformatics solutions for Next Generation Sequencing Data and Gene Expression Analysis.

Exciting Updates and Latest News

Keeping you up-to-date with the latest in NGS, Bioinformatics Analysis, and cancer genomics with blogs on Array Suite, OncoLand (TCGA and more), ImmunoLand, and more.

Filtering by Tag: RNASeq

[Array Studio Video Tutorial] RNA-Seq Advanced Analysis

Vivian Zhang

Finding genes or transcripts that are differentially expressed among different conditions is an important analysis step in understanding the functions of genetic variants. Array Studio contains a number of different modules for performing univariate analysis/differential expression, including One-Way ANOVA, Two-Way ANOVA, and the more advanced General Linear Model, as well as a few others. Statistical inference can be performed on your feature-level data, whether it was quantified in Array Studio or imported from external programs. In this article, we will introduce popular methods of Advanced Analysis of RNA-seq data

 

 

1 ANOVA on RNA-seq Data

A One-Way ANOVA is used to research the effects of a single factor, while Two-Way ANOVAs can be used to research the effects of two factors on expression data.  For example, if a user has an experiment with factors for time and treatment, this model can be quickly used to generate results (including fold changes, estimates, raw and adjusted p-values, LSMeans, and Estimate data). By selecting factor 1 and factor 2, and then the level to compare to, Array Studio will automatically create the comparisons and model for the user. This model generates an inference report, including automatically generated Report View and VolcanoPlotView:

 

2 DESeq on RNA-seq Data

For RNA-Seq, read count is a good estimate of the abundance of the target transcript. Thus, it is of great interest to compare read counts between different conditions. The DESeq GLM test is a powerful tool for inferring differential expression of genes/transcripts from raw count data. It allows the user to model the data using a linear model and test for differential expression using negative binomial distribution. The function should perform similarly to the DESeq R packageDESeq only works on raw counts of sequencing reads (with no additional background reads added to the dataset). After running the test, a report table is generated along with a scatter plot. A volcano plot will be generated as well, similarly as in the ANOVA analysis. For more details on how the DESeq method works and more functions, check out the DESeq R manual

 

3 Identifying Differential Usage of Isoforms

ArrayStudio uses a straightforward approach to identifying genes with differential transcript usage between groups. This function allows user to identify diferentially expressed isoforms between comparisons. Based on transcript level data, either RPKM, FPKM or Count data, the function convert the expression values to ratios, dividing the value of each transcript by sum of all transcripts in the same gene. The highest ranking p-value reflects the largest difference in relative transcript usage.

Differentially expressed isoforms report sorted by p-value. The user can directly visualized the difference in transcript usage in genome browser.

Differentially expressed isoforms report sorted by p-value. The user can directly visualized the difference in transcript usage in genome browser.

Genome browser view can display exon junction reads. As it is shown, in lung, only 37 reads span certain junction but 11000 reads span the same junction in skin.

Genome browser view can display exon junction reads. As it is shown, in lung, only 37 reads span certain junction but 11000 reads span the same junction in skin.

 

For how to achieve the above results, please check out our video tutorial: Advanced Analysis of RNA-seq data

[Array Studio Video Tutorial] RNA-Seq Downstream Analysis: Normalization, Visualization and Data Integration

Vivian Zhang

After aligning data, there are a number of downstream analyses that can be done. For instance, the generated RPKM (or FPKM) dataset can be used, as Microarray Data, for clustering (log2 transformation may be necessary). Count data can be used to look for changes between groups of samples through DESeq analysis. A large number of visualization and QC functions are available to analyze feature-level RNA-seq data in Array Studio. In this article, we will introduce our video tutorials on RNA-Seq Downstream Analysis 

 

 

 

1 Normalizing and Transforming RNA-seq Data for MicroArray-type analysis

Array Studio has a large number of modules originally designed for Gene Expression MicroArray analysis, but these modules are also useful for analyzing feature-level (e.g. gene-level, exon-level) RNA-seq data. However, many of these modules expect normalized and log-transformed input data. Array Studio provides a number of methods for normalizing RNA-Seq data, including Log Geometric Mean, Mean, Median, Quantile, TMM (edgeR), TotalCount, RPKM to TPM, UpperQuartile, and LandNormalization. Array Studio also provides methods for normalizing and transforming -Omic data. 

 

2 Attach new Views to Data

In Array Studio, data can be directly viewed in tables, but can also be displayed in up to 40 Views, depending on the contents of the underlying data. Array Studio features the very powerful Variable View, among it's most popular views:

The Variable View allows the user to visualize one chart for each variable in the dataset. The example variable view shows the Log 2 FPKM values for gene CLDM18, categorized by tissue and gender.

The Variable View allows the user to visualize one chart for each variable in the dataset. The example variable view shows the Log 2 FPKM values for gene CLDM18, categorized by tissue and gender.

 

3 Principal Component Analysis on normalized expression data

Principal Component Analysis (PCA) is an effective tool to group data by components that contribute to the greatest variance in the dataset. In other words, PCA can group your data based on variance, which should reflect differences between samples. Outliers (such as failed samples) will often appear as outliers. 

Both 2D and 3D PCA plots are commonly used to group data or identify outliers. 

Both 2D and 3D PCA plots are commonly used to group data or identify outliers. 

 

4 Hierarchical Clustering of normalized expression data

Gene expression data can be grouped by Hierarchical Clustering by Variables (e.g. genes) and Observations (e.g. samples) to reveal associations in your data. Array Studio can easily handle Hierarchical Clustering of up to 20000 variables, far more than the capacity of many popular gene clustering programs.

Classic dendrogram is an older version of dendrogram. The new version is more interactive and provides more gene annotation information for downstream analysis. 

Classic dendrogram is an older version of dendrogram. The new version is more interactive and provides more gene annotation information for downstream analysis. 

 

5 RNAseq-MicroArray Integration

Feature-level (genes, transcripts, etc.) results from RNA-seq experiments can directly be compared to microarray data from the same samples, using the Microarray-Microarray Integration module. This module allows the user to create a duplex matrix (two values for each variable in the dataset) for two “microarray” data types. The resulting dataset can also contain correlation information for each variable, making it easy to figure out which variables correlate well between datasets.

Microarray-microarray integration module provides variable views on gene and sample level showing how well microarray and RNA-seq data correlate. 

Microarray-microarray integration module provides variable views on gene and sample level showing how well microarray and RNA-seq data correlate. 

 

To learn how to perform these downstream analysis on RNA-seq data, please check out our video tutorials on RNA-Seq Downstream Analysis 

[Array Studio Video Tutorial] RNA-Seq Analysis Basic functions: Reads Quantification, Exon Junction and Gene Fusion Detection

Vivian Zhang

RNA-Seq has become one of the most popular methods in gene and transcript level genomic research. It could help quantify gene and transcript expression, identify sequence variants and detect gene, transcript or exon level genomic events. Array Studio provides a variety of functions powerful enough for small and large scale genomic research. In this article, we will introduce a few basic and the most commonly used functions, including sequence quantification, gene annotation, exon junction detection and gene fusion detection. 

 

 

ArrayStudio provides a number of modules and options for RNA-Seq quantification at gene, transcript, exon and exon junction levels. Both FPKM and Count tables can be generated. 

Example RNA-seq gene count table and its corresponding design table.

Example RNA-seq gene count table and its corresponding design table.

 

Alternative splicing has been shown to play an important role in a number of human diseases, including cancer, cardiovascular and neurodegenerative diseases. In Omicsoft Array Studio and the Land products, we provide modules and visualization functions that make it easier for users to research splicing. In RNA-Seq analysis, besides gene and transcript counts, Array Studio can report exon junction counts as well. Results can be visualized in Omicsoft's Genome Browser.

Exon junction report and genome browser view.

Exon junction report and genome browser view.

Mutation data allows user to compare mutation frequencies and research individual variants. Users can run the  Summarize Variant Data module to annotate variants. Variants can be annotated in Mutation Reports or VCF files, and visualized directly in the Genome Browser.

Mutation annotation report and example genome browser view of variant V600E. 

Mutation annotation report and example genome browser view of variant V600E. 

 

 

Fusion genes can play an important role in cancer mutations that have multiple effects on a target gene. At Omicsoft, we provide a powerful fusion detection algorithm in FusionMap. FusionMap identifies unmapped reads that span multiple genomic locations, indicating possible gene fusion events:

Map Fusion Reads module will detect fusion genes from fusion junction-spanning reads which can characterize fusion genes at base pair resolution. This works with single end or paired end data. Combined Fusion Analysis will run fusion junction spanning + inter-transcript fusion read pairs detection at the same time. It detects fusion junction spanning reads from unmapped reads in BAM files, and detects inter-transcript fusion read pairs from singletons from BAM alignment entries. It will return a report showing potential fusion genes and counts for each fusion junction  Combined fusion analysis can only be run on paired-end data. 

Fusion report reports fusion count data with fusion annotation information attached.   Fusion genome browser can display sequence information at base pair resolution  . 

Fusion report reports fusion count data with fusion annotation information attached. Fusion genome browser can display sequence information at base pair resolution

 

 

[Array Studio Video Tutorial] RNA-Seq Analysis Basics: Getting Started with RNA-Seq Pipeline Analysis and Data QC

Vivian Zhang

Omicsoft Next Generation Sequencing (NGS) analysis includes NGS (next generation sequencing) bioinformatics tools for the entire process, from QC to alignment to post-alignment summarizations and analysis. RNA-Seq data analysis is a critical part of Omicsoft's NGS bioinformatics tools. In this article, we introduce our tutorial on how to get started with RNA-seq pipeline analysis and data QC.

Getting Started with RNA-seq pipeline functions

 

1 Running the RNA-seq pipeline for a new project

A typical RNA-seq analysis project consists steps from data quality control, alignment, aligned data quality control to data quantification, visualization, and statistical inference. In Array Studio, users have the choice of either executing each step of the analysis one-by-one, or can use the RNA-seq pipeline function. It only takes a few clicks to create a new RNA-seq project and run RNA-seq pipeline. 

屏幕截图 2016-07-29 07.02.38.png
RNA-Seq Pipeline. Users have the options to choose to perform analysis steps such as raw data QC, post-alignment data QC, exon junctions, sequence quantification, mutation and fusion detection.

RNA-Seq Pipeline. Users have the options to choose to perform analysis steps such as raw data QC, post-alignment data QC, exon junctions, sequence quantification, mutation and fusion detection.

 

 

2 Raw Data QC

If you choose to perform analysis step by step, before aligning your RNA-seq data, you must first perform quality control (QC) on the raw data, to spot common problems like adapter or barcode sequence contamination, degraded quality at ends of reads, or problematic samples. The Array Studio Raw Data QC Wizard reports a number of useful measures of raw NGS quality, and can be generated as part of the RNA-seq pipeline function. 

Example QC report includes:

  • Base Distribution 
  • Basic Stats 
  • Duplication Level 
  • Kmer Analysis 
  • Overall/Per-sequence Quality Reports 
  • Quality Box plot 
  • Over-represented Sequences 
  • Per-sequence GC report 
  • Sequence Length Report 

 

3 Filtering and Trimming Raw Reads

Array Studio's NGS Filter function can trim low-quality bases from raw NGS data, filter out uniformly low-quality reads, and strip away adapter sequences. The RNA-seq pipeline assumes that input reads are pre-filtered and stripped, so only quality-based trimming and filtering will be performed in the pipeline (no adapter stripping). It is a good idea to run the Filter function on your reads, based on the raw data QC results, before running the RNA-seq pipeline.

 

4 Aligned Data QC

Array Studio automatically generates an Alignment Report after aligning reads to the genome or transcriptome. Additional alignment statistics can be generated by running the Aligned Data QC and RNA-seq 5'->3' Trend modules.

Alignment report is automatically generated after alignment. 

Alignment report is automatically generated after alignment. 

Additional aligned data QC metrics include:   1 Alignment Metrics 2 Flag Metrics 3 Profile Metrics 4 Source Metrics 5 Insert Size Metrics 6 Duplication Metrics 7 Coverage Metrics 8 Strand Metrics 9 Feature Metrics

Additional aligned data QC metrics include: 

1 Alignment Metrics
2 Flag Metrics
3 Profile Metrics
4 Source Metrics
5 Insert Size Metrics
6 Duplication Metrics
7 Coverage Metrics
8 Strand Metrics
9 Feature Metrics

 

The best way to quickly learn how to perform these analysis steps is to watch our short video tutorials Getting Started with RNA-seq pipeline functions. Please stay tuned for more blog articles on RNA-seq analysis.

Choice of Gene Annotation on RNA-Seq Results

Matt Newman

Many users (and potential users) have asked us about our choice of gene annotation (or gene model, in Omicsoft lingo), in the OncoLand and ImmunoLand products. For the past three years, we've used an implementation of the UCSC gene model, that we refer to as the "Omicsoft Gene Model".  It consisted of UCSC gene annotation + mirBase (for miRNAs) + the mitochondrial genes from Ensembl.

It was recently announced that UCSC will be moving to the GENCODE basic gene annotation for future incarnations of the gene annotation for their GRCh38 reference library. This is really good news for everyone, as it will hopefully simplify and standardize the reporting of transcript and gene IDs across publications, tools, etc.  It's something we are actively looking at for next year's releases of OncoLand and ImmunoLand as well (in addition to likely maintaining our current B37.3 and Omicsoft gene model results).

An interesting read on the effect of the gene annotation source on RNA-Seq can be found here: http://www.biomedcentral.com/1471-2164/16/97.  In it, the author found that the source of gene annotation does have a profound effect on RNA-Seq alignment, gene expression calculations, and differential expression results. 



UTR Trimming improves transcript-level quantification and RPKM accuracy in RNA-Seq analysis

Jason Lu

RPKM (Reads Per Kilobase Per Million) is a normalization method widely used in RNA-seq analysis. The calculation of RPKM involves the input of transcript length, which often is the length of a 'virtual' transcript annotated in a gene model. With a chosen gene model (e.g. ENSEMBL, or UCSC), transcript lengths are considered fixed, i.e. independent of sample data. In reality, it is well known that, for a specific transcript, its length can vary between different samples due to events such as exon skipping, alternative transcription start sites and 3' alternative polyadenylation. For example, it has been reported that cancer cell lines often expressed mRNA isoforms with shorter 3′ untranslated regions (UTRs) [Mayr 2009]. In the latest Array Studio, Omicsoft has introduced a new option - automatically trimming UTRs, a data-driven method for dynamically redefining transcript lengths. Briefly, for each gene, we first calculate the coverages of the gene. We then identify the left and right boundaries, which are defined as the positions that have 8% of the maximum coverage of the left/right gene UTR regions. The two boundary positions are the same for ALL transcripts. Then for each transcript, we calculate the trimmed exon length for each end – those two trimming positions have to fall in the UTR region AND the first/last exon of that specific transcript. Here, we use examples to demonstrate that applying this UTR trimming option can improve the accuracy of RPKM calculation as well as the transcript level quantification.

We first examine the expression of the gene UBA52, using a TCGA sample TCGA-24-1846-01A-01R-1567-13 as an example. With the UCSC gene model, the RPKM values for the transcripts uc002njr and uc002njs are 119.6 and 194.5 respectively. After applying the new UTR trimming option, the RPKMs for the two transcripts become 610.5 and 1120, a five-time increase for both transcripts. Such dramatic changes in RPKMs are largely due to the fact that only a small proportion of annotated 3' long UTR regions are actually expressed, which can be clearly seen by read coverages (see Figure below). The stripping of both 5' and 3' unexpressed UTR regions leads to more accurate transcript lengths than the original, which significant improves the RPKM calculation.

 

Besides improving the RPKM calculation, UTR trimming can also benefit quantification by RSEM, a de facto algorithm utilized by both TCGA and Omicsoft Oncoland for transcript-level quantification. Here we take RPL28 as an example. This gene has 5 transcript isoforms, and the predominant form in the same TCGA sample above is listed as uc010yga. But if we examine the read coverage for this particular gene, clearly there is an issue - the transcript uc010yga has no or minimal coverage in the last exon (exon 5) (see Figure below). Furthermore, the exon junction data does not support uc010yga being the most predominant transcript (data not shown). In contrast, if we apply the UTR trimming option, and the most abundant transcript for this gene changes to uc002qkv, which matches well with the read coverages in the browser. Again, the true UTR region of this gene is significantly shorter than what has been annotated, which leads to bias in RSEM quantification. UTR trimming, however, can remedy the problem and allows the identification of true abundant transcripts.     

Another example is from gene IGBP1. The real 5' end UTR region in this TCGA sample is much shorter than the region being annotated in the UCSC gene model. RSEM assigns a majority of reads to the transcript uc004dxw, which is clearly questionable due to the fact that no junction reads are observed between the first and second exon of uc004dxw (see Figure below, including junction reads). With UTR trimming, RSEM correctly identifies the true predominant transcript - uc004dxv, containing greater than 99% of the total reads mapped to this gene.

    

 

 



Detect bi-directional fusion genes from RNA-Seq dataset

Gary Ge

20140221_FusionMap.png

In RNA-Seq datasets, fusion genes can be detected based on both paired end connectivity and single end read spanning. In a paired-end NGS dataset, a discordant read pair is one that is not aligned to the reference genome with the expected distance or orientation. If a set of discordant read pairs are mapped to two different genes, a fusion gene is suggested (Fusion PE detection). On the other hand, single-end reads that span the fusion junctions provide base-pair evidence for the fusion events (Fusion SE detection). ArrayStudio/ArrayServer software provides functions to do both fusion detection algorithms and the Land contains fusion views for both detection types.

In our recent data analysis of the CCLE and TCGA datasets, we found that there are a significant number of fusion genes where both bi-directional fusion genes are expressed. Taking the BCR-ABL1 fusion as one example, after stringent filtering, there are eight cell lines expressing BCR-ABL1  fusion and five cell lines expressing ABL1-BCR in CCLE's 780 RNA-Seq samples in CGHub. Here is the fusion RPKM views for both fusions in the CCLE-Land:

Fusion RPKM views for BCR-ABL1 and ABL1-BCR. Samples are filtered by splice pattern, open reading frame, exon boundary based on fusion breakpoints information and # of minimum supporting fusion junction seed reads. There are a few more samples if we relax these filters. Charts are based on fusion junctions, not on gene level. There are multiple fusion junction or alternative fusion isoforms from the same fusion gene.

By checking the fusion junction spanning reads in Omicsoft Genome Browser, the fusion from ABL1 to BCR is very convincing. Taking the cell line NALM1 as one example, it is expressing both BCR-ABL1 and ABL1-BCR.

BCR-ABL1 fusion at read level in Omicsoft Genome Browser (two panes for fusion display).

ABL1-BCR fusion at read level in Omicsoft Genome Browser (two panes for fusion display).

There are papers reporting bidirectional fusion gene expression, such as the following one for BCR and ABL1:

http://bloodjournal.hematologylibrary.org/content/81/1/158.full.pdf

We also found the expression of ERG-TMPRSS2 fusion in TCGA samples, as well as the FGFR3-TACC3, TACC3-FGFR3 fusion. In inter-chromosome cases, fusions are formed on both directions due to translocation (such as BCR-ABL). Within the same chromosome or genes nearby, fusions in both directions are formed by tandem repeats (such as the FGFR3-TACC3 example). Here is a literature figure showing the FGFR3-TACC3 fusion:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3561838/figure/F2/

Also Read: