For pricing and inquiries, send an email to

5001 Weston Parkway, Suite 201
Cary, NC 27513



Omicsoft is the leading provider of Next Generation Sequencing, Cancer Genomics, Immunology, and Bioinformatics solutions for Next Generation Sequencing Data and Gene Expression Analysis.

Exciting Updates and Latest News

Keeping you up-to-date with the latest in NGS, Bioinformatics Analysis, and cancer genomics with blogs on Array Suite, OncoLand (TCGA and more), ImmunoLand, and more.

Filtering by Tag: OncoLand

[Event] Learn|Network|Impact 2017 OmicSoft User Group Meeting

Vivian Zhang

OmicSoft, now a QIAGEN company, would like to invite you to our annual Omicsoft User Group Meeting being held in Cambridge, MA on September 19-20, 2017. 

FREE registration and attendance, limited time only. For registration and more details, please directly go to our UGM page.

In the past ten years, OmicSoft has helped numerous users from major pharma and biotech companies (as well as research institutions) accelerate their bioinformatics and genomics research (who are our customers?). Last year, OmicSoft successfully held our kick-off OmicSoft User Group Meeting. More than 30 leading pharmaceutical and biotech companies, more than 100 experts and scientists in the field of bioinformatics/genomics/genetics attended the meeting.

Last year, our action-packed one-day meeting provided an open platform for our users and industry peers to learn, to network, and to impact the development of OmicSoft products. Click here for 2016 OmicSoft UGM meeting agenda. This year OmicSoft has had several milestones and technology breakthroughs including: our acquisition by QIAGEN, Array Suite 10,0. release, Cloud-Based Lands, Single Cell RNA-Seq support, upcoming integration with QIAGEN's bioinformatics products, Web-based solutions and more. We are expanding the 2017 OmicSoft User Group Meeting into a two-day event with:

  • More product training - Get the most out of Omicsoft products, and QIAGEN's bioinformatics products
  • More user talks and networking opportunities - Learn from others' experiences, industry best practices, and expand professional network
  • More One-On-One meetings - Get problems solved, questions answered and get personalized training from our experienced staff


Learn, network, impact. Come join us and leading pharma, biotech companies and research institutions.

  • Learn to Use OmicSoft Products More Efficiently 
  • Impact Future Product Development
  • Network with Peers and Industry Experts
  • Get One-On-One Help from Experts
  • Explore more QIAGEN Bioinformatics products


Please contact us for potential presentation and collaboration opportunities. 

[Land Update] Omicsoft Quarterly Land Update Summary

Vivian Zhang

Omicsoft is excited to announce it’s latest Land updates, including OncoLand and DiseaseLand.

Highlights include:


  • Official release of the B38 Human Lands, including TCGA, CCLE, GTEx, Blueprint, and Sanger
  • Additional samples in the TARGET and Blueprint Lands
  • 5700 new Somatic Mutation samples in the TumorMutation Land
  • 350 new samples and 80 new comparisons in the OncoGeo Land
  • 1750 new expression samples in the ClinicalOutcome Land
  • Updated clinical data and CNVCall data in TCGA
  • New Comparison data (Tumor vs Normal) for 24 tumor types in TCGA


  • 3916 single cell samples from 5 projects to the Single Cell Human Land, including seven new cell types
  • 5466 single cell samples from 10 projects to the Single Cell Human Land, including seven new mouse cell types
  • 840 new RNA-Seq samples in Human DiseaseLand, along with additional comparisons
  • 2402 new RNA-Seq samples in Mouse DiseaseLand, along with additional comparisons
  • DiseaseLand now includes over 67,000 human samples, with 3239 comparison from 1000+ projects and almost 21,000 mouse samples, with 2,248 comparisons from over 650 projects. 

Incorporation with the recently introduced Gene Set Analysis module provides extra value to the release, as we now allow users to query against all of the new Land data as well. 


Matt Newman, VP of Business Development, will spend 45 minutes on December 12th, at 11:00 am EST, to give an overview of all the new datasets and visualizations that are included with this latest release. Please register here. We will contact users about this release update shortly after our webinar. Please stay tuned. 

[Land Tutorial] Getting Started with OncoLand

Vivian Zhang

OncoLand is an Oncology database and visualization software that helps users explore public and private cancer genomics datasets. It contains tens of thousands of carefully processed and curated oncology -Omic data samples. OmicSoft uses the Land framework to deliver an increasing number of large datasets, including data types such as RNA-Seq, DNA-Seq, miRNA-Seq, Copy Number Variation, Gene Expression Chip, Protein Expression, Methylation and hundreds of clinical measurements. 

Omicsoft contains data from more than 10 large public dataset, including TCGA, CCLE, CGCI, ICGC, TARGET, Multiple Myeloma, GTEx, Blueprint and more. In this blog, we will introduce our data content based on our video tutorials: Getting Started With OncoLand

For more details about Land content, please refer to our NEW wiki pages: Introduction to TCGA Land Content and Introduction to CCLE Land Content

A first look at OncoLand

Most our OncoLand users are likely to be familiar with our Land interface. After selecting Land, you are likely to see the graphical interface similar to the following:

Example TCGA_B37 default view, displaying Sample Distribution view. 

Example TCGA_B37 default view, displaying Sample Distribution view. 


TCGALand Introduction and Overview

TCGA, The Cancer Genome Atlas, is a comprehensive and coordinated effort to accelerate the understanding of the molecular basis of cancer through the application of genome analysis technologies. TCGALand is OncoLand's signature Land, it contains RNA-Seq, Expression Array, DNA-Seq, CNV, Methylation, and Protein data from more than 30 tumor types. 

TCGALand Sample Distribution across Tumor Type.

TCGALand Sample Distribution across Tumor Type.

TCGALand provides table and figure views on the sample, gene and clinical data level. We will introduce genomic data views in the following article, or you can refer to our video tutorials: Getting Started With OncoLand. Here, we would like to highlight clinical data views, which is introduced in the TCGALand Introduction and Overview video clip.

Clinical Significance - Group Association is a dynamic view showing the association of all clinical variables with the selected grouping variable. It quickly provides insights on which clinical variables are potentially associated with the selected grouping variable. 

Clinical Association for TCGALand Tumor Type.

Clinical Association for TCGALand Tumor Type.

Another useful view is Survival View. It plots survival rate over time for selected grouping variables

TCGALand Survival Plot by Tumor Type.

TCGALand Survival Plot by Tumor Type.


CCLELand Introduction and Overview

The Cancer Cell Line Encyclopedia (CCLE) project is an effort to conduct a detailed genetic characterization of a large panel of human cancer cell lines. CCLE provides public access analysis and visualization of DNA copy number, mRNA expression, mutation data and more, for 1000 cancer cell lines. CCLELand groups data according to Primary Site (Tissue), with histology as the secondary grouping.

CCLELand Primary Grouping by Primary Site, instead of Tumor Type.

CCLELand Primary Grouping by Primary Site, instead of Tumor Type.


Stay tuned for more on OncoLand! 

[OncoLand Case Study] Empower OncoLand with Array Studio Analysis: Visualize "mutation burden" in each tumor in TCGALand

Vivian Zhang

One of the common goals in cancer research is identification of genes or samples with mutations that occur during tumor development. The number of identified mutations in cancer samples can vary wildly, but some tumors tend to aggregate widespread alterations. This Nature paper about the mutation landscape and significance across 12 major cancer types (as part of the TCGA Pan-Cancer effort) is a good example. In the very first figure, the authors investigated the mutation frequencies of six transition (Ti) and transversion (Tv) categories for each cancer type:

Figure 1: Mutation frequencies, spectra and contexts across 12 cancer types.   Kandoth, Cyriac, et al. "Mutational landscape and significance across 12 major cancer types."   Nature   502.7471 (2013): 333-339.

Figure 1: Mutation frequencies, spectra and contexts across 12 cancer types. Kandoth, Cyriac, et al. "Mutational landscape and significance across 12 major cancer types." Nature 502.7471 (2013): 333-339.

In another recent Nature paper, Whole-genome mutational burden analysis of three pluripotency induction methods, the authors researched mutational subtypes in each sample:

Figure 2: Characterization of variants caused by reprogramming method.   Bhutani, Kunal, et al. "Whole-genome mutational burden analysis of three pluripotency induction methods."   Nature communications   7 (2016). 

Figure 2: Characterization of variants caused by reprogramming method. Bhutani, Kunal, et al. "Whole-genome mutational burden analysis of three pluripotency induction methods." Nature communications 7 (2016). 

Using OncoLand, you can easily calculate and visualize total mutation burden of every sample or tumor type. Check out this OncoLand case study: Visualize "mutation burden" of each tumor in TCGALand

1. Calculate total mutation burden of every sample in TCGALand.

To calculate the number of total mutations per tumor sample (mutation burden), you can simply use Summarize Sample Mutation Count under Analytics tab in Land. By specifying the individual nucleotide changes, for example "A->C", the result will calculate the total number of mutations (from a selected GeneSet) mutated in each sample (from selected SampleSet).

You can further summarize the data by downloading this TotalMutationBurdenByNTchange table to Array Studio's local analysis. For example, adding a variable view to better visualize mutation burden across samples:

Mutation burden variable view. Y-axis represents mutation number. X-axis represents different samples.

Mutation burden variable view. Y-axis represents mutation number. X-axis represents different samples.

2. Calculate average mutation burden in each tumor in TCGALand

Using local analysis functions, you can further research mutation burden in each tumor in Land data. The Summarize function allows user to calculate the mean mutation number grouped by tumor type, or other preferred grouping options.  

After Stacking the table, you can plot another variable view to visualize the distribution of each type of nucleotide change in each tumor type

                                                                               Stack table by row to generate variable view 

                                                                               Stack table by row to generate variable view 

In this way, we can easily tell which tumor type has the highest mutation burden.

To learn how to exactly perform the above analysis, please watch our OncoLand case study: Visualize "mutation burden" of each tumor in TCGALand.

[OncoLand Case Study] MTBP expression and copy number correlates with poor patient survival

Vivian Zhang

A recent paper used TCGA data to show that cancer patients with higher expression/amplified copy number of MTBP had reduced survival. Thus, their data revealed that "MTBP significantly contributes to breast cancer and is a potential novel therapeutic target in the treatment of TNBC". To evaluate MTBP expression, the paper researched mRNA data, copy number variation data and patient survival data from TCGA (The Cancer Genome Atlas) to validate the results. Omicsoft users can easily use OncoLand, and the TCGA Land to verify and explore these results in minutes. We archived a nice OncoLand case study in our wiki:


Visualize MTBP expression in tumor vs. normal breast cancer tissue

A typical case starts with visualizing the data. After selecting TCGA2015 Land, users can search for MTBP gene in the search box. Go to Gene FPKM View to visualize gene expressions. Next, filter Tumor Type to breast cancer (BRCA) and select Sample Type for grouping. Thus, you will generate a plot that looks like this:

Categorize breast tumor samples by MTBP expression quartiles

The user can subset by using MTBP expression level to test whether tumors with high expression correlate with other variables, such as survival. You can achieve this by generating a MTBP Gene FPKM custom query and labeling samples according to the query. Only tumor samples express high level of MTBP. 

Next, we can correlate MTBP expression with survival. As we can see, survival of high MTBP expressing samples is significantly worse than low expressing samples:

If you want to explore how amplification or deletion of a gene correlates with expression, mutation, or clinical metadata, you can perform a "custom query" on Copy Number Variation as well (please refer to our OncoLand case study wiki for details). Our results are consistent with the results from the paper, for example, MTBP was significantly elevated in breast cancer samples (Figure 1A) and those patients with elevated MTBP expression exhibited reduced survival (Figure 1B):


Create a SampleSet of Triple-Negative Breast Cancer samples

Breast cancers that are estrogen receptor-negative, progesterone receptor-negative and HER2-negative can not be targeted by many common treatments for breast cancer, for instance, common HER2 inhibitors like trastuzumab, however other treatment plans are still available. 

We can use TCGA clinical meta data to identify triple negative samples. By filtering ER, PR and HER2 status from Sample | Clinical Data | Procedure |Genetic Testing filter options, you can select those triple negative data and create a Sampleset using Group Sample Set From Selection: 

Subgroup sample expression data by multiple variables

You can quickly partition a gene's expression by each permutation of multiple clinical variables i.e. triple-negative breast cancer samples in this case. And as we can see, triple-negative samples indeed have higher MTBP expression than any other samples:

To learn how to perform the analyses, please visit our wikipage MTBP expression and copy number correlates with poor patient survival




Bridging Bioinformatics|Genomics|Genetics Research: 2016 Omicsoft User Group Meeting

Vivian Zhang


  • Who Attended:
    • More than 30 leading pharmaceutical and biotech companies. 
    • More than 100 attendees who are experts and scientists in the field of bioinformatics/genomics/genetics.
  • What Occurred:
    • Numerous discussions among attendees on the future of biomarker discovery, as well as best practices of data management, visualization and analysis.



Omicsoft Corporation successfully held our kick-off Omicsoft User Group Meeting in Cambridge, MA on Wednesday May 4, 2016.

We would like to thank all speakers and attendees, all of whom are extremely important in helping build out our platform successfully.  We've received extremely positive feedback from the meeting, and hope to do it again in the future.  Feedback on our software and services help drive our business, and the direct interaction with our customers during the event proved invaluable to us. 

Highlights from the meeting:

  • Introduction of GeneticsLand for management of genetics data
  • Introduction to the future SingleCell Land
  • Overview on curation processes
  • Updates on current data subscription Lands

For more details, please visit our 2016 User Group Meeting webpage.


Above is just a glance of some exciting moments at our meeting. If you missed the meeting, we have uploaded our speaker presentations and videos on our 2016 User Group Meeting webpage.

If you have any question with regard to the meeting, please contact us. 


[Land Update] Brief Introduction of TumorMutation and OncoGEO in Oncoland

Vivian Zhang

In this blog, we would like to introduce two recently updated Lands in Oncoland: TumorMutation2015 and OncoGEO2015.

TumorMutationLand is a collection of mutation and copy number tumor data from more than 2400 samples. The data are from important publications that are not included in other Lands.

TumorMutation2015 Land Data Availability (Partial list).

TumorMutation2015 Land Data Availability (Partial list).

TumorMutation2015 Land Sample Distribution.

TumorMutation2015 Land Sample Distribution.

OncoGEO currently has over 1200 RNASeq samples from GEO and Sequence Read Archive (SRA). It serves as future home to “comparison” data from GEO (similar to data provided in ImmunoLand comparing Disease vs Normal, Treated vs Control, etc.).

OncoGEO2015 Land Data Availability.

OncoGEO2015 Land Data Availability.

OcoGEO2015 Land Sample Distribution.

OcoGEO2015 Land Sample Distribution.

[Feature Review] Analyze "Land" Genomic Data with the R API

Vivian Zhang

Omicsoft's current Lands, OncoLand and ImmunoLand, provide users pre-configured data content covering a variety of genomic data types. The rich visualization capacity and customizable filters empower Lands to be comprehensive OMIC data hubs. Many of our customers process their own in-house data into our Land format, either themselves, or by contracting us to run the datasets for them (we do everything from RNA-Seq to WGS data so contact us at if you would like to contract out some data processing.)

That said, in the world of research, there are always talented and creative bioinformaticans that would like to explore the data in their own way. Our Land R API function provides a way to query Land data using R. 

The R API function uses Oshell API functions to connect to ArrayServer and run the Land Text Dump function on a list of genes or/and on a list of samples. User can run additional analysis based on the land dump data in R and create more data visualizations, customized to their liking.  Once the data is in R, the analyst has the world of Bioconductor and more with which to work with for their analysis.  Enjoy!

Example 1: Scatter plot of expression value vs. CN log2ratios within specific genes and samples

An example to draw scatter plot of gene expression vs. CN log2ratios for genes MDM, BRAF, EGFR, and FGF12

An example to draw scatter plot of gene expression vs. CN log2ratios for genes MDM, BRAF, EGFR, and FGF12

Example 2: Scatter plot of expression value vs. CN log2ratios with full sample meta data

[GENOMIC RESEARCH] Mutation Analysis with improved mutation annotation system

Vivian Zhang

Identifying disease associated gene mutations is an important part of genetic disease research in designing of targeted drugs. To accelerate gene mutation analysis, Omicsoft's Land database provides rich mutation visualization views including mutation and somatic mutation site distribution, mutation landscape, and mutation genome browser.  Omicsoft's recent improvement to it's mutation annotation system allows user to annotate and filter mutations based on hundreds of criteria, including mutation confidence, position, gene information, functional mutation, eQTL information, regulation information protein information, and clinical information, using publicly available databases as the source for annotation. 

Land mutation annotation directory

A simple example is to compare the mutation distributions with and without synonymous mutation included:

Figure: TP53 mutation distribution with synonymous mutation included

Figure: TP53 mutation distribution with synonymous mutation included

Figure TP53 mutation distribution with synonymous mutation excluded

Figure TP53 mutation distribution with synonymous mutation excluded

With an improved annotation system, the user can also filter mutation and identify clinical phenotype associated mutations, using database resources like ClinVar.

Databases additionally include SIFT, Polyphen, 1000 Genomes, ExAC, ESP6500, GTEx eQTL, RegulomeDb, and Interpro Domain, and the system is designed to be able to support countless other databases or customer's internal annotation systems as well.

Figure: TP53 mutation landscape in Li Fraumeni Syndrome samples

Figure: TP53 mutation landscape in Li Fraumeni Syndrome samples

Creating your own Lands for integration with OncoLand or ImmunoLand

Matt Newman

Free Land Creation

While many of our users are aware of the OncoLand and ImmunoLand datasets, not everyone might be aware of how easy it is to create your own Lands, and further integrate these with the public Lands (for instance with TCGA).

Omicsoft provides easy-to-use command line tools that can be used to import your own mutation data, copy number data, and RNA-Seq data into a Land created specifically for you or your dataset.  These can then easily be combined with "virtual" lands to create a Land that allows visualization and querying of your data side-by-side with the public data. In order to do, curation is the key requirement, as you must choose two columns for integration. In most cases, for OncoLand-based Lands, this will be Tumor Type and Sample Type (Primary Tumor, Normal, etc.), and for ImmunoLand this might be DiseaseState and Tissue.

These tools are available for free with your subscription to either OncoLand or ImmunoLand, and if you'd like to try building the Lands yourself, contact with any questions on getting started.

Paid Land Creation

Many of our users prefer to have Omicsoft do the Land creation, including processing of their data through our pipelines (using either Omicsoft resources or the customer resources via VPN access).  This can be a way to get the benefit of internal Land creation, without having to invest any time in gaining expertise on how the process works.  If you're interested in seeing how we can help process your data, be it WGS, WXS, Targeted Sequencing, RNA-Seq, and more, contact us at

Finding Association to Clinical Variables

Matt Newman

Association of clinical variables in Cancer Genomics or Immunology

One question we've had come up many times, whether it's in the context of cancer genomics with the TCGA dataset in OncoLand, or in the context of ImmunoLand and project-specific clinical variables, or even for your own datasets where you have many clinical parameters, is how to quickly scan all clinical variables, based on some prescribed grouping, and find the variables that are most significantly associated with that grouping.

Imagine we have a population of samples (let's say patients with Colon Adenocarcinoma) and we'd like to know - what clinical variables in the TCGA dataset correlate with that status?  For instance, what clinical variables correlate with BRAF V600E mutation status in these samples?

I'm pleased to announce that we now have that ability, using the new Group Association view, available at the top level of every Land.

Microsatellite Instability (MSI) found to correlate with BRAF V600E mutation status in colon adenocarcinoma samples.

Microsatellite Instability (MSI) found to correlate with BRAF V600E mutation status in colon adenocarcinoma samples.

As you can see from the screenshot, MSI status (Microsatellite instability) correlates with BRAF V600E status in Colon Adenocarcinoma samples.  A quick search of the literature finds similar conclusions:

A search for mutations in BRAF confirms the larger population of MSI-H samples vs other MSI status types in colon adenocarcinoma.

A search for mutations in BRAF confirms the larger population of MSI-H samples vs other MSI status types in colon adenocarcinoma.

With thousands of clinical variables available in TCGA and other datasets, this new functionality opens up the data mining possibilities for users interested in looking at clinical data side-by-side with OMIC data.

[ONCOLOGY RESEARCH] OncoLand helps to quickly confirm functional gene fusion events from Cell Reports article

Vivian Zhang

Typical molecular and cell biology research takes years and significant amount of money from design, grant applications, experiments, and validation and reporting. A significant problem is that often times the results are not expected or not applicable for real world applications. The return on investment of research can be problematic and hard to justify (Science economics: What science is really worth). 

Luckily, at least in cancer research, there are increasing number of public datasets available each year. With the large amount of public data, researchers can potentially save time and money, with the discovery of a candidate gene target. If you are a biologist, a bioinformatican, a PI or a R&D project leader or decision maker in pharmaceutical, biotech companies or academic institutes doing cancer research, OncoLand can help. 

In five minutes, OncoLand can help you to identify recurrent fusion genes in gastric cancer, which recently too some effort to identify in a Cell Reports paper (1).

Figure 1. Characteristics of Somatic SVs Identified by DNA-PET in GC. (A) SV filtering procedure for GC patient 125 is shown. Yap et  al.  2015

Figure 1. Characteristics of Somatic SVs Identified by DNA-PET in GC. (A) SV filtering procedure for GC patient 125 is shown. Yap et al. 2015

In the article, Yao et al. identified recurrent fusions between CLDN18, a tight junction gene, and ARHGAP26, a gene encoding a RHOA inhibitor. According to the article, "CLDN18-ARHGAP26 negatively affects cell-cell and cell-matrix interactions and epithelial barrier function, thereby potentially contributing to gastritis, a known risk factor for GC." To verify the discovery of CLDN18-ARHGAP26 fusion, the user can use TCGA (The Cancer Genome Atlas) Land data.

First, the user can search gene CLDN18 and ARHGAP26 through multiple gene search in TCGA Land. By clicking on RNA-Seq Fusion Details, the user could check the list of fusions for the two genes. Interestingly, all the fusions identified are from Stomach Adenocarcinoma (STAD) samples, consistent with the article where the authors identified the fusion gene in gastric cancer. 

The user can check out fusion site frequency and RPKM under RNA-Seq Fusion tab. 

Furthermore, the user can also check out our Genome Browser to visualize the reads of the fusion genes. The Land tab will link to the browser through Browse Selected Samples. Our browser will automatically show a multiple panels view splitting the browser into two panels for the two fusion genes:


(1) Yao, Fei, et al. "Recurrent Fusion Genes in Gastric Cancer: CLDN18-ARHGAP26 Induces Loss of Epithelial Integrity." Cell Reports (2015).

New cancer genomics datasets (TCGA and more) with OncoLand's 2015 Q2 Release

Vivian Zhang

PR News Release

Omicsoft Corporation, an industry leader in cancer genomics, bioinformatics, and next generation sequencing storage and analysis, today announced their quarterly release of their OncoLand data service. Watch the Land 2015 Q2 Release Webinar

Omicsoft Corporation provides a data service and oncology database platform, OncoLand, that focuses on management of both public and customer cancer datasets, including clinical, next generation sequencing, gene expression, copy number, protein, and methylation data. 

In the 2015 Q2 release, the biggest Land update since it's introduction in 2013, Omicsoft highlights the following updates:

  • Introduction of a new clinical subsystem
  • 10+ Patient centric views now available
  • Introduction of new lands, such as GenentechCellLine, and more than 10,000 new sample data (requires controlled access to Genentech Cell Line study published in Nature in 2014)
  • Feature updates:
  1. Dynamic correlation (among RNA-Seq, Mutation, CNV and protein expression data)
  2. Viral and bacterial data integration
  3. Sample centric views
  4. Multiple grouping
  5. Geneset improvement
  6. Land Audit Trail
  7. "Missing data" visualizations

For more details about feature update, please watch the Land 2015 Q2 Release Webinar

Along with the OncoLand release, Omicsoft is pleased to introduce the next release of ImmunoLand. ImmunoLand incorporates public immunology data in disease areas including Arthritis, Asthma, COPD, IBD (Ulcerative Colitis, Crohn’s diseases), Lupus, Psoriasis and other skin diseases, Infectious diseases and vaccine, and Neuroimmuno-diseases (multiple sclerosis and more).

For new users, please contact for a free trial and consultation. For existing user, please contact for more details. 




OncoLand and ImmunoLand Update Webinar (Updates to TCGA and GEO Datasets)

Matt Newman

Jack Liu, President of Omicsoft, will discuss and demonstrate live some of our biggest updates to the “Land” technology since its introduction in 2013. This includes framework updates, feature updates, as well as data updates to both OncoLand and ImmunoLand. 

Major framework updates: 

(1) deep clinical integration with the user interface and analytics with carefully curated TCGA clinical data 
(2) dynamic correlation framework providing instant integration capabilities across different data types and genes 
(3) much improved comparison support greatly enhancing our ImmunoLand subscription and future OncoLand subscriptions 
(4) integration with our breakthrough variant annotation system. 

Also included are the following features/data updates: 
(1) Built-in metagenomics integration (viral, bacterial) for TCGA, GTEx and a few other “Lands” 
(2) Sample specific views (expression, CNV and mutation) 
(3) Grouping/Profiling visualization support for multiple columns 
(4) Various “GeneSet” view improvements, including performance improvement 
(5) Cross-tissue normalized protein data: RPPA-RBN 
(6) New “Group Summary” views for powerful exporting of statistics behind the boxplots, heatmaps, etc. 
(7) Thousands of new NGS and MicroArray samples for ImmunoLand with additional disease coverage 
(8) Various data updates (TCGA, ICGC, GenetechCellLine, etc.) for Oncoland 

We will also present some of the near-future working projects, including HLA typing and dynamic cohort support. The Land updates will also be made available to our customers on July 17th as well.

We'll also update our blog in the coming days with direct links to the recorded webinar, and stay tuned for blog posts showing some of the newly released features and datasets.

[Oncology Research] Oncovirus: Tumor-Virus Association

Vivian Zhang

The tumor-virus associations are known in many types of cancer, including Human papillomavirus (HPV) head-and-neck squamous cell carcinoma, Hepatitis B virus in hepatocellular carcinoma and Epstein-Barr virus (EBV) in gastric carcinoma tumors (1,2). Infection with the hepatitis B virus has been linked to the development of hepatocellular carcinoma. HBV-induced chronic active hepatitis (CAH) and cirrhosis are important risk factors in liver carcinogenesis (3). The detection capacity and sensitivity of RNA-Seq allow researchers to study this association across human genome.

The Cancer Genome Atlas (TCGA) provides viral sequences generated by RNA-Seq. OncoLand, with its integration analysis capacity among mutation, gene expression and structural variation data, provides easy and fast approaches to examine the virus-gene expression association in tumor samples.

OncoLand provides virus sequence count data from more than 4000 GeneBank IDs that were generated by TCGA. In OncoLand, it’s easy to show viral count (expression) per sample. It is clear that Hepatitis B virus is highly expressed in Liver Hepatocellular Carcinoma (LIHC): 

To investigate the single gene-Hepatitis B virus association, OncoLand enables user to create cohorts with user-defined high and low virus counts through creation of SampleSet=>Group SampleSet from Selection: 

With cohorts annotated by high and low virus count, Integration Analysis provides a Kruskal–Wallis test to compare the gene expression between high and low viral cohorts across the rest of the genome. Users could examine one individual gene, a list of molecular signatures of interest or all genes by specifying a Gene Set: 

The result provides statistical results, including gene expression levels in different cohorts:

The user can export the results to Array Studio for further analysis, focus in on a gene or genes of interest for further validation of the association, or export the results to excel for reporting. Tumor-related genes could be good candidates for explaining the molecular viral oncology mechanism and for cancer drug discovery. 




1.     Khoury, Joseph D., et al. The landscape of DNA virus associations across human malignant cancers using RNA-Seq: an analysis of 3775 cases. Journal of virology (2013): JVI-00340.

2.     Di Bisceglie, Adrian M. Hepatitis B and hepatocellular carcinoma. Hepatology49.S5 (2009): S56-S60.

3.       Kremsdorf, D., et al. Hepatitis B virus-related hepatocellular carcinoma: paradigms for viral-related human carcinogenesis. Oncogene 25.27 (2006): 3823-3833.

[Feature Review] Dynamic Correlation among Gene, Structural Variation and Protein

Vivian Zhang

Cancer is a complex disease, and like other complex diseases, changes in gene expression and structural variation correlate with each other and together play an integrated role in the development of cancer. Understanding the correlation among gene expression, structural variation and protein expression is indispensable in oncology research. 

Figure 1. Relationships between NRC signature genes and their driver-mutating genes in the protein interaction network. Jie Li et  al.  2010

Figure 1. Relationships between NRC signature genes and their driver-mutating genes in the protein interaction network. Jie Li et al. 2010

Oncoland provides dynamic correlation visualization for RNA-Seq, miRNA-Seq, somatic mutation, copy number variation and protein RPPA data.

Take ESR1, estrogen receptor 1, for example, the RNA-Seq Expression=> RNA-Seq Expression provides the correlation and scatterplot view of ESR1 expression with all other genes:

In OncoLand, you can filter any criteria of your interest in tumor type, sample metadata, clinical subpopulation and more.

For instance, if you were interested in Estrogen Receptor positive samples in primary breast cancer, just filter it:

The correlation and scatterplot view will dynamically change with the filter criteria:


Li, Jie, et al. "Identification of high-quality cancer prognostic markers and metastasis network modules." Nature communications 1 (2010): 34.

[Feature Review] Add Virtual Genome File from HTTP to Genome Browser

Vivian Zhang

The increasing number of available public datasets, such as from the National Cancer Institute (NCI), Gene Expression Omnibus (GEO), Catalogue of somatic mutations in cancer (COSMIC) and The Encyclopedia of DNA Elements (ENCODE) Consortium, makes it much easier for researchers to conduct biomedical and clinical research.

Want to check a sample’s genome for mutations or gene expression? There could be VCF, BigWig files and BAM files available from a variety of sources.

The number of studies based on whole genome sequencing is skyrocketing, due to the development of Next Generation Sequencing technology that has decreased sequencing costs. However, with the large sequencing files, especially publicly available BAM files public, downloading these file takes hours or even days.

In Array Studio, Genome Browser allows users to Add Track from URL:

This allows users to add sorted BAM files, sorted VCF files, BAS files, BigBed and BigWig files and more from the URL. Once added, the Genome Browser only takes a few seconds to load. 
For more details, please read our wiki page:

If you are an OncoLand user, you may want to check out your sample of interest in the Land. We already downloaded and preprocessed thousands of samples, and included source files for most RNA-Seq data that are genome browser-ready. If you are not a Land user, contact us for a free demo!

Array Studio and Land: integral in helping you save time and accelerate your research discovery. 

[Newly implemented features] Show Query Status-Audit Your Queries in Land

Vivian Zhang

Omicsoft has rapidly expanded it’s Land (OncoLand and ImmunoLand) datasets, and with sample/patient information being increasingly important, it is crucial to be able to navigate and filter sample information in order to fine tune the data visualizations. We aim to provide extremely detailed sample and clinical information.

With the powerful sample query abilitiy and hundreds of clinical annotation items available, it may be hard to remember and trace back all the customized filters and queries.

To help navigate through the sample and clinical information jungle, Omicsoft has added a new feature: Show Query Status. 

With this feature, you don’t need to worry about being interrupted when navigating the data and forget about what filters you applied. The Show Query Status function will record it all for you.  You can use this information when creating presentations, taking notes for your research, or just as a way to ensure you know what filters are applied to your current search query.

Navigate through the jungle faster and replicate your success path! 

UTR Trimming improves transcript-level quantification and RPKM accuracy in RNA-Seq analysis

Jason Lu

RPKM (Reads Per Kilobase Per Million) is a normalization method widely used in RNA-seq analysis. The calculation of RPKM involves the input of transcript length, which often is the length of a 'virtual' transcript annotated in a gene model. With a chosen gene model (e.g. ENSEMBL, or UCSC), transcript lengths are considered fixed, i.e. independent of sample data. In reality, it is well known that, for a specific transcript, its length can vary between different samples due to events such as exon skipping, alternative transcription start sites and 3' alternative polyadenylation. For example, it has been reported that cancer cell lines often expressed mRNA isoforms with shorter 3′ untranslated regions (UTRs) [Mayr 2009]. In the latest Array Studio, Omicsoft has introduced a new option - automatically trimming UTRs, a data-driven method for dynamically redefining transcript lengths. Briefly, for each gene, we first calculate the coverages of the gene. We then identify the left and right boundaries, which are defined as the positions that have 8% of the maximum coverage of the left/right gene UTR regions. The two boundary positions are the same for ALL transcripts. Then for each transcript, we calculate the trimmed exon length for each end – those two trimming positions have to fall in the UTR region AND the first/last exon of that specific transcript. Here, we use examples to demonstrate that applying this UTR trimming option can improve the accuracy of RPKM calculation as well as the transcript level quantification.

We first examine the expression of the gene UBA52, using a TCGA sample TCGA-24-1846-01A-01R-1567-13 as an example. With the UCSC gene model, the RPKM values for the transcripts uc002njr and uc002njs are 119.6 and 194.5 respectively. After applying the new UTR trimming option, the RPKMs for the two transcripts become 610.5 and 1120, a five-time increase for both transcripts. Such dramatic changes in RPKMs are largely due to the fact that only a small proportion of annotated 3' long UTR regions are actually expressed, which can be clearly seen by read coverages (see Figure below). The stripping of both 5' and 3' unexpressed UTR regions leads to more accurate transcript lengths than the original, which significant improves the RPKM calculation.


Besides improving the RPKM calculation, UTR trimming can also benefit quantification by RSEM, a de facto algorithm utilized by both TCGA and Omicsoft Oncoland for transcript-level quantification. Here we take RPL28 as an example. This gene has 5 transcript isoforms, and the predominant form in the same TCGA sample above is listed as uc010yga. But if we examine the read coverage for this particular gene, clearly there is an issue - the transcript uc010yga has no or minimal coverage in the last exon (exon 5) (see Figure below). Furthermore, the exon junction data does not support uc010yga being the most predominant transcript (data not shown). In contrast, if we apply the UTR trimming option, and the most abundant transcript for this gene changes to uc002qkv, which matches well with the read coverages in the browser. Again, the true UTR region of this gene is significantly shorter than what has been annotated, which leads to bias in RSEM quantification. UTR trimming, however, can remedy the problem and allows the identification of true abundant transcripts.     

Another example is from gene IGBP1. The real 5' end UTR region in this TCGA sample is much shorter than the region being annotated in the UCSC gene model. RSEM assigns a majority of reads to the transcript uc004dxw, which is clearly questionable due to the fact that no junction reads are observed between the first and second exon of uc004dxw (see Figure below, including junction reads). With UTR trimming, RSEM correctly identifies the true predominant transcript - uc004dxv, containing greater than 99% of the total reads mapped to this gene.