Genomics – SciServer

Recent advances in sequencing technology have led to an explosive growth in the amount of publicly available human sequence data. This represents an invaluable opportunity for scientific exploration and discovery, but the sheer quantity of the data presents a major challenge for analysis.

For instance, the 1000 Genomes project sequences a large number of human genomes to provide a comprehensive resource on human genetic variation. This data is publicly available but due to its sheer size it has been practically inaccessible to contemporary sequence search methods.

The following genomics projects have used SciServer for solutions to big-data analysis obstacles:

Recount2

Recount2 provides processed and summarized expression data for over 70,000 human RNA-seq samples from the Sequence Read Archive (SRA), The Cancer Genome Atlas (TCGA), and The Genotype-Tissue Expression (GTEx) project (https://doi.org/10.1038/nbt.3838).

The associated Bioconductor package provides a convenient API for querying, downloading, and analyzing the data. Each processed study consists of meta- and phenotype data, the expression levels of genes and their underlying exons and splice junctions, and corresponding genomic annotation. By taking care of several preprocessing steps and combining many datasets into one easily-accessible website, we make finding and analyzing RNA-seq data considerably more straightforward.

Past Projects

Terabase Search Engine (TSE)

The Terabase Search Engine (TSE) project will develop novel software and databases that allow users to search, retrieve, and re-analyze the raw data underlying thousands of human genomes. SciServer will provide the building blocks for the query and data management framework for the TSE project. New tools built on top of SciServer will make this rich resource available for exploration and discovery to the research community.