A graph showing the relationships between datasets and science domains
Diagram showing the science domains and projects supported by SciServer (click for a larger view).

SciServer comprises several scientific projects that have made their full datasets available in a common format and through a common set of interfaces. Many other projects will soon make their data available here as well.

The diagram shows SciServer’s projects, both those that currently make their datasets available through the SciServer environment and those that will in the future. Green ellipses mark specific projects, and red circles show those project’s databases. Blue ellipses map out which science domains these projects address. Click on the diagram for a larger view.

The list below gives the science domains addressed by SciServer’s current and future projects. Click on any of the icons for more information about that domain.

We also maintain a growing list of SciServer publications.

SDSS logo: a purple spectroscopic plate with the Big Dipper and Southern CrossWith huge collections of observational data offering insights into the structure and history of the Universe, astronomy has long been at the forefront of data-intensive science. SciServer will carry that tradition forward by opening up new ways of working with astronomy data.

The Science

The SciServer framework grew out of SkyServer, a website first created in 2001 for the Sloan Digital Sky Survey (SDSS). The still-ongoing SDSS uses a telescope in New Mexico to take images of nearly half a billion stars and galaxies, resulting in a high-resolution map of the Universe. This map is the largest and most detailed ever created, and has led to discoveries revolutionizing nearly all areas of astronomy. To make the high-quality SDSS dataset available to all, we created the SkyServer website. SkyServer makes the entire SDSS dataset available, free of charge, to researchers, educators, and the public. SkyServer featured a set of easy-to-use tools for browsing and searching SDSS data, including a free-form SQL query interface to allow users to ask an unlimited variety of questions about the data. Our goal was to train and guide users toward using more powerful and flexible interfaces for working with the data. Our work with SkyServer led us to create CasJobs, a batch system that allows researchers to perform highly complex searches that return millions of sky objects. We also helped to develop the Galaxy Zoo citizen science website, in which hundreds of thousands of online volunteers have contributed millions of data points to more than 40 published science papers. Our approach has been adapted by other astronomy projects such as the Panoramic Survey Telescope and Rapid Response System (Pan-STARRS), and the Galaxy Evolution Explorer (GALEX).

The Big Data Problem

a photo of an edge-on spiral galaxy with a thick dust lane
A galaxy as seen by the Sloan Digital Sky Survey
The story of the SDSS clearly illustrates the challenges of modern astronomy. Before the survey began, astronomers had digital data for about 200,000 galaxies. Today, largely because of the SDSS, that number is more than 200 million. More data mean more opportunities for knowledge, but this "data avalanche" can overwhelm researchers. Exponential growth in data volumes means:
  • Challenges in moving data "off the mountain"
  • Challenges in storing data
  • Challenges in accessing data as data sizes make downloading impossible
All these major challenges require new solutions - solutions that can be applied throughout all scientific fields. But astronomy is unique in that it is the perfect test case for e-Science methods, because of many factors:
  • Astronomical measurements are highly complex, with many different data types and formats
  • Astronomers measure many parameters; SDSS photometry alone contains more than 200 measured quantities for each star or galaxy
  • The sky is free to everyone; astronomy data are free of any legal or contractual requirements for anonymity or confidentiality
  • Astronomy asks many fascinating questions about the history and nature of the universe; questions whose answers require new big data techniques

SciServer Use Cases

SciServer will pick up where SkyServer and CasJobs left off, building a new set of tools to accomplish the following goals.
SDSS Unification
A pie-shaped graph with colored dots. Each dot represents the location of a galaxy.
A map of the universe as seen by the SDSS. Each dot is a galaxy.
Although the existing SkyServer website provides easy access to all SDSS catalog data, there are other SDSS datasets that were previously unavailable there. In particular, SDSS raw data files in FITS format were hosted separately at Fermi National Accelerator Laboratory. Furthermore, the SDSS identity is highly fragmented, with four phases, eleven sub-surveys, and countless web portals and data access tools. The SciServer project has unified SDSS data in the following ways:
  • Brought SDSS raw imaging data (FITS files) to our servers at JHU
  • Took over several additional SDSS data access services
  • Created a new logo and design schemes
  • Unified helpdesks across all SDSS phases
  • Combined CasJobs MyDBs across all SDSS phases
Our efforts have led to a new SDSS web presence, hosted on machines at JHU at www.sdss.org.
Scratch data space
The heart of our CasJobs system is the MyDB, personal database space where users can store query results, and perform efficient operations to analyze and cross-reference those results. Although MyDBs are large enough to store most query results, some queries generate very large intermediate datasets. Even when a user knows that his or her final science data will fit inside their MyDB, sometimes these large intermediate datasets make their analysis impossible to complete. SciServer will solve this problem by offering shared scratch database space into which all users can write very large intermediate dataset results. Although these scratch datasets will not be retained indefinitely, they will remove the last major barrier preventing astronomers from performing highly complex science analyses online.
Value-added catalogs
Once researchers perform these complex analyses on their datasets, they will frequently want to share their discoveries with others. The current CasJobs system allows users to share their data with their colleagues using MyDB tables; but there is currently no way to mark these datasets as possibly helpful to other researchers. SciServer will give users the ability to identify "value-added datasets" that can be distributed through CasJobs with a variety of licensing options.
Cross-matching with other datasets
The SDSS is the premier survey dataset in the visible-light astronomy, but many other astronomical surveys have generated data in other wavelengths of the electromagnetic spectrum. Researchers can gain much new knowledge by combining observations of the same object in multiple wavelengths. But it is not always straightforward to figure out which observations match among different wavelengths. SciServer will solve this difficult problem by further developing the SkyQuery astronomical cross-match tool.


We will complete the unification and curation of SDSS data by mid-2015. Educational and other data access tools will be added then, and as more astronomy projects put their data online, much more data will be available through SciServer. Our timeline is:
  • January 2015: Launch new sdss.org website
  • January 2015: Unify MyDBs
  • March 2015: Unify helpdesk systems
  • April 2016: Deploy updated system, with MyScratch and value-added catalogs
  • May 2016: Complete transfer of all SDSS data services
  • July 2016: SDSS Data Release 14
Another important application of large numerical simulations is in cosmological N-body simulations, which model the evolution of the Milky Way or the entire Universe. By 2016, SciServer will host several of these simulations, offering unified access and collaborative visualization tools. The unique advantage of accessing these simulations through SciServer will be the ease with which they can be compared with real astronomical observations.
Recent advances in sequencing technology have led to an explosive growth in the amount of publicly available human sequence data. This represents an invaluable opportunity for scientific exploration and discovery, but the sheer quantity of the data presents a major challenge for analysis. For instance, the 1000 Genomes project sequences a large number of human genomes to provide a comprehensive resource on human genetic variation. This data is publicly available but due to its sheer size it has been practically inaccessible to contemporary sequence search methods. The Terabase Search Engine (TSE) project will develop novel software and databases that allow users to search, retrieve, and re-analyze the raw data underlying thousands of human genomes. The SciServer will provide the building blocks for the query and data management framework for the TSE project. New tools built on top of SciServer will make this rich resource available for exploration and discovery to the research community.
One important context in which turbulent flow occurs is in Earth's oceans, especially on scales of a few kilometers. Like engineers, oceanographers have long relied on numerical simulations to model physical systems - but oceanography presents an additional challenge to simulations because data for comparison are sparse in space and time. Starting in late 2015, SciServer will host some of the largest computational models of global ocean circulation ever offered online. The new models will allow users to trace the physical and chemical evolution of ocean currents across orders of magnitude in space and time, and to quickly analyze important aspects of model events in conjunction with observational data. New SciServer tools will allow researchers to collaboratively track particles through circulation models, identifying hard-to-find features such as high-vorticity regions and isolated vertical mixing.
A graphic showing a view of a magnifying glass with soil organisms inside, surrounded by buildings, forests, and the GLUSEEN acronymLike other field sciences, soil ecology research requires detailed and consistent monitoring of natural conditions. The more researchers can monitor natural environments, the more they can discover - but long-term widespread studies can be prohibitively expensive. Automated data collection programs can help reduce the burden, and can allow researchers to concentrate on the most scientifically interesting problems. SciServer grows out of such a program - the Life Under Your Feet wireless sensor network. With this network, researchers from the JHU Department of Earth and Planetary Sciences have deployed temperature and soil-moisture sensors for long-term environmental monitoring at field sites all over the world. By mid-2015, SciServer will host the entire Life Under Your Feet dataset, and will have developed some additional tools, including ways to integrate sensor network data with reference datasets of climate, land cover, and other topics from remote sensing. SciServer will also extend its framework to store visualize datasets of local in situ observations from thousands of researchers and citizen scientists, which will enable consistent and detailed long-term monitoring on a global scale. The extended framework will also include social media tagging and sharing tools to encourage worldwide collaboration around these datasets.
Turbulent fluid flow impacts a wide variety of engineering problems, but turbulence is mathematically complicated and poorly understood. One of the most productive research techniques is numerical simulation, but simulations powerful enough to be realistic are also computationally intensive enough to require the largest supercomputers to run at all. SciServer offers a solution: run the simulations in advance, then store the results online for easy distributed access. SciServer offers accerss to the ensemble of the Johns Hopkins Turbulence Databases, a growing collection of multi-Terabyte computational models of turbulent fluid flow. These databases preserve the entire spacetime history of high-performance computing simulations, including forced isotropic turbulence, magnetohydrodymaic turbulence, and channel flow. Users can access the data using a variety of data retrieval and immersive analysis interfaces. We will continue to add additional simulations into these databases in the future - but more importantly, we will offer new ways of accessing turbulence simulation data. Starting in late 2015, SciServer will build off its existing MyDB framework to develop data sharing and collaborative visualization tools. SciServer will also develop GPU computing technology to offer full-field operations, making future simulations even more powerful for real-world research.