A graph showing the relationships between datasets and science domains
Diagram showing the science domains and projects supported by SciServer (click for a larger view).

SciServer comprises several scientific projects that have made their full datasets available in a common format and through a common set of interfaces. Many other projects will soon make their data available here as well.

The diagram shows SciServer’s projects, both those that currently make their datasets available through the SciServer environment and those that will in the future. Green ellipses mark specific projects, and red circles show those project’s databases. Blue ellipses map out which science domains these projects address. Click on the diagram for a larger view.

The list below gives the science domains addressed by SciServer’s current and future projects. Click on any of the icons for more information about that domain.

We also maintain a growing list of SciServer publications.

SDSS logo: a purple spectroscopic plate with the Big Dipper and Southern CrossWith huge collections of observational data offering insights into the structure and history of the Universe, astronomy has long been at the forefront of data-intensive science. SciServer will carry that tradition forward by opening up new ways of working with astronomy data.

The Science

The SciServer framework grew out of SkyServer, a website first created in 2001 for the Sloan Digital Sky Survey (SDSS). The still-ongoing SDSS uses a telescope in New Mexico to take images of nearly half a billion stars and galaxies, resulting in a high-resolution map of the Universe. This map is the largest and most detailed ever created, and has led to discoveries revolutionizing nearly all areas of astronomy. To make the high-quality SDSS dataset available to all, we created the SkyServer website. SkyServer makes the entire SDSS dataset available, free of charge, to researchers, educators, and the public. SkyServer featured a set of easy-to-use tools for browsing and searching SDSS data, including a free-form SQL query interface to allow users to ask an unlimited variety of questions about the data. Our goal was to train and guide users toward using more powerful and flexible interfaces for working with the data. Our work with SkyServer led us to create CasJobs, a batch system that allows researchers to perform highly complex searches that return millions of sky objects. We also helped to develop the Galaxy Zoo citizen science website, in which hundreds of thousands of online volunteers have contributed millions of data points to more than 40 published science papers. Our approach has been adapted by other astronomy projects such as the Panoramic Survey Telescope and Rapid Response System (Pan-STARRS), and the Galaxy Evolution Explorer (GALEX).

The Big Data Problem

a photo of an edge-on spiral galaxy with a thick dust lane
A galaxy as seen by the Sloan Digital Sky Survey
The story of the SDSS clearly illustrates the challenges of modern astronomy. Before the survey began, astronomers had digital data for about 200,000 galaxies. Today, largely because of the SDSS, that number is more than 200 million. More data mean more opportunities for knowledge, but this "data avalanche" can overwhelm researchers. Exponential growth in data volumes means:
  • Challenges in moving data "off the mountain"
  • Challenges in storing data
  • Challenges in accessing data as data sizes make downloading impossible
All these major challenges require new solutions - solutions that can be applied throughout all scientific fields. But astronomy is unique in that it is the perfect test case for e-Science methods, because of many factors:
  • Astronomical measurements are highly complex, with many different data types and formats
  • Astronomers measure many parameters; SDSS photometry alone contains more than 200 measured quantities for each star or galaxy
  • The sky is free to everyone; astronomy data are free of any legal or contractual requirements for anonymity or confidentiality
  • Astronomy asks many fascinating questions about the history and nature of the universe; questions whose answers require new big data techniques

SciServer Use Cases

SciServer will pick up where SkyServer and CasJobs left off, building a new set of tools to accomplish the following goals.
SDSS Unification
A pie-shaped graph with colored dots. Each dot represents the location of a galaxy.
A map of the universe as seen by the SDSS. Each dot is a galaxy.
Although the existing SkyServer website provides easy access to all SDSS catalog data, there are other SDSS datasets that were previously unavailable there. In particular, SDSS raw data files in FITS format were hosted separately at Fermi National Accelerator Laboratory. Furthermore, the SDSS identity is highly fragmented, with four phases, eleven sub-surveys, and countless web portals and data access tools. The SciServer project has unified SDSS data in the following ways:
  • Brought SDSS raw imaging data (FITS files) to our servers at JHU
  • Took over several additional SDSS data access services
  • Created a new logo and design schemes
  • Unified helpdesks across all SDSS phases
  • Combined CasJobs MyDBs across all SDSS phases
Our efforts have led to a new SDSS web presence, hosted on machines at JHU at www.sdss.org.
Scratch data space
The heart of our CasJobs system is the MyDB, personal database space where users can store query results, and perform efficient operations to analyze and cross-reference those results. Although MyDBs are large enough to store most query results, some queries generate very large intermediate datasets. Even when a user knows that his or her final science data will fit inside their MyDB, sometimes these large intermediate datasets make their analysis impossible to complete. SciServer will solve this problem by offering shared scratch database space into which all users can write very large intermediate dataset results. Although these scratch datasets will not be retained indefinitely, they will remove the last major barrier preventing astronomers from performing highly complex science analyses online.
Value-added catalogs
Once researchers perform these complex analyses on their datasets, they will frequently want to share their discoveries with others. The current CasJobs system allows users to share their data with their colleagues using MyDB tables; but there is currently no way to mark these datasets as possibly helpful to other researchers. SciServer will give users the ability to identify "value-added datasets" that can be distributed through CasJobs with a variety of licensing options.
Cross-matching with other datasets
The SDSS is the premier survey dataset in the visible-light astronomy, but many other astronomical surveys have generated data in other wavelengths of the electromagnetic spectrum. Researchers can gain much new knowledge by combining observations of the same object in multiple wavelengths. But it is not always straightforward to figure out which observations match among different wavelengths. SciServer will solve this difficult problem by further developing the SkyQuery astronomical cross-match tool.
Another important application of large numerical simulations is in cosmological N-body simulations, which model the evolution of the Milky Way or the entire Universe. By 2016, SciServer will host several of these simulations, offering unified access and collaborative visualization tools. The unique advantage of accessing these simulations through SciServer will be the ease with which they can be compared with real astronomical observations.
Recent advances in sequencing technology have led to an explosive growth in the amount of publicly available human sequence data. This represents an invaluable opportunity for scientific exploration and discovery, but the sheer quantity of the data presents a major challenge for analysis. For instance, the 1000 Genomes project sequences a large number of human genomes to provide a comprehensive resource on human genetic variation. This data is publicly available but due to its sheer size it has been practically inaccessible to contemporary sequence search methods. The Terabase Search Engine (TSE) project will develop novel software and databases that allow users to search, retrieve, and re-analyze the raw data underlying thousands of human genomes. The SciServer will provide the building blocks for the query and data management framework for the TSE project. New tools built on top of SciServer will make this rich resource available for exploration and discovery to the research community.

The MEDE Data Science Cloud: SciServer Based Data Science for Materials Scientists and Engineers

At Hopkins we’ve developed the Materials in Extreme Dynamic Environments Data Science Cloud (MEDE-DSC) to address the need for robust, sustainable data-science tools in the materials domain. The MEDE-DSC combines computing infrastructure with collaborative integration into the materials design loop. The focus of the project aligns with MGI strategic goals to facilitate access to materials data; to build data science skills in the materials domain; and to create tools that help materials scientists link experiments, computation, and theory. This focus guides the project commitment to bring data science tools to materials domain researchers where domain knowledge and expertise guide meaningful materials research. MEDE-DSC infrastructure is built on the SciServer platform. SciServer, an NSF Data Infrastructure Building Block (DIBB) center, combines core components for Big Data storage and computation to bring the computation to the data. In our implementation we focus on delivering materials science tools in a simple, robust package. The computing environment utilizes preloaded Docker containers built on the SciServer virtual machine, Linux architecture. Materials scientists and engineers access computing tools and data through a versatile, expandable Jupyter Notebook architecture. The combination of containers and notebooks brings power, consistency and clarity while moving towards reproducible, narrated computation. Ultimately, our hope is that MEDE-DSC’s Big Data tools provide materials scientists the opportunity to design a new class of research that fully utilizes modern instrumentation and simulation capabilities. SciServer Compute MEDE Notebook   For more information, please contact Tamas Budavari (budavari@jhu.edu) or David Elbert (elbert@jhu.edu), or visit the JHU CMEDE website (https://hemi.jhu.edu/cmede/).
SciServer hosts numerical model output of high-resolution Ocean General Circulation Models (GCMs) set up and run by the research group of Prof. Thomas W. N. Haine (Johns Hopkins University - Department of Earth and Planetary Sciences). These models allow users to trace the physical evolution of ocean currents across orders of magnitude in space and time, and to quickly analyze important aspects of model events in conjunction with observational data. The goal of the SciServer Ocean Modeling User Case is to build a collaborative sharing environment where users can access and process high-resolution datasets. The analysis of these large datasets is often restricted by limited computational resources, so we have developed fast algorithms to facilitate extracting information from model output fields. SciServer users can either download subsets of data on their own machines, or run our tools online and store post-processing files on our servers.
Step by step instructions are available here.
Here is a list of notebooks associated with scientific publications:
  • Almansi, M., T. W. N. Haine, R. S. Pickart, M. G. Magaldi, R. Gelderloos, and D. Mastropole, 2017: High-Frequency Variability in the Circulation and Hydrography of the Denmark Strait Overflow from a High-resolution Numerical Model. Journal of Physical Oceanograpy. NOTEBOOK

Arctic-Subarctic Circulation and Dynamics

Our research goal is to better understand the circulation and dynamics of the Denmark Strait, East Greenland Shelf, and Irminger Sea. Diagnosing and monitoring the flow in this area is critical to estimate the state and variability of the meridional overturning circulation in the North Atlantic ocean. Therefore, we have configured high-resolution (~2-4 km) realistic numerical models which are centered on Denmark Strait and include the entire Iceland Sea to the north as well as Cape Farewell to the southwest. The dynamics are simulated using the Massachusetts Institute of Technology general circulation model (MITgcm). These models have been compared to several observational data sources with accurate results (e.g., Haine 2010; Magaldi et al. 2011; Koszalka et al. 2013; Gelderloos et al. 2017; Almansi et al. 2017).
Model domain
(a) Plan view of the numerical domain superimposed on sea-floor bathymetry. (b) Schematic of the currents flowing in the 2 km resolution area. (Almansi et al. 2017, JPO)

Eulerian Framework

The equations solved in the simulations are in the Eulerian framework. The scripts provided in this user case enable rapid extraction of more than 50 diagnostics. For example, it is possible to request and produce figures and movies of time series of volume, heat, and salt transport through any vertical section, or time series of a horizontal slice of the model of temperature, salinity, velocity, or pressure.
Temperature snapshot
Model snapshot of sea surface temperature (SST). March 15, 2007 at midnight.
Transport Denmark Strait
Upper panel: volume flux (transport) through Denmark Strait. Lower panel: vertical section of along-strait velocities in the Strait. Equatorward flow is negative.

Lagrangian Framework

Visualizing the model results, as well as studying the model kinematics, is facilitated by the implementation of a fast algorithm for Lagrangian particle tracking (Koszalka et al. 2013; Gelderloos et al. 2016). Virtual particles can be seeded anywhere in the model domain, and tracked forward and backward in time. The position, temperature, and salinity are recorded for further off-line analysis.
Particle path
Particle-path and temperature of virtual water parcels ending up in Kangerdlugssuaq Fjord, Northeast Greenland, in November 2007 (Gelderloos et al. 2017, JPO).
A graphic showing a view of a magnifying glass with soil organisms inside, surrounded by buildings, forests, and the GLUSEEN acronymLike other field sciences, soil ecology research requires detailed and consistent monitoring of natural conditions. The more researchers can monitor natural environments, the more they can discover - but long-term widespread studies can be prohibitively expensive. Automated data collection programs can help reduce the burden, and can allow researchers to concentrate on the most scientifically interesting problems. SciServer grows out of such a program - the Life Under Your Feet wireless sensor network. With this network, researchers from the JHU Department of Earth and Planetary Sciences have deployed temperature and soil-moisture sensors for long-term environmental monitoring at field sites all over the world. By mid-2015, SciServer will host the entire Life Under Your Feet dataset, and will have developed some additional tools, including ways to integrate sensor network data with reference datasets of climate, land cover, and other topics from remote sensing. SciServer will also extend its framework to store visualize datasets of local in situ observations from thousands of researchers and citizen scientists, which will enable consistent and detailed long-term monitoring on a global scale. The extended framework will also include social media tagging and sharing tools to encourage worldwide collaboration around these datasets.
Turbulent fluid flow impacts a wide variety of engineering problems, but turbulence is mathematically complicated and poorly understood. One of the most productive research techniques is numerical simulation, but simulations powerful enough to be realistic are also computationally intensive enough to require the largest supercomputers to run at all. SciServer offers a solution: run the simulations in advance, then store the results online for easy distributed access. SciServer offers accerss to the ensemble of the Johns Hopkins Turbulence Databases, a growing collection of multi-Terabyte computational models of turbulent fluid flow. These databases preserve the entire spacetime history of high-performance computing simulations, including forced isotropic turbulence, magnetohydrodymaic turbulence, and channel flow. Users can access the data using a variety of data retrieval and immersive analysis interfaces. We will continue to add additional simulations into these databases in the future - but more importantly, we will offer new ways of accessing turbulence simulation data. Starting in late 2015, SciServer will build off its existing MyDB framework to develop data sharing and collaborative visualization tools. SciServer will also develop GPU computing technology to offer full-field operations, making future simulations even more powerful for real-world research.