SciServer comprises several scientific projects that have made their full datasets available in a common format and through a common set of interfaces. Many other projects will soon make their data available here as well.
The diagram shows SciServer’s projects, both those that currently make their datasets available through the SciServer environment and those that will in the future. Green ellipses mark specific projects, and red circles show those project’s databases. Blue ellipses map out which science domains these projects address. Click on the diagram for a larger view.
The list below gives the science domains addressed by SciServer’s current and future projects. Click on any of the icons for more information about that domain.
We also maintain a growing list of SciServer publications.
The ScienceThe SciServer framework grew out of SkyServer, a website first created in 2001 for the Sloan Digital Sky Survey (SDSS). The still-ongoing SDSS uses a telescope in New Mexico to take images of nearly half a billion stars and galaxies, resulting in a high-resolution map of the Universe. This map is the largest and most detailed ever created, and has led to discoveries revolutionizing nearly all areas of astronomy. To make the high-quality SDSS dataset available to all, we created the SkyServer website. SkyServer makes the entire SDSS dataset available, free of charge, to researchers, educators, and the public. SkyServer featured a set of easy-to-use tools for browsing and searching SDSS data, including a free-form SQL query interface to allow users to ask an unlimited variety of questions about the data. Our goal was to train and guide users toward using more powerful and flexible interfaces for working with the data. Our work with SkyServer led us to create CasJobs, a batch system that allows researchers to perform highly complex searches that return millions of sky objects. We also helped to develop the Galaxy Zoo citizen science website, in which hundreds of thousands of online volunteers have contributed millions of data points to more than 40 published science papers. Our approach has been adapted by other astronomy projects such as the Panoramic Survey Telescope and Rapid Response System (Pan-STARRS), and the Galaxy Evolution Explorer (GALEX).
The Big Data ProblemThe story of the SDSS clearly illustrates the challenges of modern astronomy. Before the survey began, astronomers had digital data for about 200,000 galaxies. Today, largely because of the SDSS, that number is more than 200 million. More data mean more opportunities for knowledge, but this "data avalanche" can overwhelm researchers. Exponential growth in data volumes means:
- Challenges in moving data "off the mountain"
- Challenges in storing data
- Challenges in accessing data as data sizes make downloading impossible
- Astronomical measurements are highly complex, with many different data types and formats
- Astronomers measure many parameters; SDSS photometry alone contains more than 200 measured quantities for each star or galaxy
- The sky is free to everyone; astronomy data are free of any legal or contractual requirements for anonymity or confidentiality
- Astronomy asks many fascinating questions about the history and nature of the universe; questions whose answers require new big data techniques
SciServer Use CasesSciServer will pick up where SkyServer and CasJobs left off, building a new set of tools to accomplish the following goals.
SDSS UnificationAlthough the existing SkyServer website provides easy access to all SDSS catalog data, there are other SDSS datasets that were previously unavailable there. In particular, SDSS raw data files in FITS format were hosted separately at Fermi National Accelerator Laboratory. Furthermore, the SDSS identity is highly fragmented, with four phases, eleven sub-surveys, and countless web portals and data access tools. The SciServer project has unified SDSS data in the following ways:
- Brought SDSS raw imaging data (FITS files) to our servers at JHU
- Took over several additional SDSS data access services
- Created a new logo and design schemes
- Unified helpdesks across all SDSS phases
- Combined CasJobs MyDBs across all SDSS phases
Scratch data spaceThe heart of our CasJobs system is the MyDB, personal database space where users can store query results, and perform efficient operations to analyze and cross-reference those results. Although MyDBs are large enough to store most query results, some queries generate very large intermediate datasets. Even when a user knows that his or her final science data will fit inside their MyDB, sometimes these large intermediate datasets make their analysis impossible to complete. SciServer will solve this problem by offering shared scratch database space into which all users can write very large intermediate dataset results. Although these scratch datasets will not be retained indefinitely, they will remove the last major barrier preventing astronomers from performing highly complex science analyses online.
Value-added catalogsOnce researchers perform these complex analyses on their datasets, they will frequently want to share their discoveries with others. The current CasJobs system allows users to share their data with their colleagues using MyDB tables; but there is currently no way to mark these datasets as possibly helpful to other researchers. SciServer will give users the ability to identify "value-added datasets" that can be distributed through CasJobs with a variety of licensing options.
Cross-matching with other datasetsThe SDSS is the premier survey dataset in the visible-light astronomy, but many other astronomical surveys have generated data in other wavelengths of the electromagnetic spectrum. Researchers can gain much new knowledge by combining observations of the same object in multiple wavelengths. But it is not always straightforward to figure out which observations match among different wavelengths. SciServer will solve this difficult problem by further developing the SkyQuery astronomical cross-match tool.
The MEDE Data Science Cloud: SciServer Based Data Science for Materials Scientists and EngineersAt Hopkins we’ve developed the Materials in Extreme Dynamic Environments Data Science Cloud (MEDE-DSC) to address the need for robust, sustainable data-science tools in the materials domain. The MEDE-DSC combines computing infrastructure with collaborative integration into the materials design loop. The focus of the project aligns with MGI strategic goals to facilitate access to materials data; to build data science skills in the materials domain; and to create tools that help materials scientists link experiments, computation, and theory. This focus guides the project commitment to bring data science tools to materials domain researchers where domain knowledge and expertise guide meaningful materials research. MEDE-DSC infrastructure is built on the SciServer platform. SciServer, an NSF Data Infrastructure Building Block (DIBB) center, combines core components for Big Data storage and computation to bring the computation to the data. In our implementation we focus on delivering materials science tools in a simple, robust package. The computing environment utilizes preloaded Docker containers built on the SciServer virtual machine, Linux architecture. Materials scientists and engineers access computing tools and data through a versatile, expandable Jupyter Notebook architecture. The combination of containers and notebooks brings power, consistency and clarity while moving towards reproducible, narrated computation. Ultimately, our hope is that MEDE-DSC’s Big Data tools provide materials scientists the opportunity to design a new class of research that fully utilizes modern instrumentation and simulation capabilities. For more information, please contact Tamas Budavari (email@example.com) or David Elbert (firstname.lastname@example.org), or visit the JHU CMEDE website (https://hemi.jhu.edu/cmede/).
Step by step instructions are available here.
Here is a list of notebooks associated with scientific publications:
- Almansi, M., T. W. N. Haine, R. S. Pickart, M. G. Magaldi, R. Gelderloos, and D. Mastropole, 2017: High-Frequency Variability in the Circulation and Hydrography of the Denmark Strait Overflow from a High-resolution Numerical Model. Journal of Physical Oceanograpy. NOTEBOOK