With huge collections of observational data offering insights into the structure and history of the Universe, astronomy has long been at the forefront of data-intensive science. SciServer will carry that tradition forward by opening up new ways of working with astronomy data.
The SciServer framework grew out of SkyServer, a website first created in 2001 for the Sloan Digital Sky Survey (SDSS). The still-ongoing SDSS uses a telescope in New Mexico to take images of nearly half a billion stars and galaxies, resulting in a high-resolution map of the Universe. This map is the largest and most detailed ever created, and has led to discoveries revolutionizing nearly all areas of astronomy.
To make the high-quality SDSS dataset available to all, we created the SkyServer website. SkyServer makes the entire SDSS dataset available, free of charge, to researchers, educators, and the public. SkyServer featured a set of easy-to-use tools for browsing and searching SDSS data, including a free-form SQL query interface to allow users to ask an unlimited variety of questions about the data. Our goal was to train and guide users toward using more powerful and flexible interfaces for working with the data.
Our work with SkyServer led us to create CasJobs, a batch system that allows researchers to perform highly complex searches that return millions of sky objects. We also helped to develop the Galaxy Zoo citizen science website, in which hundreds of thousands of online volunteers have contributed millions of data points to more than 40 published science papers. Our approach has been adapted by other astronomy projects such as the Panoramic Survey Telescope and Rapid Response System (Pan-STARRS), and the Galaxy Evolution Explorer (GALEX).
The Big Data Problem
The story of the SDSS clearly illustrates the challenges of modern astronomy. Before the survey began, astronomers had digital data for about 200,000 galaxies. Today, largely because of the SDSS, that number is more than 200 million.
More data mean more opportunities for knowledge, but this “data avalanche” can overwhelm researchers. Exponential growth in data volumes means:
- Challenges in moving data “off the mountain”
- Challenges in storing data
- Challenges in accessing data as data sizes make downloading impossible
All these major challenges require new solutions – solutions that can be applied throughout all scientific fields. But astronomy is unique in that it is the perfect test case for e-Science methods, because of many factors:
- Astronomical measurements are highly complex, with many different data types and formats
- Astronomers measure many parameters; SDSS photometry alone contains more than 200 measured quantities for each star or galaxy
- The sky is free to everyone; astronomy data are free of any legal or contractual requirements for anonymity or confidentiality
- Astronomy asks many fascinating questions about the history and nature of the universe; questions whose answers require new big data techniques
SciServer Use Cases
SciServer will pick up where SkyServer and CasJobs left off, building a new set of tools to accomplish the following goals.
Although the existing SkyServer website provides easy access to all SDSS catalog data, there are other SDSS datasets that were previously unavailable there. In particular, SDSS raw data files in FITS format were hosted separately at Fermi National Accelerator Laboratory. Furthermore, the SDSS identity is highly fragmented, with four phases, eleven sub-surveys, and countless web portals and data access tools.
The SciServer project has unified SDSS data in the following ways:
- Brought SDSS raw imaging data (FITS files) to our servers at JHU
- Took over several additional SDSS data access services
- Created a new logo and design schemes
- Unified helpdesks across all SDSS phases
- Combined CasJobs MyDBs across all SDSS phases
Our efforts have led to a new SDSS web presence, hosted on machines at JHU at www.sdss.org.
Scratch data space
The heart of our CasJobs system is the MyDB, personal database space where users can store query results, and perform efficient operations to analyze and cross-reference those results. Although MyDBs are large enough to store most query results, some queries generate very large intermediate datasets. Even when a user knows that his or her final science data will fit inside their MyDB, sometimes these large intermediate datasets make their analysis impossible to complete.
SciServer will solve this problem by offering shared scratch database space into which all users can write very large intermediate dataset results. Although these scratch datasets will not be retained indefinitely, they will remove the last major barrier preventing astronomers from performing highly complex science analyses online.
Once researchers perform these complex analyses on their datasets, they will frequently want to share their discoveries with others. The current CasJobs system allows users to share their data with their colleagues using MyDB tables; but there is currently no way to mark these datasets as possibly helpful to other researchers. SciServer will give users the ability to identify “value-added datasets” that can be distributed through CasJobs with a variety of licensing options.
Cross-matching with other datasets
The SDSS is the premier survey dataset in the visible-light astronomy, but many other astronomical surveys have generated data in other wavelengths of the electromagnetic spectrum. Researchers can gain much new knowledge by combining observations of the same object in multiple wavelengths. But it is not always straightforward to figure out which observations match among different wavelengths. SciServer will solve this difficult problem by further developing the SkyQuery astronomical cross-match tool.