SciServer is a revolutionary new approach to doing science by bringing the analysis to the data. SciServer consists of data hosting services coupled with integrated Tools that work together to create a full-featured system.
SciServer is a fully integrated cyberinfrastructure system encompassing related tools and services to enable researchers to cope with scientific big data. SciServer enables a new approach that will allow researchers to work with Terabytes or Petabytes of scientific data, without needing to download any large datasets.
This SciServer approach offers tremendous advantages to scientists:
By offering Big Data search and analysis capabilities online…
SciServer will make it easy to compare datasets and discover new and surprising connections between them.
By offering worldwide access to large simulation datasets…
along with innovative new processing techniques, SciServer will open up computational science resources to scientists everywhere.
By providing a cloud-based scientific data storage system…
that automatically interoperates between flat files and databases through a drag-and-drop interface, SciServer allows scientists to synthesize disparate datasets and take full advantage of their contents.
By adapting existing, working tools…
SciServer builds on success, extending a system that has been functional and user-driven since its inception.
By developing new Citizen Science Projects…
SciServer adds extensibility to worldwide distributed data, such as our Soil Ecology project which uses SciServer to gather worldwide distributed data across a range of climatic conditions.
By adding an extensive set of collaborative features…
SciServer allows researchers to correlate their data sets with hosted data sets provided by external data providers.
SciServer addresses some of the most important challenges of modern science with a variety of innovative tools and approaches.
SciServer faces these daunting challenges by offering scalable database space to science data providers. SciServer’s databases are hosted on machines with big storage and fast I/O, and are heavily indexed for better query performance. But SciServer’s greatest contribution is to individual researchers: a set of easy-to-use tools for performing complex searches of big datasets, and personalized database space to store and analyze results. These new tools will revolutionize the way scientists make discoveries from 21st century data.
Many modern research programs require detailed numerical simulations. These simulations are often so complex and time-consuming that they can only be done on the largest supercomputers; but with only a few supercomputers to go around, many researchers cannot run the software they desperately need for their science. Even when supercomputer time is available, researchers still need to efficiently search and analyze the results of their simulations to maximize knowledge gained.
SciServer solves these problems by offering the ability to run analysis codes directly on our servers, keeping the computation close to the underlying data. This approach will democratize access to supercomputing resources, and will enable an incredible variety of new science.
SciServer will open up access to all these big data resources to researchers worldwide. But big data is not the only kind of data in science; the most important discoveries often take place in the “long tail” of small datasets collected by thousands of researchers around the world. Furthermore, datasets large and small are used every day in K-12 and college classes by the next generation of scientists and science-literate citizens. These “long tail” datasets come in very different file formats, and educational users have very different needs from practicing researchers.
SciServer will allow both types of users to access the same tools as researchers who work with Big Data. The result will be a robust, scalable system used by researchers and the public alike. SciServer tools will be a regular part of the toolbox for 21st century professional and citizen scientists, and will be at the forefront of an amazing new era of scientific discovery.
SciServer began in 2013, but it builds on more than a decade of research and development in data-intensive science. With funding from the National Science Foundation, our team at Johns Hopkins University’s Institute for Data-Intensive Engineering and Science (IDIES) and beyond has been continuously building tools to support research in all areas of science and technology.
In addition to the general problem of dealing with Scientific Big Data, scientists over the years have grappled with many problems that originate from a lack of data infrastructure. Below are a few of the most serious problems we hope to address with SciServer. Scientists perusing this list will undoubtedly find it familiar, yet incomplete.
Just a few data-related problems scientists face are:
- Data preservation
- Ad hoc data storage format and media
- Insufficient metadata and incomplete documentation
- Unequal access to data and data processing resources
A variety of projects have developed approaches to preserving and managing datasets, but providing easy access such that all researchers can compare, analyze, and share them remains a problem. The SciServer team has spent the last two decades addressing these problems, initially in astronomy, with SDSS, and later in other areas of science.
SDSS has Increased Astronomical Data by three orders of magnitude (1000 times). From 200,000 galaxies, we now know of more than 200,000,000 galaxies. This has revolutionized astronomy in unforeseen ways due to unprecedented sample sizes. SDSS has indeed provided an embarrassment of riches,
The database to house SDSS data was built at JHU, as was the spectrograph used to collect galactic spectra. In addition, the Catalog Archive Server (CAS) and SkyServer were also developed at JHU in 1997 and 2001, respectively.
Analysis of website traffic shows that usage of the system has continually increased since the start of the project. Over that time both the number of visitors (shown in gray) and number of queries executed (shown in red) have increased. The system itself, including both the supporting hardware infrastructure and the software services themselves, has successfully scaled up to meet the demand throughout this period, and will continue to do so in the future, as necessary.
In the figure below, consider the genealogy of existing SDSS applications. We see that the foundational components of SDSS data have already become the basis for numerous extensions and applications, even to other (non-astronomical) scientific data sets. The SDSS genealogy shows, for example, that the SkyServer framework has server as parent template for applications in the fields of turbulence (Turbulence Simulations), environmental science (Life Under Your Feet), and radiation oncology (Onco Space).