Hosted Datasets
SciServer hosts more than two Petabytes of scientific data in a variety of disciplines, including astronomy and fluid mechanics. Datasets come in two types:
Public Data Volumes can be accessed from SciServer Compute by mounting them onto a new container at the time you create the container.
Databases can be accessed from SciServer Compute by querying them with commands from the SkyServer.CasJobs
module. See the SciServer module API Documentation to learn how to use these commands.
The sections below describe each of the datasets available through SciServer. Click on one of the gray boxes to see full information about that dataset.
SciServer makes a number of datasets directly available in SciServer Compute in the form of Public Data Volumes. To use these data volumes in your research or education activities, you will need to mount them to a virtual container at the time you create that container. See the instructions on How to create a new container to learn how to mount a Public Data Volume.
MaNGA Integral Field Unit (IFU) Spectra: The reduced products of MaNGA data, with the latest from Data Release 17 (DR17). The pipeline-processed spectroscopic data products available in this image consist of 3-d data cubes, row-stacked spectra from the Data Reduction Pipeline, and 2-d analysis maps and 3-d model cubes from the Data Analysis Pipeline. The complete volume of MaNGA data can be found on the SDSS Science Archive Server (SAS).
The new SDSS Associated Data
data volume provides easy access to useful datasets from the Sloan Digital Sky Survey that are not part of the official SDSS data releases (the latest of which is now Data Release 16).
Currently this data volume includes the one dataset described below. We will continue to add new datasets, including future SDSS value-added catalogs.
HI-MaNGA: HI followup observations of MaNGA target galaxies
The HI-MaNGA dataset consists of followup observations of MaNGA galaxies in the HI (21 cm) wavelength, using the Green Bank Telescope. The observations were designed to address scientific questions related to stellar evolution and gas accretion in various types of galaxies. The final dataset will include most galaxies in the MaNGA catalog with z < 0.05.
For more information about the HI-MaNGA dataset, see its description page on the SDSS website.
SDSS Spectra
Data Volume contains spectra of millions of stars, galaxies, and quasars from the Sloan Digital Sky Survey (SDSS), an ongoing effort to make a three-dimensional map of the Universe.
If you have a SciServer account, you can see the contents of this data volume in your Dashboard by going to the Files tab, or with this direct link to the SDSS Spectra Data Volume (for logged-in users only). Catalog data from the same spectra are also available on SciServer; see Sloan Digital Sky Survey Catalog Archive Data under Databases below.
To use the SDSS spectroscopic data in your work in SciServer Compute, create a new container and check the box to mount the SDSS Spectra
onto your container.
About the Data
The observations come from the SDSS’s component extended Baryon Oscillation Spectroscopic Survey (eBOSS), which has measured optical spectra (3600-10400 Ångstroms) for millions of galaxies and quasars.
Individual spectra are available as FITS files. Each file follows the structure of the SDSS spec-lite file format, containing the coadded spectrum (HDU 1 COADD), spAll row (HDU 2 SPALL), and spZline row (HDU 3 SPZLINE) – but not the individual exposures (HDUs 4+), which are available only through the equivalent full spec files on the Science Archive Server.
For a full description of the file format, see the documentation of SDSS spec files from the SDSS data model.
The same FITS files can also be accessed through the SDSS Science Archive Server website, along with many related files describing various aspects of the SDSS spectroscopic data model.
Data Volume structure
The root level of the Data Volume contains the single directory spec-lite, indicating that the contents are SDSS spec-lite files. The next level down organizes the data by run2d, which indicates which version of the SDSS spectroscopic pipeline had been run to process the spectra in that subfolder. Because different pipeline versions were used by different surveys and programs, the run number indicates when and why the spectra were collected.
The most recent spectra come from SDSS Data Release 16 and have run2d value v5_13_0; most users will want this version. The list below shows which run2d values correspond to which datasets.
Within each run2d directory, spectra are organized by the SDSS plate used for the measurement; each plate-based directory contains either 640 or 1,000 FITS files, one for each spectrum collected by the plate.
Guide to run2d numbers
v5_13_0
contains all optical spectra released as part of Data Release 16v5_10_0
contains all optical spectra released as part of Data Release 14104
contains all optical spectra collected by the SDSS SEGUE-2 survey in 2008-2009, and some other preliminary spectroscopic data collected in the same period, first released in DR7103
contains all optical spectra collected by the SDSS SEGUE-1 survey’s cluster studies in 2004-2008 (part of DR7)26
contains all other optical spectra from SEGUE-1 and the original SDSS Legacy Survey, observed 2000-2008 and released in DR7
The following query gives summary data about each of these runs, including the SDSS survey and program responsible for the data:
select cast(run2d as int) as run, survey, programname, count(*) as nPlates, min(dateObs) as startDate, max(dateObs) as endDate from platex where run2d in ('26','103','104') group by cast(run2d as int), survey, programname order by cast(run2d as int), survey, programname

The HEASARC data volume contains a copy of all of the public data hosted at the High-Energy Astrophysics Science Archive Research Center (HEASARC). For information about the various missions available and how to use specific datasets, please see the HEASARC website and/or contact our helpdesk from that site’s Feedback link at the bottom.
The HEASARC data volume also includes a software area for miscellaneous additional things such as interactive cookbooks that are under development. Some startup instructions can be found on the HEASARC SciServer documentation page. The software environment to analyze these data can be found in the Compute Image called HEASARCv6.28
.

Indra is a suite of large-volume cosmological N-body simulations. Each of the 384 simulations is computed with the same cosmological parameters and different initial phases, providing excellent statistics of the large-scale features of the distribution of dark matter.
The independent volumes have 10243 dark matter particles in a box of length 1 Gpc/h, and are all accessible through SciServer Compute containers to all users who join the Cosmology science domain. A full description of the Indra suite of simulations can be found in a paper by Falck, et al (2021).
The Indra data volumes contain, for each simulation:
- 64 snapshots of particle positions and velocities
- 64 snapshots of FOF and SUBFIND halo catalogs
- 505 time-steps of coarse-gridded Fourier-space density fields
Indra data are accessed with the indra-tools python package pre-installed on the Computational Simulations compute image. The indra-tools git repository contains example notebooks showing how to read the binary data, query the halo database tables, compute density fields, etc.
Use of the Indra dataset is open and available to anyone. We ask that scientific publications that make use of Indra cite the Falck, et al (2021) data release paper.
This volume contains all the raw and processed file-based data from Data Release 7 (DR7) of the Sloan Digital Sky Survey (SDSS). The raw and pipeline-processed imaging and spectroscopic data products are available here (mostly) in binary FITS format.
The data on the SDSS-DAS volume can be accessed via SciServer Compute using the standard file access python tools. A copy of this data is also accessible via the SDSS DAS website, and the catalog version of this data is available from the SDSS DR7 SkyServer.
Recount2 provides processed and summarized expression data for over 70,000 human RNA-seq samples from the Sequence Read Archive (SRA), The Cancer Genome Atlas (TCGA), and The Genotype-Tissue Expression (GTEx) project (https://doi.org/10.1038/nbt.3838).
The associated Bioconductor package provides a convenient API for querying, downloading, and analyzing the data. Each processed study consists of meta- and phenotype data, the expression levels of genes and their underlying exons and splice junctions, and corresponding genomic annotation. By taking care of several preprocessing steps and combining many datasets into one easily-accessible website, we make finding and analyzing RNA-seq data considerably more straightforward.
SciServer hosts numerical model output of high-resolution Ocean General Circulation Models (GCMs) set up and run by the research group of Prof. Thomas W. N. Haine (Johns Hopkins University – Department of Earth and Planetary Sciences).These models allow users to trace the physical evolution of ocean currents across orders of magnitude in space and time, and to quickly analyze important aspects of model events in conjunction with observational data.
The goal of the SciServer Ocean Modeling User Case is to build a collaborative sharing environment where users can access and process high-resolution datasets. The analysis of these large datasets is often restricted by limited computational resources, so we have developed OceanSpy, a python package that facilitates extracting information from model output fields. SciServer users can either download subsets of data on their own machines, or run our tools online and store post-processing files on our servers.
Available Datasets
- EGshelfIIseas2km_ERAI: High-resolution (~2km) numerical simulation covering the east Greenland shelf (EGshelf), and the Iceland and Irminger Seas (IIseas). Surface forcing based on the global atmospheric reanalysis ERA-Interim (ERAI). Citation: Almansi et al., 2017
- EGshelfIIseas2km_ASR: High-resolution (~2km) numerical simulation covering the east Greenland shelf (EGshelf),and the Iceland and Irminger Seas (IIseas). Surface forcing based on the regional atmospheric Arctic System Reanalysis (ASR). Citation: Almansi et al., 2017
- EGshelfSJsec500m: Very high-resolution (500m) numerical simulation covering the east Greenland shelf (EGshelf) and the Spill Jet section (SJsec). Both hydrostatic and non-hydrostatic setup are available. Citation: Magaldi and Haine, 2015
References
- Almansi, M., T.W. Haine, R.S. Pickart, M.G. Magaldi, R. Gelderloos, and D. Mastropole, 2017: High-Frequency Variability in the Circulation and Hydrography of the Denmark Strait Overflow from a High-Resolution Numerical Model. J. Phys. Oceanogr., 47, 2999–3013, https://doi.org/10.1175/JPO-D-17-0129.1.
- Marcello G. Magaldi, Thomas W.N. Haine, Hydrostatic and non-hydrostatic simulations of dense waters cascading off a shelf: The East Greenland case, Deep Sea Research Part I: Oceanographic Research Papers, Volume 96, 2015, Pages 89-104, ISSN 0967-0637, https://doi.org/10.1016/j.dsr.2014.10.008.
SciServer offers a number of catalog datasets in the form of online SQL databases. These databases can be queried directly through CasJobs. From the CasJobs Query page, select a dataset from the Context menu to choose what to query. Enter your query in the query box, then click Quick (to return results to your browser, limited to 90 seconds) or Submit (to write results into your MyDB
.All these databases can also be queried from within SciServer Compute by using the SciServer API libraries. You do not have to mount anything to query these databases, and the libraries will work from within any container.
Simply import these libraries into your scripts, then make use of them. Information about how to use the libraries, and what commands are available, can be found in the SciServer API libraries documentation. Examples of how to query databases are available in the SciServer Example Notebooks.

Indra is a suite of large-volume cosmological N-body simulations. Each of the 384 simulations is computed with the same cosmological parameters and different initial phases, providing excellent statistics of the large-scale features of the distribution of dark matter.
The independent volumes have 10243 dark matter particles in a box of length 1 Gpc/h, and are all accessible through SciServer Compute containers to all users who join the Cosmology science domain.
A full description of the Indra suite of simulations can be found in a paper by Falck, et al (2021).
The Indra relational database contains:
- Halo catalog tables for every simulation and snapshot
- Spatial3d library to allow efficient selection of halos and particle data within 3D shapes
Indra data are accessed with the indra-tools python package pre-installed on the Computational Simulations compute image. The indra-tools git repository contains example notebooks showing how to read the binary data, query the halo database tables, compute density fields, etc.
Use of the Indra dataset is open and available to anyone. We ask that scientific publications that make use of Indra cite the Falck, et al (2021) data release paper.
The Sloan Digital Sky Survey (SDSS) is an ongoing project to make a map of the Universe. It has observed images of hundreds of millions of stars and galaxies, and spectra for more than four million.The entire history of data releases from the Sloan Digital Sky Survey (SDSS) can be queried through SciServer Compute by importing and using the SciServer libraries.
The SDSS releases its data in discrete batches through sequentially-numbered Data Releases. The most recent is Data Release 16 (DR16); for new studies with SDSS data, begin with DR16. To enable replication and extension of prior studies, we make all previous data releases (DR1 through DR15) available as well. We also provide two additional datasets: Stripe82 contains all photometric data for the repeat observations of the SDSS supernova survey, while RunsDB contains all photometric data for all SDSS observations, including overlap areas.
To query one of the SDSS databases using CasJobs, select its name from the Context menu just above the query window. The contexts holding SDSS data are DR16, DR15, etc.
To access one of the SDSS databases using SciServer Compute, specify its name as the context in the appropriate place in your SciServer.CasJobs.executeQuery(sql, context)
or SciServer.CasJobs.submitQuery(sql, context)
commands: for example, as
SciServer.CasJobs.executeQuery("select top 10 * from photoobj", context="DR16")
Gaia is a European Space Agency mission to find distances and properties of more than one billion stars in our Milky Way Galaxy.SciServer hosts the complete catalog data for Gaia Data Release 2 (Gaia DR2).
To query the Gaia DR2 catalog using CasJobs, select GaiaDR2
from the Context menu just above the query window. To access Gaia DR2 data using SciServer Compute, specify its name as the context in the appropriate place in your SciServer.CasJobs.executeQuery(sql, context)
or SciServer.CasJobs.submitQuery(sql, context)
commands: for example, as
SciServer.CasJobs.executeQuery("select top 10 * from gaia_source", context="GaiaDR2")
The Galaxy Evolution Explorer (GALEX) is an ultraviolet space telescope that operated from 2003 to 2012. During that time, it observed hundreds of thousands of galaxies, helping to determine distances and star formation rates throughout the universe. SciServer offers access to all GALEX releases up to and including its final complete dataset, GALEX Release 6 (GR6). The GALEX data releases are referred to in SciServer as GALEXGR6, GALEXGR5, etc.To query one of the GALEX databases using CasJobs, select its name from the Context menu just above the query window.To access one of the GALEX databases using SciServer Compute, specify its name as the context in the appropriate place in your SciServer.CasJobs.executeQuery(sql, context)
or SciServer.CasJobs.submitQuery(sql, context)
commands: for example, as
SciServer.CasJobs.executeQuery("select top 10 ra, dec from acsData", context="GalexGR6")
The Two Micron All-Sky Survey (2MASS) is an all-sky survey at infrared wavelengths. SciServer offers access to the Point Source Catalog of the 2MASS All-Sky Data Release, enabling studies of populations of stars and other resolved objects in the Milky Way.To query this catalog using CasJobs, select 2MASS
from the Context menu just above the query window. To access it using SciServer Compute, specify its name as the context in the appropriate place in your SciServer.CasJobs.executeQuery(sql, context)
or SciServer.CasJobs.submitQuery(sql, context)
commands: for example, asSciServer.CasJobs.executeQuery("select top 10 * from PhotoObjAll", context="2MASS")
The Two-degree-Field (2DF) Galaxy Redshift Survey is an all-sky survey at visible wavelengths, with the goal of understanding the large-scale structure of galaxies.To query this catalog using CasJobs, select 2DF
from the Context menu just above the query window. To access it using SciServer Compute, specify its name as the context in the appropriate place in your SciServer.CasJobs.executeQuery(sql, context)
or SciServer.CasJobs.submitQuery(sql, context)
commands: for example, asSciServer.CasJobs.executeQuery("select top 10 * from PhotoObjAll", context="2DF")
The Faint Images of the Sky at Twenty cm (FIRST) is an all-sky survey at radio wavelengths.To query this catalog using CasJobs, select FIRST
from the Context menu just above the query window. To access it using SciServer Compute, specify its name as the context in the appropriate place in your SciServer.CasJobs.executeQuery(sql, context)
or SciServer.CasJobs.submitQuery(sql, context)
commands: for example, asSciServer.CasJobs.executeQuery("select top 10 * from PhotoObjAll", context="FIRST")
The Minor Planet Center Orbit (MPCORB) database contains orbital parameters for more than 700,000 asteroids from the International Astronomical Union’s Minor Planet Center. The dataset available through SciServer is a snapshot of the database as of August 17, 2017.To query this catalog using CasJobs, select MPCORB
from the Context menu just above the query window. To access it using SciServer Compute, specify its name as the context in the appropriate place in your SciServer.CasJobs.executeQuery(sql, context)
or SciServer.CasJobs.submitQuery(sql, context)
commands: for example, asSciServer.CasJobs.executeQuery("select top 10 * from mpcorb", context="MPCORB")
SciServer also offers astronomical cross-matching services through the SkyQuery web service. A list of all datasets you can cross-match among is on the SkyQuery website. You can use SkyQuery to match objects between these big datasets or against data that you upload.See the SkyQuery website for more information.