Notebook 1

Query SDSS Data Release 12, save thumbnails to SciDrive

This notebook shows you how to use SciServer compute to communicate with the other components of SciServer. You will learn how to:

  1. Single-sign-on authentication through Keystone tokens
  2. import special purpose libraries written for SciServer actions
  3. Querying relational databases registered in CasJobs (SciServer's database frontend and batch query engine)
  4. Manipulating query results in python code (visualization)
  5. Storing results on local scratch disk as an HDF5 file for later reuse
  6. Based on query result retrieve images from persistents store, available to notebook because the docker container was created with a link to the corresponding volume container and show the images on the screen.
  7. Write images to the sharable, dropbox-like SciDrive. There they can be found through the UI interface and shared with colleagues.
  8. Store public URLs to thimbnails together with other data to table in one's private database, MyDB

1. Sign in to SciServer

All SciServer tools (CasJobs, SciDrive, iPython Notebooks, etc.) use the same single-sign-on system, so you only need to remember one password.

When you open your Docker container from the SciServer Compute dashboard page, the current token will be written in the file /home/idies/kestone.token. You can find your token on your Compute dashboard, under your username.

The code block below reads the token and stores it in a local variable, then prints its value along with your login name.

Note: since your token may expire, you should sometimes refresh the token as you work. You can do this by refreshing the token on the Dashboard, and then rerunnning the next block of code.

In [7]:
# This code block defined your token and makes it available as a 
#   system variable for the length of your current session.
# 
# This will usually be the first code block in any script you write.
with open('/home/idies/keystone.token', 'r') as f:
    token = f.read().rstrip('\n')
# async queries require token to be in --ident system variable
import sys
sys.argv.append("--ident="+token)
print("Your current token is"+token)
Your current token ise1833d6cc07641958af85ac572ad6f53

2. Import SciServer libraries

The SciServer team has written a number of libraries, generally prefixed by SciServer, that assist in various functions. As with all Python libraries, they must be actively imported before being used.

The next code block imports those, together with some standard Python libraries helpful for scientific analysis. The code block below applies some settings you may find helpful.

In [3]:
# Step 2a: Import Python libraries to work with SciServer

import SciServer.CasJobs as CasJobs # query with CasJobs
import SciServer.SciDrive           # read/write to/from SciDrive
In [4]:
# step 2b import other libraries for use in this notebook.
# all of these are included in the default Docker image
# but others can often be downloaded through a terminal or a '!pip install ...' run from 
# within the notebook
import numpy as np                  # standard Python lib for math ops
import pandas                       # data manipulation package
import matplotlib.pyplot as plt     # another graphing package
import skimage.io                   # image processing library
import urllib                       # accessing resources thorugh remote URLs
import json                         # work with JSON files
In [5]:
# Step 2b: Apply some special settings to the imported libraries
# ensure columns get written completely in notebook
pandas.set_option('display.max_colwidth', -1)
# do *not* show python warnings 
import warnings
warnings.filterwarnings('ignore')

3. Query an astronomy database (SDSS DR12)

The next code block searches the SDSS Data Release 12 database via the CasJobs REST API. The query completes quickly, so it uses CasJobs quick mode.

CasJobs also has an asynchronous mode, which will submit job to a queue and will store the results in a table in your MyDB. If your results are very large, you may order it to store the results in MyScratchDB instead.

Run the code block below to query DR12. Try changing some of the query parameters in step 3a to see the effect on the results returned in step 3d.

Documentation on the SciServer Python libraries can be found at our documentation site at:
http://scitest02.pha.jhu.edu/python-docs/

The actual source code is accessible on GitHub at
https://github.com/sciserver/SciScript-Python/tree/master/SciServer

In [22]:
# Step 3a: Find objects in the Sloan Digital Sky Survey's Data Release 12.
# Queries the Sloan Digital Sky Serveys' Data Release 12.
# For the database schema and documentation see http://skyserver.sdss.org
#
# This query finds "a 4x4 grid of nice-looking galaxies": 
#   galaxies in the SDSS database that have a spectrum 
#   and have a size (petror90_r) larger than 10 arcsec.
# 
# First, store the query in an object called "query"
query="""
SELECT TOP 16 p.objId,p.ra,p.dec,p.petror90_r
  FROM galaxy AS p
   JOIN SpecObj AS s ON s.bestobjid = p.objid
WHERE p.u BETWEEN 0 AND 19.6
  AND p.g BETWEEN 0 AND 17
  AND p.petror90_r > 10
"""
In [23]:
# Step 3b: Send the query to CasJobs using the SciServer.CasJobs.executeQuery method.
# The method takes the query and the 'context' (= target database) as parameters, and possibly the token.
# This example uses DR12 as context - the code makes a connection 
#   to the DR12 database, then runs the query in quick mode.
#   When the query succeeds, an "OK" message prints below.
queryResponse = CasJobs.executeQuery(query, "dr12",token=token)
executeQuery POST response:  200 OK
In [24]:
# Step 3c: store results in a pandas.DataFrame for easy analysis.
#
# CasJobs returns the results as a CSV string, stored in the "queryResponse" variable.
# Now parse the results into a DataFrame object using the pandas library.
# We identify the first column asn an index column, which is for slightly technical reasons explained below.
#    pandas.read_csv documentation: 
#    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
gals = pandas.read_csv(queryResponse,index_col='objId')
In [25]:
# Step 3d: Show the table of results
gals
Out[25]:
ra dec petror90_r
objId
1237645941824356443 40.285690 -0.714957 14.72862
1237645943978393694 55.123625 0.872487 18.85110
1237645943979114622 56.847420 0.875488 15.66479
1237645943979311221 57.248385 0.925979 14.60275
1237646588246688297 80.135126 1.186679 11.14294
1237648672921485632 243.708876 -0.915653 15.01578
1237648672922337955 245.700633 -0.844301 10.51866
1237648672922468973 246.015172 -0.902869 10.80825
1237648673459077169 245.367353 -0.457074 11.18379
1237648673459667002 246.695626 -0.510111 10.27149
1237648673459667234 246.782081 -0.492432 11.98310
1237648673460585255 248.894520 -0.452329 11.48170
1237648673971437623 189.522249 -0.027031 21.94494
1237648673992671592 237.945144 -0.105170 13.36310
1237648673994965546 243.236782 -0.096251 26.27760
1237648673995686549 244.875590 -0.038526 12.17980

Exercise A: Querying CasJobs

Try changing the query in the following ways.

  1. Return 8 matching objects insetad of 16.
  2. Select only nearby galaxies (redshift < 0.05).
  3. Select only galaxies likely to be spirals (with u-r color >= 2.22)
  4. Search for galaxies in SDSS Data Release 10 instead of DR12.

Some helpful resources

SciServer Python modules documentation: http://scitest02.pha.jhu.edu/python-docs/

Schema of SDSS Data Release 12: http://skyserver.sdss.org/dr12/en/help/browser/browser.aspx

Schema of SDSS Data Release 10: http://skyserver.sdss.org/dr10/en/help/browser/browser.aspx

4. Make a simple Plot

Now that we have run the query and stored the results, we can start analyzing the results.

Start by making a simple plot of positions, using the default query from step 3 (select top 16... AND p.petror90_r > 10).

In [26]:
plt.scatter(gals['ra'], gals['dec'])
plt.show() 

Exercise B: Plotting data

Try changing the code block below in the following ways. What changes do you notice in the plots?

  1. Plot the first 100 matching objects instead of the first 16.
  2. Plot galaxies out to dec = +/-5 (change the display axes as well)
  3. Plot a color-magnitude diagram with g on the x-axis and g-r on the y-axis. (Needs extra columns to be returned in the SQL query).

Some helpful resources

SciServer Python modules documentation: http://scitest02.pha.jhu.edu/python-docs/

Schema of SDSS Data Release 12: http://skyserver.sdss.org/dr12/en/help/browser/browser.aspx

Documentation for matplotlib module: http://matplotlib.org/contents.html

5. Store results on server for later use

The next code block saves the data table "gals" as an HD5 file and as a CSV file.

To see these files, go back to the folder in your Jupyter dashboard from which you opened this notebook. You should see your files there. Click on the file names to preview.

In [27]:
# store result as HDF5 file 
h5store = pandas.HDFStore('GalaxyThumbSample.h5')
h5store['galaxies']=gals
h5store.close()

# store result as CSV file
gals.to_csv('GalaxyThumbSample.csv')

Exercise C: Saving files

Try some alternatives:

  1. Change the name of the files
  2. Save the file as tab-separated variable (.tsv) instead of the default comma-separated variable (.csv).

Documentation on the Pandas package's DataFrame.to_csv method:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html

6. Retrieve thumbnail cutouts of galaxies and show them on screen

SkyServer, another SciServer component, has a service that will produce a color JPG image cutout of certain dimensions around a specified position, useful for creating thumbnails.

The service creates the thumbnail using a pre-defined image pyramid. For a single image, you can construct the URL of the service using your query results, then use the skimage package to call the URL. To get all thumbnails in your query result, you can iterate using a loop.

The code block below gives an example of how to retrieve JPG thumbnails of galaxies in DR12. We need to create a URL for accessing the serice and set the parameters appropriately to produce nice thumbnails.

Note, the SQL query aboove was designed to produce positions of some nice looking galaxies.

In [28]:
width=200
height=200
pixelsize=0.396
plt.figure(figsize=(15, 15))
subPlotNum = 1

for index,gal in gals.iterrows():
    # the 'scale' parameter is set so that the image will be about 2x the size of the galaxy
    scale=2*gal['petror90_r']/pixelsize/width
    url="http://skyservice.pha.jhu.edu/DR12/ImgCutout/getjpeg.aspx?ra="+str(gal['ra'])
    url+="&dec="+str(gal['dec'])+"&scale="""+str(scale)+"&width="+str(width)
    url+="&height="+str(height)
    img=skimage.io.imread(url)
    plt.subplot(4,4,subPlotNum)
    subPlotNum += 1
    plt.imshow(img)
    # show the object identifier (objId) above the image.
    plt.title(index)