This notebook shows you how to use SciServer compute to communicate with the other components of SciServer. You will learn how to:
All SciServer tools (CasJobs, SciDrive, iPython Notebooks, etc.) use the same single-sign-on system, so you only need to remember one password.
When you open your Docker container from the SciServer Compute dashboard page, the current token will be written in the file /home/idies/kestone.token. You can find your token on your Compute dashboard, under your username.
The code block below reads the token and stores it in a local variable, then prints its value along with your login name.
Note: since your token may expire, you should sometimes refresh the token as you work. You can do this by refreshing the token on the Dashboard, and then rerunnning the next block of code.
# This code block defined your token and makes it available as a
# system variable for the length of your current session.
#
# This will usually be the first code block in any script you write.
with open('/home/idies/keystone.token', 'r') as f:
token = f.read().rstrip('\n')
# async queries require token to be in --ident system variable
import sys
sys.argv.append("--ident="+token)
print("Your current token is"+token)
The SciServer team has written a number of libraries, generally prefixed by SciServer, that assist in various functions. As with all Python libraries, they must be actively imported before being used.
The next code block imports those, together with some standard Python libraries helpful for scientific analysis. The code block below applies some settings you may find helpful.
# Step 2a: Import Python libraries to work with SciServer
import SciServer.CasJobs as CasJobs # query with CasJobs
import SciServer.SciDrive # read/write to/from SciDrive
# step 2b import other libraries for use in this notebook.
# all of these are included in the default Docker image
# but others can often be downloaded through a terminal or a '!pip install ...' run from
# within the notebook
import numpy as np # standard Python lib for math ops
import pandas # data manipulation package
import matplotlib.pyplot as plt # another graphing package
import skimage.io # image processing library
import urllib # accessing resources thorugh remote URLs
import json # work with JSON files
# Step 2b: Apply some special settings to the imported libraries
# ensure columns get written completely in notebook
pandas.set_option('display.max_colwidth', -1)
# do *not* show python warnings
import warnings
warnings.filterwarnings('ignore')
The next code block searches the SDSS Data Release 12 database via the CasJobs REST API. The query completes quickly, so it uses CasJobs quick mode.
CasJobs also has an asynchronous mode, which will submit job to a queue and will store the results in a table in your MyDB. If your results are very large, you may order it to store the results in MyScratchDB instead.
Run the code block below to query DR12. Try changing some of the query parameters in step 3a to see the effect on the results returned in step 3d.
Documentation on the SciServer Python libraries can be found at our documentation site at:
http://scitest02.pha.jhu.edu/python-docs/
The actual source code is accessible on GitHub at
https://github.com/sciserver/SciScript-Python/tree/master/SciServer
# Step 3a: Find objects in the Sloan Digital Sky Survey's Data Release 12.
# Queries the Sloan Digital Sky Serveys' Data Release 12.
# For the database schema and documentation see http://skyserver.sdss.org
#
# This query finds "a 4x4 grid of nice-looking galaxies":
# galaxies in the SDSS database that have a spectrum
# and have a size (petror90_r) larger than 10 arcsec.
#
# First, store the query in an object called "query"
query="""
SELECT TOP 16 p.objId,p.ra,p.dec,p.petror90_r
FROM galaxy AS p
JOIN SpecObj AS s ON s.bestobjid = p.objid
WHERE p.u BETWEEN 0 AND 19.6
AND p.g BETWEEN 0 AND 17
AND p.petror90_r > 10
"""
# Step 3b: Send the query to CasJobs using the SciServer.CasJobs.executeQuery method.
# The method takes the query and the 'context' (= target database) as parameters, and possibly the token.
# This example uses DR12 as context - the code makes a connection
# to the DR12 database, then runs the query in quick mode.
# When the query succeeds, an "OK" message prints below.
queryResponse = CasJobs.executeQuery(query, "dr12",token=token)
# Step 3c: store results in a pandas.DataFrame for easy analysis.
#
# CasJobs returns the results as a CSV string, stored in the "queryResponse" variable.
# Now parse the results into a DataFrame object using the pandas library.
# We identify the first column asn an index column, which is for slightly technical reasons explained below.
# pandas.read_csv documentation:
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
gals = pandas.read_csv(queryResponse,index_col='objId')
# Step 3d: Show the table of results
gals
SciServer Python modules documentation: http://scitest02.pha.jhu.edu/python-docs/
Schema of SDSS Data Release 12: http://skyserver.sdss.org/dr12/en/help/browser/browser.aspx
Schema of SDSS Data Release 10: http://skyserver.sdss.org/dr10/en/help/browser/browser.aspx
Now that we have run the query and stored the results, we can start analyzing the results.
Start by making a simple plot of positions, using the default query from step 3 (select top 16... AND p.petror90_r > 10).
plt.scatter(gals['ra'], gals['dec'])
plt.show()
SciServer Python modules documentation: http://scitest02.pha.jhu.edu/python-docs/
Schema of SDSS Data Release 12: http://skyserver.sdss.org/dr12/en/help/browser/browser.aspx
Documentation for matplotlib module: http://matplotlib.org/contents.html
The next code block saves the data table "gals" as an HD5 file and as a CSV file.
To see these files, go back to the folder in your Jupyter dashboard from which you opened this notebook. You should see your files there. Click on the file names to preview.
# store result as HDF5 file
h5store = pandas.HDFStore('GalaxyThumbSample.h5')
h5store['galaxies']=gals
h5store.close()
# store result as CSV file
gals.to_csv('GalaxyThumbSample.csv')
Documentation on the Pandas package's DataFrame.to_csv method:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html
SkyServer, another SciServer component, has a service that will produce a color JPG image cutout of certain dimensions around a specified position, useful for creating thumbnails.
The service creates the thumbnail using a pre-defined image pyramid. For a single image, you can construct the URL of the service using your query results, then use the skimage package to call the URL. To get all thumbnails in your query result, you can iterate using a loop.
The code block below gives an example of how to retrieve JPG thumbnails of galaxies in DR12. We need to create a URL for accessing the serice and set the parameters appropriately to produce nice thumbnails.
Note, the SQL query aboove was designed to produce positions of some nice looking galaxies.
width=200
height=200
pixelsize=0.396
plt.figure(figsize=(15, 15))
subPlotNum = 1
for index,gal in gals.iterrows():
# the 'scale' parameter is set so that the image will be about 2x the size of the galaxy
scale=2*gal['petror90_r']/pixelsize/width
url="http://skyservice.pha.jhu.edu/DR12/ImgCutout/getjpeg.aspx?ra="+str(gal['ra'])
url+="&dec="+str(gal['dec'])+"&scale="""+str(scale)+"&width="+str(width)
url+="&height="+str(height)
img=skimage.io.imread(url)
plt.subplot(4,4,subPlotNum)
subPlotNum += 1
plt.imshow(img)
# show the object identifier (objId) above the image.
plt.title(index)
SciDrive is a new component of SciServer. It allows you to save query results as flat files in a Dropbox-like interface you can access anywhere.
The version of SciDrive this notebook connects to is not the same as the pre-production version you may have used before. Use the link below to access this test version of SciDrive. You should have no containers in this SciDrive yet.
Check your test SciDrive at:
http://scitest09.pha.jhu.edu/scidrive/scidrive.html
If the above link does not show a proper view of scidrive, with folders etc., please let us know, and do not run the rest of the code in this notebook until we investigate.
The three code blocks below work together to write the thumbnails you generated in step 6 into your test SciDrive.
# Step 7a: a function for generating a public URL for resources stored in SciDrive
# TODO this should be isolated as a part of the SciServer.SciDrive library
def scidrivePublicURL(path):
req = urllib.request.Request(url=SciServer.Config.SciDriveHost+'/vospace-2.0/1/media/sandbox/'+path,method='GET')
req.add_header('X-Auth-Token', token)
req.add_header('Content-Type','application/xml')
res=urllib.request.urlopen(req)
jsonResponse = json.loads(res.read().decode())
return jsonResponse['url']
# Step 7b: create a container (~folder) in your SciDrive to hold the thumbnail images
container = 'thumbnails'
# IMPORTANT: Only run next if the container does not yet exist. If you have already created
# the container, comment out the next line.
# Note the token that must be provided as it allows the system to connect you to the proper scidrive root folder.
SciServer.SciDrive.createContainer(container,token=token)
# Step 7c: Write thumbnails to SciDrive. You will see a confirmation message below
# for each thumbnail.
width=200
height=200
pixelsize=0.396
# for later use we determine publicly accessible URLs to each thumbnail and store these in a separate list.
puburls=[]
for index,gal in gals.iterrows():
scale=2*gal['petror90_r']/pixelsize/width
url="http://skyservice.pha.jhu.edu/DR12/ImgCutout/getjpeg.aspx?ra="+str(gal['ra'])
url+="&dec="+str(gal['dec'])+"&scale="""+str(scale)+"&width="+str(width)
url+="&height="+str(height)
req = urllib.request.Request(url=url,method='GET')
res = urllib.request.urlopen(req)
data=res.read()
scidrivename_name = container+"/new_"+str(index)+".jpg"
# Here the file gets uploaded to the container
SciServer.SciDrive.upload(scidrivename_name, data,token=token)
puburls.append(scidrivePublicURL(scidrivename_name))
# add the column of public urls to the original pandas.DataFrame
gals['pubURL']=puburls
Check your test SciDrive folder again. You should see a container called "thumbnails".
Double-click on the name to open the container. You should see the thumbnails you just saved!
Your test SciDrive URL:
http://scitest09.pha.jhu.edu/scidrive/scidrive.html
We'll store the results of our work in a table in your CasJobs/MyDB. This includes the result of your original query, with an extra column containing the public URL of the thumbnail corresponding to the galaxy retrieved from the database.
Check the state before: http://scitest02.pha.jhu.edu/CasJobs/MyDB.aspx
# add column with public urls to the galaxies table ...
gals['pubURL']=puburls
# show the table
gals
# to write to your MyDB, first create the table
# For technical reasons the column names must be exactly the same as the columns in the DataFrame.
# Note, skip this step if the table already exists
ddl = 'CREATE TABLE GalaxyThumbs(objId bigint, ra real, dec real, petror90_r real, pubURL varchar(128))'
response = SciServer.CasJobs.executeQuery(ddl,token=token)
# if no 200 OK is printed as result, something has gone wrong.
# Now upload the data directly from the DataFrame
response=SciServer.CasJobs.uploadPandasDataFrameToTable(gals,"GalaxyThumbs2",token=token)
Check the state of your MyDB after: http://scitest02.pha.jhu.edu/CasJobs/MyDB.aspx
instead of executing a query, a query job can be submitted. Returns with a jobId, wbout which the status can be requested.
Important: for now the result of the query MUST be written explicitly to a table.
# async query example. Note the SELECT ... INTO ... pattern
query="""
SELECT TOP 16 p.objId,p.ra,p.dec,p.petror90_r
into MyDB.intro1query
FROM galaxy AS p
JOIN SpecObj AS s ON s.bestobjid = p.objid
WHERE p.u BETWEEN 0 AND 19.6
AND p.g BETWEEN 0 AND 17
AND p.petror90_r > 10
"""
jobId=CasJobs.submitJob(query, context = "DR12")
# retrieve status about job
# returns a JSON string.
# job is complet is Status attribute in (3,4,5)
CasJobs.getJobStatus(jobId)