Educational Datasets
As a free online science platform anyone can use, SciServer is an ideal resource for data science education. Students can jump right in to data visualization and analysis with no need to download data, install software, or configure environments.
We have a number of student learning activities available, described on our Education page. In addition, we offer datasets tailored for education activities, with example notebooks that let students get up and running quickly.
Each dataset is listed in an expandable section below. Expand it to learn what each dataset contains and how to access it.
> Baseball History +
What it is: Play-by-play data for the history of Major League Baseball
How to access: Create a SciServer Compute container and mount the Getting Started Data Volume. The baseball dataset is the datasets/baseball folder.
SciServer’s Getting Started public data volume contains example notebooks to help you learn how to use our science platform in your research and teaching. Now, we also feature datasets and accompanying notebooks designed to introduce the concepts and skills of data science. Our first example dataset comes from one of the most visibly data-intensive activities in American culture: baseball.
Our baseball getting started big dataset documents the history of Major League Baseball events. Every at-bat since 1974 is included, with some as far back as 1915. The dataset consists of CSV files, one per season. Each row represents one event – one at-bat, or a similar event such as a stolen base.
The Getting Started public data volume also includes one example notebook with one of many possible use cases – making the plot shown here of which section the ball travels to on outs in the 2022 season – with more to follow. Example notebooks working with the baseball dataset can be found in getting_started/Example-Notebooks/baseball/
.
Events are transcribed from official records by volunteers with the Retrosheet website. For the structure of the files, see Retrosheet’s Description of Event Files.
Data Usage
Recipients of Retrosheet data are free to make any desired use of the information, including (but not limited to) selling it, giving it away, or producing a commercial product based upon the data. Retrosheet has one requirement for any such transfer of data or product development, which is that the following statement must appear prominently:
The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at 20 Sunset Rd., Newark, DE 19711.