Download Data Warehouse Final Report

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Data center wikipedia , lookup

Data model wikipedia , lookup

Clusterpoint wikipedia , lookup

Forecasting wikipedia , lookup

Data analysis wikipedia , lookup

Information privacy law wikipedia , lookup

3D optical data storage wikipedia , lookup

Data vault modeling wikipedia , lookup

Business intelligence wikipedia , lookup

Database model wikipedia , lookup

Transcript
CSc 177 Term Project Cover Page Student(s) Name _​
Chris Diel, Dan Baptista, Joseph Willet​
_________ Grade ______ Title of the project __​
_Data Warehouse: Baseball Team Analytics​
_ 1. objective statement of the term project 2. background information
3. design principle of your data mining system/ scope of study
4. implementation issues and solutions/ survey results/ diagrams/ tables
5. summary of learning experience such as experiments and readings
6. references (authors, title, publishing source data, date of publication, URL) and you should quote each reference in your report text. 7. appendix (optional) containing a set of supporting material such as examples, sample demo sessions,
and any information that reflects your effort regarding the project.
Data Warehouse Objective: The objective for our data warehouse was to use data collected from the web to build a specialized data warehouse or data mart that would allow someone to query overall team statistics by year. With the amount of baseball statistics available we tried to limit our scope by only using overall team data from the last ten years. We also aimed to organize the data in such a way that the user can view the main three types of statistics which are batting, pitching, and fielding. Additionally we wanted to set up dimensions that allowed us to filter the statistics by team, league, or division. Background Info: There is a large abundance of sports data available and as the years go on there will only be more to come. This makes sports data a good topic for data warehousing in that the scope is essentially endless. Furthermore in recent years sports such as baseball, football, and basketball have become huge sources of income for teams and other companies involved. The combination of large amounts of data involved and money invested has created a new job market for data analysts and computer scientists. With so much time and money invested in teams and stadiums team owners are now looking to data analysts to help answer the questions, of what players to hire/fire? Why are our ticket sales lacking? Where is our team struggling or doing well? These are just a few questions among many that team owners and investors may ask. Data warehousing is something that is going to continue being a part of sports or anything involving competition at this point. With all this in mind it is obvious that the sports industry as a whole has endless possibilities for data warehousing applications. These ideas and Joe’s history as a baseball player and fan were our motivations for this project. However, with only part of a semester to work on this, our project will only reflect an extremely small subset of this topic. As for languages our current Senior project was also written with a lamp stack, and since ecs is offering apache on a linux server it made sense to use PHP and Mysql to create a lamp stack for this project as well. Design Principle: With such a huge amount of data available we had to be careful to not to let our scope get too large. We did this by limiting our data to overall team statistics and only looking at the last ten years. Our system allows the user to query a database of team based on batting, fielding, and hitting statistics with filters by team, league, and division. The target user would be data analysts or even baseball fans who are interested in answer questions about overall team performance. Some example questions we aimed to answer: What were a specific team’s batting stats in over a span of years? What were the American league pitching stats last year? What were the overall western division stats in a given year? Implementation: Data collection/preprocessing: Our data source was a website called baseball­reference.com, a website that houses stats baseball stats dating back to the 1870s. The site offers various views and breakdowns of stats and also has an option to export the data in csv format. This made the data collection fairly simple. We were able to search stats by team and year and export. The csv files we collected were fairly clean, despite this the data preprocessing stage was more difficult and tedious than expected. Each file had several lines of data that was useless for our purposes. We also had to inject a two columns of data in order to make the data fit our schema. Another unexpected issue with the data was multiple team names had changed over the last ten years. This was solved fairly simply by adding an extra row to our team dimension table. Database Design: For our database design we chose a star schema. With so many columns of statistics we found it difficult to visualize the structure of our schema. We decided on three fact tables, one for each type of statistic batting, fielding, and pitching. Since the granularity of our data was only on the yearly level there wasn't really a purpose to having a time dimension in our schema. Instead we chose to use a team dimension and a league dimension. The league dimension proved useful in that it allowed us to drill down or roll up by division or league. The team dimension however only came in use for filtering by team name which made the team dimension somewhat unnecessary since we never used the extra fields. With our data processed and our schema drawn out, the next step was to actually create our database. Having our csv files process made this step rather trivial. First we had to find a server to house our database. For this we chose to request space on the ecs servers by filling out a group request form at ​
http://www.ecs.csus.edu/​
. The following day our request was approved and we were then ready to populate our database. Using the tools on ​
http://www.convertcsv.com/csv­to­sql.htm​
we were able to take our processed csv files and generate the mysql CREATE and INSERT commands for our database. Diagram of schema, sample generated mysql on following pages. Star Schema Diagram: Example CREATE command: CREATE TABLE mytable( yearID INTEGER NOT NULL PRIMARY KEY ,teamID VARCHAR(3) NOT NULL ,divID VARCHAR(5) NOT NULL ,AB INTEGER NOT NULL ,R INTEGER NOT NULL ,H INTEGER NOT NULL ,HR INTEGER NOT NULL ,RBI INTEGER NOT NULL ,SO INTEGER NOT NULL ,BA NUMERIC(5,3) NOT NULL ,OBP NUMERIC(5,3) NOT NULL ,SLG NUMERIC(5,3) NOT NULL ); Example INSERT command: INSERT INTO mytable(yearID,teamID,divID,AB,R,H,HR,RBI,SO,BA,OBP,SLG) VALUES (2013,'ARI','nWest',5676,685,1468,130,647,1142,0.259,0.323,0.391,); User interface design: The interface for our data mart was designed using a simple HTML form. The form allowed the user to select query fields. Field options included three types of stats which were batting, pitching, and fielding. These three types of stats corresponded to our three fact tables. Within each type the user is then allowed to filter by team, league, and division values which are attributes from our team and league dimension tables. Image of HTML interface: When the user clicks submit, the form then passes the selected attributes to a PHP script. The script connects to our database and uses a mysql SELECT command to pull rows from the database. Finally, it uses more HTML to format the output in a table as shown in the image above. Our data mart can be accessed at: http://athena.ecs.csus.edu/~willettj/dm.php Learning Experience: This project overall was good introduction to creating a data mart. It gave our team a chance to get experience in three major components of data warehousing which were, data preprocessing, database design, and implementations. Throughout the process we also got to refresh our html skills and learn a little PHP. We found many of the CSc 177 course resources useful, such as the data warehousing courseware available at http://athena.ecs.csus.edu/~enroll/enrollDW/Intro.php​
. Reading and doing the demos on this site helped us come up with the star schema for our project. We also often referred to the html and PHP demos on the site ​
http://www.w3schools.com/​
when looking for example code to guide us in the user interface implementation process. Another source from the Csc 177 course resources was the csv to mysql conversion tool found at http://www.convertcsv.com/csv­to­sql.htm​
this tool made much of the database creation and population and easy task. References: http://www.baseball­reference.com/ http://www.convertcsv.com/csv­to­sql.htm http://www.seanlahman.com/open­source­sports/ http://www.w3schools.com/ http://www.ecs.csus.edu/ http://athena.ecs.csus.edu/~enroll/enrollDW/Intro.php Han, Jiawei, Micheline Kamber, and Jian Pei. ​
Data mining : concepts and techniques.​
Amsterdam Boston: Elsevier/Morgan Kaufmann, 2012. Print.