Download Data Mart and Data Mining on CA State Financial Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript








Motivation
Dataset
Project Scope
Data Warehousing
Data Mining
Conclusion
Learning Experience
References

Research 1:
Public school funding is the largest program in the state budget,
receiving more than 40 % of the state's General Fund resources. The 2014–15
state budget includes more than $45 billion in General Fund resources.
- California Department of Education

Research 2:
Since 1980, higher education spending has decreased by 13 percent
in inflation adjusted dollars, whereas spending on California’s prisons and
associated correctional programs has skyrocketed by 436 percent.
- California HuffingtonPost
CA State Government financial data reported by counties,
cities and districts with more than billions of records in each
files .
 This dataset has details of Expenditures, Revenues and State
Income of all the departments generated in the form of fees,
penalties and taxes.
 https://bythenumbers.sco.ca.gov/

Value
County
Year
Dataset
City
Department
Sub-Dept.
District
Financial
Class

To give important financial information on government’s
funding and income based on distinct regions and
departments.

Target user ?
 Citizens, Tax payers, Students
 Businesses, Non-profit organizations
Data Mart
What is the State Income based on County, City and
District?
 Which Business categories and Sub-Departments
are responsible for the maximum income collection?
 Determine the expenditures for a particular
department.
 How much has your county spent on public safety in
the past 4 years?

Original data: three different files in .csv format
 Handled missing values and listed required attributes for our
project
 Data integration and data reduction to relevant records
 To provide relation and association among three different
datasets, we created extra attributes and identifier
 Load tables in MySQL database


Large Dataset with 20 billions of records approximately

Departments with invalid and blank values were eliminated
to maintain consistency in our record

Removed least required sub-categories. Primary focus was
on income and expenditure
Demo !!!
Link to our Data Mart : http://athena.ecs.csus.edu/~appanap/
Q.
A star schema has what type of relationship
between a dimension and fact table?
a)
b)
c)
d)
Many to many
One to one
One to many
All of the above
Answer: One to many
Data Mining
 Classification algorithm is used to classify the
counties and departments into Loss and
Profit classes
 Prediction for deciding value ranges for year
2014
Used the combined 3 datasets from our DataMart
application
 Maintained required attributes for mining in CSV
format
 Converted numeric column to nominal i.e. values to
ranges

Challenges
 Algorithms selection for dataset
 Large dataset
 Creating classes for important categories
 Classification Tree: J48
 Tool: Weka, Tableau
 Used data of 2010 to 2013 as training data
 Data of year 2014 is used as Test data to
check prediction accuracy
As we can see the prediction is
almost accurate.
The Predicted class of 2014 has
approximate same values and
range compared to above
actual values.
Using tableau we were able to visualize and research for the
following:
 what are the top 5 revenue and expenditure generating
counties?
 Financial data comparison in between Sacramento and Los
Angeles
 Funds distribution for flood control
 Government expenditures on public facilities like roads,
parking facilities etc.
 Government expenses on public health
Comparison on government’s expenses between prisons and
education
Q. Which of the following is not a data mining
functionality?
a) Characterization and Discrimination
b) Classification and regression
c) Selection and interpretation
d) Clustering and Analysis
Answer: Selection and interpretation




Data Warehouse design:
PHP, HTML/CSS, JavaScript
Database:
MySQL
Data Mining tools:
WEKA
Data Visualization:
Tableau
Learned designing of data mart application
 Learned different data mining tools like Weka, Rapid miner
and Tableau
 Learned the practical usage of various classification
algorithms like J48, Naïve Bayesian, correlation matrix
 Team work and brainstorming really helped us to resolve
issues in execution of our project

California States Controller’s Office , Government Financial Reports,
https://bythenumbers.sco.ca.gov/browse?utf8=%E2%9C%93&page=1
 California Department of Education:
http://www.cde.ca.gov/fg/fr/eb/
 California Drought
http://drought.ca.gov/topstory/top-story-58.html
 California Spending More On Prisons Than Colleges, Report
http://www.huffingtonpost.com/2012/09/06/california-prisonscolleges_n_1863101.html
