Download CSC 177 Fall 2014 Team Project Final Report Project Title, Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
CSC #177Data MiningGroup #8
CSC 177 Fall 2014
Team Project Final Report
Project Title,
Data Mining on Farmers Market Data
Instructor: Dr. Meiliu Lu
Team Members:
Yogesh Isawe
Kalindi Mehta
Aditi Kulkarni
CSC #177Data MiningGroup #8
CSc 177 DM Project Cover Page
Due 12-15-14 5pm
(Submit it to the CSC Department office before 5pm 12/15/14
Or to the instructor at 5:15pm in RVR 5029)
Student(s) Name : Aditi Kulkarni, Kalindi Mehta, Yogesh Isawe Grade ______
Title of the project: Data Mining on Farmers Market
Hand-in-check list:
A hardcopy of final report (without appendix) with cover page for the term project
An electronic copy on a CD including all of important writings of your term project
 Project oral presentation power point file with improvement made
based on comments of the class and instructor during oral
presentation.
 Project final report (100%) containing the following parts, font >=
11:
1.
2.
3.
4.
5.
6.
7.
objective statement of the term project (1/3 -1/2 page);
background information (1 page);
design principle of your data mining system/ scope of study (1/3 – 1/2 page);
implementation issues and solutions/ survey results/ diagrams/ tables (3-5 pages);
summary of learning experience such as experiments and readings (1/2 - 1 page);
References (authors, title, publishing source data, date of publication, URL) and you should quote
each reference in your report text.
Appendix (optional) containing a set of supporting material such as examples, sample demo
sessions, and any information that reflects your effort regarding the project.
CSC #177Data MiningGroup #8
TABLE OF CONTENTS
Chapter
1. OBJECTIVE
2. BACK GROUND INFORMATION
3. DESIGN PRINCIPLES
4. IMPLEMENTATION ISSUES AND SOLUTIONS
5. SUMMARY OF LEARNING EXPERIENCE
6. FUTURE SCOPE
7. REFERENCES
CSC #177Data MiningGroup #8
1. Abstract
Data set consists of Location of U.S. Farmers Market, Goods availability at the market as
per season. We have created a data mart that can provide the information and answers
questions. We have designed questions to address two types of users Consumer and
Government officials. For data mining project, we are working on the same data to find
patterns.
2. Objective
Using data mining tool ‘WEKA’ to do a multi-step data mining exercise. Interpreting the
data well, understanding the structure of the data using one or more data mining
algorithms, and presenting the findings.
Mining data to extract knowledge from available data.
To explore alternative data mining tools such as ‘Rapidminer’.
3. Background Information
In data mining project we are mining US Farmers Market data to extract knowledge.
Here we are using WEKA tool to mine the data. Data source for data is
http://catalog.data.gov/dataset/farmers-markets-geographic-data. Original dataset consists
of 8000 records with 41 different attributes related to farmers market. Our primary goal is
to use different mining tools to apply classification and clustering algorithms.
4. Design Principles
The design principles of this project included data cleaning and preprocessing. The first
phase of this project includes cleaning the data and makes it compatible to data mining
tool, the next phase is to apply data mining algorithms to get classification and clustering
results and study these algorithms.
The Data is cleaned and pre-processed manually by checking all the attribute entries and
made changes using Microsoft Office Excel.
Using ‘WEKA’-Data Mining tool, based on the structure and type of DB, we applied
following algorithms:
1. Classification Algorithms:
a. Logistic Algorithm
b. J48 (Decision Tree)
2. Clustering Algorithms:
a. Expectation Maximization (EM) Algorithm
b. K-Means Algorithm
5. Implementation
CSC #177Data MiningGroup #8
To mine data we have followed KDD process. Following are steps we followed:
1. Data Preprocessing:
 As it is real time data, it is noisy data and need preprocessing.
 To make it easy to handle, we have trimmed original data to 1907 rows.
 We are using 35 attributes out of 41.
 Season attribute was not consistent throughout the data. In some records it
was mention as date or duration of months. To make it consistent we
added two columns named Season start and season end.
 Some special characters were used in data which is not accepted by Weka
so we remove these characters or replace with appropriate one.
2. Import preprocessed data in Weka.
3. Applied Classification and Clustering algorithms as mention below:
Based on the structure of Data Set and type of DB, specific algorithms can only yield the
results that interpret data well.
6. Classification Algorithms
We used same database for data mining projects and data warehousing project. As the
database is very vast and distributive with many independent and with few dependent
attributes. After analyzing database, we come to conclusion that to apply different data
mining algorithms on different sets of attributes from the database, to interpret data well.
Two broad sets formed for the data mining project are;
1. Goods Prediction and Clustering:
Location + Season Information + Goods Available
CSC #177Data MiningGroup #8
Basic Classification Histogram
In the above diagram we can select different goods from class and visualize distribution
of that selected good for all the states or season.
Red- Interprets particular good is available
Blue- Interprets particular good is not available
CSC #177Data MiningGroup #8
2. Nutrition Program Prediction and Clustering:
Location + Season Information + Nutrition Programs
For nutrition programs we find out what program is available at which market location
and during what season.
Red- Interprets particular nutrition program is available
Blue- Interprets particular nutrition program is not available
CSC #177Data MiningGroup #8
All the instances from the dataset are visualized based on two conditions for each of the
above attributes, i.e. whether the nutrition program is available (red) or not (blue).
6.1 Logistic Algorithm
Highly regarded classical statistical technique for making predictions.
‘Logistic Algorithm’ assigns weightage to the attributes in the Data Set. And uses the
‘logistic regression formula’ to predict how accurately a particular attribute value can
be determined for the future instances.
Thus using relative (interdependent)
attributes increases prediction capability as oppose to using all the data available. Since
using independent attributes would affect assignment of weightage which is used to
formulate the prediction accuracy. To apply logistic algorithm classification on ‘Goods
data’ Set of relevant attributes i.e. dependent are used. Logistic algorithm then assigns
weightage to all attributes in dataset.
CSC #177Data MiningGroup #8
Then these weightages are run through ‘logistic regression formula’ to predict the
attribute under consideration in this example ‘wine’
CSC #177Data MiningGroup #8
Logistic Algorithm for class Wine
Thus from the above diagram we interpret that using ‘Logistic Classification Algorithm’
can predict next/ future instance of ‘wine’ with ‘88.8%’ accuracy, given the dependent
relations among all the attributes, that we used for this example.(location +season+all
goods)
Similarly, for the nutrition program we use ‘location + season + nutrition program’
related dataset.
And predict accuracy for the SFNMP in following example,
the algorithm can predicts future instance of SFNMP with ‘83.4%’ accuracy.
CSC #177Data MiningGroup #8
Logistic Algorithm for class SFNMP
Logistic Algorithm for class WICcash
6.2 J48 Algorithm (Decision Tree)
CSC #177Data MiningGroup #8
Logistic Algorithms cannot predict ‘numeric values’. Whereas J48 Algorithm can predict
both ‘nominal’ and ‘numeric’ attribute values.
J48 algorithm uses ‘most relevant attribute’ from the dataset to determine the prediction
values, thus it’s better to have all the attributes rather that only relevant attributes, as we
did in logistic algorithm.
Using all the data set for J48 Algorithm, the prediction efficiency increases.
J48 Algorithms visualizes result in the form of ‘Decision Tree’, where most relevant
attributes are used for prediction of particular attribute’s future-instance value. Using this
tree rules can be formed
J48 Algorithm on Bake-goods
From the above diagram, ‘Bake-goods’ can be predicted with ‘94%’ accuracy
using the attribute ‘Vegetables’ which is determined as ‘most relative’ by J48.
CSC #177Data MiningGroup #8
Decision Tree for Bake goods
Where attribute ‘vegetable’ is not alone used to predict the ‘bakegoods’, but other
‘relevant’ attributes such as ‘prepared’ and ‘soap’.
Rules that can be formed from the above ‘decision tree’ are;
1.
2.
3.
4.
If Vegetables=Yes then Bake-goods=Yes
If Vegetables=No And Prepared=Yes then Bake-goods=Yes
If Vegetables=No And Prepared=No And Soap=Yes then Bake-goods=Yes
If Vegetables=No And Prepared=No And Soap=No then Bake-goods=No
Next diagram shows Prediction of instance of ‘Herb’ with ‘90.8%’ accuracy.
CSC #177Data MiningGroup #8
J48 Algorithm for class Herbs
In the case of ‘Herb’ J48 again chooses ‘most relevant’ attribute ‘vegetable’, but then
there are other attributes from dataset to form the rules. These attributes are ‘jams’,
‘eggs’, ‘seafood’, ‘prepared’.
Rules can be formed similar to above case using following ‘decision tree’.
CSC #177Data MiningGroup #8
Decision Tree for Herbs class
CSC #177Data MiningGroup #8
J48 Algorithm for class SNAP
CSC #177Data MiningGroup #8
Decision Tree SNAP
CSC #177Data MiningGroup #8
J48 Algorithm for class WIC
CSC #177Data MiningGroup #8
Decision Tree for WIC
7. Clustering Algorithms
Clustering algorithms are applied to set of similar data, to interpret data well. We created
two sets of attributes;
1. All Goods
2. Nutrition Programs
Number of distinct values for attributes are two, Yes/No (Y/N). Thus numbers of clusters
used for both EM and K-Means algorithm are two.
CSC #177Data MiningGroup #8
Basic clustering histogram for goods
Basic clustering histogram for Nutrition Program
7.1 EM Algorithm
CSC #177Data MiningGroup #8
Properties to choose for applying clustering algorithm, where we can specify various
algorithm values so as to interpret data well.
NumClusters: Number cluster for clustering. In EM algorithm we don’t need to specify
the number. EM algorithm determines number of clusters based on data. Thus the value is
‘-1’ that means algorithm will form number clusters based on datasets.
Seed: Provides the virtualization method to choose initial random center value around
which algorithm forms cluster.
Depending on the vastness and distributive nature of data, we keep the value ‘100’.
Thus from the above diagram, EM forms two clusters, the reason for two clusters might
be based on various distinct values in dataset.
CSC #177Data MiningGroup #8
EM Algorithm Applied for Nutrition Program
7.2 Simple K-Means
Second clustering algorithm we used is simple K-Means algorithm.
Properties for simple K-Means:
CSC #177Data MiningGroup #8
numClusters : In case of K-Means algorithm we do have to specify number of clusters to
form. We input number of clusters two here, so as to compare results with EM-Algorithm
which determined based on dataset, to form two clusters.
Seed: For comparing EM Algorithm result with K-Means result and for better chance at
forming clusters we make this value ‘100’.
CSC #177Data MiningGroup #8
Simple K-Means Applied for Nutrition Program
By comparing both the clustering results for ‘Nutrition Program’
We get nearly similar result with ‘~70%’ instances in one cluster and ‘~30%’ instances in
another cluster.
Following diagram shows the clustering algorithms applied on ‘Goods Data’
CSC #177Data MiningGroup #8
EM Algorithm Applied for Goods
1st cluster: 51% instances
2nd cluster: 49% insrtances
Simple K-Means applied on Goods
1st cluster: 57% instances
2nd cluster: 43% instances
CSC #177Data MiningGroup #8
Here we do not get similar clustering as that we have seen in case of ‘Nutrition Program’.
This might the effect of vastness and distributive nature of ‘Goods’ dataset.
8. Summary of learning experience such as experiments and readings






Learned Data Mining tool such as WEKA
Got better understanding of classification algorithms such as J
48, Logistic Regression algorithm
Learned different Clustering algorithms as EM, Simple K-Means
Learned real time application and analysis of result for algorithms
Team work advantages
● Read many articles to get clear idea of how to do data mining
9. References
 Data Source: http://catalog.data.gov/dataset/farmers-markets-geographic-data
 Weka Tutorial: http://youtu.be/m7kpIBGEdkI
 Rapid Miner Tutorial:
https://www.youtube.com/watch?v=EyygHzSVZpM&list=PLLYiNNLBO1EvVz
2WJLWfbp_JWgg5It1O6