Download CSC 177 Fall 2014 Team Project Final Report Project Title, Data

CSC #177Data MiningGroup #8 CSC 177 Fall 2014 Team Project Final Report Project Title, Data Mining on Farmers Market Data Instructor: Dr. Meiliu Lu Team Members: Yogesh Isawe Kalindi Mehta Aditi Kulkarni CSC #177Data MiningGroup #8 CSc 177 DM Project Cover Page Due 12-15-14 5pm (Submit it to the CSC Department office before 5pm 12/15/14 Or to the instructor at 5:15pm in RVR 5029) Student(s) Name : Aditi Kulkarni, Kalindi Mehta, Yogesh Isawe Grade ______ Title of the project: Data Mining on Farmers Market Hand-in-check list: A hardcopy of final report (without appendix) with cover page for the term project An electronic copy on a CD including all of important writings of your term project  Project oral presentation power point file with improvement made based on comments of the class and instructor during oral presentation.  Project final report (100%) containing the following parts, font >= 11: 1. 2. 3. 4. 5. 6. 7. objective statement of the term project (1/3 -1/2 page); background information (1 page); design principle of your data mining system/ scope of study (1/3 – 1/2 page); implementation issues and solutions/ survey results/ diagrams/ tables (3-5 pages); summary of learning experience such as experiments and readings (1/2 - 1 page); References (authors, title, publishing source data, date of publication, URL) and you should quote each reference in your report text. Appendix (optional) containing a set of supporting material such as examples, sample demo sessions, and any information that reflects your effort regarding the project. CSC #177Data MiningGroup #8 TABLE OF CONTENTS Chapter 1. OBJECTIVE 2. BACK GROUND INFORMATION 3. DESIGN PRINCIPLES 4. IMPLEMENTATION ISSUES AND SOLUTIONS 5. SUMMARY OF LEARNING EXPERIENCE 6. FUTURE SCOPE 7. REFERENCES CSC #177Data MiningGroup #8 1. Abstract Data set consists of Location of U.S. Farmers Market, Goods availability at the market as per season. We have created a data mart that can provide the information and answers questions. We have designed questions to address two types of users Consumer and Government officials. For data mining project, we are working on the same data to find patterns. 2. Objective Using data mining tool ‘WEKA’ to do a multi-step data mining exercise. Interpreting the data well, understanding the structure of the data using one or more data mining algorithms, and presenting the findings. Mining data to extract knowledge from available data. To explore alternative data mining tools such as ‘Rapidminer’. 3. Background Information In data mining project we are mining US Farmers Market data to extract knowledge. Here we are using WEKA tool to mine the data. Data source for data is http://catalog.data.gov/dataset/farmers-markets-geographic-data. Original dataset consists of 8000 records with 41 different attributes related to farmers market. Our primary goal is to use different mining tools to apply classification and clustering algorithms. 4. Design Principles The design principles of this project included data cleaning and preprocessing. The first phase of this project includes cleaning the data and makes it compatible to data mining tool, the next phase is to apply data mining algorithms to get classification and clustering results and study these algorithms. The Data is cleaned and pre-processed manually by checking all the attribute entries and made changes using Microsoft Office Excel. Using ‘WEKA’-Data Mining tool, based on the structure and type of DB, we applied following algorithms: 1. Classification Algorithms: a. Logistic Algorithm b. J48 (Decision Tree) 2. Clustering Algorithms: a. Expectation Maximization (EM) Algorithm b. K-Means Algorithm 5. Implementation CSC #177Data MiningGroup #8 To mine data we have followed KDD process. Following are steps we followed: 1. Data Preprocessing:  As it is real time data, it is noisy data and need preprocessing.  To make it easy to handle, we have trimmed original data to 1907 rows.  We are using 35 attributes out of 41.  Season attribute was not consistent throughout the data. In some records it was mention as date or duration of months. To make it consistent we added two columns named Season start and season end.  Some special characters were used in data which is not accepted by Weka so we remove these characters or replace with appropriate one. 2. Import preprocessed data in Weka. 3. Applied Classification and Clustering algorithms as mention below: Based on the structure of Data Set and type of DB, specific algorithms can only yield the results that interpret data well. 6. Classification Algorithms We used same database for data mining projects and data warehousing project. As the database is very vast and distributive with many independent and with few dependent attributes. After analyzing database, we come to conclusion that to apply different data mining algorithms on different sets of attributes from the database, to interpret data well. Two broad sets formed for the data mining project are; 1. Goods Prediction and Clustering: Location + Season Information + Goods Available CSC #177Data MiningGroup #8 Basic Classification Histogram In the above diagram we can select different goods from class and visualize distribution of that selected good for all the states or season. Red- Interprets particular good is available Blue- Interprets particular good is not available CSC #177Data MiningGroup #8 2. Nutrition Program Prediction and Clustering: Location + Season Information + Nutrition Programs For nutrition programs we find out what program is available at which market location and during what season. Red- Interprets particular nutrition program is available Blue- Interprets particular nutrition program is not available CSC #177Data MiningGroup #8 All the instances from the dataset are visualized based on two conditions for each of the above attributes, i.e. whether the nutrition program is available (red) or not (blue). 6.1 Logistic Algorithm Highly regarded classical statistical technique for making predictions. ‘Logistic Algorithm’ assigns weightage to the attributes in the Data Set. And uses the ‘logistic regression formula’ to predict how accurately a particular attribute value can be determined for the future instances. Thus using relative (interdependent) attributes increases prediction capability as oppose to using all the data available. Since using independent attributes would affect assignment of weightage which is used to formulate the prediction accuracy. To apply logistic algorithm classification on ‘Goods data’ Set of relevant attributes i.e. dependent are used. Logistic algorithm then assigns weightage to all attributes in dataset. CSC #177Data MiningGroup #8 Then these weightages are run through ‘logistic regression formula’ to predict the attribute under consideration in this example ‘wine’ CSC #177Data MiningGroup #8 Logistic Algorithm for class Wine Thus from the above diagram we interpret that using ‘Logistic Classification Algorithm’ can predict next/ future instance of ‘wine’ with ‘88.8%’ accuracy, given the dependent relations among all the attributes, that we used for this example.(location +season+all goods) Similarly, for the nutrition program we use ‘location + season + nutrition program’ related dataset. And predict accuracy for the SFNMP in following example, the algorithm can predicts future instance of SFNMP with ‘83.4%’ accuracy. CSC #177Data MiningGroup #8 Logistic Algorithm for class SFNMP Logistic Algorithm for class WICcash 6.2 J48 Algorithm (Decision Tree) CSC #177Data MiningGroup #8 Logistic Algorithms cannot predict ‘numeric values’. Whereas J48 Algorithm can predict both ‘nominal’ and ‘numeric’ attribute values. J48 algorithm uses ‘most relevant attribute’ from the dataset to determine the prediction values, thus it’s better to have all the attributes rather that only relevant attributes, as we did in logistic algorithm. Using all the data set for J48 Algorithm, the prediction efficiency increases. J48 Algorithms visualizes result in the form of ‘Decision Tree’, where most relevant attributes are used for prediction of particular attribute’s future-instance value. Using this tree rules can be formed J48 Algorithm on Bake-goods From the above diagram, ‘Bake-goods’ can be predicted with ‘94%’ accuracy using the attribute ‘Vegetables’ which is determined as ‘most relative’ by J48. CSC #177Data MiningGroup #8 Decision Tree for Bake goods Where attribute ‘vegetable’ is not alone used to predict the ‘bakegoods’, but other ‘relevant’ attributes such as ‘prepared’ and ‘soap’. Rules that can be formed from the above ‘decision tree’ are; 1. 2. 3. 4. If Vegetables=Yes then Bake-goods=Yes If Vegetables=No And Prepared=Yes then Bake-goods=Yes If Vegetables=No And Prepared=No And Soap=Yes then Bake-goods=Yes If Vegetables=No And Prepared=No And Soap=No then Bake-goods=No Next diagram shows Prediction of instance of ‘Herb’ with ‘90.8%’ accuracy. CSC #177Data MiningGroup #8 J48 Algorithm for class Herbs In the case of ‘Herb’ J48 again chooses ‘most relevant’ attribute ‘vegetable’, but then there are other attributes from dataset to form the rules. These attributes are ‘jams’, ‘eggs’, ‘seafood’, ‘prepared’. Rules can be formed similar to above case using following ‘decision tree’. CSC #177Data MiningGroup #8 Decision Tree for Herbs class CSC #177Data MiningGroup #8 J48 Algorithm for class SNAP CSC #177Data MiningGroup #8 Decision Tree SNAP CSC #177Data MiningGroup #8 J48 Algorithm for class WIC CSC #177Data MiningGroup #8 Decision Tree for WIC 7. Clustering Algorithms Clustering algorithms are applied to set of similar data, to interpret data well. We created two sets of attributes; 1. All Goods 2. Nutrition Programs Number of distinct values for attributes are two, Yes/No (Y/N). Thus numbers of clusters used for both EM and K-Means algorithm are two. CSC #177Data MiningGroup #8 Basic clustering histogram for goods Basic clustering histogram for Nutrition Program 7.1 EM Algorithm CSC #177Data MiningGroup #8 Properties to choose for applying clustering algorithm, where we can specify various algorithm values so as to interpret data well. NumClusters: Number cluster for clustering. In EM algorithm we don’t need to specify the number. EM algorithm determines number of clusters based on data. Thus the value is ‘-1’ that means algorithm will form number clusters based on datasets. Seed: Provides the virtualization method to choose initial random center value around which algorithm forms cluster. Depending on the vastness and distributive nature of data, we keep the value ‘100’. Thus from the above diagram, EM forms two clusters, the reason for two clusters might be based on various distinct values in dataset. CSC #177Data MiningGroup #8 EM Algorithm Applied for Nutrition Program 7.2 Simple K-Means Second clustering algorithm we used is simple K-Means algorithm. Properties for simple K-Means: CSC #177Data MiningGroup #8 numClusters : In case of K-Means algorithm we do have to specify number of clusters to form. We input number of clusters two here, so as to compare results with EM-Algorithm which determined based on dataset, to form two clusters. Seed: For comparing EM Algorithm result with K-Means result and for better chance at forming clusters we make this value ‘100’. CSC #177Data MiningGroup #8 Simple K-Means Applied for Nutrition Program By comparing both the clustering results for ‘Nutrition Program’ We get nearly similar result with ‘~70%’ instances in one cluster and ‘~30%’ instances in another cluster. Following diagram shows the clustering algorithms applied on ‘Goods Data’ CSC #177Data MiningGroup #8 EM Algorithm Applied for Goods 1st cluster: 51% instances 2nd cluster: 49% insrtances Simple K-Means applied on Goods 1st cluster: 57% instances 2nd cluster: 43% instances CSC #177Data MiningGroup #8 Here we do not get similar clustering as that we have seen in case of ‘Nutrition Program’. This might the effect of vastness and distributive nature of ‘Goods’ dataset. 8. Summary of learning experience such as experiments and readings       Learned Data Mining tool such as WEKA Got better understanding of classification algorithms such as J 48, Logistic Regression algorithm Learned different Clustering algorithms as EM, Simple K-Means Learned real time application and analysis of result for algorithms Team work advantages ● Read many articles to get clear idea of how to do data mining 9. References  Data Source: http://catalog.data.gov/dataset/farmers-markets-geographic-data  Weka Tutorial: http://youtu.be/m7kpIBGEdkI  Rapid Miner Tutorial: https://www.youtube.com/watch?v=EyygHzSVZpM&list=PLLYiNNLBO1EvVz 2WJLWfbp_JWgg5It1O6

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download CSC 177 Fall 2014 Team Project Final Report Project Title, Data