Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Database Management System Data Mining Knowledge Discovery and Data Mining Knowledge Discovery The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. (Fayyad, et al 1996) Data Mining A step in the knowledge discovery process Application of algorithms to extract meaningful patterns Data Dredging Blind application of data mining techniques Knowledge Discovery in Databases Cleaning Integration Selection Data Transformation Mining Data Warehouse Evaluation Visualization Prepared data Patterns Data Knowledge Base Knowledge What is Data Mining? Filtering large amounts of data Searching for hidden patterns and/or trends Predicting future results Creating a competitive advantage and improving decision making Data mining is a form of artificial intelligence, but is very different from other BI tools. Discovery versus Verification What Sparked Data Mining? “Motivated by business need, large amounts of available data, and humans’ limited cognitive processing abilities Enabled by data warehousing, parallel processing, and data mining algorithms” Source: Dr. Hugh Watson Popular Data Mining Methods Neural networks – learning from data patterns and predicting new data Genetic Algorithms – optimizing techniques Decision trees – rules for classifying data Regression Analysis - statistical K-nearest neighbor – classifying and clustering technique based on weighting of selected variables Data Visualization – visually showing patterns Types of Data Mining Association – identifies relationships Sequential pattern – identifies sequencing Classifying – identifies potential outcomes for predetermined categories Clustering – identifies categories Prediction – estimates future values or forecasts Types of Data Mining Classification Prediction Clustering Association Analysis Summarization … Types of Data Mining Classification From data with known labels, create a classifier that determines which label to apply to a new observation E.g. Label loan applications as low, medium, or high risk Types of Data Mining Prediction Given a collection of data with known numeric outputs, create a function that outputs a predicted value from a new set of inputs. E.g. Given historical consumption of milk in the U.S., predict what the consumption will be over the next five years. Types of Data Mining Clustering Identify “natural” groupings in data Unsupervised learning, no predefined groups E.g. A city planner grouping houses by value, location, and house type. Types of Data Mining Association Analysis Identify relationships in data from co-occuring terms or items. E.g. Analyze grocery store purchases to identify items most commonly purchased together. This is often used to create coupons and sales: buy chips and get $0.50 off salsa. Types of Data Mining Summarization Given a data set, summarize the important characteristics of the data. E.g. calculate mean and standard deviation, determine statistical distribution, identify most commonly appearing attribute values, etc. Types of Data Mining Sequence Analysis Given data collected over time, identify trends in the data that may be used to predict future events occuring E.g. Analyzing stock data to identify stocks that will perform well vs. those that will perform poorly. Data Mining Process “Requires personnel with domain, data warehousing, and data mining expertise Requires data selection, data extraction, data cleansing, and data transformation Most data mining tools work with highly granular flat files Is an iterative and interactive process” Source: Dr. Hugh Watson Data Mining Process No Fit a Model Calculate Performance Meet Criteria? Yes Interpret Model Data Mining Algorithms Determine the preference criterion In the face of two models, which one is “better” Examples: goodness of fit, prediction accuracy, size/complexity, etc. Search algorithm Good models are found by searching the space of all possible models How is this space organized and searched? Data Mining Models Mathematical Functions Mathematical combination of attribute values E.g. linear model, non-linear model, support vectors, etc. CPU performance prediction PRP 55.9 0.489MYCT 0.0153MMIN 0.0056MMAX 0.6410CACH 0.2700CHMIN 1.480CHMAX Data Mining Models Decision Trees Study >= 10 hours <10 hours Do Homework Yes Yes No C Test Well A Test Well B No Yes C No F Data Mining Models Neural Networks 0.8 0.23 -0.48 0.67 1.5 1.93 -0.81 -0.4 0.18 0.5 -0.88 Data Mining Models Mixture Models Data Mining Models Bayesian Networks P(B) .001 Earthquake Burglary P(E) .002 B E P(A) Alarm T T 0.95 T F 0.95 F T 0.29 F F 0.001 A P(J) T 0.90 F 0.05 John Calls Mary Calls A P(M) T 0.70 F 0.01 Searching the Model Space Concept generalization is searching Almost all search algorithms are heuristic Optimal models are not guaranteed Enumerating the space involve bias Language bias – what the model can represent Search bias – which models are ignored Overfitting-avoidance bias – how models are simplified to handle outliers Searching the Model Space Study >= 10 hours Do Homework Yes Yes No C Test Well A <10 hours Test Well B Yes C Model 2 No F No Study >= 10 hours Test Well Model 1 <10 hours Homework Yes Yes Good Project A Yes No B C No Yes B No F Test Well C No How Data Mining Is Used? CRM: Research, churn and promotional management. Process Mgmt: Reduce operational delays. Analysis: Develop forecasting models and fraud prevention. Predictive Capabilities: Develop rules for queries or expert systems and oil exploration. Health Care: Medical research and trends. Banking: Identify bank locations. Sports: Guide movement of players. Sources Davis, Jennifer and Others. 2002. Data Mining I: KnowledgeSEEKER. http://www.terry.uga.edu/~hwatson/ Presentation_DMining_Final.ppt Struble, Craig A. 2004. Data Mining. http://www.mscs.mu.edu/~cstruble