Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sutee Sujitparapitaya, Ph.D. Associate Vice President for Associate Vice President for Institutional Effectiveness and Analytics San José State University Email: [email protected] Email: [email protected] Copyright © Sutee Sujitparapitaya, 2011‐2015 Data mining techniques are widely used for data analysis. While data mining may be viewed as expensive, time‐consuming, and too technical to understand and apply, it is an institutional research tool used for efficiently d d d l i i i i i l h l df ffi i l managing and extracting data from large databases and for expediting reporting through the use of statistical algorithms. This workshop will introduce the basic foundations of data mining and identify types of data typically found in large institutional databases, research questions to consider before mining data, and issues of data quality. ti t id b f i i d t di fd t lit – It will also address on how to mix traditional institutional research tools with data mining, and field additional questions typically posed by novices. i – Emphasis will be from a beginners’ (novice) perspective with an emphasis on institutional research data applications. 2 • Describe the basic foundations of data mining from an institutional research (IR) perspective. ( )p p • Explain the principle components of IR data and research questions • Describe why data mining process (CRISP‐DM Methodology) and primary techniques are valuable for IR • Describe how the data quality and data selection works Describe how the data quality and data selection works • Explain the primary features of data mining tools • Describe the relevant resources that are available to help the p data mining projects 3 Strategic Decision Making Analyzing trends Wealth Generation Security 5 Data Mining is a process of finding hidden trends patterns and relationships in data trends, patterns, and relationships in data that is not immediately apparent from summarizing the data. By examining data in large databases and infers rules to large databases and infers rules to a) obtain an insight; b) predict future behavior For example: Finding patterns in student data for student attrition or to identify student at‐risk and potential drop out from school. Motivation of Data Mining : 1. Important need for turning data into useful information 2 Fast growing amount of data, collected and stored in large and 2. Fast growing amount of data collected and stored in large and numerous databases exceeded the human ability for comprehension without powerful tools. 3. We are drowning in data, but starving for knowledge! We are drowning in data, but starving for knowledge! 6 Traditional Statistics ( (Distributions, mathematics, etc.) Machine Learning: the discipline concerned with the design and development of algorithms that gives computers the ability to learn without being explicitly programmed. ( (Computer science, heuristics and induction algorithms). Artificial Intelligence: the study and design of intelligent agents to emulate g y g g g human intelligence. Neural Networks: a mathematical model that uses an interconnected group of p p p artificial neurons processes information between inputs and outputs or to find patterns in data. It is an adaptive model that changes its structure during a learning phase. (Biological models, psychology and engineering) 7 Evolutionary Step Business Question Enabling Technologies Characteristics Data Collection (1960s) What was # new applications for the last five years? Computers, tapes, disks Retrospective, static data delivery Data Access (1980s) What was # new applications What was # new applications for College of Business last March? Relational databases, SQL, ODBC Retrospective, Retrospective dynamic data delivery at record level OLAP, OLAP multidimensional databases, data warehouses Retrospective, dynamic data delivery at multiple levels What was # new applications What was # new applications Data Warehousing for College of Business last & Decision Support March? Drill down to (1990s) Accounting Majors What’s likely to happen to Advanced algorithms, Prospective, proactive Data Mining # new Accounting applications multiprocessor computers, information delivery (At Present Time) next month? Why? massive databases 8 Statistics Conceptual Conceptual Model (Hypothesis) + Statistical Reasoning = “Prooff “P (Validation of Hypothesis) = Pattern Discovery (Model, Rule) Data Mining Data Mining Data + Data Mining Algorithm based on Interestingness 9 Association Rules describes a method for discovering interesting relations A i ti R l d ib th d f di i i t ti l ti between variables in large databases. It produces dependency rules which will predict occurrence of an item based on occurrences of other items. Example 1: Which products are frequently bought together by customers? (Basket Analysis) • DataTable = Receipts x Products • Results could be used to change the placements of products R l ld b d h h l f d Example 2: Which courses tend to be attended together? • DataTable = Students x Courses • Results could be used to avoid scheduling conflicts.... 10 Market basket analysis identifies customers purchasing habits. It provides insight into the combination of products within a customers 'basket'. Ultimately, the purchasing insights provide the potential to create cross sell propositions: • Which product combinations are bought • When they are purchased; and in Wh th h d di • What sequence Observation Items 1 Break, Coke, Milk 2 Beer, Bread 3 Beer Coke Diapers, Milk Beer, Coke, Diapers Milk 4 Beer, Bread, Diapers, Milk Rules Discovered: 5 Coke, Diapers, Milk {Milk} {Coke} {Diapers, Milk} {Beer} 11 The government's data mining projects fall into two broad categories: 1. Subject‐based Data Mining that retrieve data that could help an analyst follow a lead and follow a lead, and 2. Pattern‐based Data Mining that look for suspicious behaviors across a spread of activities. g p p Most data mining experts consider the former a version of traditional police work—chasing down leads—but instead of a police officer examining a list of phone numbers of suspect calls, a computer does it. One subject‐based data mining technique gaining traction among One subject based data mining technique gaining traction among government practitioners and academics is called link analysis. Link analysis uses data to make connections between seemingly unconnected people or events. 12 Data Visualization is the study of the visual representation of data, meaning "information that has been abstracted in some schematic form. – It refers to technique to communicate information clearly and It refers to technique to communicate information clearly and effectively through graphical means (e.g., creating images, diagrams, or animations). Source: Bradbury Science Museum, Los Alamos, NM 13 14 + Data = Interestingness or Criteria Hidden Patterns Slice 16 Interaction data - Offers - Results - Context - Click streams - Notes Attitudinal data - Opinions P f - Preferences - Needs - Desires Descriptive data - Attributes - Characteristics - Self-declared S lf d l d info i f - (Geo)demographics Behavioral data - Orders - Transactions - Payment history P hi - Usage history Source: SPSS BI • • • • • Too many records Too many variables Too many variables Complex non‐linear relationships Multi‐variable combination Proactive and prospective approach Source: Abbot, Data Mining: Level II 19 Traditional IR Work: Data file Data file => Descriptive/Regression Analysis => Descriptive/Regression Analysis => => Tabulations/Reports Historical Predictive Data Mining Driven IR Work: Database => Data Mining (Visualization, Association, Clustering, Predicative Modeling) => Immediate Actions Historical Predictive 20 Type of Interestingness • • • • • • • Frequency Correlation Length of Occurrence (for sequence) Consistency Repeating/Periodicity Abnormal Behaviors Other patterns of Interestingness 22 Typical DBMS Approach Data Mining Approach What are total applications during the last 3 pp g years? Which inquiries are most likely to turn into q y actual applications? What is the first year retention of the fall 2006 first‐time 2006 first time freshmen from under freshmen from under‐ representative minority? What are the most important parameters to predict the first year attrition for next to predict the first year attrition for next year’s entering freshmen? How many freshmen had attended the freshman orientation in November for the last 5 years? Who are likely to enroll in the freshman orientation during the month of November? What is the total pledges for California alumni donation last year? Who are likely to make pledges for alumni donation? How many “agree” and “strongly agree” responses did we received from the 2008 student/faculty satisfaction surveys? What are the main clusters found in student/faculty satisfaction surveys? 23 What do we know about our students? DBMS Approach: DBMS Approach • List of students who passed English Proficiency Exam in the spring • Summary of student’s profile for those who failed, and dropped out last semester out last semester • How many students enrolled the Business Policy course last fall semester? Data Mining Approach: • What factors are contributive to learning? • Who is likely to fail or drop out at the end of their 6 Who is likely to fail or drop out at the end of their 6th year? • What courses provide high FTES, better use of space? • What are the course taking patterns? 24 DBMS Approach: • List of all items that were sold in the last month ? • List all the items purchased by Sandy Smith ? • The total sales of the last month grouped by branch ? • How many sales transactions occurred in the month of December ? How many sales transactions occurred in the month of December ? Data Mining Approach: • Which items are sold together ? What items to stock ? Whi h it ld t th ? Wh t it t t k? • How to place items ? What discounts to offer ? • How best to target customers to increase sales ? • Which clients are most likely to respond to my next promotional mailing, and why? 25 Supervised Data Mining refers to the prior knowledge of what the outcomes exist in the data. • Classification and Prediction describe and distinguish data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown. Unsupervised Data Mining used when the researcher has no idea what hidden patterns there are in the vast database. • Clustering involve in accurate identification of group membership involve in accurate identification of group membership based on maximizing the infraclass similarity and minimizing the interclass similarity. • Associations and Sequences and Sequences identify relationships between events identify relationships between events that occur at one time, determines which things go together or sequential patterns in data. 27 Categorize your students Clustering Predict students retention/Alumni donations Neural Nets/Regression Group similar students Segmentation Identify courses that are taken together A Association i ti n Find patterns and trends over time Sequence •Cafeteria meal planning •Student housing planning •Identify high risk students •Estimate/predict alumni contributions •Predict new student application rate •Course planning •Academic scheduling •Identify Identify student preferences for clubs and social organizations •Faculty teaching load estimation •Course C planning l i •Academic scheduling •Predict alumni donations •Predict potential demand for library resources Source: Thulasi Kumar, 2004 Classification and Prediction ( , , , ) • Decision Trees (C&RT, C5.0, CHAID, and QUEST) • Neural Networks • Regressions (Linear and Logistic) Clustering • K‐Means, TwoStep, and Kohonen SOM Association Rule/Affinity Analysis • Generalized Rule Induction (GRI) • CARMA (Continuous Association Rule Mining Algorithm) • APRIORI 29 It is tree‐shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. The model predicts the value of a target variable based on several input variables. variables Two primary types of Decision trees: 1. Classification tree analysis is used when the predicted outcome is the y p class to which the data belongs. 2. Regression tree analysis is used when the predicted outcome can be considered a real number (e.g. the price of a house, or a patient’s length of stay in a hospital). stay in a hospital). Advantages: • Fast • Simple to understand and interpret • Validation using statistical tests Disadvantages: Inherently unstable • Inherently unstable • Can become large and complex 31 Dependent Variable: • Target classification is "should we play baseball?" which can be yes or no. baseball? which can be yes or no Input Variables: • Weather attributes are outlook, temperature humidity and wind speed temperature, humidity, and wind speed. They can have the following values: o outlook = { sunny, overcast, rain } o temperature = {hot, mild, cool } o humidity = { high, normal } o wind = {weak, strong } Day D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 Outlook Temperature Humidity Wind Play ball Sunny Hot High Weak No Sunny Hot High Strong No Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No O Overcast t C l Cool N Normal l Strong St Y Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes Sunnyy Mild Normal Strongg Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes Rain Mild High Strong No 32 C5.0 (Multiple split, no continuous targets) uses the C5.0 algorithm to build either a decision tree or a rule set. A C5.0 model works by splitting the sample based on the field that p provides the maximum information g gain. The Classification and Regression (C&R) Tree node is a tree-based classification and prediction method. Similar to C5.0, this method uses recursive ecu s e pa partitioning o g to o sp split the e training a g records eco ds into o seg segments e s with ssimilar a output field values. (Binary split, continuous target) QUEST—or Quick, Unbiased, Efficient Statistical Tree is a binary classification method for building decision trees. A major motivation in its development was to reduce the processing time required for large C&RT analyses with either many variables or many cases CHAID, or Chi-squared Chi squared Automatic Interaction Detection, is a classification method for building decision trees by using chi-square statistics to identify optimal splits. CHAID first examines the cross tabulations between each of the predictor variables and the outcome and tests for significance using a chisquare independence test. 33 Neural network is a model that emulates human biological neural system to solve the prediction and classification problems. – solutions for linear and non‐linear relationships between input and p p output variables. – Does not assume any particular data distribution. 34 Advantages • Has a mathematical foundation • Robust with noisy data Robust with noisy data • Detects relationships and trends in data that traditional methods overlook p • Can fit complex non‐linear models • Ability to detect all possible interactions between predictor variables Disadvantages • “Black Box" nature that does not easily analyze and interpret • Greater computational burden G t t ti lb d • Virtually impossible to "interpret" the solution in traditional, analytic terms, such as those used to build theories that explain phenomena 35 Linear regression is an approach to modeling the relationship between a scalar dependent variable (y) and one or more predictor variables (X). • The case of one predictor variable is called simple regression. • More than one predictor variable is multiple regression. M h di i bl i li l i The regression equation represents a straight line or plane that minimizes the squared differences between predicted and actual output values This is a squared differences between predicted and actual output values. This is a very common statistical technique for summarizing data and making predictions ‐ y= f(x) Advantages: • Available in most software • Widely accepted statistical technique Disadvantages: • Not appropriate for many non‐linear problems • Must meet underlying assumptions 36 Logistic regression is a type of regression analysis used for predicting the outcome of a categorical dependent variable based on one or more predictor variables that may be either continuous or categorical data. 1 Binomial or binary logistic regression refers to the instance in which the 1. Binomial or binary logistic regression refers to the instance in which the observed outcome can have only two possible types (e.g., "dead" vs. "alive", "success" vs. "failure", or "yes" vs. "no"). u o a og s c eg ess o e e s o cases e e e ou co e ca a e 2. Multinomial logistic regression refers to cases where the outcome can have three or more possible types (e.g., "better' vs. "no change" vs. "worse"). For example, logistic regression might be used to predict whether a new student will graduate within 6 years, based on observed characteristics of the ill d t ithi 6 b d b d h t i ti f th student (test score, age, gender, pre‐school preparation, etc). Advantages: • Well established statistical procedure • Simple and easy to interpret • Very fast to train and build • Can be used with small sample sizes b d h ll l g Disadvantages: • Strong sensitivity to outliers • Multicollinearity 37 Cluster analysis is an exploratory data analysis tool (unsupervised) for solving classification problems. • Its object is to sort cases (people, things, events, etc) into groups, or clusters, so that the degree of association is strong between members of the same cluster and weak b f h l d k between members of different clusters. • It is not an automatic task, but an iterative process of knowledge discovery (interactive process of knowledge discovery (interactive multi‐objective optimization) that involves trial and failure until the result achieves the desired properties desired properties. The result of a cluster analysis shown as the coloring of the squares into coloring of the squares into three clusters. Types of Clustering • K‐Means • Two‐Step • Kohonen Advantages: Make up of groups in attitudinal or behavioral tests g g p y Disadvantages: Individual group members may still differ 38 K‐Means clustering is an algorithm to classify or to group your objects based on attributes/features into K number of group. K is positive integer number. • The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. • Thus the purpose is to classify the data by partitioning n observations into k clusters in which each observation belongs to the cluster with the nearest mean. 40 Two‐step cluster analysis is a technique that groups cases into pre‐clusters that are treated as single cases. Standard hierarchical clustering is then applied to the pre clusters in the second step the pre‐clusters in the second step. • It appropriate for large datasets or datasets that have a mixture of continuous and categorical variables (not interval or dichotomous). • It processes data with a one‐pass‐through‐the‐dataset method. Therefore, It d t ith th h th d t t th d Th f it does not require a proximity table (like hierarchical classification) or an iterative process (like K‐means clustering) 41 http://www clustan com http://www.clustan.com 42 Kohonen networks are a type of neural network that perform clustering, also known as a knet or a self‐organizing map. also known as a knet or a self organizing map • It seeks to describe dataset in terms of natural clusters of cases. This type of network can be used to cluster the data set into distinct groups when you don'tt know what those groups are at the beginning. when you don know what those groups are at the beginning • Don't even need to know the number of groups to look for. Kohonen networks start with a large number of units, and as training progresses, the units gravitate toward the natural clusters in the data the units gravitate toward the natural clusters in the data. Source: SPSS BI 43 Association or affinity analysis is a data mining technique that discovers co‐occurrence relationships among activities performed by specific individuals or groups. These relationships are then expressed as a collection of association rules. l • Association rules are statements in the form if antecedent(s) then consequent(s) • Used to perform market basket analysis, in which retailers seek in which retailers seek to understand the purchase behavior of customers. Types of Association • GRI • Apriori • CARMA 45 Customer Purchase 1 jam 2 milk 3 jam 3 bread 4 jam 4 bread 4 milk Customer Jam Bread Milk 1 T F F 2 F F T 3 T T F 4 T T T 46 Business Understanding B i U d di Data Understanding Data Preparation Modeling Evaluation Deployment Source: www.crisp‐dm.org 48 Business Understanding Determine D t i Business Objectives Background Business Objectives Business Success Criteria Data Understanding Data Set D t S t Data Set Description Modeling Evaluation SSelect Modeling l t M d li E l t R lt Evaluate Results Technique Assessment of Data Modeling Technique Mining Results w.r.t. Select Data Modeling Assumptions Business Success Rationale for Inclusion / Describe Data Criteria Data Description Report Exclusion Generate Test Design Approved Models T D i Test Design Situation Assessment Explore Data Clean Data Review Process Inventory of Resources Data Exploration Report Data Cleaning Report Build Model Review of Process Requirements, Parameter Settings Assumptions, and Verify Data Quality Construct Data Models Determine Next Steps Constraints Data Quality Report Derived Attributes Model Description List of Possible Actions Risks and Contingencies k d Generated Records d d Decision Terminology Assess Model Costs and Benefits Integrate Data Model Assessment Merged Data Revised Parameter Determine Settings Data Mining Goal Format Data Data Mining Goals Reformatted Data Data Mining Success Criteria Produce Project Plan Project Plan Initial Asessment of Tools and Techniques Collect Initial Data C ll t I iti l D t Initial Data Collection Report Data Preparation Deployment Plan Deployment Pl D l t Deployment Plan Plan Monitoring and Maintenance Monitoring and M i Maintenance Plan Pl Produce Final Report Final Report Final Presentation Review Project i j Experience Documentation Source: SPSS BI • • • Good data= better decisions = more profit Bad data= risky decisions = potential disaster: Bad data= Errors = losses – “We cannot offer enough courses” = angry students, drop‐out or transfer‐out to another institution – “You’re not admitted to your intended major” = angry students and “Y ’ d i d i d d j ” d d parents, lost revenue – “We have more rooms in the dorm for new students” = bad decisions if the number of students is inflated by bad data if the number of students is inflated by bad data. 51 52 53 Scalar refer to a quantity consisting of a single real number used to measured magnitude (size). • Interval = Scale with a fixed and defined interval e.g. temperature or time. • Ordinal = Scale for ordering observations from low to high with any ties attributed to lack of measurement sensitivity e.g. score from a questionnaire. • Nominal with order = Scale for grouping into categories with order e.g. mild, moderate or severe. This can be difficult to separate from ordinal. • Nominal without order = Scale for grouping into unique categories e.g. Nominal without order = Scale for grouping into unique categories e g eye color. • Dichotomous = As for nominal but two categories only e.g. male/female. Non‐Scalar contains more than one value (e.g., lists, arrays, records) 54 • Case‐ or likewise deletion • Pairwise deletion • Single value substitution (by mean, median or mode of variable) • Regression substitution (using values of other variables in the same row or using the overall relationships of variables into account)) • Marking with a dummy variable 55 • • • • • Identify outliers (Anomaly Detection Node) Verify distributions (Data Audit Node) Verify distributions (Data Audit Node) Relationship of variables Predictive power of variables (Auto Data Prep Node) Data reduction 56 • • • • Data Audit/Data Distribution Charts Number of variables Number of variables Number of records Information content/Predictive power 57 59 Successful data mining strategy involves: 1. Make data mining models comprehensible to business users 2 Translate user’s questions into a data mining problem 2. T l ’ i i d i i bl – Well defined goals, project objectives, and questions 3. Ensure to use sufficient and relevant data 4 Close the loop: identify causality, suggest actions, and measure their 4. Close the loop: identify causality suggest actions and measure their effect. – Need domain expertise in institutional research to build, test, validate, and deploy models. validate, and deploy models. 5. Careful consideration and selection of software and analysts (tech and domain expert) 6. Support from senior administrators (VPs and the President) pp ( ) 7. Cope with privacy and security issues 8. Misuse of information/inaccurate information 60 Free Open‐source Data Mining Software and Applications: • R • RapidMiner • WEKA Commercial Data Mining Software and Applications: • PASW Modeler (IBM) • STATISTICA Data Miner (StatSoft) • Enterpriser Miner (SAS) • Oracle Data Miningg • CART/MARS (Salford Systems) ‐ Low Price • XLMiner ($199) 62 63 64 Information www.kdnuggets.com/ www‐01.ibm.com/software/analytics/spss/products/modeler www.educationaldatamining.org/index.html d ld / d h l www.sigkdd.org/ www.thearling.com/ g / Training www.the‐modeling‐agency.com th d li http://web.ccsu.edu/datamining/ www.kdnuggets.com/education/usa‐canada.html 66 67 http://kdd.ics.uci.edu/ p // / http://archive.ics.uci.edu/ml/ http://www.fedstats.gov/ http://www.census.gov/ http://nces.ed.gov/surveys/SurveyGroups.asp?group=2 68