Download Fundamental Data Mining in Institutional Research

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Sutee Sujitparapitaya, Ph.D.
Associate Vice President for
Associate Vice President for
Institutional Effectiveness and Analytics
San José State University
Email: [email protected]
Email: [email protected]
Copyright © Sutee Sujitparapitaya, 2011‐2015
Data mining techniques are widely used for data analysis. While data mining may be viewed as expensive, time‐consuming, and too technical to understand and apply, it is an institutional research tool used for efficiently d
d d
l i i
i i i
l
h
l
df
ffi i l
managing and extracting data from large databases and for expediting reporting through the use of statistical algorithms. This workshop will introduce the basic foundations of data mining and identify types of data typically found in large institutional databases, research questions to consider before mining data, and issues of data quality. ti
t
id b f
i i d t
di
fd t
lit
– It will also address on how to mix traditional institutional research tools with data mining, and field additional questions typically posed by novices. i
– Emphasis will be from a beginners’ (novice) perspective with an emphasis on institutional research data applications.
2
• Describe the basic foundations of data mining from an institutional research (IR) perspective. ( )p p
• Explain the principle components of IR data and research questions
• Describe why data mining process (CRISP‐DM Methodology) and primary techniques are valuable for IR
• Describe how the data quality and data selection works
Describe how the data quality and data selection works
• Explain the primary features of data mining tools
• Describe the relevant resources that are available to help the p
data mining projects
3
Strategic Decision Making
Analyzing trends
Wealth Generation
Security
5
Data Mining is a process of finding hidden trends patterns and relationships in data
trends, patterns, and relationships in data that is not immediately apparent from summarizing the data. By examining data in large databases and infers rules to
large databases and infers rules to a) obtain an insight; b) predict future behavior For example: Finding patterns in student data for student attrition or to identify student at‐risk and potential drop out from school.
Motivation of Data Mining :
1. Important need for turning data into useful information
2 Fast growing amount of data, collected and stored in large and 2.
Fast growing amount of data collected and stored in large and
numerous databases exceeded the human ability for comprehension without powerful tools.
3. We are drowning in data, but starving for knowledge!
We are drowning in data, but starving for knowledge!
6
Traditional Statistics (
(Distributions, mathematics, etc.)
Machine Learning: the discipline concerned with the design and development of algorithms that gives computers the ability to learn without being explicitly programmed. (
(Computer science, heuristics and induction algorithms).
Artificial Intelligence: the study and design of intelligent agents to emulate g
y
g
g
g
human intelligence.
Neural Networks: a mathematical model that uses an interconnected group of p
p
p
artificial neurons processes information between inputs and outputs or to find patterns in data. It is an adaptive model that changes its structure during a learning phase. (Biological models, psychology and engineering) 7
Evolutionary Step
Business Question
Enabling Technologies
Characteristics
Data Collection (1960s)
What was # new applications for the last five years?
Computers, tapes, disks
Retrospective, static data delivery
Data Access (1980s)
What was # new applications What
was # new applications
for College of Business last March?
Relational databases, SQL, ODBC
Retrospective, Retrospective
dynamic data delivery at record level
OLAP, OLAP
multidimensional databases, data warehouses
Retrospective, dynamic data delivery at multiple levels
What was # new applications
What was # new applications Data Warehousing for College of Business last & Decision Support March? Drill down to (1990s)
Accounting Majors
What’s likely to happen to Advanced algorithms, Prospective, proactive Data Mining # new Accounting applications multiprocessor computers, information delivery
(At Present Time)
next month? Why?
massive databases
8
Statistics
Conceptual Conceptual
Model
(Hypothesis)
+
Statistical Reasoning
=
“Prooff
“P
(Validation of Hypothesis)
=
Pattern Discovery (Model, Rule)
Data Mining
Data Mining
Data
+
Data Mining Algorithm based on Interestingness
9
Association Rules describes a method for discovering interesting relations A
i ti R l d
ib
th d f di
i i t
ti
l ti
between variables in large databases. It produces dependency rules which will predict occurrence of an item based on occurrences of other items. Example 1: Which products are frequently bought together by customers? (Basket Analysis)
• DataTable = Receipts x Products
• Results could be used to change the placements of products
R l
ld b
d
h
h l
f
d
Example 2: Which courses tend to be attended together?
• DataTable = Students x Courses
• Results could be used to avoid scheduling conflicts....
10
Market basket analysis identifies customers purchasing habits. It provides insight into the combination of products within a customers 'basket'. Ultimately, the purchasing insights provide the potential to create cross sell propositions: • Which product combinations are bought
• When they are purchased; and in Wh th
h d
di
• What sequence Observation
Items
1
Break, Coke, Milk
2
Beer, Bread
3
Beer Coke Diapers, Milk
Beer, Coke,
Diapers Milk
4
Beer, Bread, Diapers, Milk
Rules Discovered:
5
Coke, Diapers, Milk
{Milk}  {Coke}
{Diapers, Milk}  {Beer}
11
The government's data mining projects fall into two broad categories: 1. Subject‐based Data Mining that retrieve data that could help an analyst follow a lead and
follow a lead, and 2. Pattern‐based Data Mining that look for suspicious behaviors across a spread of activities. g p
p
Most data mining experts consider the former a version of traditional police work—chasing down leads—but instead of a police officer examining a list of phone numbers of suspect calls, a computer does it. One subject‐based data mining technique gaining traction among One
subject based data mining technique gaining traction among
government practitioners and academics is called link analysis. Link analysis uses data to make connections between seemingly unconnected people or events. 12
Data Visualization is the study of the visual representation of data, meaning "information that has been abstracted in some schematic form. – It refers to technique to communicate information clearly and It refers to technique to communicate information clearly and
effectively through graphical means (e.g., creating images, diagrams, or animations).
Source: Bradbury Science Museum, Los Alamos, NM 13
14
+
Data
=
Interestingness or Criteria
Hidden Patterns
Slice 16
Interaction data
- Offers
- Results
- Context
- Click streams
- Notes
Attitudinal data
- Opinions
P f
- Preferences
- Needs
- Desires
Descriptive data
- Attributes
- Characteristics
- Self-declared
S lf d l d info
i f
- (Geo)demographics
Behavioral data
- Orders
- Transactions
- Payment
history
P
hi
- Usage history
Source: SPSS BI
•
•
•
•
•
Too many records
Too many variables
Too many variables
Complex non‐linear relationships
Multi‐variable combination
Proactive and prospective approach
Source: Abbot, Data Mining: Level II
19
Traditional IR Work:
Data file
Data
file => Descriptive/Regression Analysis
=> Descriptive/Regression Analysis => =>
Tabulations/Reports
Historical
Predictive
Data Mining Driven IR Work:
Database => Data Mining (Visualization, Association, Clustering, Predicative Modeling) => Immediate Actions
Historical
Predictive
20
Type of Interestingness
•
•
•
•
•
•
•
Frequency
Correlation
Length of Occurrence (for sequence)
Consistency
Repeating/Periodicity
Abnormal Behaviors
Other patterns of Interestingness
22
Typical DBMS Approach
Data Mining Approach
What are total applications during the last 3 pp
g
years?
Which inquiries are most likely to turn into q
y
actual applications?
What is the first year retention of the fall 2006 first‐time
2006 first
time freshmen from under
freshmen from under‐
representative minority?
What are the most important parameters to predict the first year attrition for next to predict the
first year attrition for next
year’s entering freshmen?
How many freshmen had attended the freshman orientation in November for the last 5 years?
Who are likely to enroll in the freshman orientation during the month of November?
What is the total pledges for California alumni donation last year?
Who are likely to make pledges for alumni donation? How many “agree” and “strongly agree” responses did we received from the 2008 student/faculty satisfaction surveys?
What are the main clusters found in student/faculty satisfaction surveys? 23
What do we know about our students?
DBMS Approach:
DBMS
Approach
• List of students who passed English Proficiency Exam in the spring
• Summary of student’s profile for those who failed, and dropped out last semester
out last semester
• How many students enrolled the Business Policy course last fall semester?
Data Mining Approach:
• What factors are contributive to learning?
• Who is likely to fail or drop out at the end of their 6
Who is likely to fail or drop out at the end of their 6th year?
• What courses provide high FTES, better use of space?
• What are the course taking patterns?
24
DBMS Approach:
• List of all items that were sold in the last month ?
• List all the items purchased by Sandy Smith ?
• The total sales of the last month grouped by branch ?
• How many sales transactions occurred in the month of December ?
How many sales transactions occurred in the month of December ?
Data Mining Approach:
• Which items are sold together ? What items to stock ?
Whi h it
ld t th ? Wh t it
t t k?
• How to place items ? What discounts to offer ?
• How best to target customers to increase sales ?
• Which clients are most likely to respond to my next promotional mailing, and why?
25
Supervised Data Mining refers to the prior knowledge of what the outcomes exist in the data. •
Classification and Prediction  describe and distinguish data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown. Unsupervised Data Mining used when the researcher has no idea what hidden patterns there are in the vast database.
•
Clustering  involve in accurate identification of group membership involve in accurate identification of group membership
based on maximizing the infraclass similarity and minimizing the interclass similarity.
•
Associations and Sequences
and Sequences  identify relationships between events identify relationships between events
that occur at one time, determines which things go together or sequential patterns in data. 27
Categorize your students
Clustering
Predict students retention/Alumni donations
Neural Nets/Regression
Group similar students
Segmentation
Identify courses that are taken together
A
Association
i ti n
Find patterns and trends over time
Sequence
•Cafeteria meal planning
•Student housing planning
•Identify high risk students
•Estimate/predict alumni contributions
•Predict new student application rate
•Course planning
•Academic scheduling
•Identify
Identify student preferences for clubs and
social organizations
•Faculty teaching load estimation
•Course
C
planning
l
i
•Academic scheduling
•Predict alumni donations
•Predict potential demand for library
resources
Source: Thulasi Kumar, 2004
Classification and Prediction
(
,
,
,
)
• Decision Trees (C&RT, C5.0, CHAID, and QUEST)
• Neural Networks
• Regressions (Linear and Logistic)
Clustering
• K‐Means, TwoStep, and Kohonen SOM
Association Rule/Affinity Analysis
• Generalized Rule Induction (GRI)
• CARMA (Continuous Association Rule Mining Algorithm) • APRIORI
29


It is tree‐shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset.
The model predicts the value of a target variable based on several input variables.
variables
Two primary types of Decision trees:
1. Classification tree analysis is used when the predicted outcome is the y
p
class to which the data belongs.
2. Regression tree analysis is used when the predicted outcome can be considered a real number (e.g. the price of a house, or a patient’s length of stay in a hospital).
stay in a hospital).
Advantages: • Fast
• Simple to understand and interpret
• Validation using statistical tests
Disadvantages: Inherently unstable
• Inherently unstable
• Can become large and complex
31
Dependent Variable: • Target classification is "should we play baseball?" which can be yes or no.
baseball?
which can be yes or no
Input Variables: • Weather attributes are outlook, temperature humidity and wind speed
temperature, humidity, and wind speed. They can have the following values:
o outlook = { sunny, overcast, rain }
o temperature = {hot, mild, cool }
o humidity = { high, normal }
o wind = {weak, strong }
Day
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
D12
D13
D14
Outlook Temperature Humidity Wind Play ball
Sunny
Hot
High Weak
No
Sunny
Hot
High Strong No
Overcast
Hot
High Weak
Yes
Rain
Mild
High Weak
Yes
Rain
Cool
Normal Weak
Yes
Rain
Cool
Normal Strong No
O
Overcast
t
C l
Cool
N
Normal
l Strong
St
Y
Yes
Sunny
Mild
High Weak
No
Sunny
Cool
Normal Weak
Yes
Rain
Mild
Normal Weak
Yes
Sunnyy
Mild
Normal Strongg Yes
Overcast
Mild
High Strong Yes
Overcast
Hot
Normal Weak
Yes
Rain
Mild
High Strong No
32
C5.0 (Multiple split, no continuous targets) uses the C5.0 algorithm to build either
a decision tree or a rule set. A C5.0 model works by splitting the sample
based on the field that p
provides the maximum information g
gain.
The Classification and Regression (C&R) Tree node is a tree-based
classification and prediction method. Similar to C5.0, this method uses
recursive
ecu s e pa
partitioning
o g to
o sp
split the
e training
a
g records
eco ds into
o seg
segments
e s with ssimilar
a
output field values. (Binary split, continuous target)
QUEST—or Quick, Unbiased, Efficient Statistical Tree is a binary
classification method for building decision trees. A major motivation in its
development was to reduce the processing time required for large C&RT
analyses with either many variables or many cases
CHAID, or Chi-squared
Chi squared Automatic Interaction Detection, is a classification
method for building decision trees by using chi-square statistics to identify
optimal splits. CHAID first examines the cross tabulations between each of
the predictor variables and the outcome and tests for significance using a chisquare independence test.
33
Neural network is a model that emulates human biological neural system to solve the prediction and classification problems. – solutions for linear and non‐linear relationships between input and p
p
output variables.
– Does not assume any particular data distribution.
34
Advantages
• Has a mathematical foundation
• Robust with noisy data Robust with noisy data
• Detects relationships and trends in data that traditional methods overlook
p
• Can fit complex non‐linear models
• Ability to detect all possible interactions between predictor variables Disadvantages
• “Black Box" nature that does not easily analyze and interpret • Greater computational burden G t
t ti
lb d
• Virtually impossible to "interpret" the solution in traditional, analytic terms, such as those used to build theories that explain phenomena 35
Linear regression is an approach to modeling the relationship between a scalar dependent variable (y) and one or more predictor variables (X). • The case of one predictor variable is called simple regression. • More than one predictor variable is multiple regression.
M
h
di
i bl i
li l
i
The regression equation represents a straight line or plane that minimizes the squared differences between predicted and actual output values This is a
squared differences between predicted and actual output values. This is a very common statistical technique for summarizing data and making predictions ‐ y= f(x)
Advantages:
• Available in most software
• Widely accepted statistical technique
Disadvantages:
• Not appropriate for many non‐linear problems
• Must meet underlying assumptions
36
Logistic regression is a type of regression analysis used for predicting the outcome of a categorical dependent variable based on one or more predictor variables that may be either continuous or categorical data.
1 Binomial or binary logistic regression refers to the instance in which the 1.
Binomial or binary logistic regression refers to the instance in which the
observed outcome can have only two possible types (e.g., "dead" vs. "alive", "success" vs. "failure", or "yes" vs. "no"). u o a og s c eg ess o e e s o cases e e e ou co e ca a e
2. Multinomial logistic regression refers to cases where the outcome can have three or more possible types (e.g., "better' vs. "no change" vs. "worse"). For example, logistic regression might be used to predict whether a new student will graduate within 6 years, based on observed characteristics of the ill
d t ithi 6
b d
b
d h
t i ti
f th
student (test score, age, gender, pre‐school preparation, etc). Advantages:
• Well established statistical procedure
• Simple and easy to interpret
• Very fast to train and build
• Can be used with small sample sizes
b
d
h
ll
l
g
Disadvantages:
• Strong sensitivity to outliers
• Multicollinearity
37
Cluster analysis is an exploratory data analysis tool (unsupervised) for solving classification problems.
• Its object is to sort cases (people, things, events, etc) into groups, or clusters, so that the degree of association is strong between members of the same cluster and weak b
f h
l
d
k
between members of different clusters. • It is not an automatic task, but an iterative process of knowledge discovery (interactive
process of knowledge discovery (interactive multi‐objective optimization) that involves trial and failure until the result achieves the desired properties
desired properties.
The result of a cluster analysis shown as the coloring of the squares into
coloring of the squares into three clusters.
Types of Clustering • K‐Means
• Two‐Step
• Kohonen
Advantages: Make up of groups in attitudinal or behavioral tests
g
g p
y
Disadvantages: Individual group members may still differ
38
K‐Means clustering is an algorithm to classify or to group your objects based on attributes/features into K number of group. K is positive integer number. • The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. • Thus the purpose is to classify the data by partitioning n observations into k clusters in which each observation belongs to the cluster with the nearest mean. 40
Two‐step cluster analysis is a technique that groups cases into pre‐clusters that are treated as single cases. Standard hierarchical clustering is then applied to the pre clusters in the second step
the pre‐clusters in the second step. • It appropriate for large datasets or datasets that have a mixture of continuous and categorical variables (not interval or dichotomous). • It processes data with a one‐pass‐through‐the‐dataset method. Therefore, It
d t ith
th
h th d t t
th d Th f
it does not require a proximity table (like hierarchical classification) or an iterative process (like K‐means clustering)
41
http://www clustan com
http://www.clustan.com
42
Kohonen networks are a type of neural network that perform clustering, also known as a knet or a self‐organizing map. also known as a knet
or a self organizing map
• It seeks to describe dataset in terms of natural clusters of cases. This type of network can be used to cluster the data set into distinct groups when you don'tt know what those groups are at the beginning. when you don
know what those groups are at the beginning
• Don't even need to know the number of groups to look for. Kohonen
networks start with a large number of units, and as training progresses, the units gravitate toward the natural clusters in the data
the units gravitate toward the natural clusters in the data. Source: SPSS BI
43
Association or affinity analysis is a data mining technique that discovers co‐occurrence relationships among activities performed by specific individuals or groups. These relationships are then expressed as a collection of association rules.
l
• Association rules are statements in the form
if antecedent(s) then consequent(s) •
Used to perform market basket analysis, in which retailers seek
in which retailers seek to understand the purchase behavior of customers.
Types of Association
• GRI
• Apriori
• CARMA
45
Customer
Purchase
1
jam
2
milk
3
jam
3
bread
4
jam
4
bread
4
milk
Customer Jam
Bread
Milk
1
T
F
F
2
F
F
T
3
T
T
F
4
T
T
T
46






Business Understanding
B
i
U d
di
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Source: www.crisp‐dm.org
48
Business
Understanding
Determine D
t
i
Business Objectives
Background
Business Objectives
Business Success Criteria
Data
Understanding
Data Set
D
t S t
Data Set Description
Modeling
Evaluation
SSelect Modeling
l t M d li
E l t R lt
Evaluate Results
Technique
Assessment of Data Modeling Technique
Mining Results w.r.t. Select Data
Modeling Assumptions Business Success Rationale for Inclusion / Describe Data
Criteria
Data Description Report Exclusion
Generate Test Design Approved Models
T D i
Test Design
Situation Assessment Explore Data
Clean Data
Review Process
Inventory of Resources Data Exploration Report Data Cleaning Report Build Model
Review of Process
Requirements,
Parameter Settings
Assumptions, and
Verify Data Quality
Construct Data
Models
Determine Next Steps
Constraints
Data Quality Report
Derived Attributes
Model Description
List of Possible Actions
Risks and Contingencies
k
d
Generated Records
d
d
Decision
Terminology
Assess Model
Costs and Benefits
Integrate Data
Model Assessment
Merged Data
Revised Parameter Determine Settings
Data Mining Goal
Format Data
Data Mining Goals
Reformatted Data
Data Mining Success Criteria
Produce Project Plan
Project Plan
Initial Asessment of Tools and Techniques
Collect Initial Data
C
ll t I iti l D t
Initial Data Collection Report
Data
Preparation
Deployment
Plan Deployment
Pl
D l
t
Deployment Plan
Plan Monitoring and Maintenance
Monitoring and M i
Maintenance Plan
Pl
Produce Final Report
Final Report
Final Presentation
Review Project
i
j
Experience Documentation
Source: SPSS BI
•
•
•
Good data= better decisions = more profit
Bad data= risky decisions = potential disaster:
Bad data= Errors = losses
– “We cannot offer enough courses” = angry students, drop‐out or transfer‐out to another institution
– “You’re not admitted to your intended major” = angry students and “Y ’
d i d
i
d d
j ”
d
d
parents, lost revenue – “We have more rooms in the dorm for new students” = bad decisions if the number of students is inflated by bad data
if the number of students is inflated by bad data. 51
52
53
Scalar refer to a quantity consisting of a single real number used to measured magnitude (size). • Interval = Scale with a fixed and defined interval e.g. temperature or time.
• Ordinal = Scale for ordering observations from low to high with any ties attributed to lack of measurement sensitivity e.g. score from a questionnaire.
• Nominal with order = Scale for grouping into categories with order e.g. mild, moderate or severe. This can be difficult to separate from ordinal. • Nominal without order = Scale for grouping into unique categories e.g. Nominal without order = Scale for grouping into unique categories e g
eye color. • Dichotomous = As for nominal but two categories only e.g. male/female.
Non‐Scalar contains more than one value (e.g., lists, arrays, records)
54
• Case‐ or likewise deletion
• Pairwise deletion
• Single value substitution (by mean, median or mode of variable)
• Regression substitution (using values of other variables in the same row or using the overall relationships of variables into account))
• Marking with a dummy variable
55
•
•
•
•
•
Identify outliers (Anomaly Detection Node)
Verify distributions (Data Audit Node)
Verify distributions (Data Audit Node)
Relationship of variables
Predictive power of variables (Auto Data Prep Node)
Data reduction
56
•
•
•
•
Data Audit/Data Distribution Charts
Number of variables
Number of variables
Number of records
Information content/Predictive power
57
59
Successful data mining strategy involves:
1. Make data mining models comprehensible to business users
2 Translate user’s questions into a data mining problem
2.
T
l
’
i
i
d
i i
bl
– Well defined goals, project objectives, and questions
3. Ensure to use sufficient and relevant data 4 Close the loop: identify causality, suggest actions, and measure their 4.
Close the loop: identify causality suggest actions and measure their
effect.
– Need domain expertise in institutional research to build, test, validate, and deploy models.
validate, and deploy models.
5. Careful consideration and selection of software and analysts (tech and domain expert)
6. Support from senior administrators (VPs and the President)
pp
(
)
7. Cope with privacy and security issues
8. Misuse of information/inaccurate information
60
Free Open‐source Data Mining Software and Applications:
• R
• RapidMiner
• WEKA
Commercial Data Mining Software and Applications: • PASW Modeler (IBM)
• STATISTICA Data Miner (StatSoft)
• Enterpriser Miner (SAS)
• Oracle Data Miningg
• CART/MARS (Salford Systems) ‐ Low Price
• XLMiner ($199)
62
63
64
Information





www.kdnuggets.com/
www‐01.ibm.com/software/analytics/spss/products/modeler
www.educationaldatamining.org/index.html
d
ld
/ d h l
www.sigkdd.org/
www.thearling.com/
g
/
Training
 www.the‐modeling‐agency.com
th
d li
 http://web.ccsu.edu/datamining/
 www.kdnuggets.com/education/usa‐canada.html
66
67
http://kdd.ics.uci.edu/
p //
/
http://archive.ics.uci.edu/ml/ http://www.fedstats.gov/
http://www.census.gov/ http://nces.ed.gov/surveys/SurveyGroups.asp?group=2 68