Download Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining
Survey of applications and methodologies
- Akshat Singhal, Oberlin College, 2007
Presentation Summary
• What is Data mining?
• Evolution of Data mining
• Applications
• Process
• Models : Predictive vs Descriptive
• Decision Tree (Classification Rules)
Example
• Association Rules Example
• Text Mining Example
• Software used
What Is Data Mining?
• Also called Knowledge-Discovery in Databases
(KDD)
• “the extraction of hidden predictive information from
large databases”
OR
the process of automatically searching large
volumes of data for patterns
• Answering questions such as
“What products are candy buyers most likely to buy this month?”
“What kind of credit card transaction is a likely fraud?”
“What colour of automobile is the most associated with accidents?”
Evolution of Data Mining
Evolutionary
Step
Business
Question
Enabling
Technology
Data
Collection
(1960s)
“How many
widgets were
sold this Year?”
computers, tapes,
disks
Data Access
(1980s)
“How many
widgets were
sold and for
what cost this
year?"
Relational
Databases (RDBMS)
Data
Warehousing
and Decision
Support
“How many
widgets were
sold without
discount in the
recently
acquired Puerto
Rico store of
Giant Corp,
Inc.?"
On-line analytical
processing (OLAP),
multidimensional
databases, data
warehouses
Data Mining
“How many
widgets will be
sold in
Cleveland next
year?”
Machine Learning,
Technologies for
handling mass
storage and
computation like
RAID and SMP.
Files
RDBMS
OLAP
Data Mining
What Data Mining is NOT?
• Data Entry/Storage/Access or connectivity
among diverse Data Sources (Data
Warehousing)
• Presenting Data in a better format (Data
Presentation / Interfacing)
• Brute-Force algorithm application for
generating data about data (Statistics).
• Finding relations that don’t manifest
themselves in the given data (Business
Strategy).
Types of Data Mining:
1. Forecasting what may happen in the
future
2. Classifying and Clustering data items
into groups by recognizing patterns
3. Associating events (attribute values)
that are likely to occur together
4. Sequencing events that are likely to
lead to later events
Example Applications
•Fraud/Non-Compliance
Anomaly detection
(government)
•Customer Profiling
•Credit/Risk Scoring
•Intrusion detection
•Maximizing
profitability (cross
selling, identifying
profitable customers)
•Parts failure prediction
•Web Mining
•Market Basket Analysis
•Weather Prediction
•“Fun” statistics
•Using patterns in
Medical test results for
diagnosis
•Product
Recommendations
Success Stories
• HSBC - used data mining to target mailings
better at customers. (i.e. not sending Car Loan
brochures to millionaires)
• DEA – Analyzed suspect calls to catch drug
peddlers. (i.e. don’t say LSD on the phone)
• IRS – better scheduling, catching Tax Fraud.
• DaimlerBenz – used data mining for analysis
of testing data for F-Cell fuelled vehicles.
• Walmart – analyzing 7.5 TB of customer and
supplier data.
Privacy Concerns
•Data mining extracts new insights from
old data.
•This data may have been collected with
a stated purpose of record-keeping only.
•Results of data mining can classify
people as high risk/potentially criminal
and hence hurt them
•Many believe data mining is the same as
The Man simply stealing information
(the mining metaphor is ambiguous)
Issues of Scale
• Common data sets are non-trivial in
size, usually in the order of Terabytes.
• Data is almost never consistent in
quality.
• A top-down approach is needed to
solving data mining problems
• The Answer: Standard process for data
mining: CRISP-DM (CRoss Industry
Standard Process for Data Mining)
CRISP-DM
• Proposed by SPSS, Daimler-Benz, and
OHRA in 1996
• Follows uniform and well-documented
guidelines.
• Flexible on type of :
– Business/agency problems
– Data
– Application software (i.e. software tools
used for analysis)
• Very similar to the standard Software
Development Process (top-down model)
Phases of CRISP-DM
Business
Understanding
Determine
Business Objectives
Background
Business Objectives
Business Success
Criteria
Situation Assessment
Inventory of Resources
Requirements,
Assumptions, and
Constraints
Risks and Contingencies
Terminology
Costs and Benefits
Determine
Data Mining Goal
Data Mining Goals
Data Mining Success
Criteria
Produce Project Plan
Project Plan
Initial Asessment of
Tools and Techniques
Data
Understanding
Collect Initial Data
Initial Data Collection
Report
Data
Preparation
Data Set
Data Set Description
Select Data
Data Description Report
Rationale for Inclusion /
Exclusion
Explore Data
Clean Data
Describe Data
Data Exploration Report
Verify Data Quality
Data Quality Report
Data Cleaning Report
Construct Data
Derived Attributes
Generated Records
Integrate Data
Merged Data
Format Data
Reformatted Data
Modeling
Select Modeling
Technique
Modeling Technique
Modeling Assumptions
Generate Test Design
Test Design
Build Model
Parameter Settings
Models
Model Description
Assess Model
Model Assessment
Revised Parameter
Settings
Evaluation
Evaluate Results
Assessment of Data
Mining Results w.r.t.
Business Success
Criteria
Approved Models
Review Process
Review of Process
Determine Next Steps
List of Possible Actions
Decision
Deployment
Plan Deployment
Deployment Plan
Plan Monitoring and
Maintenance
Monitoring and
Maintenance Plan
Produce Final Report
Final Report
Final Presentation
Review Project
Experience
Documentation
CRISP-DM: Stage 1
• Define business objective.
• Define data mining objective.
• Define set of data to be used, and
identify outliers in the data.
• Gauge reliability of analysis
• Reasons:
– Business Objectives are often unclear. (e.g.
cutting mailing costs vs. finding new areas
to campaign in)
– Data quality varies widely, even in large
well-structured organizations.
Stage 2-3: Data Preparation
• Evaluating quality of data
• Statistical outliers, incomplete data, and sparse data
must be accounted for.
• Data may need to be transformed (for instance, by
logarithm function) for useful statistics.
• Bad quality data:
– Sparse data: e.g. in Market Basket analysis, one customer
never buys the whole store, so the resulting matrix is very
sparse.
– Incomplete data: e.g.
• people do not answer every question in surveys.
• Data from a 10-year-old IBM mainframe takes conversion and
standardized.
• Non-entries can manifest themselves as 0 or some default value.
Stage 4: Modelling
• Predictive models:
– output is function or distribution that predicts
values for individual objects.
– e.g. to play or not play, given that its sunny outside)
and humidity is high.
– Use Classification Rules
– Classification looks for associations to one target
clustering attribute (say, Class = Ham or Spam)
• Descriptive models:
– output are interesting (local, marginal) properties of
distribution
– e. g. If its sunny and we decide to play, the
temperature must be cool.
– Use Association Rules
– Associations are more numerous because they can
be between any number of attributes.
Algorithms
Predictive:
•Regression
algorithms: neural
networks, Rule
Induction
•Classification
algorithms: CHAID,
C5.0 , Naïve Bayesian
Classifier.
Descriptive:
•Clustering/Grouping
algorithms: K-means,
Kohonen maps
•Association
algorithms: GRI
Decision Tree Induction Example (C4.5)
•The C4.5 algorithm infers from this data, Classification Rules
like:
•If Outlook = sunny and Humidity <=75, Play =yes
•If Outlook = rainy and Windy = true, Play =yes
•Rules can be represented as a decision tree. In this example, the
rules can help predict if a game will be played, based on weather
data.
Association Rules Example
• Given data about Contact Lenses use
and eye characteristics for a number of
people,
• Find such associations in the data:
– If tear production rate = reduced
(low), then contact-lenses=none
(i.e. finding the association that people with
dry eyes are not prescribed contact lenses)
– If contact-lenses=hard, then
astigmatism=true
(i.e. finding the association that people with
astigmatism are prescribed hard lenses)
Text Mining Example
• Oberlinconfessional.com is a restricted (to
Oberlin) website for anonymous confessions.
• “Automatically Categorizing Written Texts
by Author Gender” by Moshe Koppel
describes an algorithm for predicting the
gender of a text’s writer based on word
occurrences.
Results:
Percentage of Total
Gender Score for
Gender
Conditional Distribution Host vs. Gender Grade Sums (2)
14.00%
12.00%
10.00%
8.00%
6.00%
4.00%
2.00%
0.00%
Male
Female
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Hour
•Posts are more male than female at 6:00 AM ,
7:00 AM, and at 5:00 PM. (possible reason:
women don’t stay up that late)
•Posts are more female than male throughout
the rest of the day. (possible reason: there are
more women than men in the community)
Software
• Weka toolkit: Java-based open source data mining
workbench (with reusable code) –
http://www.cs.waikato.ac.nz/ml/weka/
• Pentaho – Open Source Business Intelligence suite.
http://www.pentaho.com/
• IBM DB2 Data Warehouse Edition – complete data
warehouse suite with mining and visualizing
capabilities. (easily googleable)
• SPSS – Back-end software as well as a range of
industry-specific data mining solutions.
http://www.spss.com/
• SAS – Commercial Text mining tools and Business
Intelligence server.
http://www.sas.com/
Presentation Summary
Slide was repeated
because YOU are a
hetero-associative
learner.
• What is Data mining?
• Evolution of Data mining
• Applications
• Process
• Models : Predictive vs Descriptive
• Decision Tree (Classification Rules)
Example
• Association Rules Example
• Text Mining Example
• Software used
Questions