Download Introduction - Mount Holyoke College

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Course Information
Information
Data Mining
n
CS 341, Spring 2007
n
Instructor: Xiaoyan Li
Lecture: Mon&W
on&W ed 2:40pm – 3:55pm
– Room: Kendade hall 107
n
Prof. Xiaoyan Li
Visiting Assistant Professor of
Computer Science
Mount Holyoke College
Office hour: Tu/Th 10:00am – 11:00am (or
by appointment)
– Office: Clapp 227
– Email: [email protected]
© Prentice Hall
Course Information
n
Course Structure
Textbook
n
– Data Mining: Introductory and Advanced Topics
The course is divided into 3 parts
– Related concepts and basic techniques
– Core Topics
» by Margaret H. Dunham , ISBN 00-1313-088892088892-3
n
2
Topics
» Classification, clustering, association rules
– Related Concepts & Basic Techniques
– Core Topics
– Perl programming language, final projects
» Classification, Clustering and Association Rules
n
– Advanced Topics
The first 2/3 are lectures, the rest 1/3
are seminars.
» Web Mining, Spatial Mining & Temporal Mining
© Prentice Hall
3
© Prentice Hall
Tentative schedule:
n
Grading
CSCS-341 Data Ming
n
n
n
n
© Prentice Hall
4
5
Class participation: 20%
Four homework assignments: 20%
One midterm: 20%
One final project: 40%
© Prentice Hall
6
1
Some slides are
adopted from:
Introduction Outline
DATA MINING
Introductory and Advanced Topics
Goal: Provide an overview of data mining.
Part I
n
n
Margaret H. Dunham
Department of Computer Science and Engineering
Southern Methodist University
n
n
n
Define data mining
Basic data mining tasks
Data mining vs. database & KDD
Data mining development
Data mining issues
Companion slides for the text by Dr. M.H.Dunham, Data Mining,
Introductory and Advanced Topics,
Topics, Prentice Hall, 2002.
© Prentice Hall
Introduction
n
n
n
Data Mining Definition
Data is growing at a phenomenal rate
Users expect more sophisticated
information
How?
n
Finding hidden information in a
database
n
Similar terms
– Exploratory data analysis
– Data driven discovery
– Deductive learning
UNCOVER HIDDEN INFORMATION
DATA MINING
© Prentice Hall
9
© Prentice Hall
Example 1.1
n
n
n
10
Data Mining Algorithm
Credit card company must determine whether
to authorize credit card purchases.
Four classes:
–
–
–
–
8
1) Authorize,
2) Ask for further identification before authorization
3) do not authorize,
4) do not authorize but contact police
n
Purpose: Fit Data to a Model
n
Preference – Criteria to choose the best
model
Search – Technique to search the data
n
How to classify a purchase?
– Examine historical data and determine how data fit
into the four classes.
– Apply the model to new purchase
© Prentice Hall
11
© Prentice Hall
12
2
Data Mining Models
n
Data Mining Models and Tasks
Predictive:
– A predictive model makes a prediction
about values of data using known results
found from different data.
n
Descriptive:
– A descriptive model identifies patterns or
relationships in data.
© Prentice Hall
13
© Prentice Hall
Basic Data Mining Tasks
n
Classification maps data into predefined
groups or classes
n
Example 1.1 is a general classification
problem
Example 1.2 is an example of pattern
recognition
Basic Data Mining Tasks
n
– Pattern recognition
n
n
n
Example 1.3
– A college professor wishes to reach a certain level
of savings before her retirement.
– She predicts what her retirement savings will be
based on its current values and several past
values.
– She uses a linear regression formula to predict her
retirement savings.
15
© Prentice Hall
16
Basic Data Mining Tasks
(cont’
(cont’d)
Basic Data Mining Tasks
n
Regression is used to map a data item to a
real valued prediction variable.
– Assume some known type of function (e.g. linear)
and select the best one.
– Airport screening is used to determine whether
passengers are potential terrorists or criminals
– Basic patterns: distance between eyes, size and
shape of mouth, etc.
© Prentice Hall
14
Clustering groups similar data together into
clusters. (The clusters are not predefined)
n
Example 1.6
Summarization maps data into subsets
with associated simple descriptions.
– Characterization
– Generalization
– A department store chain creates special
catalogs targeted to various demographic
groups based on attributes such as income,
location, etc.
n
Example 1.7
– The average SAT score is one of the
criteria used to compare universities by the
U.S. News & World Report.
© Prentice Hall
17
© Prentice Hall
18
3
Basic Data Mining Tasks
(cont’
(cont’d)
n
Ex: Time Series Analysis
Link Analysis uncovers relationships among
data.
– Affinity analysis
– Association rules – identify items are frequently
purchased together.
n
n
n
n
n
Example: Stock Market
Predict future values
Determine similar patterns over time
Classify behavior
Example 1.8
– A grocery store retailer is trying to decide whether
to put bread on sale.
– He finds that 60% of the time that bread is sold so
are pretzels and 70% of the time jelly is also sold
by using association rules.
– Decisions?
© Prentice Hall
19
© Prentice Hall
Data Mining vs. Database
Processing --Query
--Query Examples
n
Database Processing vs. Data
Mining Processing
Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more
than $10,000 in the last month.
– Find all customers who have purchased milk
n
n
Query
n
risks. (classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased
with milk. (association rules)
n
Data
n
n
Output
– Fuzzy
– Not a subset of database
© Prentice Hall
22
KDD Process
Knowledge Discovery in Databases
(KDD): process of finding useful
information and patterns in data.
Data Mining: Use of algorithms to
extract the information and patterns
derived by the KDD process.
Another opinion:
opinion:
Modified from [FPSS96C]
n
n
n
n
– They are no difference.
difference.
© Prentice Hall
Data
– Not operational data
– Precise
– Subset of database
Data Mining vs. KDD
n
n
Output
21
Query
– Poorly defined
– No precise query language
– Operational data
– Find all credit applicants who are poor credit
n
n
– Well defined
– SQL
Data Mining
© Prentice Hall
20
n
23
Selection: Obtain data from various sources.
Preprocessing: Cleanse data.
Transformation: Convert to common format.
Transform to new format.
Data Mining: Obtain desired results.
Interpretation/Evaluation: Present results
to user in meaningful manner.
© Prentice Hall
24
4
Data Mining Development
Data Mining Metrics
•Similarity Measures
•Hierarchical Clustering
•IR Systems
•Imprecise Queries
•Textual Data
•Web Search Engines
•Relational Data Model
•SQL
•Association Rule Algorithms
•Data Warehousing
•Scalability Techniques
n
n
•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
•Time Series Analysis
•Algorithm Design Techniques
•Algorithm Analysis
•Data Structures
n
n
Usefulness
Return on Investment (ROI)
Accuracy
Space/Time
•Neural Networks
•Decision Tree Algorithms
© Prentice Hall
25
© Prentice Hall
Database Perspective on Data
Mining (what is a good data mining
tool?)
n
n
n
n
26
Social Issues
n
Privacy ?
Scalability
Real World Data
Updates
Ease of Use
© Prentice Hall
27
© Prentice Hall
28
Announcements:
n
Next Lecture:
– Database, Decision Support System &
Warehousing
n
Reading assignments:
– Chapter 2
© Prentice Hall
29
5