Download CS 404 Data Mining & Knowledge Discovery -- FS01-L1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
CS/EngMt/CpEng 404
Data Mining
&
Knowledge
Discovery
Dan St. Clair
Lect 1 – Intro. To Data Mining &
Data Warehouses
Information Age Produces Large
Amounts of Data
• Data collected on almost everything
• WWW rich data resource
• Data warehouses required to hold
data
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
2
The problem:
How do we turn information into useful
knowledge?
Solution:
Data mining & knowledge discovery
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
3
Data Mining & Knowledge
Discovery
This class provides
• Tools & techniques for producing useful
knowledge from information
• Experience in using these tools
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
4
Data Mining & Knowledge Discovery in
CS 404
• We will study
–
–
–
–
Data warehouses
Classification & Association rule miners (C4.5)
Neural networks (BP, SOM)
Classical tools
• Correlation
• Regression
• Clustering
• We will do several projects requiring mining
knowledge from “real” data
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
5
CS 404 Class Information
Prerequisites:
CS 347 (Artificial Intelligence) or CS 304
(Database Systems)
and Stat 215
Texts:
• Han, J. & Kamber, M., Data Mining: Concepts
and Techniques, Morgan Kaufmann, 2000.
• Quinlan, J., C4.5 Programs for Machine
Learning, Morgan Kaufmann, 1988.
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
6
CS 404 Class Information
Reference:
(This or a similar Matlab reference is recommended.)
Hanselman, D. and Littlefield, B., Mastering Matlab 6:
A Comprehensive Tutorial and Reference, Prentice
Hall, 2001.
Software:
• C4.5 – provided to class w/o charge
• Matlab – Can purchase from Mathworks or can login
to UMR.
• Microsoft Excel (provided on UMR CLC computers)
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
7
CS 404 Class Information
Instructor:
D.C. St. Clair, Ph.D.
325 Computer Science
Phone: (573) 341-6352
e-mail: [email protected]
(Cont.d)
Fax: (573) 341-4501
Class web page:
www.umr.edu/~stclair or
http://web.umr.edu/~stclair/class/classfiles/cs404_fs02/
Things you will find on the class web page:
•
•
•
•
Syllabus
Schedule
Homework assignments
Lecture notes
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
8
Who am I?
• Professor and Chair UMR Computer Science Dept.
• Research area -- Data mining, machine intelligence, neural networks
diagnostics
intelligent graphics
data mining
pattern recognition & analysis
system monitoring & assessment
• “Applied” experience
–
–
–
–
–
Union Pacific Technologies Intelligent Systems Advisor
Visiting Principal Scientist McDonnell Douglas Research Laboratories
NASA’s Johnson Space Center
Defense: Navy, Army, and Air Force
Co-founder & former Chief Scientist of intelligent software systems
company
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
9
Even More
CS 404 Class Information
Han, one of the authors of the data mining text has a web page at:
www.cs.sfu.ca/~han/DM_Book.html
Which contains several interesting things including:
1.
A list of errata for the data mining book
2.
A set of slides he uses in the data mining course he teaches.
[I will be using some of these slides in my lectures.]
You may want to check these out.
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
10
Topics to Be Covered in Lecture 1
Intro. to Data Mining & Knowledge Discovery
•
•
•
•
•
•
 2002 by D. C. St. Clair
We just
finished this.
Intro. to CS 404
What is Data Mining & KD?
Data sources
Data mining tasks
Data wareshousing (Ch. 2)
Multidimensional data models & schema
CS 404 Data Mining & Knowledge Discovery
11
Topics to Be Covered in Lecture 1
Intro. to Data Mining & Knowledge Discovery
•
•
•
•
•
•
 2002 by D. C. St. Clair
Intro. to CS 404
What is Data Mining & KD?
Data sources
Data mining tasks
Data wareshousing (Ch. 2)
Multidimensional data models & schema
CS 404 Data Mining & Knowledge Discovery
12
Data -- Information -- Knowledge
The set of values:
12345
67890
1000.00
2846.92
SA
CK
has no meaning. It is data but it is NOT information.
Information: Information is the result of organizing data into meaningful quantities.
The following relational table helps turns the data into information since it associates meaning
with the data:
Account
Number
12345
67890
Balance
1000.00
2846.92
type
SA
CK
A database is a “structured” collection of data stored and operated on within a management
environment known as a Database Management Systems (DBMS) or database system. The
DBMS helps to transform data into information.
Knowledge can be created from information.
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
13
What Is Data Mining?
How Does It Differ From Existing Database Technologies?
Data Sources: Databases, data warehouses, Internet
Decision Support Systems
Tools for asking questions & doing analyses when you know what
you want to ask and where you are going. (Ex. OLAP tools)
Data Mining
Process of discovering knowledge (meaningful new correlations,
patterns, and trends) in data by sifting through large amounts of
data (100M-10G) using pattern recognition as well as statistical and
mathematical techniques.
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
14
Other Names Used in Conjunction With
Data Mining
•
•
•
•
•
•
•
Knowledge discovery(mining) in databases (KDD)
Knowledge extraction
Data/pattern analysis
Data archeology
Data dredging
Information harvesting
What is not data mining
– (Deductive) query processing
– Expert systems or small ml/statistical programs
Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
15
Potential-Customer*
Person
Age Sex
Ann Smith
32
F
Joan Gray
53
F
Mary Blythe
27
F
Jane Brown
55
F
Bob Smith
30
M
Jack Brown
50
M
Data
Mining
Example
Married-To
Husband
Bob Smith
Jack Brown
Knowledge Within A Relation
Income
10,000
1,000,000
20,000
20,000
100,000
200,000
Customer
yes
yes
no
yes
yes
yes
Wife
Ann Smith
Jane Brown
IF Income(Person)  100,000 THEN Potential-Customer(Person)
IF Sex(Person) = F AND Age(Person)  32 THEN Potential-Customer(Person)
Knowledge From Multiple Relations
IF
Married-To(Person,Spouse) AND Income(Person)  100 000
THEN Potential-Customer(Spouse)
IF
Married-To(Person,Spouse) AND Potential-Customer(Person)
THEN Potential-Customer(Spouse).
* Dzeroski, Saso, Inductive Logic Programming and Knowledge Discovery in Databases, Advances in Knowledge Discovery and
Data Mining, Ed. U. Fayyad, G.Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy, AAAI Press, 1996, pp. 117-152.
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
17
Simple Concept Learning -- Example
“Routine”, “well-understood” chemistry experiment performed numerous times.
• Expected result occurred about half the time
• Unexpected result occurred remainder of the time
Numerous repetitions of experiment produced similar results
Careful analysis determined:
• One result produced when setup was in sunlight
• Second result produced when setup was in shade
Careful investigation showed:
Experiment sensitive to ultraviolet radiation
Result:
Patented method for determining presence of ultraviolet radiation
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
18
The Knowledge Discovery
Process
Interpretation/
Evaluation
Data Mining
Transformation
Preprocessing
Selection
Data
Sources
Knowledge
Patterns /
Models
Transformed
Data
Preprocessed
Data
Target
Data
 2002 by D. C. St. Clair
404 Data Mining & Knowledge Discovery
19
Source:
Fayyad, U., Piatetsky-Shapiro, G., Smyth, CS
P, From
Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.
Topics to Be Covered in Lecture 1
Intro. to Data Mining & Knowledge Discovery
•
•
•
•
•
•
 2002 by D. C. St. Clair
Intro. to CS 404
What is Data Mining & KD?
Data sources
Data mining tasks
Data wareshousing (Ch. 2)
Multidimensional data models & schema
CS 404 Data Mining & Knowledge Discovery
20
Data Sources
•
•
•
•
•
•
Relational Databases
Data Warehouses
WWW
Audio
Video
Printed Materials
:
:
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
21

Relational
Databases
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
22
Multidimensional Data Cube
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000
23
Topics to Be Covered in Lecture 1
Intro. to Data Mining & Knowledge Discovery
•
•
•
•
•
•
 2002 by D. C. St. Clair
Intro. to CS 404
What is Data Mining & KD?
Data sources
Data mining tasks
Data wareshousing (Ch. 2)
Multidimensional data models & schema
CS 404 Data Mining & Knowledge Discovery
24
Data Mining Tasks
• Predictive
– Perform inference on current data
• Descriptive (KDD)
– Characterize general properties of data
Notes:
– A measure of certainty or “belief” must be
associated with each pattern
– “Interesting” patterns must be identified
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
25
Kinds of Data Patterns to Be “Mined”
• Concept/class description
• Association analyses
• Classification & prediction
• Cluster analysis
• Outlier analysis
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
26
Concept/class Descriptions
Example 1
Produce a description summarizing characteristics of customers
who purchase diapers
• Objective: produce a description of those in the target class
• Characterizes class/concept
Example 2
What properties identify diaper buyers from other store
customers?
• Discriminates class/concept
• Leads to other questions
– What else do they buy
– When do they purchase these items?
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
27
Association Analysis
Assoc. Anal. -- discovery of association
relationships between attribute-value
conditions.
Such relationships may be expressed in many ways.
On common way is through association rules.
X => Y
 2002 by D. C. St. Clair
A1^.....^ Am  B1^....^ Bn
CS 404 Data Mining & Knowledge Discovery
28
Association Rules
Example
age (X, “20 .. 29”) ^ income (X, “20K..29K”) =>
buys (X, “CD changer)
[support = 2% confidence = 60% ]
% of data instances
satisfying all three
components of rule
 2002 by D. C. St. Clair
% of data instances where
hypothesis is satisfied and
conclusion is predicted
correctly
CS 404 Data Mining & Knowledge Discovery
29
Classification & Prediction
o
Debt
o
x
o
o
x
x
o
x
o
o
o
x
x
x
o
x
x
o
o
x
o
o
Income
 2002 by D. C. St. Clair
Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge
CS 404 Data Mining & Knowledge Discovery
Discovery In Databases, AI Magazine, Fall 1996.
30
Classification (nonlinear)
o
No Loan
Debt
o
x
o
o
x
x
o
x
o
o
o
x
x
x
o
x
o
x
x
o
o
o
Loan
Income
 2002 by D. C. St. Clair
Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge
CS 404 Data Mining & Knowledge Discovery
Discovery In Databases, AI Magazine, Fall 1996.
31
Cluster Analysis
+
Debt
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Income
 2002 by D. C. St. Clair
Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge
CS 404 Data Mining & Knowledge Discovery
Discovery In Databases, AI Magazine, Fall 1996.
32
Some Major Data Mining Issues
• Mining methodologies
• User interaction
• Performance (accuracy, robustness)
• Heterogeneous databases
• Interestingness
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
33
Topics to Be Covered in Lecture 1
Intro. to Data Mining & Knowledge Discovery
•
•
•
•
•
•
 2002 by D. C. St. Clair
Intro. to CS 404
What is Data Mining & KD?
Data sources
Data mining tasks
Data wareshousing (Ch. 2)
Multidimensional data models & schema
CS 404 Data Mining & Knowledge Discovery
34
The Knowledge Discovery
Process
Interpretation/
Evaluation
Data Mining
Transformation
Preprocessing
Selection
Data
Sources
Knowledge
Patterns /
Models
Transformed
Data
Preprocessed
Data
Target
Data
 2002 by D. C. St. Clair
404 Data Mining & Knowledge Discovery
35
Source:
Fayyad, U., Piatetsky-Shapiro, G., Smyth, CS
P, From
Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.
Chapter 2: Data Warehousing and OLAP
Technology for Data Mining
• What is a data warehouse?
• A multi-dimensional data model
• Data warehouse architecture
• Data warehouse implementation
• From data warehousing to data mining
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
36
What Is a Data Warehouse?
DWs provide architectures and tools to support
the systematic
–organization,
–understanding, and
–use of data.
Note: DWs may consist of data from numerous
sources including business, scientific, as well as
engineering data.
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
37
Features of a Data Warehouse
• Subject-oriented -- organized around major subjects
• Integrated -- integrates multiple heterogeneous data
sources
– Relational databases
– Flat files
– On-line transaction records
• Consistency is enforced
• Time-variant -- data stored to provide historical data
• Nonvolatile
– Physically separate from operational environment
– Operations on data: initial loading & retrieval
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
38
OLTP vs. OLAP
OLTP
OLAP
users
clerk, IT professional
knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
usage
access
complex query
Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
39
Topics to Be Covered in Lecture 1
Intro. to Data Mining & Knowledge Discovery
•
•
•
•
•
•
 2002 by D. C. St. Clair
Intro. to CS 404
What is Data Mining & KD?
Data sources
Data mining tasks
Data wareshousing (Ch. 2)
Multidimensional data models & schema
CS 404 Data Mining & Knowledge Discovery
40
Multidimensional Data Models
Figure 2.1 3-D data cube
AllElectronics sales data
2002 by
D. C. St. Clair
404
Data Mining
Knowledge
Discovery
Allfigure
references
in this lecture are to the text: Han, CS
J. &
Kamber,
M., &Data
Mining:
Concepts and Techniques, Morgan Kaufmann, 2000.
41
4-D Data Cube of AllElectronics Sales
Data
Figure 2.2 4-D data cube
AllElectronics sales data
2002 by
D. C. St. Clair
404
Data Mining
Knowledge
Discovery
Allfigure
references
in this lecture are to the text: Han, CS
J. &
Kamber,
M., &Data
Mining:
Concepts and Techniques, Morgan Kaufmann, 2000.
42
Fig. 2.3 A Lattice of Cuboids
all
time
0-D(apex) cuboid
item
time,location
location
item,location
time,supplier
time,item
supplier
1-D cuboids
location,supplier
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,supplier
time,item,location
item,location,supplier
4-D(base) cuboid
time, item, location, supplier
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
43
Conceptual Modeling of Data
Warehouses
• Modeling data warehouses: dimensions & measures
– Star schema: A fact table in the middle connected to a set of
dimension tables
– Snowflake schema: A refinement of star schema where some
dimensional hierarchy is normalized into a set of smaller
dimension tables, forming a shape similar to snowflake
– Fact constellations: Multiple fact tables share dimension tables,
viewed as a collection of stars, therefore called galaxy schema or
fact constellation
Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
44
Fig. 2.4 Example of Star Schema
time
item
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
 2002 by D. C. St. Clair
Slide is modified from slides provided by Han, J. &
Kamber,
M., Data
Mining:
ConceptsDiscovery
and
CS
404 Data
Mining
& Knowledge
Techniques, Morgan Kaufmann, 2000.
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
province_or_street
country
45
Fig. 2.5 Example of Snowflake Schema
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
 2002 by D. C. St. Clair
Slide is modified from slides provided by Han, J. &
Kamber,
M., Data
Mining:
ConceptsDiscovery
and
CS
404 Data
Mining
& Knowledge
Techniques, Morgan Kaufmann, 2000.
item_key
item_name
brand
type
supplier_key
supplier
supplier_key
supplier_type
location
location_key
street
city_key
city
city_key
city
province_or_street
country
46
Fig 2.6 Example of Fact Constellation
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_name
brand
type
supplier_type
item_key
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
item_key
shipper_key
location
to_location
location_key
street
city
province_or_street
country
dollars_cost
Measures
 2002 by D. C. St. Clair
time_key
from_location
branch_key
branch
Shipping Fact Table
Slide is modified from slides provided by Han, J. &
Kamber,
M., Data
Mining:
ConceptsDiscovery
and
CS
404 Data
Mining
& Knowledge
Techniques, Morgan Kaufmann, 2000.
units_shipped
shipper
shipper_key
shipper_name
location_key
47
shipper_type
A Data Mining Query Language,
DMQL: Language Primitives
• Cube Definition (Fact Table)
define cube <cube_name> [<dimension_list>]:
<measure_list>
• Dimension Definition ( Dimension Table )
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
• Special Case (Shared Dimension Tables)
– First time as “cube definition”
– define dimension <dimension_name> as
<dimension_name_first_time> in cube
<cube_name_first_time>
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
48
Defining a Star Schema in DMQL
define cube sales_star [time, item, branch,
location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day,
day_of_week, month, quarter, year)
define dimension item as (item_key, item_name,
brand, type, supplier_type)
define dimension branch as (branch_key,
branch_name, branch_type)
define dimension location as (location_key, street,
city, province_or_state, country)
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
49
CS/EngMt/CpEng 404
Data Mining
&
Knowledge
Discovery
Dan St. Clair
Lect 1 – Intro. To Data Mining &
Data Warehouses
Program
Completed
University of Missouri-Rolla
Copyright 2001 Curators of University of Missouri