Download cse 6337 spring 1999 data mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
CSE 8392 SPRING 1999
DATA MINING:
PART I
Professor Margaret H. Dunham
Department of Computer Science and
Engineering
Southern Methodist University
Dallas, Texas 75275
(214) 768-3087
fax: (214) 768-3085
email: [email protected]
www: http://www.seas.smu.edu/~mhd
January 1999
CSE8392 SPRING 1999
OUTLINE
• Course Objective: To examine Data
Mining concepts. A database perspective
(rather than AI or statistics) is taken.
•
•
•
•
•
•
I. Introduction and Related Topics
II. Core Topics
III. Advanced Topics
IV. Case Studies
V. Student Presentations
VI. Summary and Future Trends
CSE 8392 Spring 1999
2
INTRODUCTION AND
RELATED TOPICS
• Section Objective: Provide an introduction
of data mining concepts. Briefly examine
related concepts and background topics.
• Historical Perspective
– Gleaning Knowledge from the Data
– User Expectations increase as
amount/sophistication of collected data
increases.
– Reality vs Extracted Data
Physical View
Database View
Reality
Data
Information
Need
Query
CSE 8392 Spring 1999
3
Related Topics (to be covered)
–
–
–
–
–
Knowledge Discovery
Information Retrieval
Fuzzy Sets
Data Warehousing and OLAP
Dimensional Modeling
CSE 8392 Spring 1999
4
Data Mining Overview
• What is Data Mining?
– Definition: Fayyad, p. 9
– A.k.a.
• Exploratory data analysis
• Unsupervised pattern recognition
• Data driven discovery
• Deductive learning
• Data Mining determines patterns in the
data
– Non-trivial
– Valid
– Novel
– Potentially useful
– Interesting
– General and simple
– Understandable
CSE 8392 Spring 1999
5
DM Techniques (R[1])
• DM involves many different algorithms to
accomplish different things. All have the
following techniques in common.
– Model(Must fit a model to the data.)
• Function/Purpose
• Representation
– Preference Criteria (How to choose one
model over another?)
– Search Algorithm (How to search the
data)
• Example (Loan Data, fig 1.1 p6 in
Fayyad):
– Model: Classification, Linear
Function
– Preference: What best fits data? (Fig
1.2 or 1.4)
– Search Algorithm: Linear search of
database
CSE 8392 Spring 1999
6
DM Model Functions (R[1])
• Classification - Map data into predefined
groups
• Regression - Map data to real valued
predicate variable
• Clustering - Map data into groups defined
by data itself
• Summarization - Map subsets of data into
simple description
• Dependency Modeling - Identify
dependencies among data items
• Link Analysis - Identify other relationships
among data (association rules)
• Sequence Analysis - Identify sequential
patterns in data
CSE 8392 Spring 1999
7
DM Historical Perspective
• Late 70’s: Spreadsheet analysis
• 80’s: Transactional databases support data
storage and retrieval
• Early 90’s: Growing interest in end user
support (a.k.a. decision support)
– Issue: transactional databases are not
designed for decision support
• Mid 90’s: Dedicated data warehouses for
decision support and multidimensional
analysis
• Late 90’s: Proliferation; new concepts
(data marts)
• DM Tools: Neovista, Red Brick
CSE 8392 Spring 1999
8
Data Mining Metrics
•
•
•
•
•
•
•
•
•
•
Berson, Tables 17-1,17-2,17-3, p 347
Accuracy
Clarity
Dirty Data
Dimensionality
Raw Data (Preprocessing)
RDBMS embedding
Scalability
Speed
Validation
CSE 8392 Spring 1999
9
DM Issues
•
•
•
•
•
•
•
•
•
•
•
Overfitting
Outliers
Closed World Assumption
Database schemas and database models
Algorithms for data mining
Interpretation and visualization of results
Size of databases
Multimedia data, Spatio-Temporal Data
Changing data
Integration
DM Applications
– Basket market analysis Stock analysis
and selection
– Fraud detection and prevention
– Crisis prediction and prevention
CSE 8392 Spring 1999
10
KNOWLEDGE DISCOVERY IN
DATABASES (KDD)
• “Overall process of discovering useful
knowledge from data.” (p28 in R[1])
• Defn: R[1] p 30
• Steps Fig 1, p29 R[1] (Fig 1.3 in Fayyad)
• Data Mining is one step in KDD process
• KDD objective not usually clear or exact.
May require time with customer
understanding needs.
• Data usually has problems - needs cleaning
– Incorrect/missing data
– Extract from multiple sources and
compare
– Delete anomalous data and sources
– Different data types/metrics
CSE 8392 Spring 1999
11
FUZZY SETS and LOGIC
• Set membership described by a real valued
(0,1) membership function
• Ex: Set of all tall people
• Set membership function: f(x)=x is tall iff
height(x)>6 ft.
• Note that this is a simple classification
problem. Just as the Loan example, the
results are not exact.
• Basis of many classification and clustering
approaches
• In a conventional DB how do you retrieve
all tall people?
– Three valued logic: True, False, Maybe
– Multi-valued logic: More than 2 values
CSE 8392 Spring 1999
12
Fuzzy Logic
• Reasoning with uncertainty
• Extends multivalued logic; allows user to
communicate using imprecise concepts, i.e.
– “good” and “bad”
– “close to” and “far away”
• Avoids brittleness of rule based reasoning
by introducing probability of set
membership
– Allows for smoother transition between
classification sets in the domain
– Example
• Berson figure 16.2, page 325
CSE 8392 Spring 1999
13
INFORMATION RETRIEVAL
• Store and retrieve documents based on
fuzzy queries
• Predecessor of web based access
• Ex: Store information about all articles in
all IEEE Transactions journals and
Retrieve all documents dealing with heaps.
• Overview
– Conventional IR Systems
– Query Structures(Keywords)
– Matching(Multivalued logic)
– Measures
– Text Analysis Techniques
– IR Related Topics
CSE 8392 Spring 1999
14
Conventional IR Systems
• Library card catalogs
• Documents (Library Science)
– Formatted
– Unformatted (Text)
– Mixed
• Document Surrogates
– Identifiers
– Titles, names, and dates
– Abstracts, extracts, reviews
– Summaries of Numerical Data
– Image Descriptions
CSE 8392 Spring 1999
15
IR Queries
• Query Structures
– Matching Criteria
– Boolean Queries
– Vector
– Fuzzy
– Natural Language
• Logical combination of keywords
• Weight associated with keywords
• Similarity measures
CSE 8392 Spring 1999
16
Similarity Measures
– Document Vector: Di  di1 , di 2 ,..., din 
– Different Measures:
n
Sim ( Di , D j )   d ik  d jk
k 1
– Salton and McGill, Introduction to
Modern Information Retrieval, 1984,
McGraw-Hill, pp201-204.
– Similarity uses:
• Document-Document
• Query-Query
• Document-Query
CSE 8392 Spring 1999
17
IR Document/Query Matching
• Matching Process
– Relevance and Similarity Measures
– Boolean based matching
• Logical match
– Vector based matching
• Threshold match
– Probabilistic Match
n documents relevant
• P(relevant) =
N total documents
– Fuzzy Matching
– Proximity Matching
– Weighting
– Relative Importance of Items
CSE 8392 Spring 1999
18
IR Matching
• Scaling
– Impact of Sample Size
– Clustering
– Centroids
• Measures
– Precision
– Recall
CSE 8392 Spring 1999
19
IR Indexing
• Text Analysis
– Indexing is the assignment of keywords
or terms that represent document
content
• Originally a library science problem
that has grown with the advent of
web based searches
– Indexing types
• Automated vs. manual
• Controlled vs. uncontrolled
• Single term vs. terms in context
• Deep vs. shallow
CSE 8392 Spring 1999
20
IR Indexing
• General Steps
– 1. Assignment of terms or concepts
capable of representing content
– 2. Assignment to each term a weight or
value
• Indexing
– Vector based
• Start with excerpts, remove high
frequency words
– Stop list
– Thesaurus
• Compute discrimination values of
terms
CSE 8392 Spring 1999
21
IR Retrieval
• Retrieval or Classification
– Vector based
• Same starting point as with indexing
• Compute weighting factors
• Assign to each document a weighted
term vector
– Similarity Measures
• Measure similarity between
document/query
• Results normalized to range between
0-1
CSE 8392 Spring 1999
22
IR Retrieval
– Inverse Document Frequency
• Assumes importance is proportional
to standard occurrence frequency,
and inversely proportional to the
total number of documents.
• Also used for similarity
measurement
– Inverted Indexing of Document
– Concept Hierarchy
• DAG of concepts
• Follow nodes from general to more
specific
• Tag articles with low level concepts
so that each may be distinguished
from ancestors
CSE 8392 Spring 1999
23
IR Related Topics
• Information Retrieval Related Topics
– Text Analysis
– Fuzzy Sets
– Extending Databases
– Hypertext
– Digital Libraries
– Data Mining
• Web based browsers
CSE 8392 Spring 1999
24
DATA WAREHOUSING AND
OLAP
– Preparations for Mining: Data
Warehousing
• Extracting the data (from RDBMS)
• Storing the data
– Data warehouse or data mart
• Cleansing the data
• Mining the data
– Often with multidimensional
queries
• Definition
– Blend of technologies
– Integration
– Enables Strategic Use of Data
• Architecture
– Figure 6.1, page 116
CSE 8392 Spring 1999
25
DW Migration
• Migration from Relational Database to
Data Warehouse
– Differences (Relational vs. Data
Warehouse)
– Procedure for Migration
• Extraction
• Cleanup
• Transformation
• Migration
• Issues
– Multiple sources
– Database Heterogeneity
– Data Heterogeneity
CSE 8392 Spring 1999
26
DW Design
• Data Warehouse Design Considerations Nine Step Method:
– Subject Matter
– Fact Table contents
– Dimensioning
– Fact Selection
– Precalculations
– Rounding out dimension table
– Duration selection
– What about change?
– Query priorities
• Technical Considerations
– Hardware
– Communications Infrastructure
CSE 8392 Spring 1999
– Data Structures
27
More on DW
• Benefits
– Development of strategic information
and resources
– Hypothesis testing
– Knowledge discovery
• Data Marts
– Definition: a mini data warehouse for
data mining
– Directed at a partition of data
– Dedicated user group
– May be physically separate
– Drivers
• Urgent user requirements
• Small budget
• Absence of sponsor
• Decentralization
• Smaller project size
CSE 8392 Spring 1999
28
DIMENSIONAL MODELING
• Dimensional Modeling
– Describes relationships in the data that
will be mined
– Relatively new concept, still developing
– A technique for visualizing data models
– Schema (Star and Snowflake)
– Facts - A collection of related data
items, consisting of measures and
context data
– Dimensions - A collection of members
or units of the same type of view. Axis
for modeling. Sets the context for the
facts.
– Measures - Numeric attribute of fact
(What is stored about sales data)
• Focus - Tends to be on numeric data
1999
29
• MD Analysis CSE
vs. 8392
DMSpring
- Figure
4, R[3]
Data Cube
•
•
•
•
•
Way to visualize facts and dimensions
Hypercube (more than 3 dimensions)
May be nested
Figure 13.1, p249, Berson
Figure 15,R[3]
CSE 8392 Spring 1999
30
Star Schema
– Contains large fact table and a
surrounding set of dimension tables
– A.k.a. constellation or multistar model
– Figure 9.1, p171,Berson
– Following from Figure 18, R[3]
Time
Dimension
Customer
Sales
Part No.
Dimension
Facts
Dimension
Salesperson
Product
Dims
Dimension
CSE 8392 Spring 1999
31
Snowflake Schema
• Sometimes dimensions have hierarchies
among themselves
• N:1 relationships among members of a
dimension may be subdivided
• Decomposition yields a snowflake like
schema
Week
Month
Dimension
Dimension
Time
Dimension
Customer
Sales
Part No.
Dimension
Facts
Dimension
Salesperson
Product
Dimension
Dimension
Location
Manager
Dimension
Dimension
CSE 8392 Spring 1999
32
OLAP (On Line Analytic Processing)
• Multidimensional database
• Allows user to analyze data using
elaborate, multidimensional, complex
views
• MOLAP - Multidimensional OLAP.
Supported by specialized DBMS/software
systems. (Data structures, temporal)
– May not be general enough for other
uses
– Access limited and optimized for OLAP
processing
– Fig 13.3 p 253, Berson
• ROLAP - Underlying data stored in
traditional (relational) DBMS and accessed
by traditional query language (SQL).
– Layer on top of DBMS. Middleware.
– May have poor performance for OLAP
applications
– Fig 13.4 p 254,
Berson
CSE 8392
Spring 1999
33
OLAP Operations
• Move view of facts down/up dimensions
– Drill Down
– Roll Up
– Figure 3, R[3]
– Figure 16,R[3]
• Look at data by partitioning the cube
– Slice - Look at subcube to get more
specific data
– Dice - Rotate cube to look at another
dimension
– Figure 17,R[3]
CSE 8392 Spring 1999
34