Download Από τη διαχείριση πληροφορίας στη διαχείριση γνώσης

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Forecasting wikipedia, lookup

Data assimilation wikipedia, lookup

Transcript
Εξόρυξη Γνώσης
(data mining)
Χ. Παπαθεοδώρου
Εργαστήριο Ψηφιακών Βιβλιοθηκών &
Ηλεκτρονικής Δημοσίευσης
Τμήμα Αρχειονομίας – Βιβλιοθηκονομίας,
Ιόνιο Πανεπιστήμιο
1
Data Mining
Εξόρυξη γνώσης από πολύ μεγάλες συλλογές
δεδομένων
 Γνώση: κανόνες, πρότυπα συμπεριφοράς και
συσχετίσεις μεταξύ αντικειμένων (όχι προφανής,

λανθάνουσα, προηγουμένως άγνωστη, και χρήσιμη)
Αντικείμενο: Αποτελείται από ένα σύνολο
χαρακτηριστικών
 Δεν είναι:



(Deductive) query processing.
Expert systems, small machine learning /statistical
programs
2
Why Data Mining?
Potential Applications

Database analysis and decision support

Market analysis and management


Risk analysis and management



target marketing, customer relation management, market
basket analysis, cross selling, market segmentation
Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
Fraud detection and management
Other Applications

Text mining (news group, email, documents) and Web
analysis.

Intelligent query answering
3
Market Analysis and Management
(1)

Where are the data sources for analysis?


Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
Target marketing

Find clusters of model customers who share the same
characteristics: interest, income level, spending habits, etc.

Determine customer purchasing patterns over time


Conversion of single to a joint bank account: marriage, etc.
Cross-market analysis

Associations/co-relations between product sales

Prediction based on the association information
4
Market Analysis and Management (2)

Customer profiling

data mining can tell you what types of customers
buy what products (clustering or classification)

Identifying customer requirements

identifying the best products for different customers

use prediction to find what factors will attract new
customers

Provides summary information

various multidimensional summary reports

statistical summary information (data central
tendency and variation)
5
Corporate Analysis and Risk
Management

Finance planning and asset evaluation




Resource planning:


cash flow analysis and prediction
contingent claim analysis to evaluate assets
cross-sectional and time series analysis (financialratio, trend analysis, etc.)
summarize and compare the resources and spending
Competition:



monitor competitors and market directions
group customers into classes and a class-based
pricing procedure
set pricing strategy in a highly competitive market
6
Steps of a KDD Process

Learning the application domain:




Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of
effort!)
Data reduction and transformation:




Find useful features, dimensionality/variable reduction,
invariant representation.
Choosing functions of data mining


relevant prior knowledge and goals of application
summarization, classification, regression, association,
clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns,
etc.
7
Data Mining: A KDD Process
Pattern Evaluation

Data mining: the core of
knowledge discovery
Data Mining
process.
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Databases
Selection
Data pre-processing

Data preparation is a big issue for data mining

Data preparation includes


Data cleaning and data integration

Data reduction and feature selection

Discretization
A lot a methods have been developed but still
an active area of research
9
Data pre-processing
10
Clustering

Partition data set into clusters, and one
can store cluster representation only

Can have hierarchical clustering and be
stored in multi-dimensional index tree
structures

There are many choices of clustering
definitions and clustering algorithms
11
Cluster Analysis
12
Classification

Classification is an extensively studied problem (mainly
in statistics, machine learning & neural networks)

Classification is probably one of the most widely used
data mining techniques with a lot of extensions

Scalability is still an important issue for database
applications: thus combining classification with
database techniques should be a promising topic

Research directions: classification of non-relational
data, e.g., text, spatial, multimedia, etc..
13
Classification process

Model construction: describing a set of predetermined
classes




Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
The set of tuples used for model construction: training set
The model is represented as classification rules, decision trees,
or mathematical formulae
Model usage: for classifying future or unknown objects

Estimate accuracy of the model
 The known label of test sample is compared with the
classified result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set, otherwise over-fitting
will occur
14
Classification Process (1):
Model Construction
Training
Data
NAME
Mike
Mary
Bill
Jim
Dave
Anne
RANK
YEARS TENURED
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
15
Classification Process (2):
Use the Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
Merlisa
George
Joseph
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
Tenured?
16
Supervised vs. Unsupervised
Learning


Supervised learning (classification)

Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations

New data is classified based on the training set
Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
17
Document category modelling
 Example:
Filtering spam email.
 Task: classify incoming email as spam
and legitimate (2 document categories).
 Simple blacklist and keyword-based
methods have failed.
 More intelligent, adaptive approaches
are needed (e.g. naive Bayesian
category modeling).
18
Document category modelling

Step 1 (linguistic pre-processing): Tokenization,
removal of stopwords, stemming/lemmatization.
 Step 2 (vector representation): bag-of-words or
n-gram modeling (n=2,3).
 Step 3 (feature selection): information gain
evaluation.
 Step 4 (machine learning): Bayesian modeling,
using word/n-gram frequency.
19
What Is Association Mining?
 Association
rule mining:
 Finding
frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
 Applications:
 Basket
data analysis, cross-marketing, catalog design,
loss-leader analysis, clustering, classification, etc.
 Example.
form: "Body Head [support, confidence] .
 buys(x, "diapers ) buys(x, "beers ) [0.5%, 60%]
 Rule
20
Association Rule: Basic
Concepts


Given: (1) database of transactions, (2) each transaction is
a list of items (purchased by a customer in a visit)
Find: all rules that correlate the presence of one set of
items with that of another set of items


E.g., 98% of people who purchase tires and auto accessories also
get automotive services done
Applications


* Maintenance Agreement (What the store should do to boost
Maintenance Agreement sales)
Home Electronics * (What other products should the store stocks
up?)
21
Rule Measures: Support and
Confidence
Custome
r
buys
both
Customer
buys
diaper

Find all the rules X & Y Z with
minimum confidence and support


Customer
buys beer
support, s, probability that a transaction
contains {X & Y & Z}
confidence, c, conditional probability
that a transaction having {X & Y} also
contains Z
Find the rules with support and confidence equal or grater than a
given threshold
22
Mining Association Rules An
Example
Transaction ID
2000
1000
4000
5000
Items Bought
A,B,C
A,C
A,D
B,E,F
Min. support 50%
Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
For rule A C:
support = support({A =>C}) = 50%
confidence = support({A =>C})/support({A}) =
66.6%
23
References

U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy.
Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press,
1996.

J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan
Kaufmann, 2000.

T. Imielinski and H. Mannila. A database perspective on knowledge
discovery. Communications of ACM, 39:58-64, 1996.

G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to
knowledge discovery: An overview. In U.M. Fayyad, et al. (eds.),
Advances in Knowledge Discovery and Data Mining, 1-35. AAAI/MIT
Press, 1996.

G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in
Databases. AAAI/MIT Press, 1991.
24