Download Data Mining by Glen Shih

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining
Glen Shih
CS157B Section 1
Dr. Sin-Min Lee
April 4, 2006
Overview
Explanation of Data Mining
 Benefits of Data Mining
 Data Mining Background
 Data Mining Models
 Data Warehousing
 Problems and Issues of Data Mining
 Potential Applications of Data Mining

What Is Data Mining?

Data mining is:
 The automated extraction of hidden
predictive information from databases.

It is an extension of statistics with a few artificial
intelligence and machine learning twists.
What Is Data Mining? (cont.)

Now the term data mining is stretched beyond its
limits and applied to any form of data analysis.

It encompasses a number of different technical
approaches, such as clustering, data
summarization, learning classification rules, finding
dependency networks, analyzing changes, and
detecting anomalies.
Why Data Mining?

Data mining software allows users to analyze large
databases to solve business decision problems.

For example, the data mining software would use the
historical information of previous interaction between a
business and its customer to build a model of
customer behavior for predicting customer responses
to new products.
Data Mining Background
 Data
mining research has drawn on a
number of other fields:
Data Mining Background
 Data
mining research has drawn on a
number of other fields:
 Machine
learning
Data Mining Background
 Data
mining research has drawn on a
number of other fields:
 Machine
 Statistics
learning
Data Mining Background
 Data
mining research has drawn on a
number of other fields:
 Machine
learning
 Statistics
 Inductive
learning
Inductive Learning Strategies

Inductive learning where the system infers
knowledge itself from observing its environment
has two main strategies:
Inductive Learning Strategies

Inductive learning where the system infers
knowledge itself from observing its environment
has two main strategies:
 Supervised
learning
Inductive Learning Strategies

Inductive learning where the system infers
knowledge itself from observing its environment
has two main strategies:
 Supervised
learning
 Unsupervised
learning
Data Mining Models

IBM has identified two types of models or
modes of operation which may be used to
reveal information of interest to users:
Data Mining Models

IBM has identified two types of models or
modes of operation which may be used to
reveal information of interest to users:

Verification Model
Data Mining Models

IBM has identified two types of models or
modes of operation which may be used to
reveal information of interest to users:

Verification Model

Discovery Model
Data Warehousing

Data mining potential can be enhanced if the
appropriate data has been collected and stored in a
data warehouse.

The data warehousing market consists of tools,
technologies, and methodologies that allow for the
construction, usage, management, and maintenance
of the hardware and software used for a data
warehouse, as well as the actual data itself.
Data Warehouse

The term Data Warehouse was coined by Bill Inmon in
1990, which he defined in the following way:
Data Warehouse

The term Data Warehouse was coined by Bill Inmon in
1990, which he defined in the following way:

"A warehouse is a subject-oriented, integrated, timevariant and non-volatile collection of data in support of
management's decision making process".
Data Warehouse (cont.)

Subject Oriented:
 Data that gives information about a particular
subject instead of about a company's ongoing
operations.
Data Warehouse (cont.)

Subject Oriented:
 Data that gives information about a particular
subject instead of about a company's ongoing
operations.

Integrated:
 Data that is gathered into the data warehouse from
a variety of sources and merged into a coherent
whole.
Data Warehouse (cont.)

Time-Variant:
 All data in the data warehouse is identified with a
particular time period.
Data Warehouse (cont.)

Time-Variant:
 All data in the data warehouse is identified with a
particular time period.

Non-Volatile:
 Data is stable in a data warehouse. More data is
added but data is never removed. This enables
management to gain a consistent picture of the
business.
Problems and Issues of Data Mining

Data mining systems rely on database to supply the
raw data for input.

Problems rise because databases tend to be dynamic,
incomplete, noisy, and large.

Other problems relate to adequacy and the
information stored.
Problems and Issues
Problems and Issues

Limited information
Problems and Issue

Limited information

Uncertainty
Problems and Issue

Limited information

Uncertainty

Size, update, and irrelevant fields
Problems and Issue

Limited information

Uncertainty

Size, update, and irrelevant fields

Noise and missing values
Ways to Treat Missing Data by
Discovery Systems
Ways to Treat Missing Data by
Discovery Systems

Simplify disregard missing values.
Ways to Treat Missing Data by
Discovery Systems

Simplify disregard missing values.

Omit the corresponding records.
Ways to Treat Missing Data by
Discovery Systems

Simplify disregard missing values.

Omit the corresponding records.

Infer missing values from known values.
Ways to Treat Missing Data by
Discovery Systems

Simplify disregard missing values.

Omit the corresponding records.

Infer missing values from known values.

Treat missing data as a special value to be included
additionally in the attribute domain.
Ways to Treat Missing Data by
Discovery Systems

Simplify disregard missing values.

Omit the corresponding records.

Infer missing values from known values.

Treat missing data as a special value to be included
additionally in the attribute domain.

Average over the missing values using Bayesian
techniques.
Potential Applications of Data Mining
Potential Applications of Data Mining

Retail and Marketing
Potential Applications of Data Mining

Retail and Marketing

Identify buying patterns from customers

Find associations among customer demographic
characteristics

Predict response to mailing campaigns

Analyze Market basket
Potential Applications of Data Mining

Banking
Potential Applications of Data Mining

Banking






Detect patterns of fraudulent credit card use
Identify “loyal” customers
Predict customers likely to change their credit card
affiliation
Determine credit card spending by customer groups
Find hidden correlations between different financial
indicators
Identify stock trading rules from historical market data
Potential Applications of Data Mining

Insurance and Health Care
Potential Applications of Data Mining

Insurance and Health Care

Claim analysis – i.e. which medical procedures are
claimed together

Predict which customers will buy new policies

Identify behavior patterns of risky customers

Identify fraudulent behavior
Potential Applications of Data Mining

Transportation
Potential Applications of Data Mining

Transportation

Determine the distribution schedules among outlets

Analyze loading patterns
Potential Applications of Data Mining

Medicine
Potential Applications of Data Mining

Medicine

Characterize patient behavior to predict office visits

Identify successful medical therapies for different
illnesses
References

Dilly, R. (n.d.). Retrieved March 30, 2006, from Data Mining
Web site:
http://www.ppc.qub.ac.uk/tec/courses/determining/stu_notes/d
m_book_1.html

Reed, M. (n.d.). A definition of data warehousing. Retrieved
March 30, 2006, from Internet Journal Web site:
http://www.intranetjournal.com/features/datawarehousing.html.

Thearling, K. (n.d.). Retrieved March 30, 2006, from Information
about data mining and analytic technologies Web site:
http://www.thearling.com/.