Download Data Mining - Department of Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Data center wikipedia , lookup

Data model wikipedia , lookup

Database model wikipedia , lookup

Forecasting wikipedia , lookup

Data analysis wikipedia , lookup

3D optical data storage wikipedia , lookup

Data vault modeling wikipedia , lookup

Information privacy law wikipedia , lookup

Data mining wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
Data Mining
Ketaki Borkar
CS157A
November 29, 2007
Agenda
1. Definition
2. Overview
3. History
4. Evolution
5. Scope
6. Stages
7. Process
8. Relationships
9. Elements
10. Data Warehousing
11. Techniques
12. Examples
13. Advantages/Disadvantages
14. References
Definition
“Data mining is the process of discovering meaningful
new correlations, patterns and trends by sifting
through large amounts of data stored in repositories,
using pattern recognition technologies as well as
statistical and mathematical techniques.”
Overview

Data mining tools predict future trends and behaviors, allowing businesses to
make proactive, knowledge driven decisions.

Prospective analysis offered by data mining move beyond analyses of past
events provided by retrospective tools typical of decision support systems.

Data mining tools can answer business questions that traditionally were too time
consuming to resolve.

They scour databases for hidden patterns, finding predictive information that
experts may miss because it lies outside their expectations.

Data mining techniques can be implemented rapidly on existing software and
hardware platforms to enhance the value of existing information resources, and
can be integrated with new products and systems as they are brought on-line.
History

Data mining is the evolution of a field with a long history, but the term itself was
only introduced relatively recently, in the 1990s

Statistics are the foundation of most technologies on which data mining is built.

Its roots can be traced back to along three family lines:

Classical statistics

Artificial intelligence

Machine learning

It is finding increasing acceptance in science and business areas which need to
analyze large amounts of data to discover trends which they could not otherwise
find.
Classical Statistics

Classical statistics embrace concepts such as regression analysis, standard
distribution, standard deviation, standard variance, cluster analysis, all of which
are used to study data and data relationships.

These are the building blocks with which more advanced statistical analysis are
underpinned.

Within the heart of today’s data mining tools and techniques, classical statistical
analysis plays a significant role.
Artificial Intelligence (AI)

It is built upon heuristics (method that often rapidly leads to a solution that is
usually close to the best possible answer) as opposed to statistics, attempts to
apply human-thought-like processing to statistical problems.

Since this approach requires vast computer processing power, it was not
practical until the early 1980s, when computers began to offer useful power at
reasonable prices.

Certain AI concepts were adopted by some high-end commercial products, such
as query optimization modules for Relational Database Management Systems
(RDBMS).
Machine Learning

Union of statistics and artificial intelligence.

Is an evolution of artificial intelligence because it blends artificial intelligence
heuristics with advanced statistical analysis.

Machine learning attempts to let computer programs learn about the data they
study, such that programs make different decisions based on the qualities of the
studied data, using statistics for fundamental concepts, and adding more
advanced AI heuristics and algorithms to achieve its goals.
Evolution of Data Mining
Evolutionary
Step
Business
Question
Enabling
Technologies
Product
Providers
Purpose
Data
Collection(1960)
"What was my
total revenue in
the last five
years?"
Computers,
tapes, disks
IBM, CDC
Retrospective,
static data delivery
Data
Access(1980s)
"What were unit
sales in New
England last
March?"
Relational
databases
(RDBMS),
Structured Query
Language (SQL),
ODBC
Oracle, Sybase,
Informix, IBM,
Microsoft
Retrospective,
dynamic data
delivery at record
level
Data
Warehousing
& Decision
Support
(1990s)
"What were unit
sales in New
England last
March? Drill
down to Boston."
On-line analytic
processing
(OLAP),
multidimensional
databases, data
warehouses
Pilot, Comshare,
Arbor, Cognos,
Microstrategy
Retrospective,
dynamic data
delivery at multiple
levels
Data Mining
(Emerging Today)
"What’s likely to
happen to Boston
unit sales next
month? Why?"
Advanced
algorithms,
multiprocessor
computers,
massive
databases
Pilot, Lockheed,
IBM, SGI,
numerous
startups (nascent
industry)
Prospective,
proactive
information delivery
Scope of Data Mining

Automated prediction of trends and behaviors. A typical example of
a predictive problem is targeted marketing. Data mining uses data on
past promotional mailings to identify the targets most likely to maximize
return on investment in future mailings.

EX:



forecasting bankruptcy
identifying segments of a population likely to respond similarly to given events.
Automated discovery of previously unknown patterns. Data mining
tools sweep through databases and identify previously hidden patterns
in one step.

EX:


analysis of retail sales data to identify seemingly unrelated products that are often
purchased together (ex  beer and diapers).
detecting fraudulent credit card transactions and identifying anomalous data that
could represent data entry keying errors.
Stages

Stage 1: Exploration
 Data preparation, cleaning and transformation.

Stage 2: Model building and validation
 Considering various models and choosing the best one based on
their performance.

Stage 3: Deployment
 Using the selected model as best in Stage 2 and applying it to new
data in order to generate predictions or estimates of the expected
outcome.
Data Mining Process
Relationships

Classes: Stored data is used to locate data in predetermined groups.
For example, a restaurant chain could mine customer purchase data to
determine when customers visit and what they typically order. This
information could be used to increase traffic by having daily specials.

Clusters: Data items are grouped according to logical relationships or
consumer preferences. For example, data can be mined to identify
market segments or consumer affinities.

Associations: Data can be mined to identify associations. The beerdiaper example is an example of associative mining.

Sequential patterns: Data is mined to anticipate behavior patterns and
trends. For example, an outdoor equipment retailer could predict the
likelihood of a backpack being purchased based on a consumer's
purchase of sleeping bags and hiking shoes.
Elements

Extract, transform, and load transaction data onto the data warehouse
system.

Store and manage the data in a multidimensional database system.

Provide data access to business analysts and information technology
professionals.

Analyze the data by application software.

Present the data in a useful format, such as a graph or table.
Date Warehousing vs. Data Mining

Data Warehouse: “is a repository (or archive) of information gathered
from multiple sources, stored under a unified schema, at a single site.”
(Silberschatz)



Collect data  Store in single repository
Allows for easier query development as a single repository can
be queried.
Data Mining:


Analyzing databases or Data Warehouses to discover patterns
about the data to gain knowledge.
Knowledge is power
Data Mining Techniques

Clustering is the method by which like records are grouped
together. Usually this is done to give the end user a high level view of
what is going on in the database. Clustering is sometimes used to
mean segmentation - which most marketing people will tell you is useful
for coming up with a birds eye view of the business.

EX: 1) Clustering people with similar movie preferences
2) Amazon.com displays “Customers who brought this book also
bought…”

Nearest neighbor algorithm is a refinement of clustering. It perfoms
prediction by finding the prediction value of records (near neighbors)
similar to the record to be predicted.
Techniques…continued

Decision Tree: A decision tree takes as input an object or
situation described by a set of properties, and outputs a yes/no
decision. Decision trees therefore represent Boolean
functions. Specifically each branch of the tree is a classification
question and the leaves of the tree are partitions of the dataset
with their classification.

CART: Classification and Regression Trees.

CHAID: Chi-Square Automatic Interaction Detector
Examples – Amazon.com
Credit Risk – Decision Tree
Advantages

Historical data can be used to predict future trends

Knowledge about new trends can be used to improve products and
services

Extracting knowledge hidden in large volumes of data

Data mining is used in developing models to predict outcomes of future
situations.
Disadvantages

Background checks

Spam

Privacy concerns

Birthdates, SSNs, personal information scrutinized for corporate gain.

Telemarketing

Surveillance and profiling
References





http://seattlepi.nwsource.com/business/154986_
privacychallenge02.html
http://en.wikipedia.org/wiki/Data_mining
http://www.eco.utexas.edu/~norman/BUS.FOR/c
ourse.mat/Alex/
http://www.dwreview.com/Data_mining/DM_mod
els.html
http://www.cs.sjsu.edu/faculty/lee/cs157/cs157a.
html