Download slide1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining: Introduction
Why Data Mining?

The Explosive Growth of Data: from terabytes to petabytes
– Data collection and data availability
Automated
data collection tools, database systems, Web,
computerized society
– Major sources of abundant data
Business:
Science:
Society
Web, e-commerce, transactions, stocks, …
Remote sensing, bioinformatics, scientific simulation, …
and everyone: news, digital cameras, YouTube

We are drowning in data, but starving for knowledge!

“Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
Evolution of Sciences

Before 1600, empirical science

1600-1950s, theoretical science
–

1950s-1990s, computational science
–

Each discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
Computational Science traditionally meant simulation. It grew out of our inability to
find closed-form solutions for complex mathematical models.
1990-now, data science
–
The flood of data from new scientific instruments and simulations
–
The ability to economically store and manage petabytes of data online
–
The Internet and computing Grid that makes all these archives universally
accessible
–
Scientific info. management, acquisition, organization, query, and visualization
tasks scale almost linearly with data volumes.
Evolution of Database Technology

1960s:
– Data collection, database creation, IMS and network DBMS

1970s:
– Relational data model, relational DBMS implementation

1980s:
– RDBMS, advanced data models (extended-relational, OO, deductive,
etc.)
– Application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s:
– Data mining, data warehousing, multimedia databases, and Web
databases

2000s
– Stream data management and mining
– Data mining and its applications
– Web technology and global information systems
4
Why Mine Data? Commercial Viewpoint

Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions

Computers have become cheaper and more powerful
Why Mine Data? Scientific Viewpoint

Data collected and stored at
enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene
expression data
– scientific simulations
generating terabytes of data


Traditional techniques infeasible for raw data
Data mining may help scientists
– in classifying and segmenting data
Mining Large Data Sets - Motivation


There is often information “hidden” in the data that is
not readily evident
Much of the data is never analyzed at all
4,000,000
3,500,000
The Data Gap
3,000,000
2,500,000
2,000,000
1,500,000
Total new disk (TB) since 1995
1,000,000
Number of
analysts
500,000
0
1995
1996
1997
1998
1999
From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”
What is Data Mining?
 Many
Definitions
– Non-trivial extraction of implicit, previously
unknown and potentially useful information from
data
KDD Process
Input Data
Data PreProcessing
Data
Mining
PostProcessing
Data Mining in Business Intelligence
Increasing potential
to support
business decisions
Decision
Making
Data Presentation
Visualization Techniques
End User
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
DBA
What is (not) Data Mining?
What is not Data
Mining?

– Look up phone
number in phone
directory

What is Data Mining?
– Certain names are more
prevalent in certain US
locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
Origins of Data Mining

Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
Statistics/
AI
Machine Learning/
Pattern
Recognition
Data Mining
Database
systems
Data Mining Tasks

Prediction Methods
– Use some variables to predict unknown or
future values of other variables.

Description Methods
– Find human-interpretable patterns that
describe the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Why Data Mining?—Potential Applications

Data analysis and decision support
– Market analysis and management
Target
marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
– Risk analysis and management
Forecasting,
customer retention, quality control
– Fraud detection and detection of unusual patterns (outliers)

Other Applications
– Text mining and Web mining
– Bioinformatics and bio-data analysis
Ex. 1: Market Analysis and Management

Where does the data come from?
– Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies

Target marketing
– Find clusters of “model” customers who share the same characteristics:
interest, income level, spending habits, etc.
– Determine customer purchasing patterns over time

Customer profiling
– What types of customers buy what products (clustering or
classification)

Customer requirement analysis
– Predict what factors will attract new customers
Ex. 2: Corporate Analysis & Risk Management

Finance planning and asset evaluation
– cash flow analysis and prediction
– cross-sectional and time series analysis
(financial-ratio, trend analysis, etc.)

Resource planning
– summarize and compare the resources and
spending
Ex. 3: Fraud Detection & Mining Unusual Patterns

Applications: Health care, retail, credit card service, telecomm.
– Auto insurance: fraud detection
– Money laundering: suspicious monetary transactions
– Medical insurance
Professional
patients, ring of doctors.
Unnecessary
– Anti-terrorism
or correlated screening tests
Data Mining Tasks...
Classification [Predictive]
 Clustering [Descriptive]
 Association Rule Discovery [Descriptive]
 Sequential Pattern Discovery [Descriptive]
 Regression [Predictive]
 Deviation/Anomaly/Outlier Detection [Predictive]
