Download An Introduction to Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Computing the Future of Data Mining
An Introduction to Data Mining
Visit to Messiah College
September 4, 2006
William M. Pottenger, Ph.D.
Computer Science & Engineering Department
www.cse.lehigh.edu/~billp
 William M. Pottenger, Ph.D.
Knowledge Workers are Overwhelmed
• The user of software tools and computers are
domain experts, NOT the computer science
professionals
– Too much data
– Too much technology
– Not enough useful information
 William M. Pottenger, Ph.D.
Data Mining Roots:
A Confluence of Multiple Disciplines
•
•
•
•
•
•
•
•
Database Systems, Data Warehouses, and OLAP
Machine Learning
Information Theory & Statistics
Mathematical Programming
Visualization
High Performance Computing
…
Algorithms have been known for awhile…Google™
 William M. Pottenger, Ph.D.
Data Mining: On What Kind of Data?
•
•
•
•
Relational Databases
Data Warehouses
Transactional Databases
Advanced Database Systems
–
–
–
–
Object-Relational
Text
Heterogeneous: Legacy, Distributed, …
WWW
• … the Bible! 
 William M. Pottenger, Ph.D.
Why Do We Need Data Mining?
• Leverage organization’s data assets
– Only a small portion (typically - 5%-10%) of the collected
data is ever analyzed
– Data that may never be analyzed continues to be
collected, at a great expense, out of concern that
something which may prove important in the future is
missed
– Growth rates of data preclude traditional “manual
intensive” approach: need automated data fusion
techniques based on data mining
 William M. Pottenger, Ph.D.
Why Do We Need Data Mining?
• As databases and problems grow, the ability to support the
decision support process using traditional query
languages become infeasible
– Many queries of interest are difficult to state in a query language
(Query formulation problem)
– “find all cases of fraud”
– “find all individuals likely to buy a FORD Expedition”
– “find all documents that are similar to this customers problem”
 William M. Pottenger, Ph.D.
What (exactly) is Data Mining?
• Let’s take a few moments and consider this
question. Is it:
– Knowledge Discovery?
– Knowledge Management?
– Information Retrieval?
– On-line Analytic Processing (OLAP)?
– Machine Learning?
– Decision Support?
– Process Modeling/Control?
–…
 William M. Pottenger, Ph.D.
Definitions
• Data mining is the application of computer technology and
machine learning algorithms to discover patterns,
anomalies, trends, and knowledge from data.
– SGI Mineset Product Description
• Data mining is the extraction of implicit, previously
unknown, and potentially useful information from data.
– Data Mining by Witten and Frank
• Data mining, also popularly referred to as knowledge
discovery in databases (KDD), is the automated or
convenient extraction of patterns representing knowledge
implicitly stored in large databases, data warehouses, and
other massive information repositories.
– Data Mining: Concepts and Techniques by Han and Kamber
 William M. Pottenger, Ph.D.
What is Text Mining?
• Swanson (‘91) posed problem: Migraine headaches (M)
–
–
–
–
–
–
–
–
stress associated with M
stress leads to loss of magnesium
calcium channel blockers prevent some M
magnesium is a natural calcium channel blocker
spreading cortical depression (SCD) implicated in M
high levels of magnesium inhibit SCD
M patients have high platelet aggregability
magnesium can suppress platelet aggregability
• All extracted from medical journal titles
 William M. Pottenger, Ph.D.
Slide reused with permission of Marti Hearst @ UCB
Gathering Evidence
stress
magnesium
CCB
migraine
magnesium
SCD
magnesium
 William M. Pottenger, Ph.D.
Slide reused with permission of Marti Hearst @ UCB
PA
magnesium
Novel Discovery: Magnesium & Migraines!
CCB
migraine
PA
magnesium
SCD
stress
No single author knew/wrote about this connection… this
distinguishes Text Mining from Information Retrieval.
 William M. Pottenger, Ph.D.
Slide reused with permission of Marti Hearst @ UCB
Why Use Data Mining?
• Data mining will become much more important, and
companies will throw away nothing about their customers
because it will be so valuable. If you’re not doing this,
you’re out of business.
– Arno Penzias, Chief Scientist @ Bell Labs
• We are deluged by data – scientific data, medical data,
demographic data, financial data, and marketing data.
People have no time to look at this data. Human attention
has become a precious resource.
– Jim Gray, Microsoft Research in preface to Data Mining by
Han and Kamber
• Necessity is the mother of invention
– Unknown 
 William M. Pottenger, Ph.D.
How is Data Mining Used?
•
•
•
•
•
•
•
•
•
•
Direct Marketing
Customer Acquisition
Customer Retention
Cross-selling
Trend Analysis
Fraud Detection
Forecasting in Financial Markets
Process Modeling
Process Control
…
 William M. Pottenger, Ph.D.
But What is Data Mining (Really)?
Copyright © 1997 Stiftelsen Østfoldforskning: Used with permission
Data Mining: A Process
 William M. Pottenger, Ph.D.
An Example of Data Mining in
Process Modeling and Control at HP
• Quality Assurance troubleshooting
– KnowledgeSeeker Decision Tree Data
Mining Tool identified critical factors
impacting production of HP IIc Color Scanner
• Process control
– KnowledgeSeeker Decision Tree Data
Mining Tool derived rules necessary to
identify situations where process was about
to go out of control.
 William M. Pottenger, Ph.D.
How Do Decision Trees Work?
Decision trees
predict results
but also tell
about structure.
 William M. Pottenger, Ph.D.
Be right back …
A Demonstration of
Data Mining
Featuring
KnowledgeSEEKER
by Angoss Knowledge Engineering
 William M. Pottenger, Ph.D.
Examples of Commercial
Data Mining Systems
• IBM’s DB2 Intelligent Miner
– www.ibm.com/software/data/iminer
• SAS Institute’s Enterprise Miner
– www.sas.com/products/miner
• SPSS’s Clementine
– www.spss.com/clementine
• Angoss’ KnowledgeSeeker
– http://www.angoss.com/products/seeker.php
• Plus many more …
 William M. Pottenger, Ph.D.
Asymptopia
We are always given finite amounts of data … and rarely do
we reach asymptopia. Asymptopia is the mythical land, the
data miners 'utopia', where the amount of data is infinite
and all algorithms converge and all users are satisfied ...
Naturally, asymptopia can be reached only in the limit.
Ron Kohavi Nuggets 96:21 (www.kdnuggets.com)
 William M. Pottenger, Ph.D.