Download Data Mining - Department of Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data mining
By Aung Oo
What is Data Mining?
• Different perspectives: CS, Business, IT
• As a field of research in CS:
– Science of extracting useful information from large
data sets or databases
• Also known as
– Knowledge Discovery and Data Mining (KDD)
Knowledge Discovery in Databases (KDD)
Knowledge Discovery and Data Mining (KDD)
• KDD can be said to lie at the intersection of statistics, machine
learning, data bases, pattern recognition, information retrieval
and artificial intelligence.
Data Mining Definitions
1. Analysis of datasets to find unsuspected relationships
2. Summarize data in novel ways that are understandable
useful to data owner
3. Extraction of knowledge from data
–
non-trivial extraction of implicit, previously unknown &
potentially useful knowledge from data
4. Process of discovering patterns:
–
–
automatically or semi-automatically, in large quantities of
data
Patterns discovered must be useful: meaningful in that they
lead to some advantage, usually economic
Why Data Mining?
1. Large datasets are common: due to advances in digital
data acquisition and storage technology.
Business
•Supermarket
transactions
•Credit card usage
records
•Telephone call details
•Government statistics
Scientific
•Images of astronomical bodies
•Molecular databases
•Medical records
2. Automatic data production leads to need for
automatic data consumption
3. Large databases mean vast amounts of information
4. Difficulty lies in accessing it
Why Data Mining?
• Data mining is ready for application in the
business community because it is supported by
three technologies that are now sufficiently
mature:
– Massive data collection
– Powerful multiprocessor computers
– Data mining algorithms
Example of Data Mining
• If a store tracks the purchases of a customer and
notices that a customer buys a lot of silk shirts, the
data mining system will make a correlation between
that customer and silk shirts.
– The store may begin direct mail marketing of silk shirts to that
customer or it may alternatively attempt to get the customer
to buy a wider range of products .
• Another example: analysts found that beers and diapers
were often bought together .
– So place the high-profit diapers next to the high-profit beers.
• This technique is often referred to as "Market Basket Analysis".
Steps in the Evolution of Data Mining
Evolutionary Step
Business Question
Enabling Technologies
Data Collection
(1960s)
"What was my total revenue in
the last five years?"
Computers, tapes, disks
Data Access
(1980s)
"What were unit sales in New
England last March?"
Relational databases
(RDBMS), Structured
Query Language (SQL),
ODBC
Data Warehousing &
Decision Support
(1990s)
"What were unit sales in New
England last March? Drill
down to Boston."
On-line analytic processing
(OLAP),
multidimensional
databases, data
warehouses
Data Mining
(Emerging Today)
"What’s likely to happen to
Boston unit sales next
month? Why?"
Advanced algorithms,
multiprocessor
computers, massive
databases
The Scope of Data Mining
• Automated prediction of trends and behaviors.
– Data mining uses data on past promotional mailings to identify the
targets most likely to maximize return on investment in future
mailings.
• Automated discovery of previously unknown patterns.
– An example of pattern discovery is the analysis of retail sales data to
identify seemingly unrelated products that are often purchased
together.
• More columns.
– High performance data mining allows users to explore the full depth of
a database, without pre-selecting a subset of variables.
• More rows.
– Larger samples yield lower estimation errors and variance, and allow
users to make inferences about small but important segments of a
population.
Data Mining vs. Statistics
• Objective of data mining exercise plays no role in data
collection strategy
• In this way it differs from much of statistics
• For this reason, data mining is referred to as secondary
data analysis
• KDD more complicated than initially thought
– 80% preparing data
– 20% mining data
Query: Data Base vs. Data Mining
• Data Base: When you know exactly what you
are looking for.
• Data Mining: When you only vaguely know what
you are looking for.
Data Mining Tasks and Techniques
• Not so much a single technique
• Idea that there is more knowledge hidden in the data
than shows itself on the surface
• Any technique that helps to extract more out of data is
useful
• Five major task types:
1.
2.
3.
4.
5.
Exploratory Data Analysis (Visualization)
Descriptive Modeling (Density estimation, Clustering)
Predictive Modeling (Classification and Regression)
Discovering Patterns and Rules (Association rules)
Retrieval by Content (Retrieve items similar to pattern of
interest)
Privacy concerns
• For example, if an employer has access to medical
records, they may screen out people who have
diabetes or have had a heart attack. Screening out
such employees will cut costs for insurance, but it
creates ethical and legal problems.
• Essentially, data mining gives information that would
not be available otherwise. It must be properly
interpreted to be useful. When the data collected
involves individual people, there are many questions
concerning privacy, legality, and ethics.
Notable Uses of Data Mining
• Data mining has been cited as the method by
which the U.S. Army intelligence unit, Able
Danger, supposedly had identified the 9/11
attack leader, Mohamed Atta, and three other
9/11 hijackers as possible members of an al
Qaeda cell operating in the U.S. more than a
year before the attack.
References
• http://www.cedar.buffalo.edu/~srihari/CSE626
• http://en.wikipedia.org/wiki/Data_Mining
• http://www.thearling.com/text/dmwhite/dmw
hite.htm