Download Data Mining - WordPress.com

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
DATA MINING
Data Mining deals with the discovery
of hidden Knowledge , unexpected
pattern and new rules from large data
sets
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
1
Examples of Information extracted using query
language
• List customers who use credit card to purchase
more than Rs 1000 worth groceries
•List patients who had atleast one heart attack
Examples of what data mining is used for
•Develop a general profile of credit card customers
•Determine patients whose lifestyle is prone to
getting a heart attack in near future
•Differentiate poor credit risk customers from good
credit card customers
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
2
Data Mining differs from usual query processing
in many ways
1. Query cannot be well-formed or precisely stated
as what you are looking for is usually hidden
2. Data in operational data bases may not be
sufficient. Data from various sources need to be
integrated processed before quality mining can be
done
3. Output is not just a subset of data but is analysed
and presented as a pattern
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
3
Data explosion problem:
• The Explosive Growth of Data: from terabytes to
petabytes
• Progress in Hardware technology leading to
Automated data collection tools, storage media,
affordable computers
• Progress in database technology, relational
technology leading to powerful database systems
• Tremendous amounts of data stored in
databases, data warehouses and other
information repositories
• Quantity of data in the world roughly doubles
every year
• Distribution and sharing of data is possible
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
4
• Due to internet hundreds of megabytes of data are
distributed around the world
• Heterogeneous data sources can be shared using
Open DataBase Connectivity tools
• Data exchange ,integration through XML technology
• Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific
simulation, …
Society and everyone: news, digital cameras,
• More data means less information
• We are drowning in data, but starving for knowledge!
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
5
Computers against computers
Automated data collection tools and mechanical production
and reproduction of data force us to develop mechanical
methods for filtering selecting and interpreting data
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
6
•Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit,
previously unknown and
potentially useful)
patterns or knowledge from huge amount of data
Data mining: a misnomer?
•Alternative names
Knowledge discovery (mining) in databases
(KDD), knowledge extraction, data/pattern
analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
7
Knowledge discovery in databases (KDD)-is a
multistep process of finding useful information and
patterns in data while Data Mining is one of the
steps in KDD of using algorithms for extraction of
patterns
Steps Of KDD
1. SelectionData Extraction -Obtaining Data from heterogeneous
data sources -Databases, Data warehouses, World wide web or other
information repositories
2. PreprocessingData Cleaning- Incomplete , noisy, inconsistent data to
be cleaned- Missing data may be ignored or predicted, erroneous
data may be deleted or corrected
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
8
3. TransformationData Integration- Combines data from multiple sources
into a coherent store -Data can be encoded in common
formats, normalized, reduced
4. Data mining –
Apply algorithms to transformed data an extract patterns
5. Pattern Interpretation/evaluation
Pattern Evaluation- Evaluate the interestingness of resulting
patterns or apply interestingness measures to filter out discovered
patterns
Knowledge presentation- present the mined knowledgevisualization techniques can be used
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
9
Visualization Techniques
Graphical-bar charts,pie charts Geometric-boxplot, scatter plot
histograms
40
35
30
25
20
15
10
5
0
10000
30000
50000
70000
90000
Icon-based- using colors
Pixel-based- data as colored
figures as icons
pixels
Hierarchical- Hierarchically
Hybrid- combination of above
dividing display area
approaches
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
10
Knowledge discovery process
KDD is the nontrivial
extraction of implicit
previously unknown
and potentially useful
knowledge from data
Pattern Evaluation
Data Mining
Data Transformation
Data Preprocessing
Data Warehouses
Data Integration
Data Cleaning
Selection
Operational Databases
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
11
Data Mining is the process of discovering
interesting Knowledge from large amounts of data
stored in data bases, data warehouses or other
information repositories
The architecture of a typical data mining system
may have the following major components
Database, Data warehouse, World wide web
or other information repository-Data cleaning
and data integration techniques may be performed
on the data
Database or Data Warehouse Server-It is
responsible for fetching the relevant data based on
the user’s data mining request.
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
12
Graphical User Interface
Pattern Evaluation
Data Mining Engine
Knowl
edgeBase
Database or Data
Warehouse Server
data cleaning, integration, and selection
Database
4/30/2017
Data
World-Wide Other Info
Repositories
Warehouse
Web
Data Mining -By Dr. S. C. Shirwaikar
13
Data mining Engine-It consists of a set of
functional modules for task such as
characterization, association and correlation
analysis classification, prediction cluster analysis,
outlier analysis etc
Knowledge base – It is the domain knowledge
used to guide the search or evaluate the
interestingness of resulting patterns
Pattern evolution module- It applies
interestingness measures to filter out discovered
patterns
Graphical User Interface- user can specify a data
mining query
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
14
Why Data Mining?—Potential Applications
• Data analysis and decision support
– Market analysis and management
• Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
– Risk analysis and management
• Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
– Fraud detection and detection of unusual patterns (outliers)
• Other Applications
– Text mining (news group, email, documents) and Web
mining
– Stream data mining
– Bioinformatics and bio-data analysis
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
15
Data Mining algorithms-All algorithms attempt to fit
a model closest to the data being examined.
Model is based on the analysis of attributes of a
training data set
The Model is than evaluated using a test data set
Data Model can be
Descriptive-characterize, explore properties of
current data
Predictive-perform inference on current data to
make predictions on future data
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
16
Data Mining
Descriptive
Predictive
Clustering
Classification
Sequence Discovery
Prediction
Summarization
Regression
Association rules
Time series Analysis
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
17
Classification- maps data into predefined groups or classes
It uses supervised learning .
The algorithm uses learning phase to build a classifier using
training data set containing data attributes and associated
class labels
Regression-maps data into real-valued prediction variableAlgorithm tries to find best function (linear, Non-linear that fits the
training data)
Time Series Analysis- the value of an attribute is examined as it
varies over time
It can be used to determine similarities, classify the behavior or
predict future values
Prediction – predicts future values using regression, time series
analysis or other approaches
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
18
Clustering -Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
Unsupervised learning: no predefined classes
Interpretability and usability-results should be comprehensible
and usable-domain expert is required
Summarization - maps data into subsets with simple
descriptions- It extracts or derives representative summary
type of information
Association rules–discovers relationship among data – used in
Market basket analysis to find item frequently purchased
togather
Sequence Discovery- discovers sequential patterns in data-oder
in which items are purchased or data is accessed
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
19
Influence from many disciplines
Database
Technology
Machine
Learning
Pattern
Recognition
4/30/2017
Statistics
Data Mining
Algorithm
Data Mining -By Dr. S. C. Shirwaikar
Visualization
Other
Disciplines
20
Depending on data mining approach, techniques
from other disciplines may be applied such as
•Information Retrieval
•Artificial Intelligence
•Neural networks
•Fuzzy set theory
•Knowledge representation
•Logic programming
•High performance computing
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
21
Data Mining issues
Human interaction- interfaces required with both domain
and technical experts- variety of databases, variety of users
leading to numerous data mining techniques – What is
required is not known hence extraction process need to be
interactive.
Interpretation of results- requirements of expertsinterpretability problems- Background knowledge or domain
expertise is essential to guide the discovery process
visualization of results- visualization helps- multidimensional data is problematic – The discovered knowledge
should expressed in the form of trees , tables, graphs, charts
curves etc.
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
22
Data Mining issues continued
Large datasets- scalability is a problem- algorithms do not
scale well with massive real-world datasets- sampling and
parallelization are effective tools
High dimensionality -Conventional database may contain
many different attributes, all are not relevant-increases
complexity and reduces efficiency –dimensionality curse-data
reduction-dimensionality reduction
Multimedia data - found in GIS databases proves
conventional data mining algorithms ineffective
Missing data -It is not always possible to ignore missing
data but in preprocessing data mining algorithms can be
used to replace missing data with estimates
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
23
Data Mining issues continued
Irrelevant data – Data reduced by removing irrelevant data
Noisy data and outliers –Invalid , incorrect data will lead to
poor quality data mining- Outliers are very much different and
do not fit nicely into the derived model
Changing data- Data warehouses contain non-volatile dataDynamic data is uploaded and then algorithms are reapplied
Integration- KDD requests are one time needs-data mining
functions are now integrated into traditional database
systems
Applications – Effective use of output of mining algorithm is
a challenge rather than the complexity of the mining
algorithm
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
24
Data Mining Metrics
How to measure the effectiveness of data mining process?
-KDD process is expensive- Return on investment will be the
saving due to decision process using the results
-Difficult to measure and quantify
-Measured as increase in sales, reduction in advertising cost
Social Implications of Data mining
Two sides of the coin
Data mining can be used to improve customer service and
satisfaction
Data mining can be used to confront one’s right to privacy
Omnipresent Invisible Data mining affecting everyoneprofiling is used to label typical characteristics
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
25
Data mining should follow certain Guidelines
Organization for Economic Co-operation and
Development(OECD) established a set of
international guidelines referred as fair information
practices
•Purpose specification and use limitation-usage of
collected should not exceed stated purpose
•Openness-right to know the nature of data
collected about them
•Security safeguards- protected from loss,
unauthorized access, destruction, use, modification
or disclosure of data
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
26
Data mining should follow certain Guidelines
•Individual participation – Individual has the right to
have the data erased, completed or corrected
•Privacy Preserving data mining
-secure Multiparty computation- data values are
encoded so that no party can learn another’s data
values.
-data obscuration- actual data is distorted by
aggregation or by adding random noisereconstruction algorithm is essential for getting the
distribution of original data.
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
27
BOOKS
Data Mining, Introduction and Advanced Topics by Margaret H.
Dunham and Sridhar
Pearson Education
ISBN 81-7758-785-4
Data Mining Techniques by Arun K Pujari
Universities Press (India) Limited
ISBN 81-7371-380-4
Data mining, Pieter Adriaans& Dolf zantinge: (pearson Education
Asia), ISBN 81-7808-425-2. Addison Wesley Longman (Singapore)
Data Mining Techniques for Marketing, Sales and Customer
Relationship Management by Michael J. A. Berry and Gordon S.
Linoff
Wiley-dreamtech India Pvt. Ltd.
ISBN 81-265-0517-6
Data Mining Concepts and Techniques by Jiawei Han and
Micheline Kamber
Morgan Kaufmann Publishers
. 81-312-0535-5
ISBN
4/30/2017
Data Mining -By Dr. S. C. Shirwaikar
28