Download Chapter_1_Introduction to Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Date: 26th February 2016
Special thanks: Han & Kamber
 Introduction
 Classification
of Data Mining System
 Data Mining Architecture
 Data Mining Functionalities
 Major Issues in Data Mining
 Importance of Data Mining
 Application of Data Mining
 Social Impacts of Data Mining
Data???

Information???
Database???
DBMS???
Data
Structured :DBMS
Dhaval Gohel
40
50
60
Rishabh Chauhan
60
70
80
Mayur Padiya
70
60
80
Ankit Prajapati
30
40
50
Viral Prajapati
80
90
70
Unstructured:text
Dhaval Gohel,40,50,60
Rishabh Chauhan 60,70,80
Semi –structured:XML
<Name>Dhaval Gohel</Name>
<CA>40</CA>
<IP>50</IP>
<CS>60</CS>
Information
 Dhaval
Gohel have 50% in current Sem.
 Viral
Prajapati have highest marks in
Reaserch Skill.
 Ankit
Prajapti have lowest marks in CA.
Data base
120160107001
Dhaval Gohel
Dhaval Gohel
120160107002
Rishabh Chauhan
Rishabh Chauhan Modasa
120160107004
Mayur Padiya
Mayur Padiya
Nadiyad
120160107007
Ankit Prajapati
Ankit Prajapati
Dehgam
120160107008
Viral Prajapati
Viral Prajapati
Naroda
Dhaval Gohel
40
50 60
Rishabh Chauhan
60
70 80
Mayur Padiya
70
60 80
Ankit Prajapati
30
40 50
Viral Prajapati
80
90 70
Dakor
DBMS
 Data:
row facts
 Information:
processed data
 Database:
collection of organized
related data
 DBMS: set
of software and tools used
manipulate the database
 Data
Mining: “ Data Mining is the
process
of
discovering
interesting
knowledge from large amount of data
stored in databases, data warehouses, or
other information repositories.“

Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
 Knowledge
discovery (mining) in databases
(KDD)
 knowledge extraction
 data/pattern analysis
 data archeology
 data dredging
 information harvesting
 business intelligence, etc.


Database:
- Find all employee having salary >=50,000
- Find all the student who have attendance 0% last
month
- Find all the Student who have Apple Laptop
Data Mining:
- Find all employee who is contractual (Classification)
- Find all the student who have attending lectures
(Clustering)
- Find all the Student who have Apple Laptop and
Apple Phone (Association Rule)
 Database
technology
 Information Science
 Statistics
 Machine Learning
 Visualization
 Other disciplines
Information
Science
Machine
Learning
Database
Technology
Statistics
Visualization
Algorithms
Data
Mining
 Classification
is based on
 Kind of database Mined:
• Data model like relational, transactional, object-
relational, or data warehouse.
• Special types of data handled like spatial, time
series, text, stream data, multimedia data mining
system, or a World Wide Web mining system.
 Kind
of knowledge Mined
• Data Mining functionalities like Characterization
and Discrimination, Mining Frequent Patterns,
Classification and Prediction, Cluster Analysis, Outlier
Analysis, Evolution Analysis
• Data regularities vs data irregularities
 Kinds
of techniques utilized
• Degree of user iteration involved e.g.,
autonomous systems, interactive exploratory
systems, query-driven system
• Method of data analysis employed e.g.,
database-oriented or data warehouse oriented
techniques, machine learning, statistics,
visualization, pattern recongnization, neural
networks, and so on.
 Application
adapted
• Finance, telecommunication, DNA, stock
markets, e-mail and so on.
Pattern Evaluation
Data Mining
Pattern
Task-relevant Data
Data transformations
Preprocessed
Data
Data Cleaning
Data Integration
Databases
Selection and Transformation



Cleaning: remove noise and inconsistent data
Integration: where multiple data sources may be combine
Selection: Data relevant to the analysis task are retrieved from
the database

Transformation: Data are transformed into appropriate form
for mining. Summary or aggregation operations
 Data Mining: Various techniques like Association rule mining,
Classification, Clustering are apply to Identify and count patterns
 Pattern Evaluation: Identify truly interesting patterns
representing knowledge base on some interestingness measure.
• For example Support and Count for Association Rule Mining
 Knowledge Presentation: Visualization and knowledge
representation techniques are used to present the mined
knowledge to the user







Cleaning: remove error logs
Integration: multiple logs may be combine
Selection: Data having valid Status and Media type is selected
Transformation: Transfer data to day wise, week wise
Data Mining: Identify Pattern and count frequent access
Pattern Evaluation: Display frequently access sequences
Knowledge Presentation: url page wise user count graph, IP
address wise number of page visited count graph
 Components
1.
Databases, Data warehouse, World Wide Web or other
Information repository
2.
Database or Data warehouse server
3.
Knowledge base
4.
Data mining engine
5.
Pattern Evaluation Module
6.
User Interface
 Data
Mining functionalities are used to
specify the kind of patterns to be found in
data mining tasks.
 Task: Descriptive
and Predictive
 Descriptive: General Properties of data
and database
 Predictive: Perform inference
(Conclusion) on the current data
1.
2.
3.
4.
5.
6.
Characterization and Discrimination
Mining Frequent Patterns
Classification and Prediction
Cluster Analysis
Outlier Analysis
Evolution Analysis
 Data
Characterization is a summarization of
the general characteristics or features of a
target class of data.
 For
example: to analyze the improvements
of the students who study in 2nd Semester
ME in GECM and whose marks increased
5% in the current semester.
 Display
forms: pie charts, bar charts,
multidimensional data cubes etc..



Data Discrimination is a comparison of the general
features of target class data objects with the
general features of objects from one or a set of
contrasting classes.
For example: faculties may like to compare the
results of students who study in 2nd Semester ME
in GECM and whose marks increased 5% and
decreased 5% in the current semester .
Display forms: pie charts,
multidimensional data cubes etc..
bar
charts,
 Frequent
patterns are patterns that occur
frequently in data set.
 Forms:
Frequent itemsets, subsequences,
and substructures.
 Frequent
itemsets: ex. milk and bread.
 Subsequence: ex. PC followed by Soft.
 Substructure: sub graph, tress, or lattices

Association Rule Mining is method use to
find the interesting frequent pattern from
large set of data items.
 computer


 antivirus [support=2%, Confidence=60%]
Support means that 2% of all the transactions in which computer
and antivirus purchased together.
Confidence 60% means 60% of customers who purchased a
computer also purchased antivirus together




Classification is the process of finding a model (or function) that
describes and distinguishes data classes or concepts. The model
is derived based on the analysis of a set of training data and is
used to predict the class label of objects for which the class label
is unknown.
Classification is a two phase process
1) Lerning: Training data are analyzed by classification algorithm.
2) Classification: Classify data into the class lable.
Prediction values continuous valued functions, i.e. it is used to
predict missing or unavailable numeric data values rather than class
labels.
Regression analysis is a statistical method used numeric prediction.
Dhaval Gohel
40
50
60
Pass
Rishabh Chauhan
60
70
80
Pass
Mayur Padiya
70
30
80
Fail
Ankit Prajapati
30
40
50
70
80
Rishabh Chauhan
Prediction
Pass
Classification
Clustering analyzes data objects without
consulting class labels.
Clustering can be used to generate class
labels for a group of data which did not exist
at the beginning.
The objects are clustered or grouped based
on the principle of maximizing the intra-class
similarity and minimizing the inter-class
similarity.
Outliers are data objects that do not comply with
the general behavior or model of data. The analysis
of outlier data is referred to as outlier mining.
Many data mining techniques discard outliers or
exceptions as noise.
However, in some events these kind of events are
more interesting. This analysis of outlier data is
referred to as outlier analysis
ex: fraud detection.
Data evolution analysis describes and models
regularities or trends for objects whose behavior
changes over time.
This may include characterization, discrimination,
association and correlation analysis, classification,
prediction or clustering of time related data.
Distinct features of such data include time series
data analysis, sequence or periodicity pattern
matching and similarity based data analysis.
 Data
collected in large data repositories
become “data tombs”.
 Data Mining tools perform data analysis
and my uncover important data patterns,
contributing
greatly
to
business
strategies, knowledge
bases, and
scientific and medical research.
 Data Mining tools turns data tombs into
“Golden nuggets” of knowledge.
 Market
analysis
 Fraud detection
 Customer retention
 Production control
 Science exploration
1. Mining different kinds of data
2. Handling multiple levels of abstraction
3. Incorporation of background
knowledge
4. Visualization of mining results
5. Handling of incomplete or noisy data
6. Scalability of algorithms
 Privacy
 Profiling
 Unauthorized
use