Download 1.1 Unit I – BY Prof. Bhong

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Elective-I
Examination Scheme-
In semester Assessment: 30
End semester Assessment :70
Text Books:
Data Mining Concepts and Techniques- Micheline Kamber
Introduction to Data Mining with case studies-G.k.Gupta
Reference Books:
Mining the Web Discovering Knowledge from Hypertext dataSaumen charkrobarti
Reinforcement and systemic machine learning for decision
making- Parag Kulkarni
Data mining described
Need of data mining
Kinds of pattern and technologies
Issues in mining
KDD vs. Data Mining
Machine learning Concepts
OLAP
Knowledge Representation
Data PreproccesingCleaning,integration,Reduction,Transformation and
Discretization
 Application with mining aspect
(Weather Prediction)










Data : Data are any facts, numbers, or text that can be
processed by a computer.
 operational or transactional data such as, sales, cost, inventory,
payroll, and accounting
 nonoperational data, such as industry sales, forecast data, and
macro economic data
 meta data - data about the data itself, such as logical database
design or data dictionary definitions

Information: The patterns, associations, or relationships
among all this data can provide information.



Knowledge: Information can be converted into knowledge
about historical patterns and future trends. For example,
summary information on retail supermarket sales can be
analyzed in terms of promotional efforts to provide
knowledge of consumer buying behavior.
Thus, a manufacturer or retailer could determine which
items are most susceptible to promotional efforts.
Data Warehouses: Data warehousing is defined as a process
of centralized data management and retrieval.



5
The Explosive Growth of Data: from terabytes to petabytes
 Data collection and data availability
▪ Automated data collection tools, database systems, Web,
computerized society
 Major sources of abundant data
▪ Business: Web, e-commerce, transactions, stocks, …
▪ Science: Remote sensing, bioinformatics, scientific
simulation, …
▪ Society and everyone: news, digital cameras, YouTube
**We are drowning in data, but starving for knowledge! **
“Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets
Data mining- is the principle of sorting through large amounts of data
and picking out relevant information.
In other words…
 Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount of
data

Other names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
Searching through large amounts of data for
correlations, sequences, and trends.
Current “driving applications” in sales (targeted
marketing, inventory) and finance (stock picking)
Select information to be mined
Sales data
Choose mining tool (based on
type of results wanted)
C luster
Sequence
C lassify
Inference
Evaluate results
“70% of
customers who
purchase
comforters later
purchase
curtains”
Data Rich, Information Poor
Data Mining process
KDD process includes

data cleaning (to remove noise and inconsistent data)

data integration (where multiple data sources may be combined)

data selection (where data relevant to the analysis task are retrieved
from the database)

data transformation (where data are transformed or consolidated into
forms appropriate for mining by performing summary or aggregation
operations)

data mining (an essential process where intelligent methods
are applied in order to extract data patterns.

pattern evaluation (to identify the truly interesting patterns
representing knowledge based on some interestingness
measures)

knowledge presentation (where visualization and knowledge
representation techniques are used to present the mined
knowledge to the user)
Data mining is a core of knowledge discovery process
Knowledge Discovery (KDD) Process

Data mining—core of
knowledge discovery
process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Databases
Selection
1.
2.
3.
4.
5.
6.
7.
Data cleaning – to remove noise and inconsistent data
Data integration – to combine multiple source
Data selection – to retrieve relevant data for analysis
Data transformation – to transform data into appropriate
form for data mining
Data mining
Evaluation
Knowledge presentation

Step 1 to 4 are different forms of data
preprocessing

Although data mining is only one step in the
entire process, it is an essential one since it
uncovers hidden patterns for evaluation

Based on this view, the architecture of a typical data
mining system may have the following major
components:
 Database, data warehouse, world wide web, or other




information repository
Database or data warehouse server
Data mining engine
Pattern evaluation model
User interface

Relational Database

Data Warehouses

Transactional Databases

Advanced data and information systems
 Object-oriented database
 Temporal DB, Sequence DB and Time serious DB
 Spatial DB
 Text DB and Multimedia DB
 … and WWW
Data Mining: Confluence of Multiple
Disciplines
Database
Technology
Machine
Learning
Pattern
Recognition
Statistics
Data Mining
Algorithm
Visualization
Other
Disciplines

In general, data mining tasks can be classified into
two categories: descriptive and predictive
 Descriptive mining tasks characterize the general
properties of the data in database
 Predictive mining tasks performs inference on the current
data in order to make predictions






Class Description: Characterization and
Discrimination
Mining Frequent Patterns, Associations and
correlations
Classification and Prediction
Cluster Analysis
Outlier Analysis
Evolution Analysis


Data Characterization: A data mining system
should be able to produce a description
summarizing the characteristics of
customers.
Example: The characteristics of customers
who spend more than $1000 a year at (some
store called ) AllElectronics. The result can be
a general profile such as age, employment
status or credit ratings.

Data Discrimination: It is a comparison of the
general features of targeting class data objects
with the general features of objects from one or
a set of contrasting classes. User can specify
target and contrasting classes.

Example: The user may like to compare the
general features of software products whose
sales increased by 10% in the last year with
those whose sales decreased by about 30% in
the same duration.
Frequent Patterns : as the name suggests patterns that occur frequently in
data.
Association Analysis: from marketing perspective, determining which items
are frequently purchased together within the same transaction.
Example: An example is mined from the (some store) AllElectronic
transactional database.
buys (X, “Computers”)  buys (X, “software”) [Support = 1%, confidence =
50% ]
 X represents customer
 confidence = 50% , if a customer buys a computer there is a 50% chance
that he/she will buy software as well.
 Support = 1%, means that 1% of all the transactions under analysis
showed that computer and software were purchased together.



Another example: Multidimensional rule:
Age (X, 20…29) ^ income (X, 20K-29K) 
buys(X, “CD Player”) [Support = 2%,
confidence = 60% ]
Customers between 20 to 29 years of age
with an income $20000-$29000. There is 60%
chance they will purchase CD Player and 2%
of all the transactions under analysis showed
that this age group customers with that
range of income bought CD Player.


Classification is the process of finding a
model that describes and distinguishes data
classes or concepts..> this model is used to
predict the class of objects whose class label
is unknown.
Classification model can be represented in
various forms such as
 IF-THEN Rules
 A decision tree
 Neural network


Clustering analyses data objects without
consulting a known class label.
Example: Cluster analysis can be performed
on AllElectronics customer data in order to
identify homogeneous subpopulations of
customers. These clusters may represent
individual target groups for marketing.
The figure shows a 2-D plot of
customers with respect to customer
locations in a city.


Outlier Analysis : A database may contain data objects that
do not comply with the general behavior or model of the
data. These data objects are outliers.
Example: Use in finding Fraudulent usage of credit cards.
Outlier Analysis may uncover Fraudulent usage of credit
cards by detecting purchases of extremely large amounts for
a given account number in comparison to regular charges
incurred by the same account. Outlier values may also be
detected with respect to the location and type of purchase or
the purchase frequency.
Data mining includes many techniques from
Domains bellow:
 Statistics
 Machine Learning
 Database systems and Data Warehouses
 Information Retrieval
 Visualization
 High performance computing

Statistics: It studies Collection,Analyasis
Interpretation and presentation of Data.
#>Statistical research develops tools for
prediction and forecasting using data
#>Statistical methods can also be used to
verify data mining results.

Information Retrieval: It is science of
searching for documents or information in
documents…
Database Systems Data Warehouses:
This research focuses on the
creation,maintainance and use of
databases for organizations and end users.


Machine Learning: It investigates how
computers can learn or improve their
performance based on data.


KDD-(Knowledge Discovery in Databases)
is a field of computer science, which includes
the tools and theories to help humans in
extracting useful and previously unknown
information (i.e. knowledge) from large
collections of digitized data.
KDD consists of several steps, and Data
Mining is one of them.


This process deal with the mapping of lowlevel data into other forms those are more
compact, abstract and useful. This is achieved
by creating short reports, modelling the
process of generating data and developing
predictive models that can predict future
cases.
Data Mining:>> is application of a specific
algorithm in order to extract patterns from
data.

Although, the two terms KDD and Data
Mining are heavily used interchangeably,
they refer to two related yet slightly different
concepts. KDD is the overall process of
extracting knowledge from data while Data
Mining is a step inside the KDD process,
which deals with identifying patterns in
data. In other words, Data Mining is only the
application of a specific algorithm based on
the overall goal of the KDD process.

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization

Summary

Data in the real world is dirty
 incomplete: missing attribute values, lack of certain attributes
of interest, or containing only aggregate data
▪ e.g., occupation=“”
 noisy: containing errors or outliers
▪ e.g., Salary=“-10”
 inconsistent: containing discrepancies in codes or names
▪ e.g., Age=“42” Birthday=“03/07/1997”
▪ e.g., Was rating “1,2,3”, now rating “A, B, C”
▪ e.g., discrepancy between duplicate records

No quality data, no quality mining results!
 Quality decisions must be based on quality data
▪ e.g., duplicate or missing data may cause incorrect or even
misleading statistics.

Data preparation, cleaning, and transformation
comprises the majority of the work in a data mining
application (around 90%).

A well-accepted multi-dimensional view:
 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Valueable
 Accessibility

Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers
and noisy data, and resolve inconsistencies

Data integration
 Integration of multiple databases, or files

Data transformation
 Normalization and aggregation

Data reduction
 Obtains reduced representation in volume but produces the same or
similar analytical results
 Data discretization (for numerical data)

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization

Summary

Importance
 “Data cleaning is the number one problem in data
warehousing”

Data cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration

Data is not always available
 E.g., many tuples have no recorded values for several attributes, such
as customer income in sales data

Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of entry
 not register history or changes of the data


Noise: random error or variance in a measured variable.
Incorrect attribute values may due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 etc

Other data problems which requires data cleaning
 duplicate records, incomplete data, inconsistent data

Binning method:
 first sort data and partition into (equi-depth) bins
 then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.

Clustering
 detect and remove outliers

Combined computer and human inspection
 detect suspicious values and check by human (e.g., deal with
possible outliers)
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29,
34
 Partition into (equi-depth) bins:





Smoothing by bin means:




Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
Bin 1: 9, 9, 9, 9
Bin 2: 23, 23, 23, 23
Bin 3: 29, 29, 29, 29
Smoothing by bin boundaries:



Bin 1: 4, 4, 4, 15
Bin 2: 21, 21, 25, 25
Bin 3: 26, 26, 26, 34


Data points inconsistent with the majority of data
Different outlier
 Noisy: One’s age = 200, widely deviated points

Removal methods
 Clustering
 Curve-fitting

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization

Data integration:
 combines data from multiple sources

Schema integration
 integrate metadata from different sources
 Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id  B.cust-#

Detecting and resolving data value conflicts
 for the same real world entity, attribute values from different
sources are different, e.g., different scales, metric vs. British
units

Removing duplicates and redundant data



Smoothing: remove noise from data
Normalization: scaled to fall within a small, specified
range (-0.1 to 1.0 and 0.0 to 1.0)
Attribute/feature construction
 New attributes constructed from the given ones


Aggregation: summarization
Generalization: concept hierarchy climbing

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization

Summary
CS583, Bing Liu, UIC
56


Data is too big to work with..
Data reduction
 Obtain a reduced representation of the data set that is much smaller
in volume but yet produce the same (or almost the same) analytical
results

Data reduction strategies
 Dimensionality reduction — remove unimportant attributes
 Aggregation and clustering
 Sampling
CS583, Bing Liu, UIC
58

Feature selection (i.e., attribute subset selection):
 >>>Select a minimum set of attributes (features) that is sufficient
for the data mining task. <<<
CS583, Bing Liu, UIC
59

Partition data set into clusters..
CS583, Bing Liu, UIC
60





Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization
CS583, Bing Liu, UIC
61

Three types of attributes:
 Nominal — values from an unordered set
 Ordinal — values from an ordered set
 Continuous — real numbers

Discretization:
 divide the range of a continuous attribute into intervals because
some data mining algorithms only accept categorical attributes.

Some techniques:
 Binning methods – equal-width, equal-frequency
 Entropy-based methods
CS583, Bing Liu, UIC
62

Discretization
 reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals. Interval labels
can then be used to replace actual data values

Concept hierarchies
 reduce the data by collecting and replacing low level concepts
(such as numeric values for the attribute age) by higher level
concepts (such as young, middle-aged, or senior)
CS583, Bing Liu, UIC
63


Data preparation is a big issue for data mining
Data preparation includes
 Data cleaning and data integration
 Data reduction and feature selection
 Discretization

Many methods have been proposed but still it is an
active area of research………..
CS583, Bing Liu, UIC
64