Download Data Mining Part

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
What Is Data Mining?

Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data


Alternative names


Data mining: a misnomer?
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”?

Simple search and query processing

(Deductive) expert systems
May 22, 2017
Data Mining: Concepts and Techniques
1
Multi-Dimensional View of Data Mining

Data to be mined


Knowledge to be mined



Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
Multiple/integrated functions and mining at multiple levels
Techniques utilized


Relational, data warehouse, transactional, stream, objectoriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
Applications adapted

May 22, 2017
Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
Data Mining: Concepts and Techniques
2
Major Issues in Data Mining

Mining methodology



Mining different kinds of knowledge from diverse data types, e.g., bio, stream,
Web

Performance: efficiency, effectiveness, and scalability

Pattern evaluation: the interestingness problem

Incorporation of background knowledge

Handling noise and incomplete data

Parallel, distributed and incremental mining methods

Integration of the discovered knowledge with existing one: knowledge fusion
User interaction

Data mining query languages and ad-hoc mining

Expression and visualization of data mining results

Interactive mining of knowledge at multiple levels of abstraction
Applications and social impacts


May 22, 2017
Domain-specific data mining & invisible data mining
Protection of data security, integrity, and privacy
Data Mining: Concepts and Techniques
3
Chapter 2: Data Preprocessing

Why preprocess the data?

Descriptive data summarization

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy generation

Summary
May 22, 2017
Data Mining: Concepts and Techniques
4
Why Data Preprocessing?

Data in the real world is dirty
 incomplete: lacking attribute values, lacking
certain attributes of interest, or containing
only aggregate data


noisy: containing errors or outliers


e.g., Salary=“-10”
inconsistent: containing discrepancies in codes
or names



May 22, 2017
e.g., occupation=“ ”
e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records
Data Mining: Concepts and Techniques
5
Why Is Data Dirty?

Incomplete data may come from




Noisy data (incorrect values) may come from




Faulty data collection instruments
Human or computer error at data entry
Errors in data transmission
Inconsistent data may come from



“Not applicable” data value when collected
Different considerations between the time when the data was
collected and when it is analyzed.
Human/hardware/software problems
Different data sources
Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning
May 22, 2017
Data Mining: Concepts and Techniques
6
Why Is Data Preprocessing Important?

No quality data, no quality mining results!

Quality decisions must be based on quality data



e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
Data warehouse needs consistent integration of quality
data
Data extraction, cleaning, and transformation comprises
the majority of the work of building a data warehouse
May 22, 2017
Data Mining: Concepts and Techniques
7
Multi-Dimensional Measure of Data Quality


A well-accepted multidimensional view:
 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Value added
 Interpretability
 Accessibility
Broad categories:
 Intrinsic, contextual, representational, and accessibility
May 22, 2017
Data Mining: Concepts and Techniques
8
Major Tasks in Data Preprocessing

Data cleaning


Data integration


Normalization and aggregation
Data reduction


Integration of multiple databases, data cubes, or files
Data transformation


Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Obtains reduced representation in volume but produces the same
or similar analytical results
Data discretization

Part of data reduction but with particular importance, especially
for numerical data
May 22, 2017
Data Mining: Concepts and Techniques
9
Forms of Data Preprocessing
May 22, 2017
Data Mining: Concepts and Techniques
10
Chapter 2: Data Preprocessing

Why preprocess the data?

Descriptive data summarization

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy generation

Summary
May 22, 2017
Data Mining: Concepts and Techniques
11
Mining Data Descriptive Characteristics

Motivation


Data dispersion characteristics



To better understand the data: central tendency, variation and
spread
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals

Data dispersion: analyzed with multiple granularities of
precision

Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures

Folding measures into numerical dimensions

Boxplot or quantile analysis on the transformed cube
May 22, 2017
Data Mining: Concepts and Techniques
12
Measuring the Central Tendency
1 n
 Mean (algebraic measure) (sample vs. population): x 
 xi
n i 1



Weighted arithmetic mean:
x
N
n
Trimmed mean: chopping extreme values
x
Median: A holistic measure


w x
i 1
n
i
i
w
i 1
i
Middle value if odd number of values, or average of the middle two
values otherwise


Estimated by interpolation (for grouped data):
median  L1  (
Mode

Value that occurs most frequently in the data

Unimodal, bimodal, trimodal

Empirical formula:
May 22, 2017
n / 2  ( f )l
f median
)c
mean  mode  3  (mean  median)
Data Mining: Concepts and Techniques
13
Measuring the Dispersion of Data

Quartiles, outliers and boxplots

Quartiles: Q1 (25th percentile), Q3 (75th percentile)

Inter-quartile range: IQR = Q3 – Q1

Five number summary: min, Q1, M, Q3, max

Boxplot: ends of the box are the quartiles, median is marked, whiskers, and
plot outlier individually


Outlier: usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation (sample: s, population: σ)

Variance: (algebraic, scalable computation)
1 n
1 n 2 1 n
2
s 
( xi  x ) 
[ xi  ( xi ) 2 ]

n  1 i 1
n  1 i 1
n i 1
2

1
 
N
2
n
1
(
x


)


i
N
i 1
2
n
 xi   2
2
i 1
Standard deviation s (or σ) is the square root of variance s2 (or σ2)
May 22, 2017
Data Mining: Concepts and Techniques
14
Related documents