Download Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Mining
Lecture 4
Course Syllabus
• Course topics:
• Data Management and Data Collection Techniques for
Data Mining Applications (Week3-Week4)
– Data Warehouses: Gathering Raw Data from Relational
Databases and transforming into Information.
– Information Extraction and Data Processing Techniques
– Data Marts: The need for building highly specialized data
storages for data mining applications
• Case Study 1: Working and experiencing on the
properties of The Retail Banking Data Mart (Week 4 –
Assignment1)
Data Pre-processing: Information
Extraction and Data Processing Techniques
• Why we should do pre-processing ?
• pre-processing takes %80 of the time
• Real world data is not perfect (dirty)
– missing values (no data entrance)
•
•
•
•
eg. %35 of Education Field is incomplete
eg. %20 of Birth Date is incomplete
eg. %45 of Work Title is incomplete
eg. %60 of Income is incomplete
Data Pre-processing: Information
Extraction and Data Processing
Techniques
BÖLÜM ADI
ATM
BİREYSEL KREDİLER
BİREYSEL SİGORTALAR
CALLCENTER
ÇEK
DEBIT KARTLAR
DEMOGRAFİK VERİLER
EKONOMİK VERİLER
FATURA ÖDEMELERİ
GAYRİ NAKDİ KREDİLER
HAZİNE BONOSU DEVLET TAHVİLİ
INTERNET
KREDİ KARTLARI
KREDİLİ MEVUAT HESABI
MAAS ODEMELERİ
POS
REPO
TİCARİ KREDİLER
TİCARİ SİGORTALAR
VADELİ MEVDUATLAR
VADESİZ MEVDUATLAR
YATIRIM FONLARI
DİĞER ÜRÜNLER
TOPLAM
DEĞİŞKEN ADEDİ
36
108
26
30
69
52
54
402
64
48
64
30
230
33
17
30
28
68
26
77
318
106
21
1,937
Data Pre-processing: Information
Extraction and Data Processing
Techniques
IND_CAROWNERGROUP
INDCOMM_COUNTRY_HOUSE
INDCOMM_COUNTRY_WORK
INDCOMM_COUNTY_HOUSE
INDCOMM_COUNTY_WORK
INDCOMM_EDUCATIONLEVEL
IND_EMPLOYEEFLAG
IND_GENDER
INDCOMM_HABITANT_HOUSE
INDCOMM_HABITANT_WORK
IND_HOUSEHOLDINCOMEGROUP
IND_HOUSEHOLDNUMBER
IND_INCOMEGROUP
IND_INTERNETFLAG
IND_MARITALSTATUS
IND_MOBILEPHONEUSAGEFLAG
computed data
raw data
raw data
raw data
raw data
raw data
computed data
raw data
computed data
computed data
computed data
computed data
computed data
computed data
raw data
computed data
discrete
discrete
discrete
discrete
discrete
discrete
discrete boolean
discrete
discrete
discrete
discrete
continuous integer
discrete
discrete boolean
discrete
discrete boolean
Data Pre-processing: Information
Extraction and Data Processing
Techniques
TABLO
MUSTERI_GERCEK
MUSTERI_GERCEK
MUSTERI_TUZEL
MUSTERI_MUSTERI
MUSTERI_GERCEK
MUSTERI_GERCEK
MUSTERI_TUZEL
MUSTERI_TUZEL
MUSTERI_MUSTERI
MUSTERI_GERCEK
MUSTERI_MUSTERI
MUSTERI_GERCEK
MUSTERI_MUSTERI
MUSTERI_TUZEL
MUSTERI_TUZEL
MUSTERI_GERCEK
MUSTERI_MUSTERI
MUSTERI_GERCEK
SAHA
DOLULUK YUZDE
EGITIM DURUMU
44%
IS YERINDEKI UNVAN
39%
ORTAKLIK TIPI
36%
DOGUMTARIHI
12%
MESLEK KODU
8%
CINSIYET
4%
FAALIYET ALANI
0%
IS SAHASI
0%
TIP
0%
CALISMA DURUMU
41%
GIRIS KANALI
36%
MEDENI DURUMU
18%
DOGUM YERI
5%
KURULUS TIPI GRUBU
0%
KURULUS TIPI
0%
SON OKUL ADI
99%
SEGMENT
98%
NUFUSA KAYITLI IL
88%
VERİ DEĞERİ
ÇOK KRİTİK
ÇOK KRİTİK
ÇOK KRİTİK
ÇOK KRİTİK
ÇOK KRİTİK
ÇOK KRİTİK
ÇOK KRİTİK
ÇOK KRİTİK
ÇOK KRİTİK
KRİTİK
KRİTİK
KRİTİK
KRİTİK
KRİTİK
KRİTİK
AZ KRİTİK
AZ KRİTİK
AZ KRİTİK
DOLULUK DURUMU
ÇOK KRİTİK
ÇOK KRİTİK
ÇOK KRİTİK
KRİTİK
AZ KRİTİK
AZ KRİTİK
DOLU
DOLU
DOLU
ÇOK KRİTİK
ÇOK KRİTİK
KRİTİK
AZ KRİTİK
DOLU
DOLU
ÇOK KRİTİK
ÇOK KRİTİK
ÇOK KRİTİK
Data Pre-processing: Information
Extraction and Data Processing Techniques
– erraneous (noisy)
• eg. Birth Date > current date or Birth Date <1850
(approx. %10 of the data)
• eg. permissible values Education Field (C: college
U: university H: high school D: doctorate M: master
S : secondary school P: primary school I :
illegitimate) but X,Q,Y,T values may seen (approx.
%10 of the data)
• Income field is negative (approx. %15 of the data)
Data Pre-processing: Information
Extraction and Data Processing Techniques
– inconsistent- discrepancies in codes or names
• eg. Birth Date =’01/01/1955’, 54 (same info but
different forms)
• eg. Education Field coded
(C: college U: university H: high school D: doctorate M:
master S : secondary school P: primary school I :
illegitimate)
(5: college 3: university 4: high school 1: doctorate 2:
master 6 : secondary school 7: primary school 8 :
illegitimate)
• Income field continuous (3200 K) or interval based
(3000-4000 K)
Data Pre-processing: Information
Extraction and Data Processing Techniques
– Where may dirtiness come from
the reasons of missing values
different considerations in coding and analyzing (discrepancies
with time)
hardware/software problems
different sources not aligned with same data dictionary;
Source 1
Source 2
Source 3
Field 1
Field 2
Field 3
Field 1
Field 2
Field 3
Field 1
Field 2
Field 3
Field 1
Field 2
Field 3
Data Pre-processing: Information
Extraction and Data Processing Techniques
– Where may dirtiness come from
the reasons of erraneous values
Human gives incomplete, about to be correct information
AD
M.Ulku
Metin Ü.
SOYAD
SANER
SANRE
DOĞUM TARİHİ
10/04/1965
04.10.1965
DOĞUM YERİ
G.ANTEB
GAZİANTEP
ADRESİ
Atatürk Cd. Kemaliye Sok. No.25
Atatrk Cad. Kemaliye Mah. 25/3
ÇALIŞMA ÜNVANI
ÇALIŞMA YERİ
.........
Gen. Müdr.
Genel Müdür
G.Antep D.S.İ.
Devlet Su İşleri A.O
.........
.........
Data Pre-processing: Information
Extraction and Data Processing Techniques
– Where may dirtiness come from
the reasons of erraneous values
Human gives incomplete, about to be correct information
• Esendere Sk. Aşagidere Cikmazi No:42 D: 14 Levent
İst
Asagidere Yokuşu D:14 Esendere Cd. 3.Levent
ISTANBUL
Büyükdere Sko. Ihlamur Cad. Ş.Nedim Mha.
İhlamur Sokağı Büyükdere Cd. Şair Nedim Sok.
Data Pre-processing: Information
Extraction and Data Processing Techniques
– Where may dirtiness come from
the reasons of erraneous values
insufficient , incapable data collection instruments
• partial matching,
• fuzzy understanding,
• syntactic- semantic enrichment
continuous flow of data may cause data entrance
faults
error or disruption in data transmission
Data Pre-processing: Information
Extraction and Data Processing Techniques
– Where may dirtiness come from
the reasons of inconsistent values
• insufficient lookup mappings
• incapable transformation infrastructures
• different data sources
hard to prevent needs highly
specialized synchronization
and automation infrastructure
also we should care duplicate data (Redundancy)
Data Pre-processing: Information
Extraction and Data Processing Techniques
– Why pre-processing so important
Data quality brings successful data mining
The Only Way to extract information from Data
Major tasks in Data Pre-processing:
• Data cleaning
• Data integration
• Data transformation
• Data reduction
• Data discretization
Data Pre-processing: Information
Extraction and Data Processing
Techniques
Data Pre-processing: Information
Extraction and Data Processing Techniques
Major tasks in Data Cleaning:
– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
– Resolve redundancy caused by data integration
Data Pre-processing: Information
Extraction and Data Processing Techniques
Major tasks in Data Cleaning:
– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
– Resolve redundancy caused by data integration
Data Pre-processing: Information
Extraction and Data Processing Techniques
How to handle missing data
– simply do not accept it
– fill it manually
– fill it automatically:
» a global constant : e.g., “unknown”, a new class?!
» the attribute mean
» the attribute mean for all samples belonging to the
same class: smarter
» the most probable value: inference-based such as
Bayesian formula or decision tree
Data Pre-processing: Information
Extraction and Data Processing Techniques
How to handle noisy data
– Binning (discretization) method:
» first sort data and partition into (equi-depth) bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
» use data distribution and domain knowledge
– Clustering
» detect and remove outliers
– Combined computer and human inspection
» detect suspicious values and check by human (e.g.,
deal with possible outliers)
– Regression
» smooth by fitting the data into regression functions
– Model the data and infer the most probable values
(difficult)
Data Pre-processing: Information
Extraction and Data Processing Techniques
Binning
• Equal-width (distance) partitioning:
– divides the range into N intervals of equal size: uniform grid
– if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
– The most straightforward, but outliers may dominate
presentation
– Skewed data is not handled well.
• Equal-depth (frequency) partitioning:
– Divides the range into N intervals, each containing
approximately same number of samples
– Good data scaling
– Managing categorical attributes can be tricky.
Data Pre-processing: Information
Extraction and Data Processing Techniques
Binning
Sorted data (e.g., by price)
– 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
• Partition into (equi-depth) bins:
• Smoothing by bin means:
• Smoothing by bin boundaries:
Data Pre-processing: Information
Extraction and Data Processing Techniques
Binning
Sorted data (e.g., by price)
– 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
• Partition into (equi-depth) bins:
– Bin 1: 4, 8, 9, 15
– Bin 2: 21, 21, 24, 25
– Bin 3: 26, 28, 29, 34
• Smoothing by bin means:
– Bin 1: 9, 9, 9, 9
– Bin 2: 23, 23, 23, 23
– Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries:
– Bin 1: 4, 4, 4, 15
– Bin 2: 21, 21, 25, 25
– Bin 3: 26, 26, 26, 34
Data Pre-processing: Information
Extraction and Data Processing Techniques
Regression
y
Y1
Y1’
y=x+1
X1
x
Data Pre-processing: Information
Extraction and Data Processing Techniques
Clustering
Data Pre-processing: Information
Extraction and Data Processing Techniques
How to handle inconsistent data
– systematic conversion, “transformation”
– dynamic and interactive control mechanishms
– redundancy detection and intelligent mapping
Data Pre-processing: Information
Extraction and Data Processing Techniques
Transformation
– Smoothing: remove noise from data
– Aggregation: summarization, data cube construction
– Generalization: concept hierarchy climbing
– Normalization: scaled to fall within a small, specified
range min-max normalization
» z-score normalization
» normalization by decimal scaling
– Attribute/feature construction: New attributes
constructed from the given ones
Data Pre-processing: Information
Extraction and Data Processing Techniques
Transformation
– Smoothing: remove noise from data
– Aggregation: summarization, data cube construction
– Generalization: concept hierarchy climbing
– Normalization: scaled to fall within a small, specified
range min-max normalization
» z-score normalization
» normalization by decimal scaling
– Attribute/feature construction: New attributes
constructed from the given ones
Remember Stats Facts
• Min:
– What is the big oh value for finding min of n-sized list ?
• Max:
– What is the min number of comparisons needed to find
the max of n-sized list?
• Range:
– What about simultaneous finding of min-max?
• Value Types:
– Cardinal value -> how many, counting numbes
– Nominal value -> names and identifies something
– Ordinal value -> order of things, rank, position
mean  mode  3  (mean  median)
Transformation
• Min-max normalization: to [new_minA,
new_maxA]
v' 
v  minA
(new _ maxA  new _ minA)  new _ minA
maxA  minA
– Ex. Let income range $12,000 to $98,000 normalized
73,600  12,000
(1.0  0)  0  0.716
to [0.0, 1.0]. Then $73,600 is mapped to 98
,000  12,000
• Z-score normalization (μ: mean, σ: standard
v
v
'

deviation):

73,600  54,000
• Ex. Let μ = 54,000, σ = 16,000. Then 16,000  1.225
• Normalization by decimal scaling
A
A
v
v'  j
10
Where j is the smallest integer such that Max(|ν’|) < 1
Remember Stats Facts
• Mean (algebraic measure) (sample
population):
w x
n
x
i 1
n
i
1 n
x

x

x



i
vs. n i 1
N
i
– Weighted arithmetic mean:
w
– Trimmed mean: chopping extreme values
i 1
i
• Median: A holistic measure
– Middle value if odd number of values, or average of the
middle two values otherwise
– Estimated by interpolation (for grouped data):
• Mode
– Value that occurs most frequently in the data
– Unimodal, bimodal, trimodal
– Empirical formula: mean  mode  3  (mean  median)
Week 4-End
• read
– Course Text Book Chapter 2
Related documents