Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Data Mining Lecture 4 Course Syllabus • Course topics: • Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) – Data Warehouses: Gathering Raw Data from Relational Databases and transforming into Information. – Information Extraction and Data Processing Techniques – Data Marts: The need for building highly specialized data storages for data mining applications • Case Study 1: Working and experiencing on the properties of The Retail Banking Data Mart (Week 4 – Assignment1) Data Pre-processing: Information Extraction and Data Processing Techniques • Why we should do pre-processing ? • pre-processing takes %80 of the time • Real world data is not perfect (dirty) – missing values (no data entrance) • • • • eg. %35 of Education Field is incomplete eg. %20 of Birth Date is incomplete eg. %45 of Work Title is incomplete eg. %60 of Income is incomplete Data Pre-processing: Information Extraction and Data Processing Techniques BÖLÜM ADI ATM BİREYSEL KREDİLER BİREYSEL SİGORTALAR CALLCENTER ÇEK DEBIT KARTLAR DEMOGRAFİK VERİLER EKONOMİK VERİLER FATURA ÖDEMELERİ GAYRİ NAKDİ KREDİLER HAZİNE BONOSU DEVLET TAHVİLİ INTERNET KREDİ KARTLARI KREDİLİ MEVUAT HESABI MAAS ODEMELERİ POS REPO TİCARİ KREDİLER TİCARİ SİGORTALAR VADELİ MEVDUATLAR VADESİZ MEVDUATLAR YATIRIM FONLARI DİĞER ÜRÜNLER TOPLAM DEĞİŞKEN ADEDİ 36 108 26 30 69 52 54 402 64 48 64 30 230 33 17 30 28 68 26 77 318 106 21 1,937 Data Pre-processing: Information Extraction and Data Processing Techniques IND_CAROWNERGROUP INDCOMM_COUNTRY_HOUSE INDCOMM_COUNTRY_WORK INDCOMM_COUNTY_HOUSE INDCOMM_COUNTY_WORK INDCOMM_EDUCATIONLEVEL IND_EMPLOYEEFLAG IND_GENDER INDCOMM_HABITANT_HOUSE INDCOMM_HABITANT_WORK IND_HOUSEHOLDINCOMEGROUP IND_HOUSEHOLDNUMBER IND_INCOMEGROUP IND_INTERNETFLAG IND_MARITALSTATUS IND_MOBILEPHONEUSAGEFLAG computed data raw data raw data raw data raw data raw data computed data raw data computed data computed data computed data computed data computed data computed data raw data computed data discrete discrete discrete discrete discrete discrete discrete boolean discrete discrete discrete discrete continuous integer discrete discrete boolean discrete discrete boolean Data Pre-processing: Information Extraction and Data Processing Techniques TABLO MUSTERI_GERCEK MUSTERI_GERCEK MUSTERI_TUZEL MUSTERI_MUSTERI MUSTERI_GERCEK MUSTERI_GERCEK MUSTERI_TUZEL MUSTERI_TUZEL MUSTERI_MUSTERI MUSTERI_GERCEK MUSTERI_MUSTERI MUSTERI_GERCEK MUSTERI_MUSTERI MUSTERI_TUZEL MUSTERI_TUZEL MUSTERI_GERCEK MUSTERI_MUSTERI MUSTERI_GERCEK SAHA DOLULUK YUZDE EGITIM DURUMU 44% IS YERINDEKI UNVAN 39% ORTAKLIK TIPI 36% DOGUMTARIHI 12% MESLEK KODU 8% CINSIYET 4% FAALIYET ALANI 0% IS SAHASI 0% TIP 0% CALISMA DURUMU 41% GIRIS KANALI 36% MEDENI DURUMU 18% DOGUM YERI 5% KURULUS TIPI GRUBU 0% KURULUS TIPI 0% SON OKUL ADI 99% SEGMENT 98% NUFUSA KAYITLI IL 88% VERİ DEĞERİ ÇOK KRİTİK ÇOK KRİTİK ÇOK KRİTİK ÇOK KRİTİK ÇOK KRİTİK ÇOK KRİTİK ÇOK KRİTİK ÇOK KRİTİK ÇOK KRİTİK KRİTİK KRİTİK KRİTİK KRİTİK KRİTİK KRİTİK AZ KRİTİK AZ KRİTİK AZ KRİTİK DOLULUK DURUMU ÇOK KRİTİK ÇOK KRİTİK ÇOK KRİTİK KRİTİK AZ KRİTİK AZ KRİTİK DOLU DOLU DOLU ÇOK KRİTİK ÇOK KRİTİK KRİTİK AZ KRİTİK DOLU DOLU ÇOK KRİTİK ÇOK KRİTİK ÇOK KRİTİK Data Pre-processing: Information Extraction and Data Processing Techniques – erraneous (noisy) • eg. Birth Date > current date or Birth Date <1850 (approx. %10 of the data) • eg. permissible values Education Field (C: college U: university H: high school D: doctorate M: master S : secondary school P: primary school I : illegitimate) but X,Q,Y,T values may seen (approx. %10 of the data) • Income field is negative (approx. %15 of the data) Data Pre-processing: Information Extraction and Data Processing Techniques – inconsistent- discrepancies in codes or names • eg. Birth Date =’01/01/1955’, 54 (same info but different forms) • eg. Education Field coded (C: college U: university H: high school D: doctorate M: master S : secondary school P: primary school I : illegitimate) (5: college 3: university 4: high school 1: doctorate 2: master 6 : secondary school 7: primary school 8 : illegitimate) • Income field continuous (3200 K) or interval based (3000-4000 K) Data Pre-processing: Information Extraction and Data Processing Techniques – Where may dirtiness come from the reasons of missing values different considerations in coding and analyzing (discrepancies with time) hardware/software problems different sources not aligned with same data dictionary; Source 1 Source 2 Source 3 Field 1 Field 2 Field 3 Field 1 Field 2 Field 3 Field 1 Field 2 Field 3 Field 1 Field 2 Field 3 Data Pre-processing: Information Extraction and Data Processing Techniques – Where may dirtiness come from the reasons of erraneous values Human gives incomplete, about to be correct information AD M.Ulku Metin Ü. SOYAD SANER SANRE DOĞUM TARİHİ 10/04/1965 04.10.1965 DOĞUM YERİ G.ANTEB GAZİANTEP ADRESİ Atatürk Cd. Kemaliye Sok. No.25 Atatrk Cad. Kemaliye Mah. 25/3 ÇALIŞMA ÜNVANI ÇALIŞMA YERİ ......... Gen. Müdr. Genel Müdür G.Antep D.S.İ. Devlet Su İşleri A.O ......... ......... Data Pre-processing: Information Extraction and Data Processing Techniques – Where may dirtiness come from the reasons of erraneous values Human gives incomplete, about to be correct information • Esendere Sk. Aşagidere Cikmazi No:42 D: 14 Levent İst Asagidere Yokuşu D:14 Esendere Cd. 3.Levent ISTANBUL Büyükdere Sko. Ihlamur Cad. Ş.Nedim Mha. İhlamur Sokağı Büyükdere Cd. Şair Nedim Sok. Data Pre-processing: Information Extraction and Data Processing Techniques – Where may dirtiness come from the reasons of erraneous values insufficient , incapable data collection instruments • partial matching, • fuzzy understanding, • syntactic- semantic enrichment continuous flow of data may cause data entrance faults error or disruption in data transmission Data Pre-processing: Information Extraction and Data Processing Techniques – Where may dirtiness come from the reasons of inconsistent values • insufficient lookup mappings • incapable transformation infrastructures • different data sources hard to prevent needs highly specialized synchronization and automation infrastructure also we should care duplicate data (Redundancy) Data Pre-processing: Information Extraction and Data Processing Techniques – Why pre-processing so important Data quality brings successful data mining The Only Way to extract information from Data Major tasks in Data Pre-processing: • Data cleaning • Data integration • Data transformation • Data reduction • Data discretization Data Pre-processing: Information Extraction and Data Processing Techniques Data Pre-processing: Information Extraction and Data Processing Techniques Major tasks in Data Cleaning: – Fill in missing values – Identify outliers and smooth out noisy data – Correct inconsistent data – Resolve redundancy caused by data integration Data Pre-processing: Information Extraction and Data Processing Techniques Major tasks in Data Cleaning: – Fill in missing values – Identify outliers and smooth out noisy data – Correct inconsistent data – Resolve redundancy caused by data integration Data Pre-processing: Information Extraction and Data Processing Techniques How to handle missing data – simply do not accept it – fill it manually – fill it automatically: » a global constant : e.g., “unknown”, a new class?! » the attribute mean » the attribute mean for all samples belonging to the same class: smarter » the most probable value: inference-based such as Bayesian formula or decision tree Data Pre-processing: Information Extraction and Data Processing Techniques How to handle noisy data – Binning (discretization) method: » first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. » use data distribution and domain knowledge – Clustering » detect and remove outliers – Combined computer and human inspection » detect suspicious values and check by human (e.g., deal with possible outliers) – Regression » smooth by fitting the data into regression functions – Model the data and infer the most probable values (difficult) Data Pre-processing: Information Extraction and Data Processing Techniques Binning • Equal-width (distance) partitioning: – divides the range into N intervals of equal size: uniform grid – if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. – The most straightforward, but outliers may dominate presentation – Skewed data is not handled well. • Equal-depth (frequency) partitioning: – Divides the range into N intervals, each containing approximately same number of samples – Good data scaling – Managing categorical attributes can be tricky. Data Pre-processing: Information Extraction and Data Processing Techniques Binning Sorted data (e.g., by price) – 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 • Partition into (equi-depth) bins: • Smoothing by bin means: • Smoothing by bin boundaries: Data Pre-processing: Information Extraction and Data Processing Techniques Binning Sorted data (e.g., by price) – 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 • Partition into (equi-depth) bins: – Bin 1: 4, 8, 9, 15 – Bin 2: 21, 21, 24, 25 – Bin 3: 26, 28, 29, 34 • Smoothing by bin means: – Bin 1: 9, 9, 9, 9 – Bin 2: 23, 23, 23, 23 – Bin 3: 29, 29, 29, 29 • Smoothing by bin boundaries: – Bin 1: 4, 4, 4, 15 – Bin 2: 21, 21, 25, 25 – Bin 3: 26, 26, 26, 34 Data Pre-processing: Information Extraction and Data Processing Techniques Regression y Y1 Y1’ y=x+1 X1 x Data Pre-processing: Information Extraction and Data Processing Techniques Clustering Data Pre-processing: Information Extraction and Data Processing Techniques How to handle inconsistent data – systematic conversion, “transformation” – dynamic and interactive control mechanishms – redundancy detection and intelligent mapping Data Pre-processing: Information Extraction and Data Processing Techniques Transformation – Smoothing: remove noise from data – Aggregation: summarization, data cube construction – Generalization: concept hierarchy climbing – Normalization: scaled to fall within a small, specified range min-max normalization » z-score normalization » normalization by decimal scaling – Attribute/feature construction: New attributes constructed from the given ones Data Pre-processing: Information Extraction and Data Processing Techniques Transformation – Smoothing: remove noise from data – Aggregation: summarization, data cube construction – Generalization: concept hierarchy climbing – Normalization: scaled to fall within a small, specified range min-max normalization » z-score normalization » normalization by decimal scaling – Attribute/feature construction: New attributes constructed from the given ones Remember Stats Facts • Min: – What is the big oh value for finding min of n-sized list ? • Max: – What is the min number of comparisons needed to find the max of n-sized list? • Range: – What about simultaneous finding of min-max? • Value Types: – Cardinal value -> how many, counting numbes – Nominal value -> names and identifies something – Ordinal value -> order of things, rank, position mean mode 3 (mean median) Transformation • Min-max normalization: to [new_minA, new_maxA] v' v minA (new _ maxA new _ minA) new _ minA maxA minA – Ex. Let income range $12,000 to $98,000 normalized 73,600 12,000 (1.0 0) 0 0.716 to [0.0, 1.0]. Then $73,600 is mapped to 98 ,000 12,000 • Z-score normalization (μ: mean, σ: standard v v ' deviation): 73,600 54,000 • Ex. Let μ = 54,000, σ = 16,000. Then 16,000 1.225 • Normalization by decimal scaling A A v v' j 10 Where j is the smallest integer such that Max(|ν’|) < 1 Remember Stats Facts • Mean (algebraic measure) (sample population): w x n x i 1 n i 1 n x x x i vs. n i 1 N i – Weighted arithmetic mean: w – Trimmed mean: chopping extreme values i 1 i • Median: A holistic measure – Middle value if odd number of values, or average of the middle two values otherwise – Estimated by interpolation (for grouped data): • Mode – Value that occurs most frequently in the data – Unimodal, bimodal, trimodal – Empirical formula: mean mode 3 (mean median) Week 4-End • read – Course Text Book Chapter 2