Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction Universidade de São Paulo, São Carlos/SP, Brasil Instituto de Ciências Matemáticas e de Computação (ICMC) Departamento de Ciências da Computação n Visualization and Data Analysis n InfoVis2 – Visualization – Sonification – Mining Visual Data Mining and Document Collections Visualization Partners n – – – – – Fernando Vieira Paulovich [email protected] M. Cristina F. Oliveira Alneu de Andrade Lopes Luis Gustavo Nonato Guilherme P. Telles Haim Levkowitz - Roberto Pinho - Lionis Watanabe - Pedro Vilela 2 Mining Large Data Sets - Motivation What is (not) Data Mining? What is not Data Mining? l 4.000.000 The Data Gap 3.500.000 3.000.000 2.500.000 Total new disk (TB) since 1995 2.000.000 1.500.000 Number of analysts 1.000.000 500.000 0 1995 1996 1997 1998 1999 l What is Data Mining? – Look up phone number in phone directory – Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) – Query a Web search engine for information about “Amazon” – Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com) 3 Origins of Data Mining n n Data Mining Tasks Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems Traditional Techniques may be unsuitable due to – Enormity of data – High dimensionality of data – Heterogeneous, distributed nature of data 4 Statistics/ AI Machine Learning/ Pattern Recognition Data Mining n Prediction Methods – Use some variables to predict unknown or future values of other variables n Description Methods – Find human-interpretable patterns that describe the data Database systems From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 5 6 1 Data Mining Tasks n n n n n n Data Mining Example: Classification l l s ica ica ou or or inu nt teg teg ss ca ca co cla Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive] Deviation Detection [Predictive] Tid Refund Marital Status Taxable Income Cheat Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No No Single 75K ? 2 No Married 100K No Yes Married 50K ? 3 No Single 70K No No Married 150K ? 4 Yes Married 120K No Yes Divorced 90K ? 5 No Divorced 95K Yes No Single 40K ? 6 No Married No No Married 80K ? 60K 10 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Training Set Learn Classifier Test Set Model 7 8 Illustrating Clustering n Association Rule Discovery: Definition Euclidean Distance Based Clustering in 3-D space Intracluster Intraclusterdistances distances are areminimized minimized Given a set of records each of which contain some number of items from a given collection – Produce dependency rules which will predict occurrence of an item based on occurrences of other items n Intercluster Interclusterdistances distances are aremaximized maximized TID Items 1 2 Bread, Coke, Milk Beer, Bread 3 4 5 Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk Rules RulesDiscovered: Discovered: {Milk} {Milk}--> -->{Coke} {Coke} {Diaper, {Diaper,Milk} Milk}--> -->{Beer} {Beer} 9 10 Deviation/Anomaly Detection n n Visualization Detect significant deviations from normal behavior Applications – Credit Card Fraud Detection – Network Intrusion Detection 11 n Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported. n Visualization of data is one of the most powerful and appealing techniques for data exploration. – Humans have a well developed ability to analyze large amounts of information that is presented visually – Can detect general patterns and trends – Can detect outliers and unusual patterns 12 2 Example: Sea Surface Temperature n Iris Sample Data Set The following shows the Sea Surface Temperature (SST) for July 1982 n – Tens of thousands of data points are summarized in a single figure Many of the exploratory data techniques are illustrated with the Iris Plant data set. – Can be obtained from the UCI Machine Learning Repository http://www.ics.uci.edu/~mlearn/MLRepository.html – Three flower types (classes): • Setosa • Virginica • Versicolour – Four (non-class) attributes • Sepal width and length • Petal width and length 13 Visualization of the Iris Data Matrix Scatter Plot Array of Iris Attributes 15 Visualization of the Iris Correlation Matrix • Correlation 14 • Standard deviation 16 Parallel Coordinates Plots for Iris Data 17 18 3 Visualizing Text Collections Projection Explorer Tool n Large and high-dimensional data sets n Dimension given by terms on the collection n Multidimensional Projection Technique – Proximity by similarity (metrics) 19 20 Process Overview Text Pre-Processing n The text pre-processing involves 1.Stopwords elimination 2.Extraction of words radicals (stemming) 3.Creation of n-grams 4.Frequency count and Luhn’s lower cut (ngrams appearing less then x times are ignored) 5.Weighting process (term-frequency inverse document-frequency - (tfidf)) 21 22 Example of Documents x Terms Matrix T1 T2 T3 T4 T5 Projection Technique T6 T7 T8 ... Tm Doc1 0.2 0.1 0.0 0.5 0.0 0.0 0.1 0.5 ... 0.1 Doc2 0.4 0.3 0.0 0.0 0.0 0.4 0.3 0.7 ... 0.5 Doc3 0.8 0.5 0.0 0.4 0.3 0.0 0.0 0.0 ... 0.0 ... ... ... Docn 0.4 0.0 0.0 0.0 0.3 0.7 0.0 ... ... ... ... ... ... X ∈ Rn α P ∈ R2 ... ... 0.5 ... 0.1 n n tfidf (ti , d j ) = freq(ti , d j ) × log dfreq ( t ) i n n 23 α:X → P, |d(xi,xj) – d2(α(xi), α(xj))| ≈ 0, ∀ xi, xj ∈ X d:Rn → R d2:R2 → R 24 4 Projection Technique (Force-Based Placement) n Projection Techniques Data instances considered into a systems obeying the Newton rules n f=mxa a = p’’ => p’’ = m x a v' = a = f / m p' = v n Projection techniques for multidimensional data – Interactive Document Map (IDMAP) – Projection by Clustering (ProjClus) – Least-Square Projection (LSP) Data instances connected through springs f = −ks (| d | − s ) d |d | 25 26 5