Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Survey

Document related concepts

Transcript

CIS4930 Introduction to Data Mining Final Review Peixiang Zhao Tallahassee, Florida Final Exam • Time: Wednesday 5/3/2017 5:30pm --- 7:30pm – Plan your time well • Venue: LOV 301, in-class exam • Closed book, closed note, but you can bring a onepage cheat sheet (A4, double side) – Plan your strategy well • No calculators or other electronic devices – Laptops, ipads, smart phones, etc. are prohibited • Any form of cheating on the examination will result in a zero grade, and will be reported to the university 1 Final Exam • Bring you FSU ID to attend the final exam • 40% of your final score • Coverage – All materials taught in the class AND in the textbook, starting from Introduction, to Clustering 2 Format • One set of true/false questions with brief answers – e.g., k-Means can be used to cluster datasets with any arbitrary shape – Answer: False. Because …… • Short-answer questions – e.g, What are the key differences between decision tree based classification and kNN classification? • Several more questions – e.g., Compute frequent itemsets and strong association rules • 100 points • I believe you have enough time (120 minutes) 3 Final Exam • How to do well in the exam? – Review the materials carefully and make sure you understand them • Both in slides and in the textbook – Reexamine the homework and make sure you can work out the solutions independently – Discuss with your peer students – Discuss with the TA and me • TA office hour: Monday, May 1st, 3pm-5pm – Relax 4 Final Exam 5 What is Data Mining • Non-trivial extraction of implicit, previously unknown, and potentially useful information from data – a.k.a. KDD (knowledge discovery in databases) • Typical procedure – Data Knowledge Action/Decision Goal • Representative Examples – Frequent pattern & association rule mining – Classification – Clustering – Outlier detection 6 Data Mining Tasks • Prediction Methods: Use some variables to predict unknown or future values of other variables – Classification – Regression – Outlier detection • Description Methods: Find human-interpretable patterns that describe the data – Clustering – Association rule mining 7 Data • Types of attributes – Nominal, ordinal, interval, ratio – Discrete, continuous • Basic statistics – Mean, median, mode – Quantiles: Q1, Q3; IQR – Variance; standard deviation • Visualization tools – – – – Boxplot Histogram Q-Q plot Scatter plot 8 Similarity • Proximity measure for binary attributes – Contingency table; symmetric, asymmetric measures; Jaccard coefficient • Minkowski distance – Metric – Manhattan, Euclidean, supremum distance – Cosine similarity 9 Data Preprocessing • Data quality • Major tasks in data preprocessing – Cleaning, integration, reduction, transformation, discretization • Clean Noisy data – Binning, regression, clustering, human inspection • Handling redundancy in data integration – Correlation analysis • Χ2 (chi-square) test • Covariance analysis 10 Data Preprocessing • Data reduction – Dimensionality reduction • Curse of dimensionality • PCA vs. SVD • Feature selection – Numerosity reduction • Regression • Histogram, clustering, sampling – Data compression • Data transformation – Normalization – Discretization 11 Frequent Pattern Mining • Definition – Frequent itemsets • Closed itemsets • Maximal itemsets – Association rules • Support, confidence • Complexity – The overall search space formulated as a lattice • Methods – Apriori – FPGrowth – Eclat 12 Apriori • The downward closure property – Or anti-monotone property of support • Apriori algorithm – Candidate generation • Self-join – Frequency counting • Hash tree • Further improvement 13 FP-Growth • Major philosophy – grow long patterns from short ones using local frequent items only • FP-tree – Augmented prefix tree – Properties • Completeness and non- redundancy • FP-growth algorithm – Progressive subspace projection – Early termination condition 14 ECLAT • Vertical representation of transactional DB – Tid-lists • Algorithm – DFS-like 15 Association Rules • The number of association rules can be exponentially large! • Algorithm • Pattern evaluation – Is confidence always an interesting measure for association analysis? 16 Classification • Problem definition – Training & Test • Classification models – – – – Decision tree: Gini index, information gain, error rate Naïve Bayes KNN SVM • Ensemble Methods – Bagging – Boosting • Model Evaluation 17 Clustering • Definition • Types of clustering • Methods – – – – K-means Hierarchical clustering DBSCAN Graph based clustering • Cluster validity • Semi-supervised clustering 18 19