Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Εξόρυξη Γνώσης (data mining) Χ. Παπαθεοδώρου Εργαστήριο Ψηφιακών Βιβλιοθηκών & Ηλεκτρονικής Δημοσίευσης Τμήμα Αρχειονομίας – Βιβλιοθηκονομίας, Ιόνιο Πανεπιστήμιο 1 Data Mining Εξόρυξη γνώσης από πολύ μεγάλες συλλογές δεδομένων Γνώση: κανόνες, πρότυπα συμπεριφοράς και συσχετίσεις μεταξύ αντικειμένων (όχι προφανής, λανθάνουσα, προηγουμένως άγνωστη, και χρήσιμη) Αντικείμενο: Αποτελείται από ένα σύνολο χαρακτηριστικών Δεν είναι: (Deductive) query processing. Expert systems, small machine learning /statistical programs 2 Why Data Mining? Potential Applications Database analysis and decision support Market analysis and management Risk analysis and management target marketing, customer relation management, market basket analysis, cross selling, market segmentation Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and management Other Applications Text mining (news group, email, documents) and Web analysis. Intelligent query answering 3 Market Analysis and Management (1) Where are the data sources for analysis? Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing Find clusters of model customers who share the same characteristics: interest, income level, spending habits, etc. Determine customer purchasing patterns over time Conversion of single to a joint bank account: marriage, etc. Cross-market analysis Associations/co-relations between product sales Prediction based on the association information 4 Market Analysis and Management (2) Customer profiling data mining can tell you what types of customers buy what products (clustering or classification) Identifying customer requirements identifying the best products for different customers use prediction to find what factors will attract new customers Provides summary information various multidimensional summary reports statistical summary information (data central tendency and variation) 5 Corporate Analysis and Risk Management Finance planning and asset evaluation Resource planning: cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financialratio, trend analysis, etc.) summarize and compare the resources and spending Competition: monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market 6 Steps of a KDD Process Learning the application domain: Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation: Find useful features, dimensionality/variable reduction, invariant representation. Choosing functions of data mining relevant prior knowledge and goals of application summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc. 7 Data Mining: A KDD Process Pattern Evaluation Data mining: the core of knowledge discovery Data Mining process. Task-relevant Data Data Warehouse Data Cleaning Data Integration Databases Selection Data pre-processing Data preparation is a big issue for data mining Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization A lot a methods have been developed but still an active area of research 9 Data pre-processing 10 Clustering Partition data set into clusters, and one can store cluster representation only Can have hierarchical clustering and be stored in multi-dimensional index tree structures There are many choices of clustering definitions and clustering algorithms 11 Cluster Analysis 12 Classification Classification is an extensively studied problem (mainly in statistics, machine learning & neural networks) Classification is probably one of the most widely used data mining techniques with a lot of extensions Scalability is still an important issue for database applications: thus combining classification with database techniques should be a promising topic Research directions: classification of non-relational data, e.g., text, spatial, multimedia, etc.. 13 Classification process Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction: training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur 14 Classification Process (1): Model Construction Training Data NAME Mike Mary Bill Jim Dave Anne RANK YEARS TENURED Assistant Prof 3 no Assistant Prof 7 yes Professor 2 yes Associate Prof 7 yes Assistant Prof 6 no Associate Prof 3 no Classification Algorithms Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ 15 Classification Process (2): Use the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAME Tom Merlisa George Joseph RANK YEARS TENURED Assistant Prof 2 no Associate Prof 7 no Professor 5 yes Assistant Prof 7 yes Tenured? 16 Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data 17 Document category modelling Example: Filtering spam email. Task: classify incoming email as spam and legitimate (2 document categories). Simple blacklist and keyword-based methods have failed. More intelligent, adaptive approaches are needed (e.g. naive Bayesian category modeling). 18 Document category modelling Step 1 (linguistic pre-processing): Tokenization, removal of stopwords, stemming/lemmatization. Step 2 (vector representation): bag-of-words or n-gram modeling (n=2,3). Step 3 (feature selection): information gain evaluation. Step 4 (machine learning): Bayesian modeling, using word/n-gram frequency. 19 What Is Association Mining? Association rule mining: Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Applications: Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc. Example. form: "Body Head [support, confidence] . buys(x, "diapers ) buys(x, "beers ) [0.5%, 60%] Rule 20 Association Rule: Basic Concepts Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) Find: all rules that correlate the presence of one set of items with that of another set of items E.g., 98% of people who purchase tires and auto accessories also get automotive services done Applications * Maintenance Agreement (What the store should do to boost Maintenance Agreement sales) Home Electronics * (What other products should the store stocks up?) 21 Rule Measures: Support and Confidence Custome r buys both Customer buys diaper Find all the rules X & Y Z with minimum confidence and support Customer buys beer support, s, probability that a transaction contains {X & Y & Z} confidence, c, conditional probability that a transaction having {X & Y} also contains Z Find the rules with support and confidence equal or grater than a given threshold 22 Mining Association Rules An Example Transaction ID 2000 1000 4000 5000 Items Bought A,B,C A,C A,D B,E,F Min. support 50% Min. confidence 50% Frequent Itemset Support {A} 75% {B} 50% {C} 50% {A,C} 50% For rule A C: support = support({A =>C}) = 50% confidence = support({A =>C})/support({A}) = 66.6% 23 References U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996. J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000. T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of ACM, 39:58-64, 1996. G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to knowledge discovery: An overview. In U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, 1-35. AAAI/MIT Press, 1996. G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991. 24