Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
COMP313/ 513 DATA MINING Unit organization ❙ Class times: Lectures: Tuesday and Wednesday 11-11:50 B251 Tutorials: Thursday 10-11:50 MCL3 (starts in Week 2) ❙ Lecturer: Neil Dunstan MC207 [email protected] ❙ Textbook: Data Mining Concepts and Techniques, J. Han and M. Kamber, Morgan Kaufman, 2nd Edition. ❙ Web site: http://mcs.une.edu.au/~comp513 ❙ Note: This is unit is being rewritten in 2008. Updated material will appear regularly on the web site. Assessment ❙ Comp313 3 Assignments, 10,10 and 15% Exam 65% ❙ Comp513 Literature Review 35% Exam 65% ❙ Submissions will be routinely checked by the Turnitin plagiarism detection system. Unit schedule in 2008 Week 1 2 3 4 5 6 7 Starts Feb 18 Feb 25 Mar 3 Mar 10 Mar 17 Mar 24 Mar 31 8 Topic Introduction Data Warehouses Online Analytical Processes Data Cubes Associations and Correlations Classification and Prediction ..continued.. mid semester break Apr 28 Neural Networks 9 10 11 12 May May May May 13 Jun 2 5 12 19 26 Clustering ..continued.. Outlier Analysis Text and Web Mining Assessment ф comp313 A1 set comp313 A1 due comp313 A2 set comp513 Report Proposals due comp313 A2 due comp313 A3 set comp513 Report due comp313 A3 due Review Exam ф Assessment due dates are the Saturday at the end of the week. Week 1 lecture slides: ❙ Topics: ❙ ❙ ❙ ❙ ❙ ❙ ❙ Data Mining Definition Enabling Technologies Evolution of Database and Data Analysis Technologies The Knowledge Discovery Process Types of Data Knowledge Discovery Methods Data Mining Systems ❙ Text Reference: Chapter 1. What is data mining? Knowledge Discovery in Large Databases “ Data Mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories and by using pattern recognition technologies as well as statistical and mathematical techniques” Some other terms: Machine Learning Data Analysis Enabling technologies ❙ ❙ ❙ ❙ ❙ Accumulated Historical Transaction Processing Data Database Technology Tertiary Storage High-speed processing - Parallel Processing Commercial Data Mining Packages That is, ❙ the availability of large volumes of data ❙ data organization methods ❙ data storage technologies ❙ high-speed data processing ❙ development of knowledge discovery algorithms Evolution of database technology ❙ ❙ ❙ ❙ ❙ ❙ ❙ ❙ ❙ ❙ ❙ ❙ ❙ Primitive File Processing – Electronic Data Processing Relational Database Systems Query Languages – SQL Hierarchical and Networked Database Systems Indexing and Access Methods – B-trees, Hashing Data Modelling – Entity-Relationship Models User Interfaces – Forms and Reports Transactions, Concurrency Control Online Transaction Processing Object-oriented Databases Spatial, Temporal and Multimedia Data Heterogeneous Database Systems - Global Schemas Web-based Database Systems – XML, The Semantic Web Evolution of data analysis ❙ ❙ ❙ ❙ ❙ Ad-hoc querying of file systems and databases Data Warehouse – Accumulation of data for analysis Online Analytical Processing – Interactive analysis Data Mining Algorithms – Correlations, Predictions, Clustering Multimedia, Stream, Time-series, Text and Web mining Operational versus data mining systems Online Transaction Processing Data Mining -------------------------------------------------------------------------------------------Reports on recent data Analysis on historical data Predictable and periodic Unpredictable, depends on need Limited data The more data the better (generally) Focus on transaction entity Focus on actionable entity, region, class Response time in seconds Response in days or weeks System of records for data Copy of data Descriptive Creative Steps in knowledge discovery ❙ Data Cleansing – Remove noise, inconsistences, errors ❙ Data Integration – Combine data from heterogeneous sources ❙ Data Transformation – Data Selection, Reduction, Aggregation ❙ Data Mining – Apply Data Analysis Techniques ❙ Evaluation – Interestingness measures. ❙ Presentation – Visualization of results. Data mining primitives ❙ Task-relevant data. What data is required? ❙ Data mining functions. What kinds of knowledge does the user want to discover? ❙ Background knowledge of the domain. In particular: Concept hierarchies. e.g. 1 litre carton on full cream milk is a sub-category of full cream milk which is a sub-category of milk. Such a concept hierarchy can be useful in summarization and association analysis at different levels of abstraction. ❙ Interestingness measures and evaluation methods. How can you assess the value of the data mining? ❙ Representation of discovered patterns and results. How can the results be presented to the user? Types of data Relational Databases e.g. Customer(C_id#, Name, Address, Credit_Rating, ..) Supplier(S_id#,Name, Address, .. ) Item(I_id#,Name,S_id, .. ) Data Warehouse e.g. A combination of data from different databases Transaction records e.g. (T_id#, attribute details.. ) Spatial Data e.g. Maps, Geographic Information Systems. Raster or Vector representation. Temporal and Time-Series Data e.g. Mouse-click sequences, Transaction sequences, Stock Market Records Text e.g. Documents Multimedia e.g. Graphics, audio, video Stream Data e.g. Video surveillance, Continuous output from Sensors World Wide Web e.g. Web usage, Web logs, Linkage (Hypertext) Structures Data terminology In this unit, data will usually refer to data in relational databases. e.g. Customer(C_id#, Name, Address, Credit_Rating, ..) is a table containing records (or tuples) with values for each of the attributes C_id#, Name, Address, Credit_Rating, .. For example: C_id# Name 101 Joe Blogs 1 Alpha St. Armidale, NSW Good 121 Bill Blick 22 Beta Rd. Armidale, NSW Poor ... Address Credit_Rating .. Directed knowledge discovery ❙ Targets some specific attribute in the data set, e.g. What items sell well with bread? ❙ Tests hypotheses, e.g. Are women most likely to shop during the day time? ❙ Seeks explanations for known patterns, e.g. Why are overseas students concentrated in Brisbane? Undirected knowledge discovery ❙ Uses all available data ❙ Seeks patterns or structures in the data set ❙ Has unspecific goals, e.g. What items sell well together? ❙ May lead to hypotheses ❙ May precede more directed knowledge discovery Examples of results of data mining ❙ Classification, e.g. Of loan applications into high, medium or low risk ❙ Prediction, e.g. Stock prices in 12 months time ❙ Association, e.g. What items seem to sell well together ❙ Grouping, e.g. Customers into different market segments ❙ ❙ Explanation, e.g. Of some pattern, by visualization or generalization Summarization, e.g. Of items sold across all branches. Descriptive patterns Descriptive patterns characterize the data ❙ Class/ concept descriptions summarize the attributes of a target class, e.g. A general profile of customers who spend spend more than $1000 per month in a store. ❙ Data discrimination, compares the common attributes of a target class with those of other classes, e.g. Typical differences between different classes of customers by buying patterns. Descriptive patterns ❙ Cluster Analysis attempts to find related groups of data by.. Maximizing the similarity of data within groups and minimizing the similarity of data from different groups. ❙ Outlier Analysis finds data that doesn’t seem to comply with the rest of the data. Hence it may be noise or errors. In some applications it may indicate fraud or identity theft. ❙ Evolution Analysis is applied to time-series data in order to discover trends. Descriptive patterns ❙ Frequent Itemsets are items that commonly occur in transactional data sets ❙ Association Rules are based on frequent itemsets. e.g. buys(X,computer) => buys(X,printer) that is, if a customer buys a computer he usually buys a printer as well. ❙ Association rules have interestingness measures ❙ Support (how often computer with printer occurs in the data) ❙ Confidence (how often printer occurs when computer occurs) Predictive patterns Predictive data mining attempts to develop models based on current data, in order to made predictions ❙ Classification models attempt to predict which of a given set of classes, a new data object should belong to. e.g. A decision tree. ❙ Predictive models output a numerical estimate. That is, the prediction is a number rather that a class. e.g. A linear model based on weighted attribute values. Evaluation of data mining ❙ Its easier to measure the results of projects with precise goals that those with vague goals ❙ In classification and prediction the data sets are divided into independent ❙ training set, for developing the model ❙ tuning set, for fine tuning and ❙ evaluation set, for final evaluation ❙ Association has Confidence and Support measures Evaluation of prediction models Predictive models provide a numerical estimate. e.g. Evaluation in terms of prediction and actual figures Actual Predicted Difference 2020 2040 -20 1900 1880 +20 3000 3050 -50 Sum of Differences = -20 + 20 -50 = -50 (+ve and -ve figures cancel out) Average Difference = (20 + 20 + 50)/ 3 = 30 Data mining systems ❙ Data mining systems may be classified according to: ❙ The type of data mined ❙ The kind of knowledge mined ❙ The techniques used ❙ The application domain ❙ The integration of the data mining system and the data can be classified as: ❙ No coupling. Data is sourced from an external database. ❙ Loose coupling. Uses features of the database to extract data ❙ Semitight coupling. Some preprocessing of the data. ❙ Tight coupling. Total integration of the database/warehouse and data mining system. Research issues in data mining ❙ Data mining query languages. Generic and for specific domains. ❙ Efficiency and scalability of data knowledge discovery algorithms. ❙ Algorithms for streamed data. ❙ Algorithms applied to multimedia data. ❙ Parallel and distributed algorithms. ❙ Knowlege discovery across the internet. Semantic Web?