Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Transcript

Data Mining (and machine learning) DM Lecture 1: Overview of DM, and overview of the DM part of the DM&ML module Many of these slides are highly derivative of Nick Taylor’s slides used for this module in previous years David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Overview of My Lectures All at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html • • • • • • • • • • 25/9 02/10: 09/10 16/10 Overview of DM (and of these 8 lectures) Data Cleaning - usually a necessary first step for large amounts of data Basic Statistics for Data Miners - essential knowledge, and very useful Basket Data/Association Rules (A Priori algorithm) - a classic algorithm, used much in industry NO THURSDAY LECTURE OCTOBER 23rd 30/10 Cluster Analysis and Clustering - simple algs that tell you much about the data NO THURSDAY LECTURE November 6th 13/11: Similarity and Correlation Measures - making sure you do clustering appropriately for the given data 20/11: Regression - the simplest algorithm for predicting data/class values 27/11: A Tour of Other Methods and their Essential Details - every important method you may learn about in future David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Data Mining - Definition & Goal Definition • – Data Mining is the exploration and analysis of large quantities of data in order to discover meaningful patterns and rules Goal • – To permit some other goal to be achieved or performance to be improved through a better understanding of the data David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Some examples of huge databases Retail basket data: much commercial DM is done with this. In one store, 18,000 baskets per month Tesco has >500 stores. Per year, 100,000,000 baskets ? The Internet ~ >15,000,000,000 pages Lots of datasets: UCI Machine Learning repository How can we begin to understand and exploit such datasets? Especially the big ones? David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Like this … David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html and this … David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html and this … David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html or this … (see http://www.cs.umd.edu/hcil/treemap-history/ David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html or this … • see http://websom.hut.fi/websom/millio ndemo/html/root.html David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Data Mining - Basics • Data Mining is the process of discovering patterns and inferring associations in raw data • Data Mining is a collection of powerful techniques intended to analyse large amounts of data • There is no single Data Mining approach • Data Mining can employ a range of techniques, either individually or in combination with each other David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Data Mining – Why is it important? • • • • • Data are being generated in enormous quantities Data are being collected over long periods of time Data are being kept for long periods of time Computing power is formidable and cheap A variety of Data Mining software is available David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Data Mining – History • The approach has its roots over 40 years ago • In the early 1960s Data Mining was called statistical analysis, and the pioneers were statistical software companies such as SPSS • By the late 1980s these traditional techniques had been augmented by new methods such as machine induction, artificial neural networks, evolutionary computing, etc. David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Data Mining – Two Major Types • Directed (Farming) – Attempts to explain or categorise some particular target field such as income, medical disorder, genetic characteristic, etc. • Undirected (Exploring) – Attempts to find patterns or similarities among groups of records without the use of a particular target field or collection of predefined classes • Compare with Supervised and Unsupervised systems in machine learning David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Data Mining – Tasks Classification - Example: high risk for cancer or not Estimation - Example: household income Prediction - Example: credit card balance transfer average amount Affinity Grouping - Example: people who buy X, often also buy Y with a probability of Z Clustering - similar to classification but no predefined classes Description and Profiling – Identifying characteristics which explain behaviour - Example: “More men watch football on TV than women” David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Data Warehousing • Note that Data Mining is very generic and can be used for detecting patterns in almost any data – Retail data – Genomes – Climate data – Etc. • Data Warehousing, on the other hand, is almost exclusively used to describe the storage of data in the commercial sector David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Data Warehousing - Definitions “A subject-oriented, integrated, time-variant and nonvolatile collection of data in support of management's decision making process” W. H. Inmon, "What is a Data Warehouse?" Prism Tech Topic, Vol. 1, No. 1, 1995 -- a very influential definition. “A copy of transaction data, specifically structured for query and analysis” Ralph Kimball, from his 2000 book, “The Data Warehouse Toolkit” David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Data Warehouse – why? For organisational learning to take place data from many sources must be gathered together over time and organised in a consistent and useful way Data Warehousing allows an organisation to remember its data and what it has learned about its data Data Mining techniques make use of the data in a Data Warehouse and subsequently add their results to it David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Data Warehouse - Contents • A Data Warehouse is a copy of transaction data specifically structured for querying, analysis and reporting • The data will normally have been transformed when it was copied into the Data Warehouse • The contents of a Data Warehouse, once acquired, are fixed and cannot be updated or changed later by the transaction system - but they can be added to of course David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Data Marts • A Data Mart is a smaller, more focused Data Warehouse – a mini-warehouse • A Data Mart will normally reflect the business rules of a specific business unit within an enterprise – identifying data relevant to that unit’s acitivities David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html From Data Warhousing to Machine Learning, via Data Marts David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html The Big Challenge for Data Mining • The largest challenge that a Data Miner may face is the sheer volume of data in the Data Warehouse • It is very important, then, that summary data also be available to get the analysis started • The sheer volume of data may mask the important relationships in which the Data Miner is interested • Being able to overcome the volume and interpret the data is essential to successful Data Mining David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html What happens in practice … Data Miners, both “farmers” and “explorers”, are expected to utilise Data Warehouses to give guidance and answer a limitless variety of questions The value of a Data Warehouse and Data Mining lies in a new and changed appreciation of the meaning of the data There are limitations though - A Data Warehouse cannot correct problems with its data, although it may help to more clearly identify them David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Which brings us to “data cleaning”, next week … David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html