Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining (and machine learning) DM Lecture 1: Overview of DM, and overview of the DM part of the DM&ML module Some of these slides are derivative of Nick Taylor’s slides used for this module in previous years David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Overview of My Lectures All at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html • Lecture 1: about data and data mining; • Lectures 2 and 3: Basic and useful ways to process and understand data • Lectures 4, 5, 6, 7, 8 Details of useful algorithms for finding knowledge from data; • Lecture 9: overview of what else there is. David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Module assessment 100% by coursework Two main items of coursework, 50% each Four small items of coursework, worth nothing, but if you don’t do them adequately you fail the module. David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html This Semester PDW lectures on Mondays (machine learning) DWC lectures on Thursdays (data mining) Friday slot usually unused – we may use it, and will let you know in advance All coursework set by DWC David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Coursework submission ALL coursework must be submitted as follows • as PDF • by email to [email protected] • the c/w is an attachment • Subject line: DMML Coursework A – (… or B, C, D, 1, 2) • Body of the email includes your Name and your Course (e.g. Joe Smith, BSc CS – Jill Brown, MSc AI) David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html DWC lectures and c/w, key dates Thur sep 17th This lecture Handout C/W A Thur sep 24th Lecture Handout C/W B Thur Oct 1st Lecture Handout Main C/W 1 (50%) Thur Oct 8th Lecture Thur Oct 15th Lecture Thur Oct 22nd Lecture Handout Main C/W 2 (50%) Thur Oct 29th NO LECTURE (handin C/W A,B and 1on Fri 30th) Thur Nov 5th NO LECTURE Thur Nov 12th Lecture Handout C/W C --- C/W 1 vivas on Fri 13th Thur Nov 19th Lecture Handout C/W D Thur Nov 26th Lecture (handin C/W C,D and 2 on Fri 27th) Thur Dec 3rd C/W 2 vivas David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html At last, the lecture David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html What some people think can be done with data Answer simple questions like: • How many female clients do we have? • How much paint did we sell in 2007? • Which is the most profitable branch of our supermarket? • Which postcodes suffered the most dropped calls in July? David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html that is so David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html that is so Boring David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html More interesting things that can be done with data Answer difficult and valuable questions like: • How can we predict Ovarian cancer early enough to treat it successfully? • How can I make significant profit on the stock market next month? • Two different authors claim to have written this story – how can we resolve the dispute? • How can we get our customers to spend more money in the store? • Is this loan applicant a good credit risk? • Is this sonar image a mine, or a rock? • What other websites will this browser be interested in? David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Data Mining - Definition & Goal Definition • – Data Mining is the exploration and analysis of large quantities of data in order to discover meaningful patterns and rules Goal • – To permit some other goal to be achieved or performance to be improved through a better understanding of the data David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Some examples of large databases Retail basket data: much commercial DM is done with this. In one store, 18,000 baskets per month Tesco has >500 stores. Per year, 100,000,000 baskets ? The Internet ~ >15,000,000,000 pages Lots of datasets: UCI Machine Learning repository How can we begin to understand and exploit such datasets? Especially the big ones? David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Like this … David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html and this … David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html and this … David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html or this … • see http://websom.hut.fi/websom/millio ndemo/html/root.html David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Data Mining - Basics • Data Mining is the process of discovering patterns and inferring associations in raw data • Data Mining is a collection of techniques intended to analyse small or large amounts of data • There is no single Data Mining approach • Data Mining can employ a range of techniques, either individually or in combination with each other David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Data Mining – Why is it important? • • • • • • Data are being generated in enormous quantities Data are being collected over long periods of time Data are being kept for long periods of time Computing power is formidable and cheap A variety of Data Mining software is available All of these data contain `hidden knowledge’ – facts, rules, patterns, that can be usefully exploited if we can find them. David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Data Mining – History • The approach has its roots over 40 years ago • In the early 1960s Data Mining was called statistical analysis, and the pioneers were statistical software companies such as SPSS • By the late 1980s these traditional techniques had been augmented by new methods such as machine induction, artificial neural networks, evolutionary computing, etc. David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Some basic terminology Gender weight height Age in mths 100m time Male Male Female Male 52kg 89kg 48kg 86kg 1.71m 1.92m 1.67m 1.96m 243 388 219 274 13.7s 22.3s 14.6s 9.58s Male 80kg 1.88m 260 10.56s etc … David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html This is called a data instance or a record or just a line of data Gender weight height Age in mths 100m time Male Male Female Male 52kg 89kg 48kg 86kg 1.71m 1.92m 1.67m 1.96m 243 388 219 274 13.7s 22.3s 14.6s 9.58s Male 80kg 1.88m 260 10.56s etc … David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html This is called a field or an attribute; the value of the Age field in the 4th record is 274 Gender weight height Age in mths 100m time Male Male Female Male 52kg 89kg 48kg 86kg 1.71m 1.92m 1.67m 1.96m 243 388 219 274 13.7s 22.3s 14.6s 9.58s Male 80kg 1.88m 260 10.56s etc … David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Usually we are interested in predicting the value of a particular field, given the values of the other fields. What we want to predict is called the class field, or the target class Gender weight height Age in mths 100m time Male Male Female Male 52kg 89kg 48kg 86kg 1.71m 1.92m 1.67m 1.96m 243 388 219 274 13.7s 22.3s 14.6s 9.58s Male 80kg 1.88m 260 10.56s etc … David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Some data-mining related projects that I am currently working on (either myself, or with a PhD student or RA) Predicting whether or not two textures will be considered similar by humans. Predicting which of two or more writers is the author of a given piece of text (you will do some work on this) Discovering which subsets of many thousands of genes play a role in specific diseases (cancer, diabetes, etc) (you will do a little work on this too) Discovering technical trading rules for stock market trading (you will do a little work on this too) David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Which pair of textures is most similar? David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Which pair of textures is most similar? A line of data … 0.23 1.88 9.64 3.22 … 7.1 1086.9 2.23 … 0.76 %age of people who think they are similar 5,000 features for texture2 5,000 features for texture1 David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Who wrote text chunk 4? 0.4 0.3 0.2 0.2 0.2 0.15 0.2 0.15 0.001 0.002 0.6 … 0 0.1 0.5 … 0.001 0.002 0.5 … 0 0.002 0.6 … AuthorA AuthorA AuthorB ? Word usage `Fingerprint’ of a 1,000 word chunk of text David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Did the Dow Jones go up or down in the following week? David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Down David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Will the Dow Jones go up or down tomorrow? David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Data Mining – Two Major Types • Directed (Farming) – Attempts to explain or categorise some particular target field such as income, medical disorder, genetic characteristic, etc. • Undirected (Exploring) – Attempts to find patterns or similarities among groups of records without the use of a particular target field or collection of predefined classes • Compare with Supervised and Unsupervised systems in machine learning David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Data Mining – Tasks Classification - Example: high risk for cancer or not Estimation - Example: household income Prediction - Example: credit card balance transfer average amount Affinity Grouping - Example: people who buy X, often also buy Y with a probability of Z Clustering - similar to classification but no predefined classes Description and Profiling – Identifying characteristics which explain behaviour - Example: “More men watch football on TV than women” David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Data Warehousing • Note that Data Mining is very generic and can be used for detecting patterns in almost any data – Retail data – Genomes – Climate data – Etc. • Data Warehousing, on the other hand, is almost exclusively used to describe the storage of data in the commercial sector David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html What you should do this week Browse the UCI Machine Learning repository datasets and associated information; get acquainted with data Browse the statlib datasets archive, get acquainted with that too. And then … David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Coursework A (0 marks, but you fail if you don’t submit an adequate attempt) Find three other dataset repositories as follows: 1.One that specialises in financial data 2.One that specialises in time series data 3.One that specialises in anything else. For each of these three, tell me the URL, and write one paragraph, ~100 words, in your own words, describing the contents of this repository, Submit on or before Friday October 30th David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Au revoir David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html If time available … Some slides about data warehousing; I don’t consider this an essential part of this module, but in case you want to know what data warehousing is … David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Data Warehousing - Definitions “A subject-oriented, integrated, time-variant and nonvolatile collection of data in support of management's decision making process” W. H. Inmon, "What is a Data Warehouse?" Prism Tech Topic, Vol. 1, No. 1, 1995 -- a very influential definition. “A copy of transaction data, specifically structured for query and analysis” Ralph Kimball, from his 2000 book, “The Data Warehouse Toolkit” David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Data Warehouse – why? For organisational learning to take place data from many sources must be gathered together over time and organised in a consistent and useful way Data Warehousing allows an organisation to remember its data and what it has learned about its data Data Mining techniques make use of the data in a Data Warehouse and subsequently add their results to it David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Data Warehouse - Contents • A Data Warehouse is a copy of transaction data specifically structured for querying, analysis and reporting • The data will normally have been transformed when it was copied into the Data Warehouse • The contents of a Data Warehouse, once acquired, are fixed and cannot be updated or changed later by the transaction system - but they can be added to of course David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Data Marts • A Data Mart is a smaller, more focused Data Warehouse – a mini-warehouse • A Data Mart will normally reflect the business rules of a specific business unit within an enterprise – identifying data relevant to that unit’s acitivities David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html From Data Warhousing to Machine Learning, via Data Marts David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html The Big Challenge for Data Mining • The largest challenge that a Data Miner may face is the sheer volume of data in the Data Warehouse • It is very important, then, that summary data also be available to get the analysis started • The sheer volume of data may mask the important relationships in which the Data Miner is interested • Being able to overcome the volume and interpret the data is essential to successful Data Mining David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html What happens in practice … Data Miners, both “farmers” and “explorers”, are expected to utilise Data Warehouses to give guidance and answer a limitless variety of questions The value of a Data Warehouse and Data Mining lies in a new and changed appreciation of the meaning of the data There are limitations though - A Data Warehouse cannot correct problems with its data, although it may help to more clearly identify them David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html