G54DMT – Data Mining Techniques and Applications http://www.cs.nott.ac.uk/~jqb/G54DMT Dr. Jaume Bacardit [email protected] Lecture 0: Introduction Outline of the lecture • • • • What is Data Mining? Administrative bits Module structure Resources We are buried in data…. And in business as well… • Generating better movie recommending methods from customer ratings • Training set of 100M ratings from over 480K customers on 18K movies • Data collected from October 1998 and December, 2005 • 1M$ prize to generate a recommender system 10% better than the Netflix proprietary method • Took 3 years to solve the challenge What is Data Mining? • “The extraction of knowledge from large amounts of data” (Han and Kamber, 2006) • “Data mining is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that they lead to some advantage, usually an economic advantage. The data is invariably present in substantial quantities” (Witten and Frank, 2005) So what is the data? • In its origin data can be heterogeneous, it can have multiple sources and uncertainty (i.e. distorted or missing entries) • In most cases we will assume that data is structured as a table where the rows are instances and the columns are attributes • And in certain cases the records will have one or more labels associated to them, a class Data can be… Piles of Records • Datasets with a high number of records – This is probably the most visible dimension of large scale data mining – GenBank (the genetic sequences database from the NIH) contains (Feb, 2008) more than 82 million gene sequences and more than 85 billion nucleotides Data can be… High Dimensionality • High dimensionality domains – Sometimes each record is characterized by hundreds, thousands (or even more) features – Microarray technology (as many other post-genomic data generation techniques) can routinely generate records with tens of thousands of variables – Creating each record is usually very costly, so datasets tend to have a very small number of records. This unbalance between number of records and number of variables is yet another challenge (Reinke, 2006, Image licensed under Creative Commons) Data can be… Rare • Class unbalance – Challenge to generate accurate classification models where not all classes are equally represented – Contact Map prediction datasets (briefly explained later in the tutorial) routinely contain millions of instances from which less than 2% are positive examples – Tissue type identification is highly unbalance—see figure (Llora, Priya, Bhargava, 2009) Data can be… Lots of Classes • Yet another dimension of difficulty • Reuters-21578 dataset is a text categorization task with 672 categories • Very related to the class unbalance problem • Machine learning methods need to make an extra effort to make sure that underrepresented data is taken into account properly And what do we do with the data? • The whole process of integrating, cleaning, selecting, mining and visualising the data is generally known as Knowledge Discovery in Databases (KDD) (Han and Kamber, 2006) Fields related to Data Mining • Machine Learning – “How to construct programs that learn from experience” (Mitchell, 1997) – ML generally concentrates on the central part of the KDD process, the pattern extraction. – Also, ML is generally seen to focus on the algorithms, while DM focuses on the process • Pattern recognition – Mathematical view of the pattern extraction process in opposition to the computational view of ML • Text mining – Focused on analyzing human texts. Very specialised version of DM Educational aims • To provide the students with a strong knowledge of data mining, and its application to real-world scenarios • To understand the need of data mining to analyse large-scale real-world data • To provide the students with a sneak peak of the challenges and opportunities of data mining • The objective of this module is to study the methods and application of data mining techniques. • The focus of the module will be on the technology, but by illustrating their usage with challenging problems we aim at providing a clear understanding of how these methods can be applied in the real world • The successful completion of the module will endow a student with: – Strong understanding of core data mining problems (e.g. classification, regression, clustering, feature and prototype selection, dimensionality reduction) and the state-of-the-art methods for solving these – Strong understanding of the application of data mining to important real-world problems – Familiarity with the operation and principles behind publicly available data mining packages (e.g. Weka) Lectures and labs • Lectures: Thursdays, 15:00 – 17:00, JC-AMENB11+ • Labs: Mondays, 11:00 - 13:00, JC-COMPSCIB52 (labs start on the 11/2) – The laboratory sessions will be used to develop the coursework. I will be present to answer questions – Sometimes there will be directed sessions, but these will be few, and advertised in advance Coursework • Coursework 1 (50% or the mark) – Study in detail of one aspect of data preprocessing – How to perform a proper ML evaluation protocol – Deadline: 8/3/2013 • Project 1 (with 50% of the mark) – I will give you a challenging large-scale dataset and you are free to mine it using a combination of any of the techniques described in the module – Deadline: 10/5/2013 How to contact me? • At lectures and lab sessions • My office is B81 in the Computer Science building. However, for many reasons the chances are that if you just pop by randomly, I can't attend you • Thus, the preferred contact method is email: [email protected] Module structure • Four topics (described in the next slides) • Some topics will take several lectures to cover • All lectures will be posted at http://www.cs.nott.ac.uk/~jqb/G54DMT • Take notes – Not everything is in the slides – I will use the whiteboard often • After each lecture I will provide a list of resources to complement the material • Also, whenever necessary, I will introduce background material • If you feel that you are missing some background material, tell me straight away! Module structure • Topic 1: Preliminaries – This topic deals with several concepts that will be used across the module • • • • Data infrastructure: simple and advanced file formats Experimental validation procedures Statistical tests Most popular data mining packages Module structure • Topic 2: Data Preparation – Which steps do we follow to transform the data in order to facilitate the pattern extraction process – Many methods fall in this category • • • • • Feature selection Instance selection Dimensionality reduction Missing values handling Discretisation Module structure • Topic 3: Data Mining – This topic deals with the central part of the KDD pipeline, the extraction of patterns from data – This process can be done in many different ways. The most usual ones are • • • • Classification Regression Clustering Association Rules Mining Module structure • Topic 4: Applications – We will see a few examples of how the methods studied through the module are applied to challenging real world problems Resources • Books – J. Han and M. Kamber, Data Mining, Conceptes and techniques, Elsevier, 2006 – I Witten and E. Frank, Data Mining - Practical Machine Learning Tools and Techniques, Elsevier, 2005 – Tom M. Mitchell, Machine Learning, McGraw-Hill, 1997 – Chris Bishop, Pattern Recognition and Machine Learning, Springer 2006 – Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning, 2nd ed., Springer, 2009 • Online resources – KDNuggets, newsletter and website about data mining • Software packages – WEKA – RapidMiner – Keel Questions?