Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture outline SATOMGI Data Mining and Matching Lecture 1: Module overview and introduction to data mining Module overview • Four hours per day over three days • Day 1: Two hours lectures (data mining process / data issues), two hours practical sessions • Day 2: One hour lecture (clustering), one hour practical session, one hour lecture (association rules mining), one hour written assessment • Day 3: One hour lecture (classification and prediction), one hour practical session, one hour lecture (data integration and matching), one hour written assessment • Module lecturer: Dr Peter Christen • Senior lecturer, ANU Department of Computer Science • E-mail: [email protected] • Phone: 6125 5690 • Module Web site: http://cs.anu.edu.au/people/Peter.Christen/SATOMGI • Lecture slides, practical sessions material, links to further resources • Module overview • Very short introduction to data mining • Example applications of data mining • Definitions of data mining • The data mining process • Data mining is multi-disciplinary • Data mining challenges • Short history of data mining • Some data mining resources • Data mining books Very short introduction to data mining (1) • Many government agencies, businesses, and research projects collect massive amounts of data • Ten largest decision support databases range from 17 to 100 Terabytes (1 Terabyte = 1,024 Gigabytes = 1,232,896 Megabytes) • Ten largest transaction-processing databases range from 6 to 23 Terabytes • Sizes have tripled between 2003 and end of 2005! • Source: http://wintercorp.com/VLDB/2005_TopTen_Survey/TopTenProgram.html • Questions arise: • Is there any new, unexpected and potentially useful information in such large data collections? • Can we use historical data to predict future outcomes (such as customer behaviour, predict if a transaction is fraudulent, etc.) Very short introduction to data mining (2) Very short introduction to data mining (3) • Data mining involves: • Data mining is applied in many areas: • Database and data warehouse technologies • Machine learning and artificial intelligence • Statistics • Numerical mathematics • Parallel and high-performance computing • Visualisation • Data mining techniques: • Data cleaning and pre-processing (lecture 2) • Data integration and matching (lecture 6) • Cluster analysis (lecture 3) • Frequent patterns and associations (lecture 4) • Classification and prediction (lecture 5) • Outlier detection Example application 1: Telecommunication Huge amounts of data are collected on a daily basis Transactional data (about each phone call) (data on mobile phones, land-line phones, Internet, etc.) Customer data (billing, personal information, etc.) Additional data (network load, faults, etc.) Possible questions Which customer group is highly profitable, which one is not? To which customers should we advertise what kind of special offers? What kind of call rates would increase profit without loosing good customers? How do customer profiles change over time? Fraud detection (stolen mobile phones) Network load predictions • Retail • Bioinformatics and health • Governments (statistics, census, taxation, social welfare) • Credit card and insurance companies • Terror, crime and fraud detection, national security • Networking and telecommunications • Data mining applications: • Spatial and temporal data mining • Text and Web data mining • Data stream and time-series mining • Sequence mining (e.g. DNA, proteins) • Graph and network data mining • Multimedia data mining (audio, images, video) Example application 2: Health • Different aspects of the health system • Personal health records (at general practitioners and specialists) • Hospital data (e.g. admission data, midwives data, surgery data, etc.) • Nursing homes and death data (admissions, causes, medications, etc.) • Billing information (Medicare, Pharmaceutical Benefit Scheme) • Private health insurance and ambulance/emergency data • Possible questions • Are doctors following the procedures (e.g. prescription of medication)? • Can we predict adverse drug reactions (analysis of multiple linked data collections to find correlations) • Are people committing fraud (e.g. doctor shoppers)? • Are there correlations between social and environmental issues and people's health (temporal and spatial analysis of linked data collections)? Example application 3: Astronomy Definitions of data mining • Terabytes of images and other data from telescopes and satellites • Knowledge discovery in databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. (Fayyad, Piatetsky-Shapiro and Smyth, 1996) • Large-area sky surveys in optical, infrared, and radio wavelengths • Time-series data • Possible questions • Classification of objects (stars, galaxies, pulsars, quasars, etc.) • Detect (large scale) structures in the data • Find rare, unusual, or even previously unknown types of astronomical objects and phenomena • MACHO (MAssive Compact Halo Objects) (ANU and US) • Search for dark matter, objects like brown dwarfs or planets in the milky way • An information extraction activity whose goal is to discover hidden facts contained in databases. Using a combination of machine learning, statistical analysis, modeling techniques and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results. Typical applications include market segmentation, customer profiling, fraud detection, evaluation of retail promotions, and credit risk analysis. (http://www.twocrows.com/glossary.htm) • Try also: http://www.google.com, search term: "define: data mining" Definitions of data mining (2) The data mining / KDD process • Essential in definitions is: • ... non-trivial extraction ... • ... previously unknown or novel ... • ... potentially useful information ... • ... understandable and interesting ... • ... large amounts of data ... • ... prediction and modelling ... • Data mining is often also called Knowledge Discovery in Databases (KDD) • Some say data mining is only one essential step in the KDD process • Data mining is an interactive process • Data mining = Build Model(s) • Typically up to 90% of time and effort are spent in the first three steps! (Follows: CRoss Industry Standard Process for Data Mining, http://www.crisp-dm.org/) The data mining / KDD process (2) Data mining and business intelligence Increasing potential to support business decisions End User Decision Making Data Presentation Visualization Techniques Business Analyst Data Mining Information Discovery Data Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems Source: Han and Kamber, DM Book, 2nd Ed. (Copyright © 2006 Elsevier Inc.) DBA Source: Han and Kamber, DM Book, 2nd Ed. (Copyright © 2006 Elsevier Inc.) Major challenges in data mining Data mining is multi-disciplinary • Data size Database Technology Statistics • Size of data collections grows more than linear, doubling around every 18 months (similar to Moore's law of processor speed) • Scalable algorithms are needed • Data complexity Machine Learning Visualisation Data Mining Pattern Recognition Algorithms Other Disciplines Source: Han and Kamber, DM Book, 2nd Ed. (Copyright © 2006 Elsevier Inc.) Different types of data (database tables, free text, HTML, XML, multimedia) Dimensionality of the data increases (more attributes) The curse of dimensionality affects many algorithms (for example finding nearest neighbours in high dimensions) • Privacy and confidentiality • Data mining can reveal details about people which is not available otherwise • Linking and matching data is especially critical / controversial Ten grand challenges in data mining (U. Fayyad) Short history of data mining • Technical challenges • The term data mining was first mentioned by statisticians several decades ago, but with a different meaning compared to today: data dredging (inappropriate, sometimes • How does the data grow? • Scalability (of algorithms) • Complexity/understandability trade-off • Interestingness • A theory for what we do • Pragmatic challenges • Where is the data? • Embedding algorithms and solutions within operational systems • Integrating domain knowledge • Managing and maintaining models • Effectiveness measurement (Source: http://www.acm.org/sigs/sigkdd/explorations/, Editorial, vol 5, no 2, Dec. 2003) deliberately so, search for statistically significant relationships in large quantities of data; from Wikipedia) • First workshops on knowledge discovery in databases in late 1980s and early 1990s (part of IJCAI (Artificial Intelligence) and ACM SIGMOD (Management of Data) conferences) • First data mining conferences in mid 1990 • Many more conferences since early 2000 • So data mining is now in it's teen years (around 18 years old) Data mining resources (1) Data mining resources (2) • Conferences • Journals • ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (since 1995) • European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD) (since 1997) • Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) (since 1997) • SIAM (Society for Industrial and Applied Mathematics) International Conference on Data Mining (since 2001) • IEEE (Institute of Electrical and Electronics Engineers) International Conference on Data Mining (ICDM) (since 2001) • Australasian Data Mining Conference (AusDM) (workshop since 2002, conference since 2004) • Springer Data Mining and Knowledge Discovery http://www.springerlink.com/content/1573-756X • Springer Knowledge and Information Systems http://www.springerlink.com/content/0219-3116 • IEEE Transactions on Knowledge and Data Engineering http://www.computer.org/tkde/ • ACM SIGKDD Explorations http://www.acm.org/sigs/sigkdd/explorations • ACM Transactions on Knowledge Discovery from Data http://tkdd.cs.uiuc.edu/ Data mining resources (3) • Web resources • http://www.kdnuggets.com/ (News, software, jobs, courses, conferences, data repositories, polls, and more) • http://www.kmining.com (news, definitions, people, conferences) • http://www.iapa.org.au (Institute of Analytics Professionals of Australia) • http://www.togaware.com/analytics/ (Canberra Analytics Group) • http://www.acm.org/sigs/sigkdd/ (ACM Special Interest group on KDD) • http://www.dmg.org (Data mining group, PMML) • http://www.togaware.com/ (Graham Williams, ATO) • http://datamining.anu.edu.au/ • http://kdd.ics.uci.edu/ (UCI Knowledge Discovery in Databases Archive) Lecture summary • Data mining is concerned with finding novel, valid and potentially useful information in large data collections • It is a relatively new field that draws from many different disciplines • Data mining is an iterative process • Business and data understanding, as well as data preparation, are major components of data mining • Major challenges in data mining are the growing size and complexity of data collections, privacy issues, interestingness and understandability • Data mining is being applied in many areas Data mining books • There are many different book on data mining available, with different focus (statistics, science, business, etc.)