Download Intelligent Data Analysis and Data Mining Data Analysis and

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Lluis Belanche + Alfredo Vellido
Intelligent Data Analysis and Data Mining
or …
Data Analysis and Knowledge Discovery
a.k.a. Data Mining II
Office 319, Omega, BCN
EET, office 107, TR‐2, Terrassa
[email protected]
skype, gtalk: avellido
Tels.: 934137796, 937398090
www.lsi.upc.edu/~avellido/teaching/data_mining.html
raco.fib.upc.edu/home/assignatura?espai=270717
raco.fib.upc.edu/home/assignatura?espai=270650
IDADM
Contents of the course … (but who knows)
1. Introduction to DM and its methodologies
2. Visual DM: Exploratory DM through visualization
3. Pattern recognition 1
4. Pattern recognition 2
5. Feature extraction
6. Feature selection
7. Error estimation
8. Linear classifiers, kernels and SVMs
9. Probability in Data Mining
10. Nonlinear Dimensionality Reduction (NLDR)
11. Applications of NLDR: biomed & beyond
12. DM Case studies
IDADM
2013/2014. Alfredo Vellido
An Introduction to Mining (1)
What is DATA MINING? IDADM
What is DATA MINING? (1)
“Data Mining is the process of discovering actionable and meaningful patterns, profiles, and trends by sifting through your data using pattern recognition technologies (…) is a hot new technology about one of the oldest processes of human endeavour: pattern recognition (…) It is an iterative process of extracting knowledge from business transactions (…) DM is the automatic discovery of usable knowledge from your stored data.”
Jesús Mena: Data Mining your Website
(Digital Press, 1999, available @ books.google)
IDADM
What is DATA MINING? (2)
“Data Mining, by its simplest definition, automates the detection of relevant patterns in a database (…) For many years, statisticians have manually “mined” databases (…) DM uses well‐established statistical and machine learning techniques to build models that predict customer behaviour. Today, technology automates the mining process, integrates it with commercial data warehouses, and presents it in a relevant way for business users (…) the leading DM products address the broader business and technical issues, such as their integration into complex IT environments.”
Berson, Smith, & Thearling: Building Data Mining Applications for CRM (McGraw‐Hill, 2000)
IDADM
What is DATA MINING? (3)
WIKIPEDIA 2005 DIXIT: “Data mining has been defined as "The nontrivial extraction of implicit, previously unknown, and potentially useful information from data" (1) and "The science of extracting useful information from large data sets or databases" (2). Although it is usually used in relation to analysis of data, data mining, like artificial intelligence, is an umbrella term and is used with varied meaning in a wide range of contexts.”
(1) W. Frawley and G. Piatetsky‐Shapiro and C. Matheus, Knowledge Discovery in Databases: An Overview. AI Magazine, 1992, 213‐228.
(2) D. Hand, H. Mannila, P. Smyth: Principles of Data Mining. MIT Press, 2001.
en.wikipedia.org/wiki/Data_mining
IDADM
What is DATA MINING? (4)
WIKIPEDIA’06 DIXIT: “Data mining (DM), also called Knowledge‐Discovery in Databases (KDD) or Knowledge‐Discovery and Data Mining, is the process of automatically searching large volumes of data for patterns such as association rules. It is a fairly recent topic in computer science but applies many older computational techniques from statistics, information retrieval, machine learning and pattern recognition.
IDADM
DAKD,KDD,KDDM …
In 1996, in the proceedings of the 1st International Conference on KDD, Fayyad gave one of the best‐known definitions of Knowledge Discovery from Data:
“The non‐trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.”
KDD quickly gathered strength as an interdisciplinary research field where a combination of advanced techniques from Statistics, Artificial Intelligence, Information Systems, and Visualization are used to tackle knowledge acquisition from large data bases. The term Knowledge Discovery from Data appeared in 1989 referring to the:
“[...] overall process of finding and interpreting patterns from data, typically interactive and iterative, involving repeated application of specific data mining methods or algorithms and the interpretation of the patterns generated by these algorithms.”
IDADM
What is DATA MINING? (6)
WIKIPEDIA’08 DIXIT: “Data mining is the process of sorting through large amounts of data and picking out relevant information. It is usually used by business intelligence organizations, and financial analysts, but is increasingly being used in the sciences to extract information from the enormous data sets generated by modern experimental and observational methods. It has been described as "the nontrivial extraction of implicit, previously unknown, and potentially useful information from data" and "the science of extracting useful information from large data sets or databases." Data mining in relation to enterprise resource planning is the statistical and logical analysis of large sets of transaction data, looking for patterns that can aid decision making.”
IDADM
What is DATA MINING? (7)
WIKIPEDIA’10 gave up:
IDADM
What is DATA MINING? (8)
… but never lose your faith … W’13
Data mining (the analysis step of the "Knowledge Discovery in Databases" process, or KDD), an interdisciplinary subfield of computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management aspects, data pre‐processing, model and inference considerations, interestingness metrics, complexity considerations, post‐processing of discovered structures, visualization, and online updating.
The term is a buzzword, and is frequently misused to mean any form of large‐scale data or information processing (collection, extraction, warehousing, analysis, and statistics) but is also generalized to any kind of computer decision support system ... IDADM
A different (practical) approach to the definition of DM: What to expect from a DM conference…
15‐17 September’04: Wessex Institute of Technology
(W.I.T.)
IDADM
What to find in a DM conference…
Sessions 1 & 2: Text Mining
Session 3: Web Mining
Session 4: Clustering Techniques
Session 5: Data Preparation Techniques
Session 6 & 7: Applications in Business, Industry and Government
Session 8: Customer Relationship Management (CRM)
Session 9 & 10: Applications in Science and Engineering
IDADM
What to find in a DM conference (three
years later)… 2007
Session 1: Categorisation Methods
Session 2: Data Preparation
Session 3: Enterprise Information Systems
Session 4: Clustering Techniques
Session 5: National Security
Session 6: Data and Text Mining
Session 7: Mining Environmental and Geospatial Data
Session 8: Applications in Business, Industry and Government
IDADM
What to find in the dark last few years …
IDADM
What to find in the dark last few years …
Investigative Data Mining For Security And Criminal Detection
Jesús Mena
Butterworth‐Heinemann 2003
IDADM
A different conference, a different take …
IEEE CIDM 2012, Brussels
2012 IEEE Symposium on Computational Intelligence and Data Mining
• Data mining foundations
• Novel data mining algorithms in traditional areas (such as classification, regression, clustering, probabilistic modeling, and association analysis)
• Algorithms for new, structured, data types (chemistry, biology, environment, and other scientific domains)
• Developing a unifying theory of data mining
• Mining sequences and sequential data
• Mining spatial and temporal datasets
• Mining textual and unstructured datasets
• High performance implementations of data mining algorithms
IDADM
A different conference, a different take …
IEEE CIDM 2012, Brussels
2012 IEEE Symposium on Computational Intelligence and Data Mining
• Mining in targeted application contexts
• Mining high speed data streams
• Mining sensor data
• Distributed data mining and mining multi‐agent data
• Mining in networked settings: web, social and computer networks, and online communities
• Data mining in electronic commerce, such as recommendation, sponsored web search, advertising, and marketing tasks
IDADM
A different conference, a different take …
IEEE CIDM 2012, Brussels
2012 IEEE Symposium on Computational Intelligence and Data Mining
• Methodological aspects and the KDD process
• Data pre‐processing, data reduction, feature selection, and feature transformation
• Quality assessment, interestingness analysis, and post‐processing
• Statistical foundations for robust and scalable data mining
• Handling imbalanced data
• Automating the mining process and other process related issues
• Dealing with cost sensitive data and loss models
• Human‐machine interaction and visual data mining
• Security, privacy, and data integrity
IDADM
A different conference, a different take …
IEEE CIDM 2012, Brussels
2012 IEEE Symposium on Computational Intelligence and Data Mining
• Integrated KDD applications and systems
• Bioinformatics, computational chemistry, geoinformatics, and other science & engineering disciplines
• Computational finance, online trading, and analysis of markets
• Intrusion detection, fraud prevention, and surveillance
• Healthcare, epidemic modeling, and clinical research
• Customer relationship management
• Telecommunications, network and systems management
But let’s talk money ...
Starved for ca$h?: ask your TIA
IDADM
The T.I.A.
The W Bush years
“The Total Information Awareness (TIA) program may have been
killed by congressional decree, but key elements of the program
have survived at other intelligence agencies, according to
congressional, federal, and research officials. TIA's goal was to
employ data‐mining to shift through public and private
databases to track terrorists, which stirred up fears that the
program would be used to spy on millions of innocent
Americans.” “Congressional officials have not disclosed which TIA programs were
eliminated and which were retained, but insiders report that
TIA's Evidence Extraction and Link Discovery projects, collectively encompassing 18 data‐mining initiatives, are among
the surviving components. “
“Despite the death of TIA, Capitol Hill is still paying for the
development of software designed to collect foreign intelligence
on terrorists: a $64 million research program run by the
Advanced Research and Development Activity (ARDA), which
has employed some of the same researchers as TIA, was left
untouched by Congress.” www.darpa.mil
IDADM
What’s DATA MINING?: A procedural viewpoint
IDADM
What’s DATA MINING?: A historicist viewpoint
STATISTICS
ESTADÍSTICA
DM
PATT
RECOG
KDD
ARTIFICIAL
INTELLIGENCE
EXPERT
SYSTEMS
MACHINE
LEARNING
DB
MANAGEMENT
IDADM
What’s DATA MINING?: A historicist viewpoint
ADVANCED
PROBABILISTIC
MODELS
STATISTICS
ESTADÍSTICA
KDD
ARTIFICIAL
INTELLIGENCE
MACHINE
LEARNING
Probabilistic
Models
Algor. Devel.
Bio-plausible
Models
OTHERS…
DATA MINING as a methodology
IDADM
CRISP: a DM methodology
CRoss‐Industry Standard Process for Data Mining: neutral methodology from the point of view of industry, tool and application (free & non‐
proprietary)
Pete Chapman, Randy Kerber (NCR); Julian Clinton, Thomas Khabaza, Colin Shearer (SPSS), Thomas Reinartz, Rüdiger Wirth (DaimlerChrysler)
CRISP‐DM was conceived in 1996
DaimlerChrysler: leaders in industrial application, SPSS: leaders in product development (Clementine, 1994), NCR: owners of large (huge!) databases (Teradata)
Financed by the EU. Version 1.0 released officially in 1999
IDADM
CRISP: Hierarchic structure of the methodology
IDADM
CRISP: Description of phases
Problem understanding: study of targets and requirements form the
business/problem viewpoint. Defining it as a DM problem.
Data understanding: data recolection; getting to know the data, trying to detect
both quality problems and interesting features.
Data preparation: Preparing the data set to be modelled, starting from raw
data. This is an iterative and exploratory process. Selection of files, tables, variables, record samples… plus data cleaning.
Modelling: Data analysis using modelling techniques of a sort that are suitable
for the problem at hand. Includes fiddling with the models, tuning their
parameters, etc.
Evaluation: All previous steps must be evaluated as whole (as a unitary process), and we must decide whether deliverables so far meet the DM challenge. Implementation: All the knowledge aquired to this point must be organized and presented to the “client” in a usable form. We must define, together with this
client, a protocol to reliably deploy the DM findings.
IDADM
CRISP: The virtuous loop of methodology phases
IDADM
Use of DM methodologies (2004)
Enterprise MinerTM: SEMMA
The acronym SEMMA ‐‐ Sample, Explore, Modify, Model, Assess ‐‐ refers to the core process of conducting data mining. Beginning with a statistically representative sample of your data, SEMMA makes it easy to apply exploratory statistical and visualization techniques, select and transform the most significant predictive variables, model the variables to predict outcomes, and confirm a model's accuracy. IDADM
Use of DM methodologies (2004 → 2007)
2004
2007
IDADM
CRISP: Phases: Problem understanding
PROBLEM UNDERSTANDING
DATA DATA
UNDERST’ING
PREPARATION
DETERMINE
PROBLEM
GOAL
BACKGROUND
ASSESS SITUATION
INVENTORY RESOURCES
DETERMINE
DM
GOALS
GOALS DM
SUCCESS CRITERIA DM
PRODUCE PROJECT
PLAN
PROJECT
INITIAL SELECTION OF TOOLS
PLAN
MODELLING
PROBLEM
SUCCESS
GOALS
CRITERIA
REQUERIMS. ASSUMPTIONS LIMITATIONS
RISKS CONTINGEN.
EVALUATION
TERMINOLOG.
IMPLEMEN
TATION
COSTS & BENEFITS
IDADM
DM application areas (’06‐>’09)
IDADM
DM application areas (’09‐>’10)
IDADM
DM application areas (’10‐>’11)
IDADM
CRISP: Phases: Data understanding
PROBLEM UNDERSTANDING
DATA DATA
UNDERST’ING
PREPARATION
OBTAIN INITIAL DATA
DESCRIPTION DATA
EXPLORATION DATA
VERIFICATION QUALITY DATA
INITIAL DATA REPORT
DATA DESCRIPTIVE REPORT
DATA EXPLORATION REPORT
DATA QUALITY REPORT
MODELLING
EVALUATION
IMPLEMEN
TATION