Download Data Mining

Document related concepts
no text concepts found
Transcript
Data Mining
Informazioni docente – Data Mining
Chi è il vostro docente?
Laurea in Informatica presso Università di Salerno (1994)
Contratti di consulenza con OAC - Osservatorio Astronomico di Capodimonte (1995 – 1999)
Astronomo ricercatore presso INAF – OAC ( 1999 – pensione (forse??) )
Docente a contratto di architettura degli elaboratori presso il dip. Informatica dell’Università
Federico II di Napoli (2002 – 2007)
Docente associato di Tecnologie Astronomiche presso il dip. fisica della Federico II (dal 2008)
Progettazione/realizzazione grandi telescopi e strumenti (ottica, elettronica, software
engineering, quality control), project management, data mining e machine learning per
grandi archivi di dati astrofisici
Ufficio OAC tel. 081.5575553 – cell. 338.5354945 - e-mail: [email protected]
http://www.na.astro.it/~brescia.html – http://dame.dsf.unina.it
M. Brescia - Data Mining - lezione 1
2
Presentazione corso – Data Mining
Il corso sarà articolato nei seguenti argomenti:
totale 12 lezioni da 4 ore cad. = 50 ore (ultime 2 lezioni da 5 ore)
a)
b)
c)
d)
e)
f)
fondamenti di e-science e data warehousing;
fondamenti di data mining e Intelligenza Artificiale;
fondamenti di machine learning supervisionato;
fondamenti di machine learning non supervisionato;
fondamenti di ICT per il data mining;
esempi pratici di data mining e casi d'uso;
M. Brescia - Data Mining - lezione 1
3
Presentazione corso – Data Mining
Il recente riconoscimento a livello globale del concetto di Scienza data-centrica, ha indotto
una rapida diffusione e proliferazione di nuove metodologie di data mining. Il concetto
chiave consegue dal quarto paradigma della Scienza moderna, ossia del "Knowledge
Discovery in Databases" o KDD, dopo teoria, sperimentazione e simulazioni. Una delle cause
principali è stata l'evoluzione della tecnologia e di tutte le scienze di base ed applicate, che
fanno dell'esplorazione efficiente dei dati il principale mezzo per nuove scoperte.
Il data mining dunque si prefigge di gestire ed analizzare enormi quantità di dati eterogenei,
avvalendosi di tecniche ed algoritmi auto-adattivi, afferenti al paradigma del Machine
Learning.
Il presente corso intende quindi fornire i concetti fondamentali alla base della teoria del data
mining, data warehousing e Machine Learning (reti neurali, logica Fuzzy, algoritmi genetici,
Soft Computing), con tecniche pratiche derivanti dallo stato dell'arte dell'Information &
Communication Technology (tecnologie web 2.0, calcolo distribuito e cenni alla
programmazione su architetture parallele).
Il corso conterrà esempi di sviluppo di modelli di data mining, facendo uso di linguaggi di
programmazione (C, C++, Java, CUDA C);
M. Brescia - Data Mining - lezione 1
4
Presentazione corso – Data Mining
Modalità di svolgimento del corso:
1) Lezioni frontali (slides);
2) Discussione collegiale;
3) (Esercitazioni) ed esempi pratici;
Le lezioni frontali saranno basate su slides quasi esclusivamente in lingua inglese (la
letteratura è infatti prevalentemente in inglese e conviene abituarci a consultare testi in
linguaggio diverso dall’italiano!)
Al termine di ogni lezione, dedicheremo una parte alla discussione aperta
Tutto il materiale del corso, incluse slide, bibliografia e links web utili, è a disposizione
attraverso una pagina web, gestita dal sottoscritto.
http://dame.dsf.unina.it/master.html
M. Brescia - Data Mining - lezione 1
5
Data Mining
Data mining—core of
knowledge discovery
process
Pattern Evaluation
Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
M. Brescia - Data Mining - lezione 1
6
Data, Data everywhere, yet ...
I can’t find the data I need
ִ Data is scattered over the
network
ִ many versions and formats
I can’t get the data I need
ִ need an expert to get the data
I can’t understand the data I found
ִ available data poorly documented
I can’t use the data I found
ִ results are unexpected
ִ data needs to be transformed
from one form to other
7
Most data will never be seen by humans…
Cascade of data
1 ZB or 1.000.000.000.000 GB = 109 TeraByte
Tsunami of data
Small, big, in a network, isolated … modern devices produce large
amounts of data and EACH DATA which is produced needs to be
reduced, analyzed, interpreted…
Increase in number and size of
devices or in efficiency or in
number of bands … all cause an
increase in pixels (worldwide)
Computing time and costs do not
scale linearly with number of
pixels
Moore law’s does not apply
anymore. Slopes are changed.
International Technology Roadmap for Semiconductors
Tsunami of data
For over two decades, before the advent of multi-core architectures,
the general purpose CPUs have been characterized, at each
generation, by an almost linear increasing of performances together
with a decreasing of costs, also known as Moore’s Law (Moore 1965)
Increase in number and size of
devices or in efficiency or in
number of bands … all cause an
increase in pixels (worldwide)
Computing time and costs do not
scale linearly with number of
pixels
So far, in order to maintain the
cyclic hardware/software trend,
the software applications had to
change their perspective, moving
towards parallel computing
Tsunami of data
The forerunner:
LHC
Computationally demanding but still
a relatively simple (embarassingly
parallel) KDD task
each CPU gets one event at a time
and needs to perform simple tasks
Data Stream: 330 TB/week
ATLAS detector event
Tsunami of data
DATA INTENSIVE SCIENCE HAS BECOME A REALITY IN
ALMOST ALL FIELDS and poses worse problems
•
Huge data sets ( ca. Pbyte)
In astronomy as in many other sciences
•
Thousands of different problems
•
Many, many thousands of users
i.e. LHC is a “piece of cake”
(simple computational model)
Tsunami of data
Jim Gray
“One of the greatest challenges for
21st-century science is how we
respond to this new era of data
intensive science …
… This is recognized as a new
paradigm beyond experimental and
theoretical research and computer
simulations of natural phenomena one that requires new tools,
techniques, and ways of working.”
Tsunami of data
Real world physics is too
complex. Validation of models
requires accurate simulations,
tools to compare simulations
and data, and better ways to
deal with complex & massive
data sets
Cosmological simulation.
The total number of
particles is 2,097,152
Need to increase
computational and
algorithmic capabilities
beyond current and
expected technological
trends
A new Science concept
Virtualization of Science and
Scholarship
Summary
Overture
• The world transformed
• Climbing the S-Curve
Science in the exponential world
Virtual Observatory: a case study
• The modern scientific process
eScience and the new paradigms
The evolution of computing
• Scientific communication and collaboration
The rise of immersive virtual environments: Web 3.0?
• The growing synergies
Exploring and building in cyberspace
Definitions
Definition: By Virtualization, I mean a migration of the scholarly work, data,
tools, methods, etc., to cyber-environments, today effectively the Web
This process is of course not limited to science and scholarship;
essentially all aspects of the modern society are undergoing the same
transformation
Cyberspace (today the Web, with all information and tools it connects) is
increasingly becoming the principal arena where humans interact with each
other, with the world of information, where they work, learn, and play
ITC Revolution
Information & Communication Technology
revolution is historically unprecedented in its impact it is like the industrial
revolution and the invention of printing
combined
Yet, most fields of science and scholarship have not
yet fully adopted the new ways of doing things, and
in most cases do not understand them well…
It is a matter of developing a new methodology of
science and scholarship for the 21st century
eScience
What Is This Beast Called e-Science?
It depends on whom you ask, but some
general properties include:
• Computationally enabled
• Data-intensive
• Geographically distributed resources (i.e., Web-based)
However:
• All science in the 21st century is becoming cyber-science (aka e-Science) –
so this is just a transitional phase
• There is a great emerging synergy of the computationally enabled science,
and the science-driven IT
Facing the Data Tsunami
Astronomy, all sciences, and every other modern field
of human endeavor (commerce, security, etc.) are
facing a dramatic increase in the volume and
complexity of data
• We are entering the second phase of the IT revolution: the rise of the
information/data driven computing
The challenges are universal, and growing:
– Management of large, complex,
distributed data sets
– Effective exploration of such data new
knowledge
Data complexity and volume
Exponential Growth in Data Volumes and
Complexity
Understanding of complex phenomena requires complex data!
Multi-data fusion leads to a more complete, less
biased picture (also: multi-scale, multi-epoch, …)
Numerical simulations are also producing
many TB’s of very complex “data”
Data + Theory = Understanding
An example: Astronomy
Astronomy Has Become Very Data-Rich
• Typical digital sky survey now generates ~ 10 - 100 TB, plus a
comparable amount of derived data products
– PB-scale data sets are on the horizon
• Astronomy today has ~ 1 - 2 PB of archived data, and generates a few
TB/day
– Both data volumes and data rates grow exponentially, with a
doubling time ~ 1.5 years
– Even more important is the growth of data complexity
• For comparison:
Human memory ~ a few hundred MB
Human Genome < 1 GB
1 TB ~ 2 million books
Library of Congress (print only) ~ 30 TB
The reaction
The Response of the Scientific Community to the IT Revolution
• The rise of Virtual Scientific Organizations:
– Discipline-based, not institution based
– Inherently distributed, and web-centric
– Always based on deep collaborations between domain scientists and applied
CS/IT scientists and professionals
– Based on an exponentially growing technology and thus rapidly evolving
themselves
– Do not fit into the traditional organizational structures
– Great educational and public outreach potential
• However: Little or no coordination and interchange between different scientific
disciplines
• Sometimes, entire new fields are created, e.g., bioinformatics, computational biology
The Virtual Observatory
The Virtual Observatory Concept
• A complete, dynamical, distributed, open research environment for the new
astronomy with massive and complex data sets
– Provide and federate content (data, metadata)
services, standards, and analysis/compute
services
– Develop and provide data exploration and
discovery tools
– Harness the IT revolution in the service of
astronomy
– A part of the broader e-Science /CyberInfrastructure
http:// ivoa.net
http://us-vo.org
http://www.euro-vo.org
The world is flat
Probably the most important aspect of the IT
revolution in science
Professional Empowerment: Scientists and students anywhere with an internet
connection should be able to do a first-rate science (access to data and tools)
– A broadening of the talent pool in astronomy, leading to a substantial
democratization of the field
• They can also be substantial contributors, not only consumers
– Riding the exponential growth of the IT is far more cost effective than building
expensive hardware facilities, e.g., big telescopes, large accelerators, etc…
– Especially useful for countries without major research facilities
VO Education and Public Outreach
The Web has a truly
transformative potential
for education at all levels
• Unprecedented opportunities in
terms of the content, broad
geographical and societal range, at
all levels
• Astronomy as a gateway to learning
about physical science in general,
as well as applied CS and IT
VO (also as Virtual Organization) Functionality Today
What we did so far:
• Lots of progress on interoperability, standards, etc.
• An incipient data grid of astronomy
• Some useful web services
• Community training, EPO
What we did not do (yet):
• Significant data exploration and mining tools. That is where the science will
come from!
Thus, little VO-enabled science so far and a slow community buy-in
Development of powerful knowledge discovery tools should be a key priority
Donald Rumsfeld’s Epistemology
There are known knowns,
There are known unknowns, and
There are unknown unknowns
Or, in other words (Data Mining):
1. Optimized detection algorithms
2. Supervised clustering
3. Unsupervised clustering
The Mixed Blessings of Data Richness
Modern digital sky surveys typically contain ~ 10 – 100 TB, detect Nobj ~ 108 - 109
sources, with D ~ 102 – 103 parameters measured for each one -- and multi-PB data
sets are on the horizon
Potential for discovery
Nobj or data volume Big surveys
Nsurveys 2 (connections) Data federation
Great! However … DM algorithms scale very badly:
– Clustering ~ N log N
N2, ~ D2
– Correlations ~ N log N
N2, ~ Dk (k 1)
– Likelihood, Bayesian ~ Nm (m ≥ 3), ~ Dk (k ≥ 1)
Scalability and dimensionality reduction (without a significant loss of information) are
critical needs!
The Curse of Hyperdimensionality
DM Toolkit
Not a matter of hardware
or software, but new ideas
User
Visualization
Visualization!
A fundamental limitation of the human perception: DMAX = 3? 5? 10?
(We can understand mathematically much higher dimensionalities, but
cannot really visualize them; our own Neural Nets are powerful
pattern recognition tools)
Interactive visualization must be a key part of the data mining process
Dimensionality
reduction
via
machine
patterns/substructures and correlations in the data?
discovery
of
Visualization
Effective visualization is the bridge between quantitative
information, and human intuition
L’uomo non è in grado di comprendere senza
immagini; L’immagine è una similitudine di
una cosa corporea, ma la comprensione è
dell’universale astratto dai particolari
Aristotele, De Memoria et Reminiscentia
Data analysis
The key role of data analysis is to replace the raw complexity
seen in the data with a reduced set of patterns, regularities,
and correlations, leading to their theoretical understanding
However, the complexity (e.g., dimensionality) of data sets and
interesting, meaningful constructs in them is starting to exceed
the cognitive capacity of the human brain
Data understanding
This is a Very Serious Problem!
Hyperdimensional structures (clusters, correlations, etc.) are likely present in
many complex data sets, whose dimensionality is commonly in the range of
D ~ 102 – 104, and will surely grow
It is not only the matter of data understanding, but also of choosing the
appropriate data mining algorithms, and interpreting their results
• Things are rarely Gaussian in reality
• The clustering topology can be complex
What good are the data if we cannot effectively extract knowledge from
them?
“A man has got to know his limitations”
Dirty Harry, an American philosopher
Knowledge Discovery in Databases
The new Science
Information Technology New Science
• The information volume grows exponentially
Most data will never be seen by humans!
The need for data storage, network, database-related technologies, standards,
etc.
• Information complexity is also increasing greatly
Most data (and data constructs) cannot be comprehended by humans
directly!
The need for data mining, KDD, data understanding technologies,
hyperdimensional visualization, AI/Machine-assisted discovery …
• We need to create a new scientific methodology on the basis of applied CS
and IT
• Important for practical applications beyond science
Evolution of knowledge
The Evolving Paths to Knowledge
• The First Paradigm: Experiment/Measurement
• The Second Paradigm: Analytical Theory
• The Third Paradigm: Numerical Simulations
• The Fourth Paradigm: Data-Driven Science?
From numerical simulations…
Numerical Simulations:
A qualitatively new (and necessary) way
of doing theory, beyond analytical
approach
Simulation output: a data set, the theoretical
statement, not an equation
Formation
of a cluster of
galaxies
Turbulence in the Sun
…to the fourth paradigm
Is this really something qualitatively new, rather than the same old data
analysis, but with more data?
The information content of modern data sets is so high as to
enable discoveries which were not envisioned by the data
originators (data mining)
Data fusion reveals new knowledge which was implicitly
present, but not recognizable in the individual data sets
Complexity threshold for a human comprehension of complex
data constructs? Need new methods to make the data
understanding possible (machine learning)
Data Fusion + Data Mining + Machine Learning = The Fourth
Paradigm
The fourth paradigm
1. Experiment ( ca. 3000 years)
2. Theory (few hundreds years)
mathematical description, theoretical
models, analytical laws (e.g. Newton,
Maxwell, etc.)
3. Simulations (few tens of years)
Complex phenomena
4. Data-Intensive science
(and it is happening now!!)
http://research.microsoft.com/fourthparadigm/
Machine Learning
The Roles for Machine Learning and Machine Intelligence in
CyberScience:
Data processing:
Object / event / pattern classification
Automated data quality control (fault detection and repair)
+
Data mining, analysis, and understanding:
Clustering, classification, outlier / anomaly detection
Pattern recognition, hidden correlation search
Assisted dimensionality reduction for hyperdimensional visualisation
orkflow control in Grid-based apps
Data farming and data discovery: semantic web, and beyond
Code design and implementation: from art to science?
The way to produce new science
The old and the new
The Book and
the Cathedral …
… and
the Web, and
the Computer
Technologies for information
storage and access are
evolving, and so does scholarly
publishing
Worlds of knowledge
K. Popper, Objective Knowledge:
An Evolutionary Approach, 1972
Cyberspace is now
effectively World 3,
plus the ways of
interacting with it
Science Commons, or Discovery Space
Data
Archives
Simulations
& Theory
Published
Literature
Communication
& Collaboration
Origins of discovery
A Lot of Science Originates in
Discussions and Constructive
Interactions
This creative process can be
enabled and enhanced using
virtual
interactive
spaces,
including the Web2.0 tools
Computing as a Communication Tool
With the advent of the Web, most of the computing usage is not
in a number crunching, but in a search, manipulation, and display
of data and information, and increasingly also for human
interactions (e.g., much of Web 2.0)
Information as communication
Information Technology as a Communication Medium:
Social Networking and Beyond
•
•
•
•
Science originates on the interface between human minds, and the human minds and
data (measurements, structured information, output of simulations)
Thus, any technology which facilitates these interactions is an enabling technology for
science, scholarship, and intellectual progress more generally
Virtual Worlds (or immersive VR) are one such technology, and will likely revolutionize
the ways in which we interact with each other, and with the world of information we
create
Thus, we started the Meta-Institute for Computational Astrophysics (MICA), the first
professional scientific organization based entirely in VWs (Second life)
http://slurl.com/secondlife/StellaNova
Subjective experience quality much higher
than traditional videoconferencing (and it
can only get better as VR improves)
Effective worldwide telecommuting, at ~
zero cost
Professional conferences easily organized,
at ~ zero cost
Immersive data visualization
Encode up to a dozen dimensions for a
parameter space representation
Interactive data exploration in a pseudo3D environment
Multicolor SDSS data set on
stars, galaxies and quasars
Immersive mathematical visualization
Pseudo-3D representation of highly-dimensional mathematical
objects
Potential research and educational uses: geometry, topology, etc.
A pseudo-3D projection
of a 248-dimensional
mathematical object
Personalization of Cyberspace
We inhabit the
individuals
Cyberspace
as
– and not just for work, but in very
personal ways, to express ourselves,
and to connect with others (“As we
may feel”?)
e-Science is unified by a common
methodology and tools
“We must all hang together, or assuredly we will all
hang separately”
Ben Franklin
The Truth About Social Networking
social networking as the intersection of narcissism, ADHD (Attention Deficit
Hyperactivity Disorder), and good old fashioned stalking
The Core business of Academia
To discover, preserve, and disseminate knowledge
To serve as a source of scientific and technological innovation
To educate the new generations, in terms of the knowledge, skills, and tools
But when it comes to the adoption of computational tools and methods, innovation,
and teaching them to our students, we are doing very poorly – and yet, the science and
the economy of the 21st century depend critically on these issues
Is the discrepancy of time scales
to blame for this slow uptake?
IT ~ 2 years
Education ~ 20 years
Career ~ 50 years
Universities ~ 200 years
(Are universities obsolete?)
Some Thoughts about e-Science
Computational science ≠ Computer science
Numerical modeling
Computational science
Data-driven science
• Data-driven science is not about data, it is about knowledge
extraction (the data are incidental to our real mission)
• Information and data are (relatively) cheap, but the expertise is
expensive
o Just like the hardware/software situation
• Computer science as the “new mathematics”
o It plays the role in relation to other sciences which
mathematics did in ~ 17th - 20th century
o Computation as a glue/lubricant of interdisciplinarity
Some Transformative Technologies To Watch
Cloud (mobile, ubiquitous) computing
• Distributed data and services
• Also mobile / ubiquitous computing
Semantic Web
• Knowledge encoding and discovery infrastructure
for the next generation Web
Immersive & Augmentative Virtual Reality
• The human interface for the next generation
Web, beyond the Web 2.0 social networking
Machine Intelligence redux
• Intelligent agents as your assistants / proxies
• Human-machine intelligence interaction
A new set of disciplines: X-Informatics
Machine learning
Data structures
Advanced programming
languages
Data mining
Formation of a new
generation of
scientists
Computer networks
visualization
Databases
Numerical analysis
Computational
infrastructures
Semantics
ETC.
Within any X-informatics discipline, information granules are unique to that discipline, e.g.,
gene sequences in bio, the sky object in astro, and the spatial object in geo (such as points
and polygons in the vector model, and pixels in the raster model). Nevertheless the goals are
similar: transparent data re-use across sub-disciplines and within education settings,
information and data integration and fusion, personalization of user interactions with data
collection, semantic search and retrieval, and knowledge discovery. The implementation of
an X-informatics framework enables these semantic e-science research goals
Some Speculations
We create technology, and it changes us, starting with the grasping of sticks
and rocks as primitive tools, and continuing ever since
When the technology touches our minds, that process can have profound
evolutionary impact in the long term; VWs are one such technology
Development of AI seems inevitable, and its uses in assisting us with the
information management and knowledge discovery are already starting
In the long run, immersive VR may facilitate the co-evolution of human and
machine intelligence
Scientific and Technological Progress
Mining of Warehouse Data
Data Mining + Data Warehouse = Mining of Warehouse Data
• For organizational learning to take place, data from must be gathered together
and organized in a consistent and useful way – hence, Data Warehousing (DW);
• DW allows an organization to remember what it has noticed about its data;
• Data Mining techniques should be interoperable with data organized in a DW.
Enterprise “Database”
Transactions
VO
registries
Simulations
Observations
Copied,
organized
summarized
Etc…
Etc…
Data
Warehouse
Data Miners:
• “Farmers” – they know
• “Explorers” - unpredictable
Data Mining
DM 4-rule virtuous cycle
•
•
–
–
Finding patterns is not enough
Science business must:
Respond to patterns by taking action
Turning:
• Data into Information
• Information into Action
• Action into Value
• Hence, the Virtuous Cycle of DM:
•
Virtuous cycle implementation steps:
– Transforming data into information
via:
• Hypothesis testing
• Profiling
• Predictive modeling
– Taking action
• Model deployment
• Scoring
– Measurement
• Assessing a model’s stability &
effectiveness before it is used
1.
Identify the problem
2.
Mining data to transform it into actionable information
3.
Acting on the information
4.
Measuring the results
DM: 11-step Methodology
The four rules reflect into an 11-step exploded strategy, at the base of DAME (Data Analysis,
Mining and Exploration) concept
1.
Translate any opportunity (science case) into DM opportunity (problem)
2.
Select appropriate data
3.
Get to know the data
4.
Create a model set
5.
Fix problems with the data
6.
Transform data to bring information
7.
Build models
8.
Assess models
9.
Deploy models
10.
Assess results
11.
Begin again (GOTO 1)
Why Mine Data? Commercial Viewpoint
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/stores
– Bank/Credit Card
transactions
• Computers have become cheaper and more powerful
• Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in Customer
Relationship Management)
Why Mine Data? Scientific Viewpoint
• Data collected and stored at
enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene
expression data
– scientific simulations
generating terabytes of data
• Traditional techniques infeasible for raw data
• Data mining may help scientists
– in classifying and segmenting data
– in Hypothesis Formation
Terminology
•
Components of the input:
–
Concepts: kinds of things that can be learned
• Aim: intelligible and operational concept description
–
Instances: the individual, independent examples of a concept
• Note: more complicated forms of input are possible
–
Features/Attributes: measuring aspects of an instance
• We will focus on nominal and numeric ones
–
Patterns: ensemble (group/list) of features
• In a same dataset, a group of patterns are usually in a homogeneous format
(same number, meaning and type of features)
What’s a DM concept?
•
Data Mining Tasks (Styles of learning):
Classification learning:
predicting a discrete class
Association learning:
detecting associations between features
Clustering:
grouping similar instances into clusters
Sequencing what events are likely to lead to later events
Forecasting what may happen in the future
Numeric prediction (Regression):
predicting a numeric quantity
•
Concept: thing to be learned
•
Concept description: output
of learning scheme
Effective DM process break-down
Market Analysis and Management
•
•
Where does the data come from?—Credit card transactions, loyalty cards, discount
coupons, customer complaint calls, plus (public) lifestyle studies
Target marketing
– Find clusters of “model” customers who share the same characteristics: interest, income
level, spending habits, etc.,
– Determine customer purchasing patterns over time
•
•
•
Cross-market analysis—Find associations/co-relations between product sales, &
predict based on such association
Customer profiling—What types of customers buy what products (clustering or
classification)
Customer requirement analysis
– Identify the best products for different customers
– Predict what factors will attract new customers
•
Provision of summary information
– Multidimensional summary reports
– Statistical summary information (data central tendency and variation)
Data quality and integrity problems
•
Legacy systems no longer documented
•
Outside sources with questionable quality procedures
•
Production systems with no built in integrity checks and no integration
– Operational systems are usually designed to solve a specific business problem
and are rarely developed to a a corporate plan
•
•
•
•
•
•
• “And get it done quickly, we do not have time to worry about corporate
standards...”
Same person, different spellings
– Agarwal, Agrawal, Aggarwal etc...
Multiple ways to denote company name
– Persistent Systems, PSPL, Persistent Pvt. LTD.
Use of different names
– mumbai, bombay
Different account numbers generated by different applications for the same
customer
Required fields left blank
Invalid product codes collected at point of sale
– manual entry leads to mistakes
– “in case of a problem use 9999999”
What is a Data Warehouse?
A single, complete and consistent store of data obtained from a variety of
different sources made available to end users in a what they can
understand and use in a business/research context.
• Data should be integrated across the
enterprise
• Summary data has a real value to the
organization
• Historical data holds the key to understanding
data over time
• What-if capabilities are required
DW is a process of transforming data into information and
making it available to users in a timely enough manner to
make a difference
Technique for assembling and managing data from various
sources for the purpose of answering business questions.
Thus making decisions that were not previous possible
The evolution of data analysis
Evolutionary Step Business Question Enabling
Technologies
Product Providers Characteristics
Data Collection
(1960s)
"What was my total Computers, tapes,
revenue in the last disks
five years?"
IBM, CDC
Retrospective,
static data delivery
Data Access
(1980s)
"What were unit
sales in New
England last
March?"
Relational
databases
(RDBMS),
Structured Query
Language (SQL),
ODBC
Oracle, Sybase,
Informix, IBM,
Microsoft
Retrospective,
dynamic data
delivery at record
level
Data
Warehousing &
Decision Support
(1990s)
"What were unit
sales in New
England last
March? Drill down
to Boston."
On-line analytic
processing
(OLAP),
multidimensional
databases, data
warehouses
SPSS, Comshare,
Retrospective,
Arbor, Cognos,
dynamic data
Microstrategy,NCR delivery at multiple
levels
Data Mining
(Emerging Today)
"What’s likely to
happen to Boston
unit sales next
month? Why?"
Advanced
algorithms,
multiprocessor
computers, massive
databases
SPSS/Clementine,
Lockheed, IBM,
SGI, SAS, NCR,
Oracle, numerous
startups
Prospective,
proactive
information
delivery
Definition of a Massive Data Set
• TeraBytes -- 1012 bytes:
Astrophysical observation (per night)
• PetaBytes -- 1015 bytes:
Geographic Information Systems or
Astrophysical Survey Archive
• ExaBytes -- 1018 bytes:
National Medical Records
• ZettaBytes -- 1021 bytes:
Weather images
• ZottaBytes -- 1024 bytes:
Intelligence Agency Videos
DM, Operational systems and DW
What makes data mining possible?
• Advances in the following areas are making
data mining deployable:
– data warehousing
– Operational systems
– the emergence of easily deployed data mining
tools and
– the advent of new data mining techniques
(Machine Learning)
OLTP vs OLAP
OLPT and OLAP are complementing technologies. You can't live without OLTP: it runs your
business day by day. So, using getting strategic information from OLTP is usually first “quick
and dirty” approach, but can become limiting later.
OLTP (On Line Transaction Processing) is a data modeling approach typically used to facilitate
and manage usual business applications. Most of applications you see and use are OLTP
based.
OLAP (On Line Analytic Processing) and is an approach to answer multi-dimensional queries.
OLAP was conceived for Management Information Systems and Decision Support Systems
but is still widely underused: every day I see too much people making out business
intelligence from OLTP data!
With the constant growth
of data analysis and
intelligence applications,
understanding the OLAP
benefits is a must if you
want to provide valid and
useful analytics to the
management.
OLTP
Application
OLAP
Operational: ERP,
Management Information System,
CRM, legacy apps, ... Decision Support System
Typical users Staff
Managers, Executives
Horizon
Weeks, Months
Years
Refresh
Immediate
Periodic
Data model
Entity-relationship
Multi-dimensional
Schema
Normalized
Star
Emphasis
Update
Retrieval
Examples of OLTP data systems
Data
Industry
Usage
Volumes
Customer All
File
Legacy application, flat Small-medium
files, main frames
Account
Balance
Legacy applications,
Large
hierarchical databases,
mainframe
ERP, Client/Server,
Very Large
relational databases
Point-ofSale data
Call
Record
Track
Customer
Details
Finance
Control
account
activities
Retail
Generate
bills, manage
stock
Telecomm- Billing
unications
Technology
Production ManufactRecord
uring
Control
Production
Legacy application,
Very Large
hierarchical database,
mainframe
Medium
(ERP) Enterprise
Resource Planning,
relational databases
Why Separate Data Warehouse?
• Operational Systems are OLTP systems (DW is OLAP)
– Run mission critical applications
– Need to work with stringent performance requirements for routine
tasks
– Used to run a business!
– Optimized to handle large numbers of simple read/write transactions
– RDBMS have been used for OLTP systems
Function of DW for DM (outside data mining)
ִMissing data: Decision support requires historical data, which op dbs do not
typically maintain.
ִData consolidation: Decision support requires consolidation (aggregation,
summarization) of data from many heterogeneous sources: op dbs, external
sources.
ִData quality: Different sources typically use inconsistent data representations,
codes, and formats which have to be reconciled.
So, what’s different?
Application-Orientation vs. Subject-Orientation
Application-Orientation
Subject-Orientation
Operational
Database
Loans
Credit
Card
Data
Warehouse
Customer
Vendor
Trust
Savings
Product
Activity
OLTP vs Data Warehouse
• OLTP (run a business)
– Application Oriented
– Used to run business
– Detailed data
– Current up to date
– Isolated Data
– Repetitive access
– Office worker User
–
–
–
–
–
• Warehouse (optimize a business)
– Subject Oriented
– Used to analyze business
– Summarized and refined
– Snapshot data
– Integrated Data
– Ad-hoc access
– Knowledge User (Manager)
– Performance relaxed
Performance Sensitive
Few Records accessed at a time – Large volumes accessed at a time
(millions)
(tens)
– Mostly Read (Batch Update)
Read/Update Access
– Redundancy present
No data redundancy
– Database Size 100 GB – few TB
Database Size 100MB -100GB
OLAP and Data Marts
A data mart is the access layer of the data warehouse environment that is used to get
data out to the users. The data mart is a subset of the data warehouse that is usually
oriented to a specific business line or team. In some deployments, each department or
business unit is considered the owner of its data mart including all
the hardware, software and data
•
Data marts and OLAP servers
are departmental solutions
supporting a handful of users
•
Million dollar massively parallel
hardware is needed to deliver
fast time for complex queries
•
OLAP servers require massive
indices
•
Data warehouses must be at
least 100 GB to be effective
Components of the Warehouse
•
•
•
•
Data Extraction and Loading
The Warehouse
Analyze and Query -- OLAP Tools
Metadata
• Data Mart
• Data Mining
Relational
Databases
Optimized Loader
Enterprise
Resource
Planning
Systems
Extraction
Cleansing
Data Warehouse
Engine
Purchased
Data
Legacy
Data
Metadata Repository
Analyze
Query
True data warehouses
Data Sources
Data Warehouse
Data Marts
With data mart centric DWs, if
you end up creating multiple
warehouses, integrating them is
a problem
DW Query Processing - Indexing
Exploiting indexes to reduce scanning of data is of crucial importance
Bitmap Indexes
Join Indexes
Other Issues
Text indexing
Parallelizing and sequencing of index builds and incremental updates
•
Bitmap indexing:
– A collection of bitmaps -- one for each distinct value of the column
– Each bitmap has N bits where N is the number of rows in the table
– A bit corresponding to a value v for a row r is set if and only if r has the value
for the indexed attribute
Base Table
Cust
C1
C2
C3
C4
C5
C6
C7
Region Rating
N
H
S
M
W
L
W
H
S
L
W
L
N
H
Customers where
Region Index
Rating Index
Row ID N S E W
1 1 0 0 0
2 0 1 0 0
3 0 0 0 1
4 0 0 0 1
5 0 1 0 0
6 0 0 0 1
7 1 0 0 0
Region = W
Row ID H M L
1 1
0
0
2 0
1
0
3 0
0
1
4 1
0
0
5 0
0
1
6 0
0
1
7 1
0
0
And
Rating = M
DW Query Processing - Indexing
•
Join indexing
– Pre-computed joins
– A join index between a fact table and a dimension table correlates a
dimension tuple with the fact tuples that have the same value on the
common dimensional attribute
• e.g., a join index on city dimension of calls fact table
• correlates for each city the calls (in the calls table) from that city
Calls
C+T
Time
C+T+L
Location
Plan
C+T+L
+P
DW Query Processing - Indexing
•
Parallel query processing:
– Three forms of parallelism
• Independent
• Pipelined
• Partitioned and “partition and replicate”
– Deterrents to parallelism
• startup
• Communication
– Partitioned Data
• Parallel scans
• Yields I/O parallelism
– Parallel algorithms for relational operators
• Joins, Aggregates, Sort
– Parallel Utilities
• Load, Archive, Update, Parse, Checkpoint, Recovery
– Parallel Query Optimization
OLAP Representation
•
•
•
•
•
•
•
•
•
Fast
Analysis
Shared
Multidimensional
Information
Online Analytical Processing - coined by
EF Codd in 1994 paper contracted by
Arbor Software*
Generally synonymous with earlier terms
such as Decisions Support, Business
Intelligence, Executive Information System
OLAP = Multidimensional Database
MOLAP: Multidimensional OLAP (Arbor
Essbase, Oracle Express)
ROLAP: Relational OLAP (Informix
MetaCube, Microstrategy DSS Agent)
W
S
N
Juice
Cola
Milk
Cream
Toothpaste
Soap
Product
OLAP Is FASMI •
1 2 34 5 6 7
Month
* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html
OLAP vs SQL
•
Limitation of SQL:
OLAP:
“A Freshman in Business needs a Ph.D. in SQL”
Ralph Kimball
– powerful visualization paradigm
– fast, interactive response times
– good for analyzing time series
– It finds some clusters and outliers
– Many vendors offer OLAP tools
– Embedded SQL Extensions
•
Nature of OLAP Analysis:
– Aggregation - (total sales, percentto-total)
– Comparison -- Budget vs. Expenses
– Ranking -- Top 10, quartile analysis
– detailed and aggregate data
– Complex criteria specification
– Visualization
Relational OLAP
Data Warehouse
Database Layer
Store atomic data in
industry standard
RDBMS.
Engine
Decision Support Client
Application Logic Layer
Presentation Layer
Generate SQL execution
plans in the engine to
obtain OLAP functionality.
Obtain multi-dimensional
reports from the DS Client.
Multi-Dimensional OLAP
MDDB Engine
Database Layer
MDDB Engine
Application Logic Layer
Store atomic data in a proprietary MD data
structure (MDDB), pre-calculate as many outcomes
as possible, obtain OLAP functionality via
proprietary algorithms running against this data.
Decision Support Client
Presentation Layer
Obtain multi-dimensional
reports from the DS
Client.
Number of Aggregations
OLAP Problem: too many data!
Data Explosion Syndrome
70000
65536
60000
50000
40000
30000
20000
16384
10000
0
16
2
(4 levels in each dimension)
3
1024
256
81
4
5
4096
6
Number of Dimensions
7
8
OLAP Solution: Metadata
The primary rational for data warehousing is
to provide businesses with analytics results
from data mining, OLAP and reporting. The
ability of obtaining front-end analytics is
lowered if there is an expensive data quality
all along the pipeline from data source to
analytical reporting.
Data Flow after Company-wide
Metadata Implementation
With a unified meta-data source and
definition, the business is embarking further
on the analysis journey. OLAP reporting is
moving across stream with greater access to all
employees. Data mining models are now more
accurate as the model sets can be scored and
trained on larger data sets
Data Warehouse pitfalls
•
You are going to spend much time extracting, cleaning, and loading data
•
Despite best efforts at project management, data warehousing project scope
will increase
•
You are going to find problems with systems feeding the data warehouse
•
You will find the need to store data not being captured by any existing system
•
You will need to validate data not being validated by transaction processing
systems
•
For interoperability among worldwide data centers, you need to move
massive data sets on the network:
DISASTER!
Data
Applications ?
Moving programs not data:
the true bottle neck
Data Mining + Data Warehouse =
Mining of Warehouse Data
•
For organizational learning to take place, data from must be gathered together and
organized in a consistent and useful way – hence, Data Warehousing (DW);
•
DW allows an organization to remember what it has noticed about its data;
•
Data Mining apps should be interoperable with data organized and shared between DW.
Interoperability scenarios
Data+apps
DA1 Exchange
DA2
DA
Data+apps
Exchange
WA
WA
Data+apps
Exchange
WA
Full interoperability between DA (Desktop Applications)
Local user desktop fully involved (requires computing power)
Full WA DA interoperability
Partial DA WA interoperability (such as remote file storing)
MDS must be moved between local and remote apps
user desktop partially involved (requires minor computing and storage power)
Except from URI exchange, no interoperability and different accounting policy
MDS must be moved between remote apps (but larger bandwidth)
No local computing power required
Improving Aspects
DAs has to become WAs
WA1
plugins
WA2
Unique accounting policy (google/Microsoft like)
To overcome MDS flow apps must be plug&play (e.g. any WAx
feature should be pluggable in WAy on demand)
No local computing power required. Also smartphones can
run VO apps
Requirements
• Standard accounting system;
• No more MDS moving on the web, but just moving Apps, structured as plugin repositories and
execution environments;
• standard modeling of WA and components to obtain the maximum level of granularity;
• Evolution of SAMP architecture to extend web interoperability (in particular for the migration
of the plugins);
Plugin granularity flow
WAx
WAy
Px-1
Py-1
Px-2
Py-2
Px-3
Py-…
Px-…
Py-n
Px-n
3. Way execute Px-3
Px-3
This scheme could be iterated and extended between more standardized web apps
The Lernaean Hydra
After a certain number of such iterations…
WAx
The scenario will
become:
WAy
Py-1
Px-2
No different WSs, but simply one
WS with several sites (eventually
with different GUIs and
computing environments)
Px-3
All WS sites can become a mirror
site of all the others
Py-…
Px-1
Px-…
Px-n
Py-1
Py-2
Py-2
The synchronization of plugin
releases between WSs is
performed at request time
Py-n
Minimization of data exchange
flow (just few plugins in case of
synchronization between mirrors)
Px-2
Px-1
Px-3
Py-…
Px-…
Py-n
Px-n
Web 2.0
Web 2.0? It is a system that breaks with the old model of centralized Web sites and moves the
power of the Web/Internet to the desktop. [J. Robb]
the Web becomes a universal, standards-based integration platform. [S. Dietzen]
Conclusions
e-Science is a transitional phenomenon, and will become an overall research
environment of the data-rich, computationally enabled science of the 21st
century
Essentially all of the humanity’s activities are being virtualized in some way,
science and scholarship included
We see growing synergies and co-evolution between science, technology,
society, and individuals, with an increasing fusion of the real and the virtual
Cyberspace, now embodied though the Web and its participants,
is the arena in which these processes unfold
VR technologies may revolutionize the ways in which humans interact with
each other, and with the world of information
A synthesis of the semantic Web, immersive and augmentative VW, and
machine intelligence may shape our world profoundly
REFERENCES
Borne, K. D., 2009. X-Informatics: Practical Semantic Science. American Geophysical
Union, Fall Meeting 2009, abstract #IN43E-01
(http://adsabs.harvard.edu/abs/2009AGUFMIN43E..01B)
The Fourth Paradigm, Microsoft Research,
http://research.microsoft.com/fourthparadigm/
Thomsen E., 1997. OLAP Solutions, John Wiley and Sons
Inmon W.H. , Zachman John A., Geiger Jonathan G. , 1997. Data Stores Data
Warehousing and the Zachman Framework, McGraw Hill Series on Data Warehousing
and Data Management
Inmon W.H., 1996. Building the Data Warehouse, Second Edition, John Wiley and Sons
Inmon W.H. , Welch J. D. , Glassey Katherine L., 1997. Managing the Data Warehouse,
John Wiley and Sons
Devlin B., 1997. Data Warehouse from Architecture to Implementation, Addison
Wesley Longman, Inc.
Lin S.C., Yen E., 2011. Data Driven e-Science; Use Cases and Successful Applications of
Distributed Computing Infrastructures (ISGC 2010), Springer