Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Informazioni docente – Data Mining Chi è il vostro docente? Laurea in Informatica presso Università di Salerno (1994) Contratti di consulenza con OAC - Osservatorio Astronomico di Capodimonte (1995 – 1999) Astronomo ricercatore presso INAF – OAC ( 1999 – pensione (forse??) ) Docente a contratto di architettura degli elaboratori presso il dip. Informatica dell’Università Federico II di Napoli (2002 – 2007) Docente associato di Tecnologie Astronomiche presso il dip. fisica della Federico II (dal 2008) Progettazione/realizzazione grandi telescopi e strumenti (ottica, elettronica, software engineering, quality control), project management, data mining e machine learning per grandi archivi di dati astrofisici Ufficio OAC tel. 081.5575553 – cell. 338.5354945 - e-mail: [email protected] http://www.na.astro.it/~brescia.html – http://dame.dsf.unina.it M. Brescia - Data Mining - lezione 1 2 Presentazione corso – Data Mining Il corso sarà articolato nei seguenti argomenti: totale 12 lezioni da 4 ore cad. = 50 ore (ultime 2 lezioni da 5 ore) a) b) c) d) e) f) fondamenti di e-science e data warehousing; fondamenti di data mining e Intelligenza Artificiale; fondamenti di machine learning supervisionato; fondamenti di machine learning non supervisionato; fondamenti di ICT per il data mining; esempi pratici di data mining e casi d'uso; M. Brescia - Data Mining - lezione 1 3 Presentazione corso – Data Mining Il recente riconoscimento a livello globale del concetto di Scienza data-centrica, ha indotto una rapida diffusione e proliferazione di nuove metodologie di data mining. Il concetto chiave consegue dal quarto paradigma della Scienza moderna, ossia del "Knowledge Discovery in Databases" o KDD, dopo teoria, sperimentazione e simulazioni. Una delle cause principali è stata l'evoluzione della tecnologia e di tutte le scienze di base ed applicate, che fanno dell'esplorazione efficiente dei dati il principale mezzo per nuove scoperte. Il data mining dunque si prefigge di gestire ed analizzare enormi quantità di dati eterogenei, avvalendosi di tecniche ed algoritmi auto-adattivi, afferenti al paradigma del Machine Learning. Il presente corso intende quindi fornire i concetti fondamentali alla base della teoria del data mining, data warehousing e Machine Learning (reti neurali, logica Fuzzy, algoritmi genetici, Soft Computing), con tecniche pratiche derivanti dallo stato dell'arte dell'Information & Communication Technology (tecnologie web 2.0, calcolo distribuito e cenni alla programmazione su architetture parallele). Il corso conterrà esempi di sviluppo di modelli di data mining, facendo uso di linguaggi di programmazione (C, C++, Java, CUDA C); M. Brescia - Data Mining - lezione 1 4 Presentazione corso – Data Mining Modalità di svolgimento del corso: 1) Lezioni frontali (slides); 2) Discussione collegiale; 3) (Esercitazioni) ed esempi pratici; Le lezioni frontali saranno basate su slides quasi esclusivamente in lingua inglese (la letteratura è infatti prevalentemente in inglese e conviene abituarci a consultare testi in linguaggio diverso dall’italiano!) Al termine di ogni lezione, dedicheremo una parte alla discussione aperta Tutto il materiale del corso, incluse slide, bibliografia e links web utili, è a disposizione attraverso una pagina web, gestita dal sottoscritto. http://dame.dsf.unina.it/master.html M. Brescia - Data Mining - lezione 1 5 Data Mining Data mining—core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases M. Brescia - Data Mining - lezione 1 6 Data, Data everywhere, yet ... I can’t find the data I need ִ Data is scattered over the network ִ many versions and formats I can’t get the data I need ִ need an expert to get the data I can’t understand the data I found ִ available data poorly documented I can’t use the data I found ִ results are unexpected ִ data needs to be transformed from one form to other 7 Most data will never be seen by humans… Cascade of data 1 ZB or 1.000.000.000.000 GB = 109 TeraByte Tsunami of data Small, big, in a network, isolated … modern devices produce large amounts of data and EACH DATA which is produced needs to be reduced, analyzed, interpreted… Increase in number and size of devices or in efficiency or in number of bands … all cause an increase in pixels (worldwide) Computing time and costs do not scale linearly with number of pixels Moore law’s does not apply anymore. Slopes are changed. International Technology Roadmap for Semiconductors Tsunami of data For over two decades, before the advent of multi-core architectures, the general purpose CPUs have been characterized, at each generation, by an almost linear increasing of performances together with a decreasing of costs, also known as Moore’s Law (Moore 1965) Increase in number and size of devices or in efficiency or in number of bands … all cause an increase in pixels (worldwide) Computing time and costs do not scale linearly with number of pixels So far, in order to maintain the cyclic hardware/software trend, the software applications had to change their perspective, moving towards parallel computing Tsunami of data The forerunner: LHC Computationally demanding but still a relatively simple (embarassingly parallel) KDD task each CPU gets one event at a time and needs to perform simple tasks Data Stream: 330 TB/week ATLAS detector event Tsunami of data DATA INTENSIVE SCIENCE HAS BECOME A REALITY IN ALMOST ALL FIELDS and poses worse problems • Huge data sets ( ca. Pbyte) In astronomy as in many other sciences • Thousands of different problems • Many, many thousands of users i.e. LHC is a “piece of cake” (simple computational model) Tsunami of data Jim Gray “One of the greatest challenges for 21st-century science is how we respond to this new era of data intensive science … … This is recognized as a new paradigm beyond experimental and theoretical research and computer simulations of natural phenomena one that requires new tools, techniques, and ways of working.” Tsunami of data Real world physics is too complex. Validation of models requires accurate simulations, tools to compare simulations and data, and better ways to deal with complex & massive data sets Cosmological simulation. The total number of particles is 2,097,152 Need to increase computational and algorithmic capabilities beyond current and expected technological trends A new Science concept Virtualization of Science and Scholarship Summary Overture • The world transformed • Climbing the S-Curve Science in the exponential world Virtual Observatory: a case study • The modern scientific process eScience and the new paradigms The evolution of computing • Scientific communication and collaboration The rise of immersive virtual environments: Web 3.0? • The growing synergies Exploring and building in cyberspace Definitions Definition: By Virtualization, I mean a migration of the scholarly work, data, tools, methods, etc., to cyber-environments, today effectively the Web This process is of course not limited to science and scholarship; essentially all aspects of the modern society are undergoing the same transformation Cyberspace (today the Web, with all information and tools it connects) is increasingly becoming the principal arena where humans interact with each other, with the world of information, where they work, learn, and play ITC Revolution Information & Communication Technology revolution is historically unprecedented in its impact it is like the industrial revolution and the invention of printing combined Yet, most fields of science and scholarship have not yet fully adopted the new ways of doing things, and in most cases do not understand them well… It is a matter of developing a new methodology of science and scholarship for the 21st century eScience What Is This Beast Called e-Science? It depends on whom you ask, but some general properties include: • Computationally enabled • Data-intensive • Geographically distributed resources (i.e., Web-based) However: • All science in the 21st century is becoming cyber-science (aka e-Science) – so this is just a transitional phase • There is a great emerging synergy of the computationally enabled science, and the science-driven IT Facing the Data Tsunami Astronomy, all sciences, and every other modern field of human endeavor (commerce, security, etc.) are facing a dramatic increase in the volume and complexity of data • We are entering the second phase of the IT revolution: the rise of the information/data driven computing The challenges are universal, and growing: – Management of large, complex, distributed data sets – Effective exploration of such data new knowledge Data complexity and volume Exponential Growth in Data Volumes and Complexity Understanding of complex phenomena requires complex data! Multi-data fusion leads to a more complete, less biased picture (also: multi-scale, multi-epoch, …) Numerical simulations are also producing many TB’s of very complex “data” Data + Theory = Understanding An example: Astronomy Astronomy Has Become Very Data-Rich • Typical digital sky survey now generates ~ 10 - 100 TB, plus a comparable amount of derived data products – PB-scale data sets are on the horizon • Astronomy today has ~ 1 - 2 PB of archived data, and generates a few TB/day – Both data volumes and data rates grow exponentially, with a doubling time ~ 1.5 years – Even more important is the growth of data complexity • For comparison: Human memory ~ a few hundred MB Human Genome < 1 GB 1 TB ~ 2 million books Library of Congress (print only) ~ 30 TB The reaction The Response of the Scientific Community to the IT Revolution • The rise of Virtual Scientific Organizations: – Discipline-based, not institution based – Inherently distributed, and web-centric – Always based on deep collaborations between domain scientists and applied CS/IT scientists and professionals – Based on an exponentially growing technology and thus rapidly evolving themselves – Do not fit into the traditional organizational structures – Great educational and public outreach potential • However: Little or no coordination and interchange between different scientific disciplines • Sometimes, entire new fields are created, e.g., bioinformatics, computational biology The Virtual Observatory The Virtual Observatory Concept • A complete, dynamical, distributed, open research environment for the new astronomy with massive and complex data sets – Provide and federate content (data, metadata) services, standards, and analysis/compute services – Develop and provide data exploration and discovery tools – Harness the IT revolution in the service of astronomy – A part of the broader e-Science /CyberInfrastructure http:// ivoa.net http://us-vo.org http://www.euro-vo.org The world is flat Probably the most important aspect of the IT revolution in science Professional Empowerment: Scientists and students anywhere with an internet connection should be able to do a first-rate science (access to data and tools) – A broadening of the talent pool in astronomy, leading to a substantial democratization of the field • They can also be substantial contributors, not only consumers – Riding the exponential growth of the IT is far more cost effective than building expensive hardware facilities, e.g., big telescopes, large accelerators, etc… – Especially useful for countries without major research facilities VO Education and Public Outreach The Web has a truly transformative potential for education at all levels • Unprecedented opportunities in terms of the content, broad geographical and societal range, at all levels • Astronomy as a gateway to learning about physical science in general, as well as applied CS and IT VO (also as Virtual Organization) Functionality Today What we did so far: • Lots of progress on interoperability, standards, etc. • An incipient data grid of astronomy • Some useful web services • Community training, EPO What we did not do (yet): • Significant data exploration and mining tools. That is where the science will come from! Thus, little VO-enabled science so far and a slow community buy-in Development of powerful knowledge discovery tools should be a key priority Donald Rumsfeld’s Epistemology There are known knowns, There are known unknowns, and There are unknown unknowns Or, in other words (Data Mining): 1. Optimized detection algorithms 2. Supervised clustering 3. Unsupervised clustering The Mixed Blessings of Data Richness Modern digital sky surveys typically contain ~ 10 – 100 TB, detect Nobj ~ 108 - 109 sources, with D ~ 102 – 103 parameters measured for each one -- and multi-PB data sets are on the horizon Potential for discovery Nobj or data volume Big surveys Nsurveys 2 (connections) Data federation Great! However … DM algorithms scale very badly: – Clustering ~ N log N N2, ~ D2 – Correlations ~ N log N N2, ~ Dk (k 1) – Likelihood, Bayesian ~ Nm (m ≥ 3), ~ Dk (k ≥ 1) Scalability and dimensionality reduction (without a significant loss of information) are critical needs! The Curse of Hyperdimensionality DM Toolkit Not a matter of hardware or software, but new ideas User Visualization Visualization! A fundamental limitation of the human perception: DMAX = 3? 5? 10? (We can understand mathematically much higher dimensionalities, but cannot really visualize them; our own Neural Nets are powerful pattern recognition tools) Interactive visualization must be a key part of the data mining process Dimensionality reduction via machine patterns/substructures and correlations in the data? discovery of Visualization Effective visualization is the bridge between quantitative information, and human intuition L’uomo non è in grado di comprendere senza immagini; L’immagine è una similitudine di una cosa corporea, ma la comprensione è dell’universale astratto dai particolari Aristotele, De Memoria et Reminiscentia Data analysis The key role of data analysis is to replace the raw complexity seen in the data with a reduced set of patterns, regularities, and correlations, leading to their theoretical understanding However, the complexity (e.g., dimensionality) of data sets and interesting, meaningful constructs in them is starting to exceed the cognitive capacity of the human brain Data understanding This is a Very Serious Problem! Hyperdimensional structures (clusters, correlations, etc.) are likely present in many complex data sets, whose dimensionality is commonly in the range of D ~ 102 – 104, and will surely grow It is not only the matter of data understanding, but also of choosing the appropriate data mining algorithms, and interpreting their results • Things are rarely Gaussian in reality • The clustering topology can be complex What good are the data if we cannot effectively extract knowledge from them? “A man has got to know his limitations” Dirty Harry, an American philosopher Knowledge Discovery in Databases The new Science Information Technology New Science • The information volume grows exponentially Most data will never be seen by humans! The need for data storage, network, database-related technologies, standards, etc. • Information complexity is also increasing greatly Most data (and data constructs) cannot be comprehended by humans directly! The need for data mining, KDD, data understanding technologies, hyperdimensional visualization, AI/Machine-assisted discovery … • We need to create a new scientific methodology on the basis of applied CS and IT • Important for practical applications beyond science Evolution of knowledge The Evolving Paths to Knowledge • The First Paradigm: Experiment/Measurement • The Second Paradigm: Analytical Theory • The Third Paradigm: Numerical Simulations • The Fourth Paradigm: Data-Driven Science? From numerical simulations… Numerical Simulations: A qualitatively new (and necessary) way of doing theory, beyond analytical approach Simulation output: a data set, the theoretical statement, not an equation Formation of a cluster of galaxies Turbulence in the Sun …to the fourth paradigm Is this really something qualitatively new, rather than the same old data analysis, but with more data? The information content of modern data sets is so high as to enable discoveries which were not envisioned by the data originators (data mining) Data fusion reveals new knowledge which was implicitly present, but not recognizable in the individual data sets Complexity threshold for a human comprehension of complex data constructs? Need new methods to make the data understanding possible (machine learning) Data Fusion + Data Mining + Machine Learning = The Fourth Paradigm The fourth paradigm 1. Experiment ( ca. 3000 years) 2. Theory (few hundreds years) mathematical description, theoretical models, analytical laws (e.g. Newton, Maxwell, etc.) 3. Simulations (few tens of years) Complex phenomena 4. Data-Intensive science (and it is happening now!!) http://research.microsoft.com/fourthparadigm/ Machine Learning The Roles for Machine Learning and Machine Intelligence in CyberScience: Data processing: Object / event / pattern classification Automated data quality control (fault detection and repair) + Data mining, analysis, and understanding: Clustering, classification, outlier / anomaly detection Pattern recognition, hidden correlation search Assisted dimensionality reduction for hyperdimensional visualisation orkflow control in Grid-based apps Data farming and data discovery: semantic web, and beyond Code design and implementation: from art to science? The way to produce new science The old and the new The Book and the Cathedral … … and the Web, and the Computer Technologies for information storage and access are evolving, and so does scholarly publishing Worlds of knowledge K. Popper, Objective Knowledge: An Evolutionary Approach, 1972 Cyberspace is now effectively World 3, plus the ways of interacting with it Science Commons, or Discovery Space Data Archives Simulations & Theory Published Literature Communication & Collaboration Origins of discovery A Lot of Science Originates in Discussions and Constructive Interactions This creative process can be enabled and enhanced using virtual interactive spaces, including the Web2.0 tools Computing as a Communication Tool With the advent of the Web, most of the computing usage is not in a number crunching, but in a search, manipulation, and display of data and information, and increasingly also for human interactions (e.g., much of Web 2.0) Information as communication Information Technology as a Communication Medium: Social Networking and Beyond • • • • Science originates on the interface between human minds, and the human minds and data (measurements, structured information, output of simulations) Thus, any technology which facilitates these interactions is an enabling technology for science, scholarship, and intellectual progress more generally Virtual Worlds (or immersive VR) are one such technology, and will likely revolutionize the ways in which we interact with each other, and with the world of information we create Thus, we started the Meta-Institute for Computational Astrophysics (MICA), the first professional scientific organization based entirely in VWs (Second life) http://slurl.com/secondlife/StellaNova Subjective experience quality much higher than traditional videoconferencing (and it can only get better as VR improves) Effective worldwide telecommuting, at ~ zero cost Professional conferences easily organized, at ~ zero cost Immersive data visualization Encode up to a dozen dimensions for a parameter space representation Interactive data exploration in a pseudo3D environment Multicolor SDSS data set on stars, galaxies and quasars Immersive mathematical visualization Pseudo-3D representation of highly-dimensional mathematical objects Potential research and educational uses: geometry, topology, etc. A pseudo-3D projection of a 248-dimensional mathematical object Personalization of Cyberspace We inhabit the individuals Cyberspace as – and not just for work, but in very personal ways, to express ourselves, and to connect with others (“As we may feel”?) e-Science is unified by a common methodology and tools “We must all hang together, or assuredly we will all hang separately” Ben Franklin The Truth About Social Networking social networking as the intersection of narcissism, ADHD (Attention Deficit Hyperactivity Disorder), and good old fashioned stalking The Core business of Academia To discover, preserve, and disseminate knowledge To serve as a source of scientific and technological innovation To educate the new generations, in terms of the knowledge, skills, and tools But when it comes to the adoption of computational tools and methods, innovation, and teaching them to our students, we are doing very poorly – and yet, the science and the economy of the 21st century depend critically on these issues Is the discrepancy of time scales to blame for this slow uptake? IT ~ 2 years Education ~ 20 years Career ~ 50 years Universities ~ 200 years (Are universities obsolete?) Some Thoughts about e-Science Computational science ≠ Computer science Numerical modeling Computational science Data-driven science • Data-driven science is not about data, it is about knowledge extraction (the data are incidental to our real mission) • Information and data are (relatively) cheap, but the expertise is expensive o Just like the hardware/software situation • Computer science as the “new mathematics” o It plays the role in relation to other sciences which mathematics did in ~ 17th - 20th century o Computation as a glue/lubricant of interdisciplinarity Some Transformative Technologies To Watch Cloud (mobile, ubiquitous) computing • Distributed data and services • Also mobile / ubiquitous computing Semantic Web • Knowledge encoding and discovery infrastructure for the next generation Web Immersive & Augmentative Virtual Reality • The human interface for the next generation Web, beyond the Web 2.0 social networking Machine Intelligence redux • Intelligent agents as your assistants / proxies • Human-machine intelligence interaction A new set of disciplines: X-Informatics Machine learning Data structures Advanced programming languages Data mining Formation of a new generation of scientists Computer networks visualization Databases Numerical analysis Computational infrastructures Semantics ETC. Within any X-informatics discipline, information granules are unique to that discipline, e.g., gene sequences in bio, the sky object in astro, and the spatial object in geo (such as points and polygons in the vector model, and pixels in the raster model). Nevertheless the goals are similar: transparent data re-use across sub-disciplines and within education settings, information and data integration and fusion, personalization of user interactions with data collection, semantic search and retrieval, and knowledge discovery. The implementation of an X-informatics framework enables these semantic e-science research goals Some Speculations We create technology, and it changes us, starting with the grasping of sticks and rocks as primitive tools, and continuing ever since When the technology touches our minds, that process can have profound evolutionary impact in the long term; VWs are one such technology Development of AI seems inevitable, and its uses in assisting us with the information management and knowledge discovery are already starting In the long run, immersive VR may facilitate the co-evolution of human and machine intelligence Scientific and Technological Progress Mining of Warehouse Data Data Mining + Data Warehouse = Mining of Warehouse Data • For organizational learning to take place, data from must be gathered together and organized in a consistent and useful way – hence, Data Warehousing (DW); • DW allows an organization to remember what it has noticed about its data; • Data Mining techniques should be interoperable with data organized in a DW. Enterprise “Database” Transactions VO registries Simulations Observations Copied, organized summarized Etc… Etc… Data Warehouse Data Miners: • “Farmers” – they know • “Explorers” - unpredictable Data Mining DM 4-rule virtuous cycle • • – – Finding patterns is not enough Science business must: Respond to patterns by taking action Turning: • Data into Information • Information into Action • Action into Value • Hence, the Virtuous Cycle of DM: • Virtuous cycle implementation steps: – Transforming data into information via: • Hypothesis testing • Profiling • Predictive modeling – Taking action • Model deployment • Scoring – Measurement • Assessing a model’s stability & effectiveness before it is used 1. Identify the problem 2. Mining data to transform it into actionable information 3. Acting on the information 4. Measuring the results DM: 11-step Methodology The four rules reflect into an 11-step exploded strategy, at the base of DAME (Data Analysis, Mining and Exploration) concept 1. Translate any opportunity (science case) into DM opportunity (problem) 2. Select appropriate data 3. Get to know the data 4. Create a model set 5. Fix problems with the data 6. Transform data to bring information 7. Build models 8. Assess models 9. Deploy models 10. Assess results 11. Begin again (GOTO 1) Why Mine Data? Commercial Viewpoint • Lots of data is being collected and warehoused – Web data, e-commerce – purchases at department/stores – Bank/Credit Card transactions • Computers have become cheaper and more powerful • Competitive Pressure is Strong – Provide better, customized services for an edge (e.g. in Customer Relationship Management) Why Mine Data? Scientific Viewpoint • Data collected and stored at enormous speeds (GB/hour) – remote sensors on a satellite – telescopes scanning the skies – microarrays generating gene expression data – scientific simulations generating terabytes of data • Traditional techniques infeasible for raw data • Data mining may help scientists – in classifying and segmenting data – in Hypothesis Formation Terminology • Components of the input: – Concepts: kinds of things that can be learned • Aim: intelligible and operational concept description – Instances: the individual, independent examples of a concept • Note: more complicated forms of input are possible – Features/Attributes: measuring aspects of an instance • We will focus on nominal and numeric ones – Patterns: ensemble (group/list) of features • In a same dataset, a group of patterns are usually in a homogeneous format (same number, meaning and type of features) What’s a DM concept? • Data Mining Tasks (Styles of learning): Classification learning: predicting a discrete class Association learning: detecting associations between features Clustering: grouping similar instances into clusters Sequencing what events are likely to lead to later events Forecasting what may happen in the future Numeric prediction (Regression): predicting a numeric quantity • Concept: thing to be learned • Concept description: output of learning scheme Effective DM process break-down Market Analysis and Management • • Where does the data come from?—Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing – Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc., – Determine customer purchasing patterns over time • • • Cross-market analysis—Find associations/co-relations between product sales, & predict based on such association Customer profiling—What types of customers buy what products (clustering or classification) Customer requirement analysis – Identify the best products for different customers – Predict what factors will attract new customers • Provision of summary information – Multidimensional summary reports – Statistical summary information (data central tendency and variation) Data quality and integrity problems • Legacy systems no longer documented • Outside sources with questionable quality procedures • Production systems with no built in integrity checks and no integration – Operational systems are usually designed to solve a specific business problem and are rarely developed to a a corporate plan • • • • • • • “And get it done quickly, we do not have time to worry about corporate standards...” Same person, different spellings – Agarwal, Agrawal, Aggarwal etc... Multiple ways to denote company name – Persistent Systems, PSPL, Persistent Pvt. LTD. Use of different names – mumbai, bombay Different account numbers generated by different applications for the same customer Required fields left blank Invalid product codes collected at point of sale – manual entry leads to mistakes – “in case of a problem use 9999999” What is a Data Warehouse? A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business/research context. • Data should be integrated across the enterprise • Summary data has a real value to the organization • Historical data holds the key to understanding data over time • What-if capabilities are required DW is a process of transforming data into information and making it available to users in a timely enough manner to make a difference Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible The evolution of data analysis Evolutionary Step Business Question Enabling Technologies Product Providers Characteristics Data Collection (1960s) "What was my total Computers, tapes, revenue in the last disks five years?" IBM, CDC Retrospective, static data delivery Data Access (1980s) "What were unit sales in New England last March?" Relational databases (RDBMS), Structured Query Language (SQL), ODBC Oracle, Sybase, Informix, IBM, Microsoft Retrospective, dynamic data delivery at record level Data Warehousing & Decision Support (1990s) "What were unit sales in New England last March? Drill down to Boston." On-line analytic processing (OLAP), multidimensional databases, data warehouses SPSS, Comshare, Retrospective, Arbor, Cognos, dynamic data Microstrategy,NCR delivery at multiple levels Data Mining (Emerging Today) "What’s likely to happen to Boston unit sales next month? Why?" Advanced algorithms, multiprocessor computers, massive databases SPSS/Clementine, Lockheed, IBM, SGI, SAS, NCR, Oracle, numerous startups Prospective, proactive information delivery Definition of a Massive Data Set • TeraBytes -- 1012 bytes: Astrophysical observation (per night) • PetaBytes -- 1015 bytes: Geographic Information Systems or Astrophysical Survey Archive • ExaBytes -- 1018 bytes: National Medical Records • ZettaBytes -- 1021 bytes: Weather images • ZottaBytes -- 1024 bytes: Intelligence Agency Videos DM, Operational systems and DW What makes data mining possible? • Advances in the following areas are making data mining deployable: – data warehousing – Operational systems – the emergence of easily deployed data mining tools and – the advent of new data mining techniques (Machine Learning) OLTP vs OLAP OLPT and OLAP are complementing technologies. You can't live without OLTP: it runs your business day by day. So, using getting strategic information from OLTP is usually first “quick and dirty” approach, but can become limiting later. OLTP (On Line Transaction Processing) is a data modeling approach typically used to facilitate and manage usual business applications. Most of applications you see and use are OLTP based. OLAP (On Line Analytic Processing) and is an approach to answer multi-dimensional queries. OLAP was conceived for Management Information Systems and Decision Support Systems but is still widely underused: every day I see too much people making out business intelligence from OLTP data! With the constant growth of data analysis and intelligence applications, understanding the OLAP benefits is a must if you want to provide valid and useful analytics to the management. OLTP Application OLAP Operational: ERP, Management Information System, CRM, legacy apps, ... Decision Support System Typical users Staff Managers, Executives Horizon Weeks, Months Years Refresh Immediate Periodic Data model Entity-relationship Multi-dimensional Schema Normalized Star Emphasis Update Retrieval Examples of OLTP data systems Data Industry Usage Volumes Customer All File Legacy application, flat Small-medium files, main frames Account Balance Legacy applications, Large hierarchical databases, mainframe ERP, Client/Server, Very Large relational databases Point-ofSale data Call Record Track Customer Details Finance Control account activities Retail Generate bills, manage stock Telecomm- Billing unications Technology Production ManufactRecord uring Control Production Legacy application, Very Large hierarchical database, mainframe Medium (ERP) Enterprise Resource Planning, relational databases Why Separate Data Warehouse? • Operational Systems are OLTP systems (DW is OLAP) – Run mission critical applications – Need to work with stringent performance requirements for routine tasks – Used to run a business! – Optimized to handle large numbers of simple read/write transactions – RDBMS have been used for OLTP systems Function of DW for DM (outside data mining) ִMissing data: Decision support requires historical data, which op dbs do not typically maintain. ִData consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: op dbs, external sources. ִData quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled. So, what’s different? Application-Orientation vs. Subject-Orientation Application-Orientation Subject-Orientation Operational Database Loans Credit Card Data Warehouse Customer Vendor Trust Savings Product Activity OLTP vs Data Warehouse • OLTP (run a business) – Application Oriented – Used to run business – Detailed data – Current up to date – Isolated Data – Repetitive access – Office worker User – – – – – • Warehouse (optimize a business) – Subject Oriented – Used to analyze business – Summarized and refined – Snapshot data – Integrated Data – Ad-hoc access – Knowledge User (Manager) – Performance relaxed Performance Sensitive Few Records accessed at a time – Large volumes accessed at a time (millions) (tens) – Mostly Read (Batch Update) Read/Update Access – Redundancy present No data redundancy – Database Size 100 GB – few TB Database Size 100MB -100GB OLAP and Data Marts A data mart is the access layer of the data warehouse environment that is used to get data out to the users. The data mart is a subset of the data warehouse that is usually oriented to a specific business line or team. In some deployments, each department or business unit is considered the owner of its data mart including all the hardware, software and data • Data marts and OLAP servers are departmental solutions supporting a handful of users • Million dollar massively parallel hardware is needed to deliver fast time for complex queries • OLAP servers require massive indices • Data warehouses must be at least 100 GB to be effective Components of the Warehouse • • • • Data Extraction and Loading The Warehouse Analyze and Query -- OLAP Tools Metadata • Data Mart • Data Mining Relational Databases Optimized Loader Enterprise Resource Planning Systems Extraction Cleansing Data Warehouse Engine Purchased Data Legacy Data Metadata Repository Analyze Query True data warehouses Data Sources Data Warehouse Data Marts With data mart centric DWs, if you end up creating multiple warehouses, integrating them is a problem DW Query Processing - Indexing Exploiting indexes to reduce scanning of data is of crucial importance Bitmap Indexes Join Indexes Other Issues Text indexing Parallelizing and sequencing of index builds and incremental updates • Bitmap indexing: – A collection of bitmaps -- one for each distinct value of the column – Each bitmap has N bits where N is the number of rows in the table – A bit corresponding to a value v for a row r is set if and only if r has the value for the indexed attribute Base Table Cust C1 C2 C3 C4 C5 C6 C7 Region Rating N H S M W L W H S L W L N H Customers where Region Index Rating Index Row ID N S E W 1 1 0 0 0 2 0 1 0 0 3 0 0 0 1 4 0 0 0 1 5 0 1 0 0 6 0 0 0 1 7 1 0 0 0 Region = W Row ID H M L 1 1 0 0 2 0 1 0 3 0 0 1 4 1 0 0 5 0 0 1 6 0 0 1 7 1 0 0 And Rating = M DW Query Processing - Indexing • Join indexing – Pre-computed joins – A join index between a fact table and a dimension table correlates a dimension tuple with the fact tuples that have the same value on the common dimensional attribute • e.g., a join index on city dimension of calls fact table • correlates for each city the calls (in the calls table) from that city Calls C+T Time C+T+L Location Plan C+T+L +P DW Query Processing - Indexing • Parallel query processing: – Three forms of parallelism • Independent • Pipelined • Partitioned and “partition and replicate” – Deterrents to parallelism • startup • Communication – Partitioned Data • Parallel scans • Yields I/O parallelism – Parallel algorithms for relational operators • Joins, Aggregates, Sort – Parallel Utilities • Load, Archive, Update, Parse, Checkpoint, Recovery – Parallel Query Optimization OLAP Representation • • • • • • • • • Fast Analysis Shared Multidimensional Information Online Analytical Processing - coined by EF Codd in 1994 paper contracted by Arbor Software* Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information System OLAP = Multidimensional Database MOLAP: Multidimensional OLAP (Arbor Essbase, Oracle Express) ROLAP: Relational OLAP (Informix MetaCube, Microstrategy DSS Agent) W S N Juice Cola Milk Cream Toothpaste Soap Product OLAP Is FASMI • 1 2 34 5 6 7 Month * Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html OLAP vs SQL • Limitation of SQL: OLAP: “A Freshman in Business needs a Ph.D. in SQL” Ralph Kimball – powerful visualization paradigm – fast, interactive response times – good for analyzing time series – It finds some clusters and outliers – Many vendors offer OLAP tools – Embedded SQL Extensions • Nature of OLAP Analysis: – Aggregation - (total sales, percentto-total) – Comparison -- Budget vs. Expenses – Ranking -- Top 10, quartile analysis – detailed and aggregate data – Complex criteria specification – Visualization Relational OLAP Data Warehouse Database Layer Store atomic data in industry standard RDBMS. Engine Decision Support Client Application Logic Layer Presentation Layer Generate SQL execution plans in the engine to obtain OLAP functionality. Obtain multi-dimensional reports from the DS Client. Multi-Dimensional OLAP MDDB Engine Database Layer MDDB Engine Application Logic Layer Store atomic data in a proprietary MD data structure (MDDB), pre-calculate as many outcomes as possible, obtain OLAP functionality via proprietary algorithms running against this data. Decision Support Client Presentation Layer Obtain multi-dimensional reports from the DS Client. Number of Aggregations OLAP Problem: too many data! Data Explosion Syndrome 70000 65536 60000 50000 40000 30000 20000 16384 10000 0 16 2 (4 levels in each dimension) 3 1024 256 81 4 5 4096 6 Number of Dimensions 7 8 OLAP Solution: Metadata The primary rational for data warehousing is to provide businesses with analytics results from data mining, OLAP and reporting. The ability of obtaining front-end analytics is lowered if there is an expensive data quality all along the pipeline from data source to analytical reporting. Data Flow after Company-wide Metadata Implementation With a unified meta-data source and definition, the business is embarking further on the analysis journey. OLAP reporting is moving across stream with greater access to all employees. Data mining models are now more accurate as the model sets can be scored and trained on larger data sets Data Warehouse pitfalls • You are going to spend much time extracting, cleaning, and loading data • Despite best efforts at project management, data warehousing project scope will increase • You are going to find problems with systems feeding the data warehouse • You will find the need to store data not being captured by any existing system • You will need to validate data not being validated by transaction processing systems • For interoperability among worldwide data centers, you need to move massive data sets on the network: DISASTER! Data Applications ? Moving programs not data: the true bottle neck Data Mining + Data Warehouse = Mining of Warehouse Data • For organizational learning to take place, data from must be gathered together and organized in a consistent and useful way – hence, Data Warehousing (DW); • DW allows an organization to remember what it has noticed about its data; • Data Mining apps should be interoperable with data organized and shared between DW. Interoperability scenarios Data+apps DA1 Exchange DA2 DA Data+apps Exchange WA WA Data+apps Exchange WA Full interoperability between DA (Desktop Applications) Local user desktop fully involved (requires computing power) Full WA DA interoperability Partial DA WA interoperability (such as remote file storing) MDS must be moved between local and remote apps user desktop partially involved (requires minor computing and storage power) Except from URI exchange, no interoperability and different accounting policy MDS must be moved between remote apps (but larger bandwidth) No local computing power required Improving Aspects DAs has to become WAs WA1 plugins WA2 Unique accounting policy (google/Microsoft like) To overcome MDS flow apps must be plug&play (e.g. any WAx feature should be pluggable in WAy on demand) No local computing power required. Also smartphones can run VO apps Requirements • Standard accounting system; • No more MDS moving on the web, but just moving Apps, structured as plugin repositories and execution environments; • standard modeling of WA and components to obtain the maximum level of granularity; • Evolution of SAMP architecture to extend web interoperability (in particular for the migration of the plugins); Plugin granularity flow WAx WAy Px-1 Py-1 Px-2 Py-2 Px-3 Py-… Px-… Py-n Px-n 3. Way execute Px-3 Px-3 This scheme could be iterated and extended between more standardized web apps The Lernaean Hydra After a certain number of such iterations… WAx The scenario will become: WAy Py-1 Px-2 No different WSs, but simply one WS with several sites (eventually with different GUIs and computing environments) Px-3 All WS sites can become a mirror site of all the others Py-… Px-1 Px-… Px-n Py-1 Py-2 Py-2 The synchronization of plugin releases between WSs is performed at request time Py-n Minimization of data exchange flow (just few plugins in case of synchronization between mirrors) Px-2 Px-1 Px-3 Py-… Px-… Py-n Px-n Web 2.0 Web 2.0? It is a system that breaks with the old model of centralized Web sites and moves the power of the Web/Internet to the desktop. [J. Robb] the Web becomes a universal, standards-based integration platform. [S. Dietzen] Conclusions e-Science is a transitional phenomenon, and will become an overall research environment of the data-rich, computationally enabled science of the 21st century Essentially all of the humanity’s activities are being virtualized in some way, science and scholarship included We see growing synergies and co-evolution between science, technology, society, and individuals, with an increasing fusion of the real and the virtual Cyberspace, now embodied though the Web and its participants, is the arena in which these processes unfold VR technologies may revolutionize the ways in which humans interact with each other, and with the world of information A synthesis of the semantic Web, immersive and augmentative VW, and machine intelligence may shape our world profoundly REFERENCES Borne, K. D., 2009. X-Informatics: Practical Semantic Science. American Geophysical Union, Fall Meeting 2009, abstract #IN43E-01 (http://adsabs.harvard.edu/abs/2009AGUFMIN43E..01B) The Fourth Paradigm, Microsoft Research, http://research.microsoft.com/fourthparadigm/ Thomsen E., 1997. OLAP Solutions, John Wiley and Sons Inmon W.H. , Zachman John A., Geiger Jonathan G. , 1997. Data Stores Data Warehousing and the Zachman Framework, McGraw Hill Series on Data Warehousing and Data Management Inmon W.H., 1996. Building the Data Warehouse, Second Edition, John Wiley and Sons Inmon W.H. , Welch J. D. , Glassey Katherine L., 1997. Managing the Data Warehouse, John Wiley and Sons Devlin B., 1997. Data Warehouse from Architecture to Implementation, Addison Wesley Longman, Inc. Lin S.C., Yen E., 2011. Data Driven e-Science; Use Cases and Successful Applications of Distributed Computing Infrastructures (ISGC 2010), Springer