Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
MFD – Schedule and grades MINING OF FINANCIAL DATABASES Lectures: Thursday, 17:20-19:35, CKU 110, 15 hours Computer Labs: 15 hours gr.1 Wednesday, 15:00-17:15, 002Z1 gr.2 Thursday, 15:00-17:15, 003Z1 INTRODUCTION Final grade = AVG (Ass1+Ass2+Ass3) ± D activity Jerzy KORCZAK email: [email protected] http://www.korczak-leliwa.pl http://kti.ue.wroc.pl 1 Outline • • • • • • • 2 Data Mining • Data mining (knowledge discovery in databases, KDD) Introduction – concept of data-driven knowledge discovery History of data mining Statistics vs. data mining Overview of data sets, databases CRISP - Data mining methodology Business Requirements Research progress – Extraction of interesting, non-trivial, implicit, previously unknown and potentially useful information (knowledge) or patterns from data in large databases or other information repositories • Scientific point of view: data abstraction and KDD • Commercial point of view: competitive pressure • Necessity is the mother of invention – Data is everywhere — data mining should be everywhere, too! – Understand and use data — an imminent task! 3 Origins of Data Mining 4 Statistics vs Data Mining • Draws ideas from AI/machine learning, pattern recognition, statistics, and database systems • Statistics: a discipline dedicated to data analysis • What are the differences? Artificial Intelligence Statistics • Traditional Techniques may be unsuitable due to – Enormity of data – High dimensionality of data – Heterogeneous, distributed nature of data Machine Learning Pattern Recognition Data Mining Database systems – Huge amount of data—in Giga to Tera bytes – – Fast computer—quick response, interactive analysis Multi-dimensional, powerful, thorough analysis – High-level, “declarative”—user’s ease and control – Automated or semi-automated—mining functions hidden or built-in in many systems Visualisation 5 6 1 Data Sets, Database, Images • Relational database — A commodity of every enterprise • Huge data warehouses are under construction • POS (Point of Sales): Transactional DBs in terabytes • Object-relational databases, distributed, heterogeneous, and legacy databases • Spatial databases (GIS), remote sensing database (EOS), and scientific/engineering databases • Time-series data (e.g., stock trading) and temporal data • Text (documents, emails) and multimedia databases • WWW: A huge, hyper-linked, dynamic, global information system 7 Healy J., Why what happens in an internet minute really matters, M2M 8 9 Types of Decision-Support Systems (DSS) 10 A Multi-Dimensional View of Data Mining • Databases to be mined Model-driven DSS: Relational, transactional, object-relational, active, spatial, time-series, text, Primarily stand-alone systems Use a strong theory or model to perform “what-if” analyses multi-media, heterogeneous, legacy, WWW, etc. • Knowledge to be mined Data-driven DSS: Characterization, discrimination, association, classification, clustering, trend, • Integrated with large pools of data in major enterprise systems and Web sites • Support decision making by enabling user to extract useful information • Data mining: can obtain types of information such as associations, sequences, classifications, clusters, and forecasts deviation and outlier analysis, etc. • Techniques utilized Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural network, genetics, etc. • Applications adapted Decision Support Systems, telecommunication, banking, fraud analysis, DNA mining, stock market analysis, image interpretation, Web mining, etc. 11 12 2 What is Data Mining? Data Mining Tasks... Many definitions – Non-trivial extraction of implicit, previously unknown and potentially useful information from data – Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns Data Tid Refund Marital Taxable Status Income Cheat Milk 13 1 Yes Single 125K No 2 No Married 100K No 3 No Single No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No 70K Married 75K No 10 No Single 90K Yes 11 No Married 60K No 12 Yes Divorced 220K No 13 No Single 85K Yes 14 No Married 75K No 15 No Single Yes 90K 14 10 A Brief History of Data Mining • Most scientific discoveries involve “data mining” – Kepler’s Law, Newton’s Laws, periodic table of chemical elements, …, from “big bang” to DNA • 1989 IJCAI Workshop on Knowledge Discovery in Databases • 1991-1994 Workshops on Knowledge Discovery in Databases – – Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991) Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996) • 1995- now International Conferences on Knowledge Discovery in Databases and Data Mining (KDD) • 1998 ACM SIGKDD, SIGKDD’1999-2005 conferences, and SIGKDD Explorations • More conferences on data mining – – Journal of Data Mining and Knowledge Discovery (1997) PAKDD, PKDD, SIAM-Data Mining, (IEEE) ICDM, DaWaK, SPIE-DM, etc. 15 16 Types of Data Structures: Mining of Big Data Quasi-Structured Data Structured Data • Concepts of Big Data: – “Big Data” is data whose scale, distribution, diversity, and/or timeliness require the use of new technical architectures and analytics to enable insights that unlock new sources of business value • Requires new data architectures, analytic sandboxes • New tools • New analytical methods • Integrating multiple skills into new role of data scientist – Organizations are deriving business benefit from analyzing ever larger and more complex data sets that increasingly require real-time or near-real time capabilities Semi-Structured Data View Source http://www.google.com/#hl=en&sugexp=kjrmc&cp=8&gs_id=2m&xhr=t&q=data+scientist&pq=big +data&pf =p&sclient=psyb&source=hp&pbx=1&oq=data+sci&aq=0&aqi=g4&aql=f&gs_sm=&gs_u pl=&bav=on.2,or.r_gc.r_pw.,cf .osb&fp=d566e0fbd09c8604&biw=1382&bih=651 Unstructured Data The Red Wheelbarrow, by William Carlos Williams Source: McKinsey May 2011 article Big Data: The next frontier for innovation, competition, and productivity 17 18 3 Business Requirements Data Structures Current Business Problems Provide Opportunities for Organizations to Become More Analytical & Data Driven • Data containing a defined data type, format, structure Structured • Example: Transaction data and OLAP More Structured SemiStructured Driver 1 • Textual data files with a discernable pattern, enabling parsing Examples Desire to optimize business operations Sales, pricing, profitability, efficiency Desire to identify business risk Customer churn, fraud, default 3 Predict new business opportunities Upsell, cross-sell, best new customer prospects 4 Comply with laws or regulatory requirements Anti-Money Laundering, Fair Lending, Basel II • Example: XML data files that are self describing and defined by an xml schema “Quasi” Structured 2 • Textual data with erratic data formats, can be formatted with effort, tools, and time • Example: Web clickstream data that may contain some inconsistencies in data values and formats • Data that has no inherent structure and is usually stored as different types of files. Unstructured • Example: Text documents, PDFs, images and video 19 20 Data Analytics – Development Methodology Cross Industry Standard Process for Data Mining known by its acronym CRISP-DM [ESPRIT, 1996]. • Data Analytics Lifecycle 1 Do I have enough information to draft an analytic plan and share for peer review? Discovery 2 6 Operationalize 3 5 Model Planning Communicate Results 4 Is the model robust enough? Have we failed for sure? Data Prep Do I have enough good quality data to start building the model? Model Building Do I have a good idea about the type of model to try? Can I refine the analytic plan? 22 22 21 Data Analytics – Development (cont.) Data Analytics – Development (cont.) • Phase 1: Discovery • Phase 2: Data Preparation 1 Do I have enough information to draft an analytic plan and share for peer review? Discovery • Formulate Initial Hypotheses IH, H1 , H2, H3, … Hn Operation Data Do I have enough information to draft an analytic plan and share for peer review? Do I have enough good quality data to start building the model? Gather and assess hypotheses from stakeholders and alize Prep domain experts Preliminary data exploration to inform discussions with stakeholders Communi during the hypothesis forming stage Model • Identifycate Data Sources – Begin Learning the Data Plannin Aggregate Resultssources for previewing the data and provide g high-level understanding Model Do I have a good idea about the type of model Review therobust raw data Is the model to try? Can I refine the Building enough? Have we Determine the structures and tools needed failed for sure? analytic plan? Scope the kind of data needed for this kind of problem 23 • Prepare Analytic Sandbox Discover y Work space for the analytic team 10x+ vs. EDW • Perform ELT Operation Determine alize needed transformations Assess data quality and structuring Communi Derive cate statistically useful measures Results Extract data and determineModel data connections for raw data, OLTP the model robust • UsefulIsTools for this phase: Building enough? Have we 2 Data Prep Do I have enough good quality data to start building the model? Model Planning Do I have a good idea about the type of model to try? Can I refine the OLAP cubes or data SQL, feeds • transactions, For Data Transformation & Cleansing: Hadoop, MapReduce, analyticAlpine plan? Miner failed for sure? Big ELT and Big ETL 23 24 24 4 Data Analytics – Development (cont.) Data Analytics – Development (cont.) • Phase 3: Model Planning • Phase 3 - Model Planning Do I have enough information to draft an analytic plan and share for peer review? Discovery Discoveryy Do I have enough good quality data to start building the model? • Determine Methods Select methods based on hypotheses, data Data Prep structure and volume Operational Ensureize techniques and approach will meet business objectives 3 Model Planning Communic • Techniques & Workflow ate tests and sequence Candidate Results Identify and document modeling assumptions Model Is the model robust Building • Useful Tools for this phase: R/PostgresSQL, SQL Do I have a good idea about the type of model to try? Can I refine the analytic plan? enough? Have we Analytics, Alpine Miner, SAS/ACCESS, SPSS/OBDC failed for sure? How do people generally solve this Operation problem alize with the kind of data and resources I have? • Does that work well enough? Communi Or do I have to come up with cate something new? Results • What are related or analogous problems? How are theyModel solved? Is the model robust Building Can I do Have that? enough? we Data Prep Model Planning Do I have a good idea about the type of model to try? Can I refine the analytic plan? failed for sure? 25 Data Analytics – Development (cont.) • Phase 3 - Model Planning • Data Exploration Data Analytics – Development Do I have enough information to draft an analytic plan and share for peer review? Discover y • Variable Selection Inputs from stakeholders and domain Do I have enough good quality data to start building the model? Data Prep Operation experts alize Capture essence of the predictors, leverage a technique for dimensionality reduction Communi Iterative testing to confirm the most cate significant variables The Problem to Solve The Category of Techniques Algorithms I want to group items by similarity. I want to find structure (commonalities) in the data Clustering K-means clustering I want to discover relationships between actions or items Association Rules Apriori I want to determine the relationship between the outcome and the input variables Regression Linear Regression Logistic Regression I want to assign (known) labels to objects Classification Naïve Bayes Decision Trees I want to find the structure in a temporal process I want to forecast the behavior of a temporal process Time Series Analysis ACF, PACF, ARIMA I want to analyze my text data Text Analysis Regular expressions, Document representation (Bag of Words), TFIDF 3 Model Planning Results Model Building • Model Selection 26 Do I have a good idea about the type of model to try? Can I refine the analytic plan? Is the model robust Conversion enough? HavetoweSQL or database failed for sure? language for best performance 27 Choose technique based on the end 27 28 goal Data Analytics – Development (cont.) • Phase 4: Model Building Data Analytics – Development (cont.) • Phase 5: Communicate Results Do I have enough information to draft an analytic plan and share for peer review? Discovery • Develop data sets for testing, training, and production purposes Do I have Need to ensure that the model data is sufficiently robust model and analytical techniques Discovery enough good for the quality data to start building the model? Operationalize Dataset for initial Operation Smaller, test sets for validating approach, training experiments alize Prep • Get the best environment you can for building models and workflows…fast hardware, parallel processing Communicate Results Is the model robust enough? Have we failed for sure? • 5 Communicate Results Model Planning 4 Model Building Do I have enough information to draft an analytic plan and share for peer review? Do I have enough good quality data to start building the model? Did we succeed? Did we fail? • Interpret the results Data Prep • Compare to IH’s from Phase 1 • Identify key findings • Quantify business value Model • Summarizing findings, depending on Plannin audience Do I have a good idea about the type of model to try? Can I refine the analytic plan? Is the model robust enough? Have we failed for sure? Useful Tools for this phase: R, PL/R, SQL, Alpine Miner, SAS Enterprise Miner 29 Model Building g Do I have a good idea about the type of model to try? Can I refine the analytic plan? 30 5 Data Analytics – Development (cont.) Data Analytics – Core deliverables • Phase 6: Operationalize Do I have enough information to draft an analytic plan and share for peer review? Discover 6 Operationalize Communicate Results Presentation for Project Sponsors 1. “Big picture" takeaways for executive level stakeholders 2. Determine key messages to aid their decision-making process 3. Focus on clean, easy visuals for the presenter to explain and for the viewer to grasp Presentation for Analysts 1. Business process changes 2. Reporting changes 3. Fellow Data Scientists will want the details and are comfortable with technical graphs (such as ROC curves, density plots, histograms) Do I have y enough good • Run a pilot quality data to start building • Assess the benefits the model? Data • Deliver final deliverables Prep • Model execution in production environment Model • Define process Plannin to update and retrain the model, as g needed Is the model robust enough? Have we failed for sure? Model Building Code for technical people Do I have a good idea about the type of model to try? Can I refine the analytic plan? Technical specs of implementing the code 31 32 Data Analytics – Key roles Role Business Requirements Description Business User Someone who benefits from the end results and can consult and advise project team on value of end results and how these will be operationalized Project Sponsor Person responsible for the genesis of the project, providing the impetus for the project and core business problem, generally provides the funding and will gauge the degree of value from the final outputs of the working team Project Manager Ensure key milestones and objectives are met on time and at expected quality. Business Intelligence Analyst Business domain expertise with deep understanding of the data, KPIs, key metrics and business intelligence from a reporting perspective Data Engineer Deep technical skills to assist with tuning SQL queries for data management, extraction and support data ingest to analytic sandbox Database Administrator (DBA) Database Administrator who provisions and configures database environment to support the analytical needs of the working team Data Scientist Provide subject matter expertise for analytical techniques, data modeling, applying valid analytical techniques to given business problems and ensuring overall analytical objectives are met • Objectives of the problem decomposition: – Focus your time – Ensure rigor and completeness – Enable better transition to members of the crossfunctional analytic teams • Repeatable • Scale to additional analysts • Support validity of findings 33 34 Business Requirements Business Requirements • How do you currently approach your analytics problems? • Do you follow a methodology or some kind of framework? • How do you plan for an analytic project? • Analytical Approaches for Meeting Business Drivers Predictive Analytics & Data Mining (Data Science) High Data Science Typical Techniques & Data Types • Optimization, predictive modeling, forecasting, statistical analysis • Structured/unstructured data, many types of sources, very large data sets Common Questions • What if…..? • What’s the optimal scenario for our business ? • What will happen next? What if these trends continue? Why is this happening? Business Intelligence BUSINESS VALUE Business Intelligence Typical Techniques & Data Types • Standard and ad hoc reporting, dashboards, alerts, queries, details on demand • Structured data, traditional sources, manageable data sets Common Questions • What happened last quarter? • How many did we sell? • Where is the problem? In which situations? Low Past 35 TIME Future 36 6 Business Requirements Business Requirements • A typical analytical architecture • A typical analytical architecture (cont.) 1 Data Sources 1 Data Sources Non-Agile Models 2 Non-Agile Models Departmental Warehouse “Spread Marts” Departmental Warehouse 2 Enterprise Applications 3 Static schemas accrete over time Departmental Warehouse “Spread Marts” 4 Departmental Warehouse Prioritized Operational Processes Reporting Enterprise Applications 3 Static schemas accrete over time Siloed Analytics Non-Prioritized Data Provisioning 4 Prioritized Operational Processes Reporting Siloed Analytics Non-Prioritized Data Provisioning Errant data & marts Errant data & marts 37 38 Business Requirements Business Requirements • Opportunities for a new approach to analytics Implications of Typical Architecture for Data Science – High-value data is hard to reach and leverage – Predictive analytics & data mining activities are last in line for data Slow • Queued after prioritized operational processes “time-to-insight” – Data is moving in batches from EDW to local analytical tools & reduced • In-memory analytics (such as R, SAS, SPSS, Excel) business impact • Sampling can skew model accuracy – Isolated, ad hoc analytic projects, rather than centrally-managed harnessing of analytics • Non-standardized initiatives • Frequently, not aligned with corporate business goals 1 D at a D evices Individual Analytic Services Medical Information Brokers Advertising Marketers Employers Law Enforcemen t 2 Internet Government D at a C ollect or s Websites 3 D at a Aggr egat or s D at a U ser s/ B uyer s 4 Catalog Co-Ops Phone/TV Media Media Archives Retail Credit Bureaus Financial Banks List Brokers Delivery Service Private Investigators /Law yers Government 39 40 Data Mining Data Mining 41 42 7 Research Progress Conclusions and Perspectives • TREND: AFTER C-C, Availabilty of BD IN-MEMORY Technology, LOWER cost, REAL TIME, IoT • According to McKinsey – a retailer using big data to the full could increase its operating marging by more than 60% • Bad data or poor data quality costs US businesses $600 billion annually • According to Gartner Big Data $232 billion in spending through 2016. • By 2016, 5 milion IT jobs globally weree created to support big data, generating 2 million IT jobs in the US. 43 • • • • • • • • • • • • • • Multi-dimensional data analysis: Data Warehouse and OLAP Association, correlation, and causality analysis Classification: scalability and new approaches Clustering and outlier analysis Sequential patterns and time-series analysis Similarity analysis: curves, trends, images, texts, etc. Text mining, Web mining and Weblog analysis Social networks, link analysis Spatial, multimedia, scientific data analysis Smart sensors: IoT Image classification and interpretation Data preprocessing and database compression Data visualization and visual data mining Many others, e.g., collaborative filtering 44 8