Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 1 Why & What is Data Mining? Based on Data Mining Techniques (2nd Ed.), Berry and Linoff, 2004, Wiley. Slides by Prof. Norman of the National University in La Jolla, CA. Adapted by Peter Auer. What, Who Data Mining – Definition & Goal • Definition – DM is the exploration and analysis of large quantities of data in order to discover meaningful patterns and rules. • Goal – To allow an “enterprise”* to IMPROVE its ______ through better understanding of its ______ . – Potential for Competitive Advantage. * Synonyms include: corporation, firm, non-profit organization, government agency 2 Foundations of Data Mining Data mining is the process of using “raw” data to infer important “business” relationships. Despite a consensus on the value of data mining, a great deal of confusion exists about what it is. Data Mining is a collection of powerful techniques intended for analyzing large amounts of data. There is no single data mining approach, but rather a set of techniques that can be used stand alone or in combination with each other. 3 How Customer Relationship Management (CRM) 4 Customer Relationship Management (CRM) How In order to form a learning relationship with its customers, an enterprise (firm) must be able to: 1. Notice – what its customers are doing 2. Remember – what it and its customers have done over time 3. Learn – from what it has remembered 4. Act On – what it has learned to make customers more profitable 5 How Based on “Transaction” Data 6 Definitions of a Data Warehouse “A subject-oriented, integrated, time-variant and 1. non-volatile collection of data in support of management's decision making process” - W.H. Inmon 2. “A copy of transaction data, specifically structured for query and analysis” - Ralph Kimball 7 Data Warehouse • For organizational learning to take place, data from many sources must be gathered together and organized in a consistent and useful way – hence, Data Warehousing (DW) • DW allows an organization (enterprise) to remember what it has noticed about its data • Data Mining techniques make use of the data in a DW 8 Data Warehouse Enterprise “Database” Customers Orders Transactions Etc… Vendors Etc… Copied, organized summarized Data Warehouse Data Mining 9 Data Warehouse • • • • Data, data, data…everywhere! Information…that’s another story! Especially, the right information @ the right time! Data warehousing’s goal is to make the right information available @ the right time • Data warehousing is a data store (eg., a database of some sort) and a process for bringing together disparate data from throughout an organization for decision-support purposes 10 Data warehousing • Data warehouses are natural allies for data mining (work together well) • Data mining can help fulfill some of the goal of data warehouses – right information @ the right time • Relational database management systems (RDBMS), such as Oracle, DB2, Sybase, Informix, Focus, SQL Server, etc. are often used for data warehousing 11 Data of different kind 12 Transaction (Operational) Data • Operational (production) systems create (massive number of) transactions, such as sales, purchases, deposits, withdrawals, returns, refunds, phone calls, toll roads, web site “hits”, etc… • Transactions are the base level of data – the raw material for understanding customer behavior • Unfortunately, operational systems change due to changing business needs • Fortunately, operational systems can usually be changed to support changing business needs • Data warehousing strategies need to be aware of operational system changes 13 Operational Summary Data Summaries are for a specific time period and utilize the transaction data for that time period Other Examples??? 14 Database Schema • Database schema defines the structure of data, not the values of the data (e.g., first name, last name = structure; Ron Norman = values of the data) • In RDBMS: – Columns = fields = attributes (A,B,C) – Rows = records = tuples (1-7) 15 Metadata • General definition: Data about data !!! – Examples: • A library’s card catalog (metadata) describes publications (data) • A file system maintains permissions (metadata) about files (data) • A form of system documentation including: – – – – – Values legally allowed in a field (e.g., AZ, CA, OR, UT, WA, etc.) Description of the contents of each field (e.g., start date) Date when data were loaded Indication of currency of the data (last updated) Mappings between systems (e.g., A.this = B.that) • Invaluable, otherwise have to research to find it 16 Business Rules • Highest level of abstraction from operational (transaction) data • Describes why relationships exist and how they are applied • Examples: – Need to have 3 forms of ID for credit – Only allow a maximum daily withdrawal of $200 – After the 3rd log-in attempt, lock the log-in screen – Accept no bills larger than $20 – Others??? 17 General Architecture for Data Warehousing • End users (business) • Metadata repository • Central repository • Extraction, (Clean), Transformation, & Load (ETL) • Source systems 18 OLAP – Online Analytical Processing • A definition: • Data representation is in the form of a CUBE • OLAP goes beyond SQL with its analysis capabilities • Key feature of OLAP: Relevant multi-dimensional views such as products, time, geography 19 OLAP Overview gender • Interactive, exploratory analysis of multidimensional data to discover patterns ts n e age id c c a 20 Data Mining versus OLAP • OLAP - Online Analytical Processing – Provides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening 21 Results of Data Mining Include: • Forecasting what may happen in the future • Classifying people or things into groups by recognizing patterns • Clustering people or things into groups based on their attributes • Associating what events are likely to occur together • Sequencing what events are likely to lead to later events 22 Data Mining Flavors • Directed – Attempts to explain or categorize some particular target field such as income or response. • Undirected – Attempts to find patterns or similarities among groups of records without the use of a particular target field or collection of predefined classes. 23 Data Mining Tasks • Classification – example: Jr, Sr • Estimation – example: household income • Prediction – example: predict credit card balance transfer average amount • Affinity Grouping – Example: people who buy X, often buy Y also • Clustering – similar to classification but no predefined classes • Description and Profiling – behavior begets an explanation such as “More guys prefer In-n-Out Burger than do gals.” 24 Automatic Cluster Detection Automatic Cluster Detection • DM techniques used to find patterns in data – Not always easy to identify • No observable pattern • Too many patterns • Automatic Cluster Detection is useful to find “better behaved” clusters of data within a larger dataset; seeing the forest without getting lost in the trees 26 Automatic Cluster Detection • K-Means clustering algorithm depends on a geometric interpretation of the data • Other automatic cluster detection (ACD) algorithms include: – – – – Gaussian mixture models Agglomerative clustering Divisive clustering Self-organizing maps (SOM) – Ch. 7 – Neural Nets • ACD is a tool used primarily for undirected data mining – No preclassified training data set – No distinction between independent and dependent variables • ACD rarely used in isolation – other methods follow up 27 Clustering Examples • “Star Power” ~ 1910 Hertzsprung-Russell • Group of Teens • 1990’s US Army – women’s uniforms: •100 measurements for each of 3,000 women •Using K-means algorithm reduced to a handful 28 K-means Clustering • “K” – circa 1967 – this algorithm looks for a fixed number of clusters which are defined in terms of proximity of data points to each other • How K-means works (see next slide figures): – Algorithm selects K data points randomly – Assigns each of the remaining data points to one of K clusters (via perpendicular bisector) – Calculate the centroids of each cluster (uses averages in each cluster to do this) 29 K-means Clustering 30 K-means Clustering • Resulting clusters describe underlying structure in the data, however, there is no one right description of that structure (Ex: Figure 11.6 – playing cards K=2, K=4) 31 Similarity & Difference • Automatic Cluster Detection is quite simple for a software program to accomplish – data points, clusters mapped in space • However, business data points are not about points in space but about purchases, phone calls, airplane trips, car registrations, etc. which have no obvious connection to the dots in a cluster diagram 32 Similarity & Difference • Clustering business data requires some notion of natural association – records (data) in a given cluster are more similar to each other than to those in another cluster • For DM software, this concept of association must be translated into some sort of numeric measure of the degree of similarity • Most common translation is to translate data values (eg., gender, age, product, etc.) into numeric values so can be treated as points in space • If two points are close in geometric sense then they represent similar data in the database 33 Similarity & Difference • Business variable (fields) types: – – – – Categorical (eg., mint, cherry, chocolate) Ranks (eg., freshman, soph, etc.) Intervals (eg., 56 degrees, 72 degrees, etc) True measures – interval variables that measure from a meaningful zero point • Fahrenheit, Celsius not good examples • Age, weight, height, length, tenure are good • Geometric standpoint the above variable types go from least effective to most effective (top to bottom) • Finally, there are dozens/hundreds of published techniques for measuring the similarity of two data records 34 Evaluating Clusters • What does it mean to say that a cluster is “good”? – Clusters should have members that have a high degree of similarity – Standard way to measure within-cluster similarity is variance* – clusters with lowest variance is considered best – Cluster size is also important so alternate approach is to use average variance** * The sum of the squared differences of each element from the mean ** The total variance divided by the size of the cluster 35 Evaluating Clusters • Finally, if detection identifies good clusters along with weak ones it could be useful to set the good ones aside (for further study) and run the analysis again to see if improved clusters are revealed from only the weaker ones 36 But… • Finding patterns is not enough • Business (individuals) must: – Respond to the pattern(s) by taking action – Turning: • Data into Information • Information into Action • Action into Value 37 Data Mining’s Business Cycle 1. Identify the business opportunity* 2. Mining data to transform it into actionable information 3. Acting on the information 4. Measuring the results 38 1. Identify the Business Opportunity • Many business processes are good candidates: – New product introduction – Direct marketing campaign – Evaluating the results of a test market • Measurements from past DM efforts: – What types of customers responded to our last campaign? – Where do the best customers live? – What products should be promoted with our XYZ product? 39 2. Mining data to transform it into actionable information • Success is making business sense of the data • Numerous data “issues”: – Bad data formats (alpha vs numeric, missing, null, bogus data) – Confusing data fields (synonyms and differences) – Lack of functionality (“I wish I could…”) – Legal ramifications (privacy, etc.) – Organizational factors (unwilling to change “our ways”) – Lack of timeliness 40 3. Acting on the Information • This is the purpose of Data Mining – with the hope of adding value • What type of action? – Interactions with customers, prospects, suppliers – Modifying service procedures – Adjusting inventory levels – Consolidating – Expanding – Etc… 41 4. Measuring the Results • Assesses the impact of the action taken • Often overlooked, ignored, skipped • Planning for the measurement should begin when analyzing the business opportunity, not after it is “all over” • Assessment questions (examples): – Did this ____ campaign do what we hoped? – Did some offers work better than others? – Did these customers purchase additional products? – Tons of others… 42 What Does All of This Mean? • On a regular basis, data miners utilize their data warehouses to give guidance for and/or answer a limitless variety of questions. • Nothing is free, however, and the benefits do come with a cost. • The value of a data warehouse and subsequent data mining is a result of the new and changed business processes it enables – competitive advantage also. • There are limitations, though - A Data Warehouse cannot correct problems with its data, although it may help to more clearly identify them. 43