Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining and Data Visualization SOM 485 Fall 2007 Getting Started What is Data Mining? Online Analytical Processing Data Mining Techniques Market Basket Analysis Limitations and Challenges to Data Mining Data Visualization Siftware Technologies What is Data Mining (DM)? Group of activities used to find different patterns in data Information provided through a Data Warehouse Provides valuable information for different types of research. Applications of DM Customer Relationship Management (CRM) software is an application that can benefit DM Activities of CRM One-to-One Marketing Sales Force Automation Sales Campaign Management Marketing Encyclopedia Call Center Automation Verification of DM Requires a lot of prior knowledge on the decision maker’s part Used mainly in casinos i.e. Can determine if a new customer is a high roller, a souvenir buyer, a ticket purchaser, etc. Uses Siftware to help discover new patterns of customer spending habits Allows effective targeting to a specific group of customers Online Analytical Processing Online Analytical Processing (OLAP) was introduced by E. F. Codd in 1993 OLAP: computer process that allows a user to extract data from different view points Scientific and Academic organizations store about 1 terabyte (1 trillion bytes) of new data each day. OLAP continue… Codd’s 12 Rules for OLAP 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. Multidimensional View Transparent to the User Accessible Consistent Reporting Client-Server architecture Generic Dimensionality Dynamic Sparse Matrix Handling Multi-user Support Cross-Dimensional Operations Intuitive Data Manipulation Flexible Reporting Infinite Levels of Dimension and Aggregation OLAP: MOLAP & ROLAP OLAP data is stored in a Multidimensional Database (MBD) MOLAP: OLAP application that accesses data from a multidimensional database MBD are frequently created using input from an existing Relational Database ROLAP: Relational Database server that can work with SQL for portability and scalability. DATA MINING TECHNIQUES FOUR MAJOR CATEGORIES 1. Classification 2. Association 3. Sequence 4. Cluster CLASSIFICATION - Mining processes intended to discover rules that define whether an item belongs to a particular class of data - Two Sub-processes: 1) Building a Model 2) Predicting Classifications ASSOCIATION Techniques that employ association search all details from operational systems for patterns with a high probability of repetition Example: Market Basket Analysis SEQUENCE Time series analysis methods relate events in time based on a series of preceding events Through analysis, various hidden trends, often highly predictive of future events, can be discovered. Example: Mail Industry CLUSTER To create partitions so that all members of each set are similar according to some metric Simply a set of objects grouped together by virtue of their similarity or proximity to each other Example: Credit Card Transactions DATA MINING TECHNOLOGIES Providing new answers to old questions Developing new knowledge and understanding through discovery Statistical Analysis – statistically evaluating products and making a decision based on logical reasoning Neural Networks – attempts to mirror the way the human brain works in recognizing patterns by developing mathematical structures with the ability to learn DATA MINING TECHNOLOGIES CONT’ Genetic Algorithms and Fuzzy Logic – machine learning techniques derive meaning from complicated and imprecise data and can extract patterns from and detect trends within the data that are far too complex to be noticed by humans Decision Trees – assists in data mining applications by the classification of items or events contained within the warehouse NEW APPLICATIONS FOR DATA MINING Two new categories of applications 1) Text Mining – summarizes, navigates, and clusters documents contained in a database 2) Web Mining – integrates data and text mining within a Web site; enhances the Web site with intelligent behavior, such as suggesting related links or recommending new products to the consumer Market Basket Analysis Market Basket Analysis Market Basket Analysis • Market Basket Analysis is an algorithm that examines a long list of transactions in order to determine which items are most frequently purchased together. • It takes its name from the idea of a person in a supermarket throwing all of their items into a shopping cart (a "market basket"). • Market basket analysis one of the most common and useful types of data analysis for marketing. • With the data gathered from MBA, marketers can group products that customers like and group them together. • Market basket analysis can improve the effectiveness of marketing and sales tactics. Benefits of Market Basket Analysis: •A good indication of consumer behavior •Increase in sales •Improves customer satisfaction •Tracks what types of products interest consumer and finds relative alternative ones to introduce to the consumer. ASSOCIATION RULES for MBA • Support • Confidence • Lift •Method Association rules- are a common undirected data mining technique and complement market basket analysis. These rules are unidirectional Left-hand side rule IMPLIES Right-hand side rule ex. Pasta IMPLIES Wine, but Wine IMPLIES Pasta may not hold 40% of transactions that contain Pasta also contain Wine. 4% of transaction contain both of these items. Support- % measure of baskets where the association rule is true between the Left-hand side & the Right-hand side. ex. 4% of transactions contain both Confidence- Probability that the Right-hand side item is present once the Left-hand side item is present. ex. 40% of transactions that contain Pasta… p=.40 Lift- compares the likelihood of finding the right-hand side item in any random basket. Measures how well and associative rules performs by comparing how well an item can sell without the other item (improvement). Method Frozen Pizza Milk Cola Potato Chips Pretzels Frozen Pizza 2 1 2 0 0 Milk 1 3 1 1 1 Cola 2 1 3 0 1 Potato Chips 0 1 0 1 0 Pretzels 0 1 1 0 2 Market Basket Analysis Market Basket analysis- determines what products customers purchase together Limits to Market Basket Analysis • A large number of data is req. to obtain meaningful data, but data’s accuracy is compromised if all the products don’t occur w/in similar frequency. • ex. Milk sells almost every transaction, but Elmer’s glue sells sporadically, its not effective to put them in same basket analysis. • Sometimes presents results that are actually due to the success of previous market campaigns. • ex. Discounted price of cola with purchase of pizza. Using Data from MBA Once information has been gathered about different items and how they sell with respect to other items, a store may want to change their layout of items to improve their profits. ex. Lunchboxes and School Supplies For business without an actual storefront, they may want to offer promotions for products that sell togetherincreasing sales. MARKET BASKET ANALYSIS In a Nutshell Current Limitations and Challenges to Data Mining Current Limitations & Challenges to Data Mining New and underdeveloped field Identification of missing information Most companies run legacy systems Not DW (data warehouse) friendly DW designers have to convert existing ODSs (operational data stores) to homogenous form of DW Current Limitations & Challenges to Data Mining Not all knowledge about application domains are present in the data ODSs are normally limited to those needed by the operational application associated with that DB Data warehouse designers need to include mechanisms for “inventorying” data Data noise & missing values Most operational databases contain data errors in their values and/or classification Errors lead to misclassification Future data mining systems must incorporate more sophisticated mechanisms for treating “noisy data” Bayesian technique – a statistical technique Large Databases & high dimensionality Databases are large & dynamic Contents are always changing Data patterns must be constantly updated New discovery applications have to portion problems into smaller chunks of manageable data without losing any essential attributes of the data Data Visualization Process by which numerical data are converted into meaningful 3-D images Example Intended to analyze complex data Data from: satellite photos, sonar measurements, surveys, or computer simulations History of Data Visualization Originated from statistics and science Example of 2-D Advancement credited to NCSA National Center for Supercomputing Applications Newest developments by Xerox PARC in virtual reality Human Visual Perception Human visual cortex dominates our perception Accelerates the identification of hidden patterns in data “A picture is worth a thousand words” Geographical Information Systems (GIS) A special-purpose DB which common spatial coordinate system is primary means of reference Requires: 1. 2. 3. 4. Data input Data storage, retrieval, and query Data transformation, analysis, and modeling Data reporting Integrates info. and aids in decision making GIS continued Spatial Data – elements stored in map form • Contain three basic components: 1. Points 2. Lines 3. Polygons Attribute Data – describes spatial data Example of GIS Applications of Data Visualization Techniques Retail Banking Government Insurance Health Care and Medicine Telecommunications Transportation Capital Markets Asset Management Siftware Technologies Siftware Technologies IBM Informix Red Brick DB2 Oracle Silicon Graphics Sybase Offers several Data Mining solutions, depending on users need. IBM Information Warehouse Solutions IBM Visualizer Red Brick Informix Three-tier model Tier 1: “Client” presentation layer Tier 2: Hewlett-Packard hardware Tier 3: Data layer INFORMIX –OnLine database Sybase Warehouse WORKS Assemble data from may sources Transform data for a consistent and understandable view Distribute data where needed Provide high-speed access to the data Leading company for large-scale data mining Data spread across mutliple databases Data spread across processors for faster queries Discover new patterns and trends that may not be realized using traditional SQL Three-dimensional Visualization Visual models can save days and even months from the review process Review Data mining (DM) Techniques used to mine data Market Basket Analysis: The King of DM Algorithms Review continued….. Current Limitations and Challenges to Data Mining Data Visualization Siftware Technologies