Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS5344: Big Data Analytics Technology http://www.comp.nus.edu.sg/~tankl/cs5344 TAN Kian-Lee Professor, School of Computing [email protected] COM1, Room 03-23 “We could have gotten started a lot earlier. We simply weren’t stepping back and looking at how to use the data” – Brad Smith, Intuit 1 Big Data Analytics Technology 2 What is BIG Data Gartner Definition • Volume • Velocity • Variety More than what you can handle • Veracity • Value • … 3 The Social Layer in an Instrumented Interconnected World 30 billion RFID 12+ TBs tags today (1.3B in 2005) camera phones world wide 100s of millions of GPS enabled data every day ? TBs of of tweet data every day 4.6 billion devices sold annually 2+ billion 25+ TBs of log data every day 76 million smart meters in 2009… 200M by 2014 people on the Web by end 2011 Twitter Tweets per Second Record Breakers of 2011 Extract Intent, Life Events, Micro Segmentation Attributes Pauline Name, Birthday, Family Tom Sit Not Relevant - Noise Tina Mu Jo Jobs Monetizable Intent Not Relevant - Noise Location Wishful Thinking Relocation SPAMbots Monetizable Intent Some Big Data Stats If you Amount of Stored Data By Sector like analogies… (in Petabytes, 2009) 1000 848 600 35ZB = enough data 715 to fill a stack of619 DVDs reaching halfway to Mars 500 434 900 800 700 Petabytes 966 364 400 269 300 227 Mars 200 100 0 Earth Sources: "Big Data: The Next Frontier for Innovation, Competition and Productivity." US Bureau of Labor Statistics | McKinsley Global Institute Analysis 1 zettabyte? = 1 million petabytes = 1 billion terabytes = 1 trillion gigabytes 7 Is This Qualitatively Different? • From the domain perspective – absolutely Yes • From the technology perspective – could be Yes or No 8 Why BIG Data • Can collect cheaply, due to automation • Can store cheaply, due to falling media prices • Realization that data was too valuable to delete! 9 Analytics • Data mining • The process of examining (large amounts of) data of a variety of types to uncover hidden patterns, unknown correlations and other useful and meaningful information – result in business benefits, such as more effective marketing and increased revenue • Many “success” stories, where useful predictions were made with the data 10 Analytics: Small vs Big Data Source: https://www.youtube.com/watch?v=jujE79yEu6Y&spfreload=1 11 The WRONG Picture! 12 The BIG Data Analytics Pipeline 13 Log Analytics • Analyze the entire data center’s logs to identify global information and determine statistical correlations and advanced predictive analytics • Improve availability and make effective long-term plans Extraction Import Integration Analysis Interpretation Raw Logs Text Parse & Extract Analytics Parsed Logs Identify Seeds Identify Content “What if” Analysis Validation Alerts CrossValidation Sessionized Logs Lookup Table 14 ID,prefix, timestamp Generate Index Identify EntitySessions Resolution ID pairs ID, type, timestamp Join Session Info with Original Logs Feature Vectors IBM Research – Almaden Reduced Machine Learning Training Feature Feature Selection Vectors Model Data Integration and Cleaning Garbage in Garbage out • The quality of results relates directly to quality of the data • 50%-70% of analytics process effort is spent on data integration and cleaning • Problems include: missing values, duplicate records, outliers, entity resolution 15 Clinical Dataset Example 08/30/1993 0 F… 01/01/1931 08/10/1993 0 F… M 01/01/1931 08/10/1994 1 F… 4 M 01/19/1849 09/17/1993 1 F… 4 M 10/31/1951 08/27/1993 0 F… 0000988 C 4 F 0001521 C 4 M 0001521 C 4 0002027 C 0233291 0233983 C 4 M 05/10/1939 09/06/1995 0 F… 0233983 C 4 M 05/10/1939 09/06/1995 1 F… 0234044 C F 05/10/1929 09/03/1993 0 F… 16 Clinical Dataset Example 08/30/1993 0 F… 01/01/1931 08/10/1993 0 F… M 01/01/1931 08/10/1994 1 F… 4 M 01/19/1849 09/17/1993 1 F… 4 M 10/31/1951 08/27/1993 0 F… 0000988 C 4 F 0001521 C 4 M 0001521 C 4 0002027 C 0233291 0233983 C 4 M 05/10/1939 09/06/1995 0 F… 0233983 C 4 M 05/10/1939 09/06/1995 1 F… 0234044 C F 05/10/1929 09/03/1993 0 F… Missing Values 17 Clinical Dataset Example 08/30/1993 0 F… 01/01/1931 08/10/1993 0 F… M 01/01/1931 08/10/1994 1 F… 4 M 01/19/1849 09/17/1993 1 F… 4 M 10/31/1951 08/27/1993 0 F… 0000988 C 4 F 0001521 C 4 M 0001521 C 4 0002027 C 0233291 0233983 C 4 M 05/10/1939 09/06/1995 0 F… 0233983 C 4 M 05/10/1939 09/06/1995 1 F… 0234044 C F 05/10/1929 09/03/1993 0 F… Outlier 18 Clinical Dataset Example 08/30/1993 0 F… 01/01/1931 08/10/1993 0 F… M 01/01/1931 08/10/1994 1 F… 4 M 01/19/1849 09/17/1993 1 F… 4 M 10/31/1951 08/27/1993 0 F… 0000988 C 4 F 0001521 C 4 M 0001521 C 4 0002027 C 0233291 0233983 C 4 M 05/10/1939 09/06/1995 0 F… 0233983 C 4 M 05/10/1939 09/06/1995 1 F… 0234044 C F 05/10/1929 09/03/1993 0 F… 19 Data Selection • Generate a set of training examples – – choose sampling method consider sample complexity • Reduce attribute dimensionality – – remove redundant and/or correlating attributes combine attributes (sum, multiply, difference) • Reduce attribute value ranges – – group symbolic discrete values quantize continuous numeric values • Transform data – – de-correlate and normalize values map time-series data to static representation 20 Cell Phone Dataset Example • Time-series of calls for each of 3600 cellphone accounts 21 The BIG Data Analytics Pipeline 22 Interestingness of Patterns • Interestingness criteria: – – – – easily understood by humans valid on new or test data with some degree of certainty potentially useful novel, or validates some hypothesis that a user seeks to confirm • Objective vs. subjective interestingness measures – Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. – Subjective: based on user’s beliefs in the data, e.g., unexpectedness, novelty, actionable, etc. Visualization One Picture is Worth 1000 Words! 24 Data Analytics: What Kind of Data? • • • • • • • Relational databases Transactional databases XML databases Spatial databases Temporal databases Text databases and multimedia databases Graph databases 25 Technology • Tools for data mining/analytics • Technologies like MapReduce/Hadoop, and NoSQL databases • Emphasizes – Scalability of number of features and instances – stress on algorithms and architectures – automation for handling large, heterogeneous data Statistics Data Analytics Database Technology Machine Learning 26 An Example 27 An Example – Rule 1 28 An Example – Rule 2 29 How reliable are these rules? • For any given train, how confident are you that the answer is correct? • Do we have enough data to construct a reliable rule? How many data points is enough? 30 How did you devise your rules? • Did you… – Look for characteristics in one set but missing in the second set? – Examine several potential rules? – Consider simple rules first? – Reject potential rules that didn’t perform well? 31 This is data analytics … The process of… • Deciding how to describe the data and task (Task Specification and Data Representation) • Identifying a rule (Search and Knowledge Representation) • Estimating confidence (Evaluation Function) • Applying the rule (Inference Technique) 32 Major Data Analytics • Association Rule Mining – e.g. If a customer buy Beer, he/she will most likely buy Diaper • Classification/Prediction – Is this a spam email? – Will this customer spend much in my company? • Clustering – e.g. Help me to group the customer in my database into three groups according to the ages, incomes and expenses. 33 Association rule mining Find frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Examples Rule form: “Body ead [support, confidence]”. buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%] major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%, 75%] 34 Association: Application 1 • Marketing and Sales Promotion: – Let the rule discovered be {Bagels, … } --> {Potato Chips} – Potato Chips as consequent => Can be used to determine what should be done to boost its sales. – Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels. – Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips! 35 Association: Application 2 • Supermarket shelf management. – Goal: To identify items that are bought together by sufficiently many customers. – Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items. – A classic rule -- • If a customer buys diaper and milk, then he is very likely to buy beer. • So, don’t be surprised if you find six-packs stacked next to diapers! 36 Classification Classifies data based on the training set (model construction) and uses it in classifying new data (model usage). Examples Rule form: “if Conditions then Class” [Confidence]. if (age > 20) and (loan = no) then risk = low (78%) if (loan = yes) then risk = high (90%) 37 Classification: Application 1 • Direct Marketing • Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new product. • Approach: • Use the data for a similar product introduced before. • We know which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision forms the class attribute. • Collect various demographic, lifestyle, and company-interaction related information about all such customers. • Type of business, where they stay, how much they earn, etc. • Use this information as input attributes to learn a classifier model. 38 Classification: Application 2 • Customer Attrition/Churn: • Goal: To predict whether a customer is likely to be lost to a competitor. • Approach: • Use detailed record of transactions with each of the past and present customers, to find attributes. • How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status, etc. • Label the customers as loyal or disloyal. • Find a model for loyalty. 39 Classification: Training Dataset This follows an example from Quinlan’s ID3 Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature Humidity Windy Class hot high false N hot high true N hot high false P mild high false P cool normal false P cool normal true N cool normal true P mild high false N cool normal false P mild normal false P mild normal true P mild high true P hot normal false P mild high true N 40 Classification: Decision Tree Model outlook? sunny overcast Humidity? rain wind? P high normal strong weak N P N P 41 Clustering • Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that – Data points in one cluster are more similar to one another. – Data points in separate clusters are less similar to one another. • Similarity Measures: – Euclidean Distance if attributes are continuous – Other problem-specific measures 42 Clustering: Application • Market Segmentation: – Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. – Approach: • Collect different attributes of customers based on their geographical and lifestyle related information. • Find clusters of similar customers. • Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters. 43 Applications for BIG Data Analytics 44 45 Acquiring Better Customers Source: https://www.youtube.com/watch?v=BfoJgoItd4M 46 Improving Customer Experience Source: https://www.youtube.com/watch?v=BfoJgoItd4M 47 Analytics Can Help • Credit card companies – – – – – Who should I offer my credit cards to? How do I decide on the credit limit for each customer? How do I fix the interest rate? How do I identify fraud? How do I predict bankruptcy? • Retailers – How do I stock the products in order to maximize my profitability? – How do I market to my customers to maximize my appeal? – How do I price new and existing products in my store? – How do I design promotion strategies to maximize customer benefit? 48 Analytics Can Help • Telecom – How do I manage credit limits for my post paid connections – How do I sell more value add services? • Hotels – How do I manage room pricing and occupancy rates to maximize revenues? – How do I estimate the life time value of a customer? 49 When Analytics Does Not Work? 50 Other Aspects of BIG Data • • • • Bigger Data are not always Better Data “Big” will evolve/change Not all Data are equivalent Just because it is accessible doesn’t make it ethical • Bonferroni’s principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap (a pattern that “fits”) 51 Summary 52 53 Is this a SPAM? 54 Clustering of S&P 500 Observe stock movements every day. Clustering points: Stock-{UP/DOWN} Similarity Measure: Two points are more similar if the events described by them frequently happen together on the same day. 55