Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Big Data 1 World Cup Soccer German soccer Team : 2014.07.05 IoT + Bigdata 2 What is big data? Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. 3 Big Data is Every Where! • Lots of data is being collected and warehoused • Web data, e-commerce • purchases at department/ grocery stores • Bank/Credit Card transactions • Social Network 4 5 What does big data do? 6 Time of Big Data What is Big Data? http://www.youtube.com/watch?v= 7D1CQ_LOizA The most popular big data application program is HADOOP: What is HADOOP? http://www.youtube.com/watch?v=9svSeWej1U 7 Evolution of Names • • • • • Artificial Intelligence Machine Learning Business Intelligence Data mining Big Data/Data Sciences 8 What Is Data Mining? • Data mining (knowledge discovery in databases): • A process of identifying hidden patterns and relationships within data (Groth) • Data mining: • Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases 9 DM and Business Decision Support • Database Marketing • • Target marketing Customer relationship management • Credit scoring • Clinical decision support • Credit Risk Management • Fraud Detection • Healthcare Informatics 10 Data Mining: A KDD Process Pattern Evaluation • Data mining: the core of knowledge discovery Data Mining process. Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases 11 A mining software: SAS Enterprise Miner (EM) Clementine for SPSS R Python 12 Government • In 2012, the Obama administration announced the Big Data Research and Development Initiative, which explored how big data could be used to address important problems faced by the government. The initiative was composed of 84 different big data programs spread across six departments. • Big data analysis played a large role in Barack Obama's successful 2012 re-election campaign. • The United States Federal Government owns six of the ten most powerful supercomputers in the world. • The Utah Data Center is a data center currently being constructed by the United States National Security Agency. When finished, the facility will be able to handle yottabytes of information collected by the NSA over the Internet. 13 Business • Amazon.com handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers. The core technology that keeps Amazon running is Linux-based and as of 2005 they had the world’s three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB. • Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data – the equivalent of 167 times the information contained in all the books in the US Library of Congress. • Facebook handles 50 billion photos from its user base. • FICO Falcon Credit Card Fraud Detection System protects 2.1 billion active accounts world-wide. • The volume of business data worldwide, across all companies, doubles every 1.2 years, according to estimates. • Windermere Real Estate uses anonymous GPS signals from nearly 100 million drivers to help new home buyers determine their typical drive times to and from work throughout various times of the day. 14 Bigdata in google trend 15 Bigdata case Movement of carts: Product display 16 16 Wild Fire in Korea(1991 – 2011) 17 17 Google Flue Service 18 18 Find Location for your business busienss 19 19 Crime Mapping in Sanfrancisco : 71% accuracy 20 20 Evolution of bigdata • Artificial Intelligence • Data mining • Business Intelligence • Bigdata • Business Analytics • Data Sciences 21 22 Future direction of bigdata 23 bigdata 2013 bigdata 2014 24 Google glass Mashup, bigdata, visualisation -> analysis of commerce area 25 IoT Key: Smart & Intelligence 26 3D Printer Healthy food, organ, face recommended? 27 A Case on Bigdata (Association Rule Analysis) 28 Association Rues Analysis As an Example of Data mining Tool: Market Basket Analysis 29 What Is Association Mining? • Association rule mining: • Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. • Applications: • Market basket analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc. • Examples: • Rule form: “Body Head [support, confidence]” • buys(x, “cookie”) buys(x, “milk”) [0.5%, 60%] 30 Support and Confidence • Support • Percent of samples contain both A and B • support(A B) = P(A ∩ B) • Confidence • Percent of A samples also containing B • confidence(A B) = P(B|A) • Example • Sliced pork lattuce [support = 2%, confidence = 60%] 31 A store selling fruits and vegetables Which items are sold together frequently? 32 An Example of Market Basket(1) • There are 8 transactions on three items on A (Apple), B (Banana) , C (Carrot). • Check associations for below two cases. (1) A (apple) B(banana) # Basket 1 A 2 B 3 C 4 A, B 5 A, C 6 B, C 7 A, B, C 8 A, B, C 33 An Example of Market Basket(1(2) • Basic probabilities are below: (1) AB Coverage 5/8 = 0.625 Support P(A∩B) = 3/8 = 0.375 Confidence P(B|A)=3/5=0.6 Lift P(A∩B) P(A)*P(B) Leverage P(A∩B) - P(A)*P(B) =0.375 - 0.39 = -0.015 0.375/(0.625*0.625)=0.375/0.39=0.0.96 34 Lift • What are good association rules? (How to interpret them?) • If lift is close to 1, it means there is no association between two items (sets). • If lift is greater than 1, it means there is a positive association between two items (sets). • If lift is less than 1, it means there is a negative association between two items (sets). 35 Leverage • • • • Leverage = P(A∩B) - P(A)*P(B) , it has three types ① Leverage > 0 ② Leverage = 0 ③ Leverage < 0 ① Two items (sets) are positively associated ② Two items (sets) are independent ③Two items (sets) are negatively associated 36 Lab on Association Rules(1) • SAS Enterprise Miner or SPSS Clementine have association rules softwares. • For this exercise, however, we uses Magnum Opus. • download Magnum Opus evaluation version ( click) 37 • After you install the problem, you can see below initial screen. From menu, choose File – Import Data (Ctrl – O). 38 • Demo Data sets are already there. Magnum Opus has two types of data sets available: (transaction data: *.idi, *.itl) and (attribute-value data: *.data, *.nam) • Data format has below two types:(*.idi, *.itl). idi itl (identifier-item file) (item list file) 001, 001, 001, 002, 002, 002, 002, apples oranges bananas apples carrots lettuce tomatoes apples, oranges, bananas apples, carrots, lettuce, tomatoes 39 • If you open tutorial.idi using note pad, you can see the file inside as left. • The example left has 5 transactions (baskets) 40 • File – Import Data, or click . click Tutorial.idi • Check Identifier – item file and click Next >. 41 • Set things as they are. • Search by: LIFT • Minimum lift: 1 • Maximum no. of rules: 10 • Click GO 42 • Results are saved in tutorial.out file. • Below is an example of rule derived: tomatoes -> lettuce [Coverage=0.263 (263); Support=0.111 (111); Strength=0.422; Lift=1.94; Leverage=0.0539 (53.9); p=2.35E-019] 43 Output from association rule analysis Only 55 rules satisfy the specified constraints. tomatoes -> lettuce [Coverage=0.263 (263); Support=0.111 (111); Strength=0.422; Lift=1.94; Leverage=0.0539 (53.9); p=2.35E-019] lettuce -> tomatoes [Coverage=0.217 (217); Support=0.111 (111); Strength=0.512; Lift=1.94; Leverage=0.0539 (53.9); p=2.35E-019] tomatoes -> carrots [Coverage=0.263 (263); Support=0.085 (85); Strength=0.323; Lift=1.85; Leverage=0.0390 (39.0); p=1.83E-012] carrots -> tomatoes [Coverage=0.175 (175); Support=0.085 (85); Strength=0.486; Lift=1.85; Leverage=0.0390 (39.0); p=1.83E-012] onions -> potatoes [Coverage=0.189 (189); Support=0.082 (82); Strength=0.434; Lift=1.53; Leverage=0.0285 (28.5); p=5.30E-007] potatoes -> onions [Coverage=0.283 (283); Support=0.082 (82); Strength=0.290; Lift=1.53; Leverage=0.0285 (28.5); p=5.30E-007] lettuce & carrots -> tomatoes [Coverage=0.045 (45); Support=0.039 (39); Strength=0.867; Lift=3.30; Leverage=0.0272 (27.2); p=3.16E-008] 44