Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Department of Computer and Information Science Big Data and Its Technologies CISC 6930 Data Mining CISC 4631 Data Mining Department of Computer and Information Science What We Are Going to Learn o What is Big Data? o Characteristics of Big Data o What To Do With The Data? o What Technology Do We Have For Big Data? o A Simple Big Data Mining Example o Hadoop in the Wild o Big Data in the Cloud 2 CISC 4631 Data Mining Department of Computer and Information Science Imagine: You are working in a company, tomorrow morning you go to your office and there’s a mail from your CEO regarding a new task: Dear <Your Name>, As you know we are building a blogging platform blogger2.com, I need some statistics. I need to find out, across all blogs ever written on blogger.com, how many times one character words occur (like 'a', 'I'), How many times two character words occur (like 'be', 'is')…, and so on till how many times do ten character words occur. I know its a really big job. So, I will assign, all 50,000 employees working in our company to work with you on this for a week. I am going on a vacation for a week, and its really important that I've this when I return. Good luck. regds, The CEO P.s : and one more thing. Everything has to be done manually, except going to the blog and copy pasting it on notepad. I read somewhere that if you write programs, google can find out about it 3 CISC 4631 Data Mining Department of Computer and Information Science Picture yourself in that position for a moment, like CEO. • You have 50,000 people to work for you for a week. And you need to find out the number of one character words, No. of two character words etc., covering the maximum number of blogs in BlogSpot. • Finally you need to give a report to your CEO with something like this: Occurrence of one character words – Around 937688399933 Occurrence of two character words – Around 23388383830753434 .. hence forth till ten • If homicide, suicide or resigning the job is not an option, how would you solve it? • How would you avoid the chaos of so many people working? • How will you co-ordinate those many since the output of one has to be merged with another? 4 CISC 4631 Data Mining Department of Computer and Information Science The Big Questions o What is Big Data? o What makes Data “Big”? o How to manage very large amounts of data and extract value and knowledge from them? 5 CISC 4631 Data Mining Department of Computer and Information Science What is Big Data? o No single standard definition… “Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it… 6 CISC 4631 Data Mining Department of Computer and Information Science What is Big Data? Here is from Wikipedia: 7 Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.” CISC 4631 Data Mining Department of Computer and Information Science Big Data EveryWhere! o Lots of data is being collected and warehoused Web data, e-commerce Purchases at department/ grocery stores Bank/Credit Card transactions Social Network 8 CISC 4631 Data Mining Department of Computer and Information Science How Much Data? o Man on the moon with 32KB (1969); my laptop had 8GB RAM (2013) 640K ought to be enough for anybody. o Google collects 270PB data in a month (2007), 20PB a day (2008) o Facebook has 2.5 PB of user data + 15 TB/day (4/2009) o eBay has 6.5 PB of user data + 50 TB/day (5/2009) 9 CISC 4631 Data Mining Department of Computer and Information Science How Much Data? 2.7 Zetabytes of data exist in the digital universe today. 235 Terabytes of data has been collected by the U.S. Library of Congress in April 2011. The Obama administration is investing $200 million in big data research projects. According to estimates, the volume of business data worldwide, across all companies, doubles every 1.2 years. 140,000 to 190,000, too many people with deep analytical skills to fill the demand of Big Data jobs in the U.S. by 2018. 10 CISC 4631 Data Mining Department of Computer and Information Science We Are in a Knowledge Economy o Data is an important asset to any organization Discovery of knowledge Enabling discovery Annotation of data o We are looking at newer Programming models, and Supporting algorithms and data structures. o NSF refers to it as “data-intensive computing” and industry calls it “big-data” and “cloud computing” 11 CISC 4631 Data Mining Department of Computer and Information Science What We Are Going to Learn o What is Big Data? o Characteristics of Big Data o What To Do With The Data? o What Technology Do We Have For Big Data ?? o A Simple Big Data Mining Example o Hadoop in the Wild o Big Data in the Cloud 12 CISC 4631 Data Mining Department of Computer and Information Science Characteristics of Big Data: 1-Scale (Volume/Scale) o Data Volume 44x increase from 2009 2020 From 0.8 zettabytes to 35zb o Data volume is increasing exponentially Exponential increase in collected/generated data 13 CISC 4631 Data Mining Department of Computer and Information Science CERN’s Large Hydron Collider (LHC) generates 15 PB a year CERN’s Large Hydron Collider (LHC) generates 15 PB a year 14 CISC 4631 Data Mining Department of Computer and Information Science The Earthscope • The Earthscope is the world's largest science project. • Designed to track North America's geological evolution • This observatory records data over 3.8 million square miles, amassing 67 terabytes of data. • It analyzes seismic slips in the San Andreas fault, sure, but also the plume of magma underneath Yellowstone and much, much more. (http://www.msnbc.msn.com/id/44363598/ns/technology_an d_science-future_of_technology/#.TmetOdQ--uI) 15 CISC 4631 Data Mining Department of Computer and Information Science 30 billion 12+ TBs 25+ TBs of log data every day sold annually billion 76 million smart meters in 2009… 200M by 2014 16 camera phones world wide 100s of million s of GPS enable d devices 2+ data every day ? TBs of of tweet data every day RFID tags today (1.3B in 2005) 4.6 billion CISC 4631 Data Mining people on the Web by end 2011 Department of Computer and Information Science Characteristics of Big Data: 2-Complexity (Variety- Complexity) o Various formats, types, and structures o Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc… o Static data vs. streaming data o A single application can be generating/collecting many types of data To extract knowledge all these types of data need to linked together 17 CISC 4631 Data Mining Department of Computer and Information Science Characteristics of Big Data: 2-Complexity (Variety- Complexity) o Types of Data Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) Graph Data Social Network, Semantic Web (RDF), … Streaming Data You can only scan the data once A single application can be generating/collecting many types of data Big Public Data (online, weather, finance, etc.) 18 CISC 4631 Data Mining Department of Computer and Information Science A Single View to the Customer Banking Finance Social Media Purchase Entertain 19 Our Known History Customer Gaming CISC 4631 Data Mining Department of Computer and Information Science Real-Time Analytics/Decision Requirement Product Recommendations that are Relevant & Compelling Improving the Marketing Effectiveness of a Promotion while it is still in Play 20 Influence Behavior Learning why Customers Switch to competitors and their offers; in time to Counter Customer Preventing Fraud as it is Occurring & preventing more proactively CISC 4631 Data Mining Friend Invitations to join a Game or Activity that expands business Department of Computer and Information Science Characteristics of Big Data: 3-Speed (Velocity) o Data begins generated fast and need to be processed fast Online Data Analytics Late decisions missing opportunities o Examples E-Promotions: Based on your current location, your purchase history, what you like send promotions right now for store next to you Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction 21 CISC 4631 Data Mining Department of Computer and Information Science Characteristics of Big Data: 3-Speed (Velocity) o Real-time/Fast Data Mobile devices (tracking all objects all the time) Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Sensor technology and networks (measuring all kinds of data) o The progress and innovation is no longer hindered by the ability to collect data o But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion 22 CISC 4631 Data Mining Department of Computer and Information Science 3 Vs of Big Data o The “BIG” in big data isn’t just about volume 23 CISC 4631 Data Mining Department of Computer and Information Science Some Make it 4V’s 24 CISC 4631 Data Mining Department of Computer and Information Science Some Make it 4V’s 25 CISC 4631 Data Mining Department of Computer and Information Science What We Are Going to Learn o What is Big Data? o Characteristics of Big Data o What To Do With The Data? o What Technology Do We Have For Big Data ?? o A Simple Big Data Mining Example o Hadoop in the Wild o Big Data in the Cloud 26 CISC 4631 Data Mining Department of Computer and Information Science Harnessing Big Data o OLTP: Online Transaction Processing (DBMSs) o OLAP: Online Analytical Processing (Data Warehousing) o RTAP: Real-Time Analytics Processing (Big Data Architecture & technology) 27 CISC 4631 Data Mining Department of Computer and Information Science What To Do With These Data? o Aggregation and Statistics Data warehouse and OLAP o Indexing, Searching, and Querying Keyword based search Pattern matching (XML/RDF) o Knowledge discovery Data Mining Statistical Modeling 28 CISC 4631 Data Mining Department of Computer and Information Science Who’s Generating Big Data Mobile devices (tracking all objects all the time) Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Sensor technology and networks (measuring all kinds of data) o The progress and innovation is no longer hindered by the ability to collect data, but o By the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion 29 CISC 4631 Data Mining Department of Computer and Information Science The Model Has Changed… o The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data 30 CISC 4631 Data Mining Department of Computer and Information Science The Evolution of Business Intelligence Interactive Business Intelligence & In-memory RDBMS Speed QliqView, Tableau, HANA Scale Graph Databases BI Reporting OLAP & Dataware house Business Objects, SAS, Informatica, Cognos other SQL Reporting Tools Big Data: Real Time & Single View Scale Speed Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra 1990’s 31 2000’s CISC 4631 Data Mining 2010’s Department of Computer and Information Science Value of Big Data Analytics o Big data is more real-time in nature than traditional DW applications o Traditional DW architectures (e.g. Exadata, Teradata) are not well-suited for big data apps o Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps 32 CISC 4631 Data Mining Department of Computer and Information Science Challenges in Handling Big Data o The Bottleneck is in technology New architecture, algorithms, techniques are needed o Also in technical skills Experts in using the new technology and dealing with big data 33 CISC 4631 Data Mining Department of Computer and Information Science What We Are Going to Learn o What is Big Data? o Characteristics of Big Data o What To Do With The Data? o What Technology Do We Have For Big Data ?? o A Simple Big Data Mining Example o Hadoop in the Wild o Big Data in the Cloud 34 CISC 4631 Data Mining Department of Computer and Information Science Big Data Landscape Apps Data as a service Infrastructure Technology 35 CISC 4631 Data Mining Department of Computer and Information Science Big Data Technology 36 CISC 4631 Data Mining Department of Computer and Information Science Hadoop/MapReduce Technology o What is Hadoop and why does it matter? Hadoop is the core platform for structuring Big Data. Hadoop is an open-source software framework for structuring and storing data and running applications on clusters of commodity hardware 37 Hadoop uses a distributed computing architecture consisting of many servers It also solves the problem of formatting it for analytic purposes. A storage part, known as Hadoop Distributed File System (HDFS) A processing part called MapReduce. CISC 4631 Data Mining Department of Computer and Information Science Hadoop/MapReduce Technology o Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant. It was originally developed to support distribution for the Nutch search engine project. The objective of design is to answer a question: “How to process big data with reasonable cost and time?” 38 CISC 4631 Data Mining Department of Computer and Information Science Hadoop/MapReduce Technology o Why Hadoop is important? It is a flexible, scalable, and highly-available architecture for distributed computation and data processing on a network of commodity hardware. 39 CISC 4631 Data Mining Department of Computer and Information Science What We Are Going to Learn o What is Big Data? o Characteristics of Big Data o What To Do With The Data? o What Technology Do We Have For Big Data ?? o A Simple Big Data Mining Example o Hadoop in the Wild o Big Data in the Cloud 40 CISC 4631 Data Mining Department of Computer and Information Science Let’s Have A Simple Big Data Mining Example o Tomorrow morning you go to your office and there’s a mail from your CEO regarding a new work: Dear <Your Name>, As you know we are building the blogging platform blogger2.com, I need some statistics. I need to find out, across all blogs ever written on blogger.com, how many times one character words occur (like 'a', 'I'), How many times two character words occur (like 'be', 'is').. and so on till how many times do ten character words occur. I know its a really big job. So, I will assign, all 50,000 employees working in our company to work with you on this for a week. I am going on a vacation for a week, and its really important that I've this when I return. Good luck. regds, The CEO P.s : and one more thing. Everything has to be done manually, except going to the blog and copy pasting it on notepad. I read somewhere that if you write programs, google can find out about it 41 CISC 4631 Data Mining Department of Computer and Information Science Let’s Have A Simple Big Data Mining Example o Chapter 1: Picture yourself in that position for a moment. Picture yourself in that position for a moment. • You have 50,000 people to work for you for a week. And you need to find out the number of one character words, No. of two character words etc., covering the maximum number of blogs in BlogSpot. • Finally you need to give a report to your CEO with something like this: Occurrence of one character words – Around 937688399933 Occurrence of two character words – Around 23388383830753434 .. hence forth till ten • If homicide, suicide or resigning the job is not an option, how would you solve it? • How would you avoid the chaos of so many people working? • How will you co-ordinate those many since the output of one has to be merged with another? 42 CISC 4631 Data Mining Department of Computer and Information Science How to Mine the Data? Or How to Solve it 43 CISC 4631 Data Mining Department of Computer and Information Science Let’s Have A Simple Big Data Mining Example o Chapter 2: Proclamation: Let there be caste The next day, you stand with a mike on the day before 50,000 and proclaim. For a week, you will all be divided into many groups: • The Mappers (tens of thousands of people will be in this group) • The Grouper (assume just one guy for now) • The Reducers (around 10 of employees) and.. • The Master (that’s you). Then you talk to each one of the groups. 44 CISC 4631 Data Mining Department of Computer and Information Science Let’s Have A Simple Big Data Mining Example o Chapter 3: Your talk with the Mappers • Each mapper will get a set of 50 blog urls and really Big sheet of paper. • Each one of you need to go to each of that url, and for each word in those blogs, write one line on the paper. • The format of that line should be the number of characters in the word, then a comma, and then the actual word. For example, if you find the word “a”, you write “1,a”, in a new line in your paper. since the word “a” has only 1 character. If you find the word “hello”, you write “5,hello” on the new line. Each take 4 days. So, After 4 days, your sheet might look like this • “1,a” At the end of the 4th day, each one of • “5,hello” you will give your sheet completely • “2,if” filled to the Grouper • .. and a million more lines 45 CISC 4631 Data Mining Department of Computer and Information Science Let’s Have A Simple Big Data Mining Example o Chapter 4: Your talk with the Grouper Someone gives you 10 papers. The first paper will be marked one, the second paper will be marked two, and so on, till ten. You collect the output from mappers and for each line in the mapper’s sheet, if it says “one,”, your write the on sheet one, if it says “two, ”, you write it on sheet two. For example, if the first line of a mapper’s sheet says “1,a”, you write “a” on sheet 1. if it says “2,if”, your write “if” on sheet 2. If it says “5,hello”, you write hello on sheet 5. 46 CISC 4631 Data Mining Department of Computer and Information Science Let’s Have A Simple Big Data Mining Example o Chapter 4: Your talk with the Grouper So at the end of your work, the 10 sheets you have might look like this • Sheet 1: a, a ,a , I, I , i, a, i, i, i…. millions more • Sheet 2: if, of, it, of, of, if, at, im, is,is, of, of … millions more • Sheet 3 :the, the, and, for, met, bet, the, the, and, … millions more • .. • Sheet 10: …… Once you are done, you distribute, each sheet to one reducer. For example sheet 1 goes to reducer 1, sheet 2 goes to reducer 2 and so on. 47 CISC 4631 Data Mining Department of Computer and Information Science Let’s Have A Simple Big Data Mining Example o Chapter 5: Your talk with the Reducers: Each one of you gets one sheet from the grouper. For each sheet you count the number of words written on it and write it in big bold letters on the back side of the paper. For example, if you are reducer 2. You get sheet 2 from the grouper that looks like this: “Sheet 2: if, of, it, of, of, if, at, im, is,is, of, of …” You count the number of words on that sheet, say the number of words is 28838380044, you write it on the back side of the paper, in big bold letters and give it to me (the master). 48 CISC 4631 Data Mining Department of Computer and Information Science Let’s Have A Simple Big Data Mining Example o Chapter 6: The controlled Chaos and the climax: At the end of this process you have 10 sheets. Sheet 1, having the count of the number of words with 1 character on the back side. Sheet2, having the count of the number words with 2 characters on the back side. It is done. Genius! 49 CISC 4631 Data Mining Department of Computer and Information Science Let’s Have A Simple Big Data Mining Example o Comments You essentially did map reduce. The greatest advantage in your approach was • the Mappers can work independently • the Reducers can work independently • the Grouper can work really fast The process can be easily applied to other kinds of problems. In such a case : • The work of the Master (dividing the work) and the Grouper (Grouping the values by key [the value before comma]), remains the same. This is what any map-reduce library provides. • The work of the Mappers and Reducers differ according to the problem. This is what you should write. 50 CISC 4631 Data Mining Department of Computer and Information Science Hadoop/MapReduce Technology o MapReduce 51 CISC 4631 Data Mining Department of Computer and Information Science Hadoop/MapReduce Technology o MapReduce 52 CISC 4631 Data Mining Department of Computer and Information Science Hadoop/MapReduce Technology 2003 2004 2006 53 CISC 4631 Data Mining Department of Computer and Information Science What We Are Going to Learn o What is Big Data? o Characteristics of Big Data o What To Do With The Data? o What Technology Do We Have For Big Data ?? o A Simple Big Data Mining Example o Hadoop in the Wild o Big Data in the Cloud 54 CISC 4631 Data Mining Department of Computer and Information Science Hadoop in the Wild • Hadoop is in use at most organizations that handle big data: o o o o o Yahoo! Facebook Amazon Netflix etc… • Some examples of scale: o Yahoo!’s Search Webmap runs on 10,000 core Linux cluster and powers Yahoo! Web search 55 CISC 4631 Data Mining Department of Computer and Information Science Hadoop in the Wild • System requirements o o o o High write throughput Cheap, elastic storage Low latency High consistency (within a single data center good enough) o Disk-efficient sequential and random read performance 56 CISC 4631 Data Mining Department of Computer and Information Science Hadoop in the Wild • Classic alternatives o These requirements typically met using large MySQL cluster o Content on HDFS could be loaded into MySQL • Problems with previous solutions o MySQL has low random write throughput… BIG problem for messaging! o Difficult to scale MySQL clusters rapidly while maintaining performance o MySQL clusters have high management overhead, require more expensive hardware 57 CISC 4631 Data Mining Department of Computer and Information Science Hadoop in the Wild Typical Hadoop Cluster Aggregation switch Rack switch o 40 nodes/rack, 1000-4000 nodes in cluster o 1 Gbps bandwidth within rack, 8 Gbps out of rack o Node specs (Yahoo terasort): 8 x 2GHz cores, 8 GB RAM, 4 disks (= 4 TB?) 58 CISC 4631 Data Mining Department of Computer and Information Science HBase o HBase is an open-source, distributed, column- oriented database built on top of HDFS based on BigTable! Designed to operate on top of the Hadoop Distributed file system (HDFS) or Kosmos File System (KFS, aka Cloudstore) for scalability, fault tolerance, and high availability. 59 No real indexes Automatic partitioning Scale linearly and automatically with new nodes Commodity hardware Fault tolerance Batch processing CISC 4631 Data Mining Department of Computer and Information Science MongoDB o MongoDB is the leading NoSQL solution free and open-source cross-platform document-oriented database program Founded in 2007, by Dwight Merriman, Eliot Horowitz Doubleclick, Oracle, Marklogic, HP CISC 4631 Data Mining Department of Computer and Information Science MongoDB It is: General Purpose Rich data model Full featured indexes Sophisticated query language Easy to Use Easy mapping to object oriented code Native language drivers in all popular languages Simple to setup and manage Fast & Scalable Operates at inmemory speed wherever possible Auto-sharding built in Dynamically add / remove capacity with no downtime 61 CISC 4631 Data Mining Department of Computer and Information Science What We Are Going to Learn o What is Big Data? o Characteristics of Big Data o What To Do With The Data? o What Technology Do We Have For Big Data ?? o A Simple Big Data Mining Example o Hadoop in the Wild o Big Data in the Cloud 62 CISC 4631 Data Mining Department of Computer and Information Science Big Data in the Cloud o Why? WEB is Replacing the Desktop 63 CISC 4631 Data Mining Department of Computer and Information Science Big Data in the Cloud o Paradigm Shift in Computing 64 CISC 4631 Data Mining Department of Computer and Information Science Big Data in the Cloud o What is Cloud Computing? Storing, processing, and accessing data and programs over the Internet instead of your computer's hard drive IT resources provided as a service Clouds leverage economies of scale of commodity hardware 65 Compute, storage, databases, queues Cheap storage, high bandwidth networks & multicore processors Geographically distributed data centers Offerings from Microsoft, Amazon, Google, … CISC 4631 Data Mining Department of Computer and Information Science Big Data in the Cloud Resources Capacity Demand Resources o Economics of Cloud Users Pay by use instead of provisioning for peak Capacity Demand Time Time Static data center Data center in the cloud Unused resources Slide Credits: Berkeley RAD Lab 66 CISC 4631 Data Mining Department of Computer and Information Science Big Data in the Cloud 2.7 ZB Global Digital Data 0.5 Petabytes Two years tweets 43% analytics could be improved in their organization if data analytics cloud services was part of 66% 67 Will or plan to use Big Data in the cloud CISC 4631 Data Mining think that data Department of Computer and Information Science Big Data in the Cloud o Data Mining in the Cloud: 3 Reasons o Skills Do you really need/want this all in-house? o Huge amounts of external data. Does it make sense to move and manage all this data behind your firewall? o Focus on the value of your data Holger Kisker 68 Instead of big data management. CISC 4631 Data Mining Department of Computer and Information Science Big Data in the Cloud o Data Mining in the Cloud: Another Reason Data Warehousing, Data Analytics & Decision Support Systems 69 Used to manage and control business Transactional Data: historical or point-in-time Optimized for inquiry rather than update Use of the system is loosely defined and can be ad-hoc Used by managers and analysts to understand the business and make judgments CISC 4631 Data Mining Department of Computer and Information Science Big Data in the Cloud o Data Analytics in the Cloud Scalability to large data volumes: Scan 100 TB on 1 node @ 50 MB/sec = 23 days Scan on 1000-node cluster = 33 minutes Divide-And-Conquer (i.e., data partitioning) Cost-efficiency: Commodity nodes (cheap, but unreliable) Commodity network Automatic fault-tolerance (fewer administrators) Easy to use (fewer programmers) 70 CISC 4631 Data Mining Department of Computer and Information Science o References 71 http://www.slideshare.net/nasrinhussain1/big-data-ppt-31616290 www.cs.kent.edu/~jin/Cloud12Spring/BigData.pptx feihu.eng.ua.edu/bigdata/week1_1.pptx https://web.cs.wpi.edu/~cs525/s13-MYE/lectures/1/intro.pptx www.cigi.illinois.edu/cybergis12/ppt/gahegan.ppt https://www.ee.columbia.edu/.../bigdata/EECS6893-BigDataAnalyticsLecture1.pdf www.csie.ntu.edu.tw/~cjlin/talks/twdatasci_cjlin.pdf www.kdnuggets.com/data_mining.../x4-data-mining-to-knowledgediscovery.ppt www.cse.buffalo.edu/~bina/cse487/spring2013/MRParallelJan22.pdf www.cse.buffalo.edu/faculty/bina/MapReduce/mapreduceApril24.ppt CISC 4631 Data Mining