* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Dagstuhl Seminar 10042, Demetris Zeinalipour, University
Survey
Document related concepts
Transcript
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Big Data - What Is It? Demetris Zeinalipour Assistant Professor Data Management Systems Laboratory Department of Computer Science University of Cyprus http://dmsl.cs.ucy.ac.cy/ 4th Architect Club Meeting, Tuesday, March 12, 2013, 8:45-14:00 Pralina, 31 Stasicratous Str., Nicosia, Cyprus 1 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Objectives • To provide an overview of the emerging field of Big Data Management from a wide range of perspectives: – Fundamentals / Trends, Industrial / Academic, Commercial / Open, Reality / Visionary, etc. • I assume that the audience has a technical background (e.g., DBAs) • Lots of examples and illustrations to keep this presentation entertaining and educating. Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 2 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Talk Outline • • Big Data Definitions and Background Big Data Definition by 3V Examples – Velocity • – Volume • – Text<Multimedia<Sciences, Web Data, Filesystems Variety • • • • Sensor Monitoring, Network Monitoring, Web2.0 Media, Smartphone Services The New Database Landscape NoSQL (Document Stores, Replication, Consistency, MapReduce, Column Stores) NewSQL Trends Big Data Education and Research – – Courses @ UCY Research Prototypes @ UCY Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 3 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Big Data Definitions • "Refers to data sets whose size and structure strains (stretches) the ability of commonly used relational DBMSs to capture, manage, and process the data within a tolerable elapsed time." – Hoffer, Ramesh, Topi: Modern Database Management, 11E, 2013. • Similar from Wikipedia, Feb. 2013 – "big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications." Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 4 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Big Data Characteristics • • • • • Size: from a few dozen terabytes to many petabytes in a single database. Data model: anything from structured (relational or tabular) to semi-structured (XML or JSON) or even unstructured (Web text and log files). Architectures: highly parallel and distributed in order to cope with the inherent I/O and CPU limitations. Hardware: mid-scale private clouds (datacenters), offering higher privacy, to large-scale public clouds. Functionality: operational (OLTP) and analytic (OLAP) functionality stand-alone or as-a-Service. Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 5 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Big Data Characteristics 2013 IEEE International Conference on Big Data (IEEE BigData 2013), October 6-9, 2013, Silicon Valley, CA, USA Wordle.net Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 6 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Background: Public Clouds Google's Datacenter in Oregon Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 7 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Background: Public Clouds Microsoft Azure in Chicago 112 containers x 2000 servers = 224,000 servers Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 8 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Background: *-as-a-Service To Amazon RDS (Relational Database Service) 963$ / year 27,165 $ / year Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 9 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Background: Private Clouds Our Laboratory Private IaaS Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 10 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Big Data: Velocity-Volume-Variety • Velocity – how fast data is being produced and how fast the data must be processed to meet demand. • • • How to deal with torrents of data, in near-real time, streaming from RFID tags and smart metering systems? How to identify fraud in 5 million trade events created each day? Reacting quickly enough to deal with velocity is a challenge to most organizations. Source: IDC. "Big Data Analytics: Future Architectures, Skills and Roadmaps for the CIO," September 2011. Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 12 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Big Data: Velocity-Volume-Variety • Volume – Past Challenge: Store data. • • • transaction-based data stored through the years. sensor data being collected Integration with web applications & social media – New Challenge: Create value from data • Turn 12 TB of Tweets each day into a sentiment analysis (opinion mining) product. – • e.g., People feel positive/negative/neutral about brand X. Turn 350 billion annual smart meter readings to knowledge that helps predicting power consumption. Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 13 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Big Data: Velocity-Volume-Variety • Variety: – By some estimates, 80 percent of an organization's data is not numeric! Different data format: unstructured, structured, semi-structured – • text, sensor data, audio, video, click streams, log files, etc. Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 14 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Talk Outline • • Big Data Definitions and Background Big Data Definition by 3V Examples – Velocity • – Volume • – Text<Multimedia<Sciences, Web Data, Filesystems Variety • • • • Sensor Monitoring, Network Monitoring, Web2.0 Media, Smartphone Services) The New Database Landscape NoSQL (Document Stores, Replication, Consistency, File Systems, Map-Reduce, Column Stores) NewSQL Trends Big Data Education and Research – – Courses @ UCY Research Prototypes @ UCY Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 15 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Velocity #1: Smart Meters • Smart meter: records consumption of electric energy in intervals and communicates that information to the utility for monitoring and billing purposes. Every 15m Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 16 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Velocity #1: Smart Meters • Ontario's Meter Data Management and Repository (MDM/R): storing, processing and managing all smart meter data in Ontario, Canada • Characteristics: – Provides hourly billing quantity and extensive reports. – 4.6 million smart meters. • – Storage/Bandwidth: 4.6M meters x 0.5K message (typical HTTP) = 2.3 GB / round 110 million meter reads per day • on an annual basis, exceeds the number of debit card transactions processed in the country (Canada!) Source: Smart Metering Entity: http://www.smi-ieso.ca/mdmr Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 17 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Velocity #2: Network Monitoring • Akamai: – CDN serving 15-30% of all Web traffic (10TB/sec) • • – • One out of every three Global 500® companies All of the top Internet portals Has a picture of the global traffic every 6 seconds How? – – 119,000 servers in 80 countries within over 1,100 networks. Servers report to a proprietary database network health information (latency/loss) every 6 seconds. Proprietary DBMS Every 6 seconds Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ ping/traceroute 18 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Velocity #2: Network Monitoring Companies started seeking Big data engineers. Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 19 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Velocity #3: Web2.0 Media • • Analyze online conversations in Social Nets. Accelerated responses to marketplace shifts. Continously Over Web2.0 protocols Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 20 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Velocity #3: Web2.0 Media Web1.0: The Unstructured Web http://books.google.com/ (content in HTML only apprehensible to User) Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 21 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Velocity #3: Web2.0 Media Web2.0: The Semi-structured Web! https://www.googleapis.com/books/v1/volumes?q=database s content in XML/JSON apprehensible to Computer Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 22 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Velocity #3: Web2.0 Media Twitter API https://twitter.com/users/dmslucy.json Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 23 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Velocity #3: Web2.0 Media In fact, Web2.0 Services are omnipresent! (Google, Twitter, Facebook, Youtube, Linkedin, …) http://www.programmableweb.com/ - 7800 APIs!!! + 6800 Mashups! https://code.google.com/apis Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 24 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Velocity #4: Smartphone Services Request Format (request.json) { "homeMobileCountryCode": 310, "homeMobileNetworkCode": 260, "radioType": "gsm", "carrier": "T-Mobile", "cellTowers": [ { "cellId": 39627456, "locationAreaCode": 40495, "mobileCountryCode": 310, "mobileNetworkCode": 260, "age": 0, "signalStrength": -95 } ], } Response Format The response format is also JSON. { "location": { "latitude": 51.0, "longitude": -0.1, }, "accuracy": 1200.4, } "wifiAccessPoints": [ { "macAddress": "01:23:45:67:89:AB", "signalStrength": 8, "age": 0, "signalToNoiseRatio": -65, "channel": 8 }, { "macAddress": "01:23:45:67:89:AC", "signalStrength": 4, "age": 0 } ] Will be discussing some furtherinhouse applications in a while Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 25 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Velocity #4: Smartphone Services Wireless Data Transfer Rates 4G ITU peak rates: •100 Mbps (high mobility, such as trains and cars) •1Gbps (low mobility, such as pedestrians and stationary users) Plot Courtesy of H. Kim, N. Agrawal, and C. Ungureanu, "Revisiting Storage for Smartphones", The 10th USENIX Conference on File and Storage Technologies (FAST'12), San Jose, CA, February 2012. *** Best Paper Award *** Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 26 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Velocity #4: Smartphone Services Mapping the Road traffic by collecting WiFi signals. Every 1 second Received Signal Strength (RSS): power present in WiFi radio signal Graphics courtesy of: A .Thiagarajan et. al. “Vtrack: Accurate, Energy-Aware Road Traffic Delay Estimation using Mobile Phones, In Sensys’09, pages 85-98. ACM, (Best Paper) MIT’s CarTel Group 27 Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Talk Outline • • Big Data Definitions and Background Big Data Definition by 3V Examples – Velocity • – Volume • – Text<Multimedia<Sciences, Web Data, Filesystems Variety • • • • Sensor Monitoring, Network Monitoring, Web2.0 Media, Smartphone Services) The New Database Landscape NoSQL (Document Stores, Replication, Consistency, File Systems, Map-Reduce, Column Stores) NewSQL Trends Big Data Education and Research – – Courses @ UCY Research Prototypes @ UCY Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 28 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Volume #1: Text<Multimedia<Sciences Sciences/ Sensors Multimedia/ Streaming Human Generated • From the TB-era to the PB-era. – – The U.S. Library of Congress (April 2011): 235 TB Anchestry.com: Genealogical data 600 TB – Games: World of Warcraft uses 1.3 PB of storage to maintain its game. Internet Video: will account for 61% of total Internet Data by 2015 (966 Exabytes or nearly 1 Zettabyte!) – – – Climate science: The German Climate Computing Centre (DKRZ) has a storage capacity of 60 PB of climate data. Physics: The experiments in the Large Hadron Collider produce about 15 PB of data per year, which is distributed over the LHC Computing Grid (Our department is part of the EGEE – Enabling Grids for E-sciencE, now EGI - European Grid Infrastructure). Source: Petabyte, from Wikipedia: http://en.wikipedia.org/wiki/Petabyte Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 29 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Volume #2: Web Data Google Volume (in 2006) IDC: The total amount of global data is expected to grow to 2.7 zettabytes during 2012. This is 48% up from 2011. http://en.wikipedia.org/wiki/Zettabyte Bigtable: A Distributed Storage System for Structured Data, OSDI'06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, November, 2006. Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 30 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Volume #3: Big Data File Systems • Big Data Filesystems: HDFS Namespace lookup are fast (1 Master enough!) [ 1GB Metadata = 1PB Data ] In NFS Metadata + Transfers going through same server => Not Scalable HDFS designed for unreliable hardware (2-3 failures / 1000 nodes / day) Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 31 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Volume #3: Big Data File Systems • Big Data Filesystems: How Big? • Results from 2010: HDFS scalability: the limits to growth http://static.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 32 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Variety #2: File Systems NFS uses a Client/Server Architecture that is a single point of failure by default. Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 33 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Talk Outline • • Big Data Definitions and Background Big Data Definition by 3V Examples – Velocity • – Volume • – Text<Multimedia<Sciences, Web Data, Filesystems Variety • • • • Sensor Monitoring, Network Monitoring, Web2.0 Media, Smartphone Services) The New Database Landscape NoSQL (Document Stores, Replication, Consistency, File Systems, Map-Reduce, Column Stores) NewSQL Overview (ACID-compliant NoSQL stores) Big Data Teaching and Research – – Courses @ UCY Research Prototypes @ UCY Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 34 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Variety Overview 451 Research, Matthew Aslett, http://goo.gl/GYcEx Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 35 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Variety #1: NoSQL • • • NoSQL ("not only SQL") is a broad class of database management systems identified by non-adherence to the widely used relational database management system model. NoSQL databases are NOT built primarily on tables, and generally DO NOT use SQL for data. NoSQL => Not Relational! – – – – – Key Value (e.g., BerkeleyDB – emb, Oracle NoSQL Distributed) Document Stores (e.g., JSON stores) BigTables (i.e., Column-stores) Graph Databases (e.g., FlockDB) … potentially much longer list but I will only focus on a few trends Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 36 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Variety #1: NoSQL / Document Stores Document in CouchDB Map Function function(doc) { for (i in doc.authors) { author = doc.authors[i]; emit(doc._id, author); } } Results (through REST/HTTP or Futon) Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 37 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Variety #1: NoSQL / Document Stores For a real app we could envision much more complex queries. http://rickosborne.org/download/SQL-to-MongoDB.pdf Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 38 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Variety #1: NoSQL / Replication Asynchronous Replication means Eventually Consistent Asynchronous Asynchronous Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 39 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Variety #1: NoSQL / Consistency SQL RDBMSs (Most) NoSQL DBMSs Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 40 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Variety #2: NoSQL / Map Reduce Analytics • Map-Reduce: a programming model for processing large data sets (Not online like Warehouses ). • Invented by Google! "MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, OSDI'04: Sixth Symposium on Operating System Design and Implementation,San Francisco, CA, December, 2004." • Can be implemented in any language (Java, example nex) • Hadoop: Apache's open-source software framework that supports data-intensive distributed applications • Derived from Google's MapReduce + Google File System (GFS) papers. • Enables applications to work with thousands of computationindependent computers and petabytes of data. • Download: http://hadoop.apache.org/ Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 41 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Variety #2: NoSQL / Map Reduce Analytics Count the distinct words in all documents cat *.txt | sort | uniq -c 1 TB on 1 PC = 2 hours!!! 1TB on 100 PCs = 1min!!! 42 Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Variety #2: NoSQL / Map Reduce Analytics Example uses 1 mapper / 1 reduce only! M a p S h u ff le R e d u c e Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 43 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Variety #2: NoSQL / Map Reduce Analytics Standard Output (e.g., socket) HFDS blocks (64MB containing documents) Hashing HDFS Reading Remote HDFS Local Write (e.g., Writing Shuffling Socket) (of terms) Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 44 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Variety #3: NoSQL / Column Stores • A column-oriented DBMS is a database management system (DBMS) that stores data tables as sections of columns rather than as rows, like most relational DBMSs Row-Store OLTP-workloads! 1,Smith,Joe,40000; 2,Jones,Mary,50000; 3,Johnson,Cathy,44000; Column-Store OLAP-workloads! 1,2,3; Smith,Jones,Johnson; Joe,Mary,Cathy; 40000,50000,44000; • Suggested for data warehouses, customer relationship management (CRM) systems and other ad-hoc inquiry systems where aggregates or scans are carried out over large numbers of similar data items Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 45 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Variety #3: NoSQL / Column Stores All column family members are stored together on the big data filesystem. Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 46 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Variety #4: NewSQL • "NewSQL" is a class of modern relational database management systems that seek to provide the same scalable performance of NoSQL systems for OLTP workloads while still maintaining the ACID guarantees (i.e., offering transactions) of a traditional DBMS. NewSQL= NoSQL+Transa ctions Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 48 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Variety #4: NewSQL Google's Trajectory • (2003) Google GFS Paper (SOSP'03) – Objective: Create a Google-scale Filesystem – Apache HDFS is GFS open-source implementation. • (2004) Google's Map-Reduce Paper (OSDI'04) – Objective: Enable big-data analytics over non-tabular data (e.g., XML or text) … with the assistance of GFS. – Apache's MapReduce: An open source implementation of the paper • (2006) Google BigTable Paper (OSDI'06) – Objective: Enable big-data analytics over tabular data (i.e., tables) – (2008) Apache's Hbase: An open-source implementation of the paper – (2010): Facebook Messaging moves from Cassandra to HBase • (2012) Google's F1 RDBMS (SIGMOD'12) & Spanner Storage Papers (OSDI'12) Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 49 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Talk Outline • • Big Data Definitions and Background Big Data Definition by 3V Examples – Velocity • – Volume • – Text<Multimedia<Sciences, Web Data, Filesystems Variety • • • • Sensor Monitoring, Network Monitoring, Web2.0 Media, Smartphone Services) The New Database Landscape NoSQL (Document Stores, Replication, Consistency, File Systems, Map-Reduce, Column Stores) NewSQL Overview (ACID-compliant NoSQL stores) Big Data Education and Research – – Courses @ UCY Research Prototypes @ UCY Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 50 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Big Data Courses @ UCY • NoSQL and NewSQL – – – – – – Intro to Web2.0 & the JSON data interchange format, Key-Value data model & CouchDB. Introduction & Fundamentals: I/O Performance, Replication Strategies, etc. Big-data Filesystems: HDFS "Big-Data" Analytics: Map-Reduce, Hadoop, PIG Column Stores: BigTable, HBase and Intro to NewSQL (Spanner and F1) Advanced Topics in Databases http://www.cs.ucy.ac.cy/~dzeina/courses/epl646 Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 51 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Big Data Courses Elsewhere • Data science incorporates varying elements and builds on techniques and theories from many fields, including with the goal of extracting meaning from data and creating data products. Data Science Combines the Following Fields: • • • • • • • • • • Math Statistics, Data engineering, Pattern recognition and learning, Advanced computing, Visualization, Uncertainty modeling, Data warehousing, and High performance computing Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 52 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Big Data Courses Elsewhere • Course Syllabus Example (Univ. of Washington): – – – – – – – – • Data modeling: relations, key-value, trees, graphs, images, text Relational algebra and parallel query processing NoSQL systems, key-value stores Tradeoffs of SQL, NoSQL, and NewSQL systems Algorithm design in Hadoop (and MapReduce in general) Basic statistical analysis at scale: sampling, regression Introduction to data mining: clustering, association rules, decision trees Case studies in analytics: social networking, bioinformatics, text processing Free 10 week course: https://www.coursera.org/course/datasci/ Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 53 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Big Data Research @ UCY • Crowdbeam: Build an innovative Windows Phone messaging platform for a Finnish alliance, backed by Microsoft & Nokia. • Problem: Millions of users querying their K closest smartphones continuously. – Query executed every few seconds. – Currently state-less service • Setup: A 14-node Couchbase cluster (i.e., distributed - shared-nothing architecture NoSQL document-oriented database that is optimized for interactive applications Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 54 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Big Data Research @ UCY CrowdBeam Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 55 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Big Data Research @ UCY Native JSON Store + JSON RESTful API Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 56 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Big Data Research @ UCY • Airplace: Build an innovative indoor localization & navigation platform for Taiwanese company. • Problem: Radiomaps of indoor environments are fairly large structures considering that those become massively available. • Setup: A 4-node Apache Hbase cluster (i.e., distributed, non-relational, shared-nothing architecture modeled after Google's BigTable and is written in Java. • Best Demo Award at IEEE MDM'12, covered on Euronews and local media. Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 57 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Big Data Research @ UCY SmartLab: Massive smartphone simulations with our first global open smartphone IaaS cloud – http://smartlab.cs.ucy.ac.cy/ Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ 58 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Big Data Research @ UCY Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/ http://smartlab.cs.ucy.ac.cy/ 59 Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Big Data - What Is It? Thanks! Questions? Demetris Zeinalipour Assistant Professor Data Management Systems Laboratory Department of Computer Science University of Cyprus http://dmsl.cs.ucy.ac.cy/ 60