Download Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus

Document related concepts

Relational model wikipedia , lookup

Big data wikipedia , lookup

Clusterpoint wikipedia , lookup

Functional Database Model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Database model wikipedia , lookup

Transcript
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Big Data - What Is It?
Demetris Zeinalipour
Assistant Professor
Data Management Systems Laboratory
Department of Computer Science
University of Cyprus
http://dmsl.cs.ucy.ac.cy/
EPL671: Research Methodologies in Computer Science,
Graduate Course, Tuesday, Mar 19th, 2013.
1
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Objectives
• To provide an overview of the emerging
field of Big Data Management from a wide
range of perspectives:
– Fundamentals / Trends, Industrial / Academic,
Commercial / Open, Reality / Visionary, etc.
• I assume that the audience has a technical
background (e.g., DBAs)
• Lots of examples and illustrations to keep
this presentation entertaining and
educating.
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
2
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Talk Outline
•
•
Big Data Definitions and Background
Big Data Definition by 3V Examples
–
Velocity
•
–
Volume
•
–
Text<Multimedia<Sciences, Web Data, Filesystems
Variety
•
•
•
•
Sensor Monitoring, Network Monitoring, Web2.0 Media,
Smartphone Services
The New Database Landscape
NoSQL (Document Stores, Replication, Consistency, MapReduce, Column Stores)
NewSQL Trends
Big Data Education and Research
–
–
Courses @ UCY
Research Prototypes @ UCY
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
3
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Big Data Definitions
•
"Refers to data sets whose size and
structure strains (stretches) the ability of
commonly used relational DBMSs to
capture, manage, and process the data
within a tolerable elapsed time."
– Hoffer, Ramesh, Topi: Modern Database
Management, 11E, 2013.
•
Similar from Wikipedia, Feb. 2013
– "big data is a collection of data sets so large
and complex that it becomes difficult to
process using on-hand database
management tools or traditional data
processing applications."
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
4
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Big Data Characteristics
•
•
•
•
•
Size: from a few dozen terabytes to many
petabytes in a single database.
Data model: anything from structured (relational or
tabular) to semi-structured (XML or JSON) or even
unstructured (Web text and log files).
Architectures: highly parallel and distributed in
order to cope with the inherent I/O and CPU
limitations.
Hardware: mid-scale private clouds (datacenters),
offering higher privacy, to large-scale public clouds.
Functionality: operational (OLTP) and analytic
(OLAP) functionality stand-alone or as-a-Service.
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
5
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Big Data Characteristics
2013 IEEE International Conference on Big Data (IEEE BigData 2013),
October 6-9, 2013, Silicon Valley, CA, USA
Wordle.net
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
6
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Background: Public Clouds
Google's Datacenter in Oregon
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
7
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Background: Public Clouds
Microsoft Azure in Chicago
112 containers x 2000 servers = 224,000 servers
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
8
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Background: *-as-a-Service
To Amazon RDS (Relational Database Service)
963$ / year
27,165 $ / year
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
9
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Background: Private Clouds
Our Laboratory Private IaaS
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
10
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Big Data: Velocity-Volume-Variety
• Velocity
– how fast data is being produced and how
fast the data must be processed to meet
demand.
•
•
•
How to deal with torrents of data, in near-real
time, streaming from RFID tags and smart
metering systems?
How to identify fraud in 5 million trade events
created each day?
Reacting quickly enough to deal with velocity is
a challenge to most organizations.
Source: IDC. "Big Data Analytics: Future Architectures, Skills and Roadmaps for the CIO," September 2011.
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
12
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Big Data: Velocity-Volume-Variety
• Volume
– Past Challenge: Store data.
•
•
•
transaction-based data stored through the years.
sensor data being collected
Integration with web applications & social media
– New Challenge: Create value from data
•
Turn 12 TB of Tweets each day into a sentiment
analysis (opinion mining) product.
–
•
e.g., People feel positive/negative/neutral about brand X.
Turn 350 billion annual smart meter readings to
knowledge that helps predicting power
consumption.
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
13
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Big Data: Velocity-Volume-Variety
•
Variety:
–
By some estimates, 80
percent of an
organization's data is
not numeric!
Different data format:
unstructured, structured,
semi-structured
–
•
text, sensor data, audio,
video, click streams, log
files, etc.
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
14
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Talk Outline
•
•
Big Data Definitions and Background
Big Data Definition by 3V Examples
–
Velocity
•
–
Volume
•
–
Text<Multimedia<Sciences, Web Data, Filesystems
Variety
•
•
•
•
Sensor Monitoring, Network Monitoring, Web2.0 Media,
Smartphone Services)
The New Database Landscape
NoSQL (Document Stores, Replication, Consistency, File
Systems, Map-Reduce, Column Stores)
NewSQL Trends
Big Data Education and Research
–
–
Courses @ UCY
Research Prototypes @ UCY
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
15
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Velocity #1: Smart Meters
•
Smart meter: records consumption of electric
energy in intervals and communicates that
information to the utility for monitoring and
billing purposes.
Every 15m
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
16
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Velocity #1: Smart Meters
•
Ontario's Meter Data Management and Repository
(MDM/R): storing, processing and managing all smart
meter data in Ontario, Canada
• Characteristics:
– Provides hourly billing quantity and extensive
reports.
– 4.6 million smart meters.
•
–
Storage/Bandwidth: 4.6M meters x 0.5K message (typical HTTP)
= 2.3 GB / round
110 million meter reads per day
•
on an annual basis, exceeds the number of debit card
transactions processed in the country (Canada!)
Source: Smart Metering Entity: http://www.smi-ieso.ca/mdmr
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
17
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Velocity #2: Network Monitoring
•
Akamai:
–
CDN serving 15-30% of all Web traffic (10TB/sec)
•
•
–
•
One out of every three Global 500® companies
All of the top Internet portals
Has a picture of the global traffic every 6 seconds
How?
–
–
119,000 servers in 80 countries within over 1,100 networks.
Servers report to a proprietary database network health information
(latency/loss) every 6 seconds.
Proprietary
DBMS
Every 6
seconds
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
ping/traceroute
18
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Velocity #2: Network Monitoring
Companies started seeking
Big data engineers.
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
19
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Velocity #3: Web2.0 Media
•
•
Analyze online conversations in Social Nets.
Accelerated responses to marketplace shifts.
Continously
Over
Web2.0
protocols
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
20
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Velocity #3: Web2.0 Media
Web1.0: The Unstructured Web
http://books.google.com/
(content in HTML
only apprehensible to
User)
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
21
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Velocity #3: Web2.0 Media
Web2.0: The Semi-structured Web!
https://www.googleapis.com/books/v1/volumes?q=database
s
content in XML/JSON
apprehensible to Computer
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
22
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Velocity #3: Web2.0 Media
Twitter API
https://twitter.com/users/dmslucy.json
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
23
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Velocity #3: Web2.0 Media
In fact, Web2.0 Services are omnipresent!
(Google, Twitter, Facebook, Youtube, Linkedin, …)
http://www.programmableweb.com/ - 7800 APIs!!! + 6800 Mashups!
https://code.google.com/apis
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
24
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Velocity #4: Smartphone Services
Request Format (request.json)
{
"homeMobileCountryCode": 310,
"homeMobileNetworkCode": 260,
"radioType": "gsm",
"carrier": "T-Mobile",
"cellTowers": [
{
"cellId": 39627456,
"locationAreaCode": 40495,
"mobileCountryCode": 310,
"mobileNetworkCode": 260,
"age": 0,
"signalStrength": -95
}
],
}
Response Format
The response format is also JSON.
{
"location": {
"latitude": 51.0,
"longitude": -0.1,
},
"accuracy": 1200.4,
}
"wifiAccessPoints": [
{
"macAddress": "01:23:45:67:89:AB",
"signalStrength": 8,
"age": 0,
"signalToNoiseRatio": -65,
"channel": 8
},
{
"macAddress": "01:23:45:67:89:AC",
"signalStrength": 4,
"age": 0
}
]
Will be discussing some furtherinhouse applications in a while
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
25
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Velocity #4: Smartphone Services
Wireless Data Transfer Rates
4G ITU peak rates:
•100 Mbps (high mobility,
such as trains and cars)
•1Gbps (low mobility,
such as pedestrians and
stationary users)
Plot Courtesy of H. Kim, N. Agrawal, and C. Ungureanu, "Revisiting Storage for
Smartphones", The 10th USENIX Conference on File and Storage Technologies
(FAST'12), San Jose, CA, February 2012. *** Best Paper Award ***
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
26
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Velocity #4: Smartphone Services
Mapping the Road traffic by collecting WiFi signals.
Every 1
second
Received Signal Strength (RSS):
power present in WiFi radio signal
Graphics courtesy of: A .Thiagarajan et. al. “Vtrack: Accurate, Energy-Aware Road Traffic Delay
Estimation using Mobile Phones, In Sensys’09, pages 85-98. ACM, (Best Paper) MIT’s CarTel Group
27
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Talk Outline
•
•
Big Data Definitions and Background
Big Data Definition by 3V Examples
–
Velocity
•
–
Volume
•
–
Text<Multimedia<Sciences, Web Data, Filesystems
Variety
•
•
•
•
Sensor Monitoring, Network Monitoring, Web2.0 Media,
Smartphone Services)
The New Database Landscape
NoSQL (Document Stores, Replication, Consistency, File
Systems, Map-Reduce, Column Stores)
NewSQL Trends
Big Data Education and Research
–
–
Courses @ UCY
Research Prototypes @ UCY
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
28
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Volume #1: Text<Multimedia<Sciences
Sciences/
Sensors
Multimedia/
Streaming
Human
Generated
• From the TB-era to the PB-era.
–
–
The U.S. Library of Congress (April 2011): 235 TB
Anchestry.com: Genealogical data 600 TB
–
Games: World of Warcraft uses 1.3 PB of storage to maintain
its game.
Internet Video: will account for 61% of total Internet Data by
2015 (966 Exabytes or nearly 1 Zettabyte!)
–
–
–
Climate science: The German Climate Computing Centre
(DKRZ) has a storage capacity of 60 PB of climate data.
Physics: The experiments in the Large Hadron Collider
produce about 15 PB of data per year, which is distributed over
the LHC Computing Grid (Our department is part of the EGEE
– Enabling Grids for E-sciencE, now EGI - European Grid
Infrastructure).
Source: Petabyte, from Wikipedia: http://en.wikipedia.org/wiki/Petabyte
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
29
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Volume #2: Web Data
Google Volume (in 2006)
IDC: The total amount of global data is expected to grow to 2.7 zettabytes during
2012. This is 48% up from 2011. http://en.wikipedia.org/wiki/Zettabyte
Bigtable: A Distributed Storage System for Structured Data,
OSDI'06: Seventh Symposium on Operating System Design and Implementation, Seattle,
WA, November, 2006.
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
30
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Volume #3: Big Data File Systems
• Big Data Filesystems: HDFS
Namespace lookup
are fast (1 Master
enough!)
[ 1GB Metadata =
1PB Data ]
In NFS Metadata +
Transfers going
through same server
=> Not Scalable
HDFS designed for
unreliable hardware
(2-3 failures / 1000
nodes / day)
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
31
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Volume #3: Big Data File Systems
• Big Data Filesystems: How Big?
• Results from 2010:
HDFS scalability: the limits to growth
http://static.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
32
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Variety #2: File Systems
NFS uses a
Client/Server
Architecture that is
a single point of
failure by default.
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
33
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Talk Outline
•
•
Big Data Definitions and Background
Big Data Definition by 3V Examples
–
Velocity
•
–
Volume
•
–
Text<Multimedia<Sciences, Web Data, Filesystems
Variety
•
•
•
•
Sensor Monitoring, Network Monitoring, Web2.0 Media,
Smartphone Services)
The New Database Landscape
NoSQL (Document Stores, Replication, Consistency, File
Systems, Map-Reduce, Column Stores)
NewSQL Overview (ACID-compliant NoSQL stores)
Big Data Teaching and Research
–
–
Courses @ UCY
Research Prototypes @ UCY
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
34
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Variety Overview
451 Research, Matthew Aslett, http://goo.gl/GYcEx
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
35
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Variety #1: NoSQL
•
•
•
NoSQL ("not only SQL") is a broad class of database
management systems identified by non-adherence to
the widely used relational database management
system model.
NoSQL databases are NOT built primarily on tables,
and generally DO NOT use SQL for data.
NoSQL => Not Relational!
–
–
–
–
–
Key Value (e.g., BerkeleyDB – emb, Oracle NoSQL Distributed)
Document Stores (e.g., JSON stores)
BigTables (i.e., Column-stores)
Graph Databases (e.g., FlockDB)
… potentially much longer list but I
will only focus on a few trends
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
36
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Variety #1: NoSQL / Document Stores
Document in CouchDB
Map Function
function(doc) {
for (i in doc.authors) {
author = doc.authors[i];
emit(doc._id, author);
}
}
Results (through REST/HTTP or Futon)
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
37
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Variety #1: NoSQL / Document Stores
For a real app we could envision much more
complex queries.
http://rickosborne.org/download/SQL-to-MongoDB.pdf
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
38
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Variety #1: NoSQL / Replication
Asynchronous
Replication means
Eventually Consistent
Asynchronous
Asynchronous
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
39
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Variety #1: NoSQL / Consistency 
SQL RDBMSs
(Most) NoSQL DBMSs
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
40
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Variety #2: NoSQL / Map Reduce Analytics
• Map-Reduce: a programming model for processing
large data sets (Not online like Warehouses ).
• Invented by Google! "MapReduce: Simplified Data Processing
on Large Clusters, Jeffrey Dean and Sanjay Ghemawat,
OSDI'04: Sixth Symposium on Operating System Design and
Implementation,San Francisco, CA, December, 2004."
• Can be implemented in any language (Java, example nex)
• Hadoop: Apache's open-source software framework
that supports data-intensive distributed applications
• Derived from Google's MapReduce + Google File System (GFS)
papers.
• Enables applications to work with thousands of computationindependent computers and petabytes of data.
• Download: http://hadoop.apache.org/
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
41
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Variety #2: NoSQL / Map Reduce Analytics
Count the distinct words in all documents
cat *.txt | sort | uniq -c
1 TB on 1 PC = 2 hours!!!
1TB on 100 PCs = 1min!!! 42
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Variety #2: NoSQL / Map Reduce Analytics
Example uses 1 mapper / 1 reduce only!
M
a
p
S
h
u
ff
le
R
e
d
u
c
e
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
43
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Variety #2: NoSQL / Map Reduce Analytics
Standard
Output (e.g.,
socket)
HFDS blocks
(64MB containing
documents)
Hashing
HDFS
Reading
Remote
HDFS
Local
Write
(e.g.,
Writing
Shuffling
Socket)
(of terms)
Demetris
Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
44
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Variety #3: NoSQL / Column Stores
• A column-oriented DBMS is a database
management system (DBMS) that stores data
tables as sections of columns rather than as rows,
like most relational DBMSs
Row-Store OLTP-workloads!
1,Smith,Joe,40000;
2,Jones,Mary,50000;
3,Johnson,Cathy,44000;
Column-Store
OLAP-workloads!
1,2,3;
Smith,Jones,Johnson;
Joe,Mary,Cathy;
40000,50000,44000;
• Suggested for data warehouses, customer
relationship management (CRM) systems and other
ad-hoc inquiry systems where aggregates or scans
are carried out over large numbers of similar data
items
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
45
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Variety #3: NoSQL / Column Stores
All column family members are stored together on the
big data filesystem.
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
46
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Variety #4: NewSQL
•
"NewSQL" is a class of modern relational database
management systems that seek to provide the same
scalable performance of NoSQL systems for OLTP
workloads while still maintaining the ACID guarantees
(i.e., offering transactions) of a traditional DBMS.
NewSQL=
NoSQL+Transa
ctions
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
48
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Variety #4: NewSQL
Google's Trajectory
• (2003) Google GFS Paper (SOSP'03)
– Objective: Create a Google-scale Filesystem
– Apache HDFS is GFS open-source implementation.
• (2004) Google's Map-Reduce Paper (OSDI'04)
– Objective: Enable big-data analytics over non-tabular data (e.g.,
XML or text) … with the assistance of GFS.
– Apache's MapReduce: An open source implementation of the paper
• (2006) Google BigTable Paper (OSDI'06)
– Objective: Enable big-data analytics over tabular data (i.e., tables)
– (2008) Apache's Hbase: An open-source implementation of the paper
– (2010): Facebook Messaging moves from Cassandra to HBase
• (2012) Google's F1 RDBMS (SIGMOD'12) & Spanner Storage Papers
(OSDI'12)
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
49
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Talk Outline
•
•
Big Data Definitions and Background
Big Data Definition by 3V Examples
–
Velocity
•
–
Volume
•
–
Text<Multimedia<Sciences, Web Data, Filesystems
Variety
•
•
•
•
Sensor Monitoring, Network Monitoring, Web2.0 Media,
Smartphone Services)
The New Database Landscape
NoSQL (Document Stores, Replication, Consistency, File
Systems, Map-Reduce, Column Stores)
NewSQL Overview (ACID-compliant NoSQL stores)
Big Data Education and Research
–
–
Courses @ UCY
Research Prototypes @ UCY
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
50
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Big Data Courses @ UCY
•
NoSQL and NewSQL
–
–
–
–
–
–
Intro to Web2.0 & the JSON data
interchange format,
Key-Value data model & CouchDB.
Introduction & Fundamentals: I/O
Performance, Replication Strategies,
etc.
Big-data Filesystems: HDFS
"Big-Data" Analytics: Map-Reduce,
Hadoop, PIG
Column Stores: BigTable, HBase
and Intro to NewSQL (Spanner and
F1)
Advanced Topics in Databases
http://www.cs.ucy.ac.cy/~dzeina/courses/epl646
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
51
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Big Data Courses Elsewhere
•
Data science incorporates varying elements and builds
on techniques and theories from many fields,
including with the goal of extracting meaning from
data and creating data products.
Data Science Combines the Following Fields:
•
•
•
•
•
•
•
•
•
•
Math
Statistics,
Data engineering,
Pattern recognition and learning,
Advanced computing,
Visualization,
Uncertainty modeling,
Data warehousing, and
High performance computing
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
52
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Big Data Courses Elsewhere
•
Course Syllabus Example (Univ. of Washington):
–
–
–
–
–
–
–
–
•
Data modeling: relations, key-value, trees, graphs, images, text
Relational algebra and parallel query processing
NoSQL systems, key-value stores
Tradeoffs of SQL, NoSQL, and NewSQL systems
Algorithm design in Hadoop (and MapReduce in general)
Basic statistical analysis at scale: sampling, regression
Introduction to data mining: clustering, association rules,
decision trees
Case studies in analytics: social networking, bioinformatics,
text processing
Free 10 week course:
https://www.coursera.org/course/datasci/
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
53
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Big Data Research @ UCY
• Crowdbeam: Build an innovative Windows
Phone messaging platform for a Finnish
alliance, backed by Microsoft & Nokia.
• Problem: Millions of users querying their K
closest smartphones continuously.
– Query executed every few seconds.
– Currently state-less service
• Setup: A 14-node Couchbase cluster (i.e.,
distributed - shared-nothing architecture NoSQL document-oriented database that is
optimized for interactive applications
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
54
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Big Data Research @ UCY
Native JSON
Store + JSON
RESTful API
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
55
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Big Data Research @ UCY
• Airplace: Build an innovative indoor localization
& navigation platform for Taiwanese company.
• Problem: Radiomaps of indoor environments
are fairly large structures considering that those
become massively available.
• Setup: A 4-node Apache Hbase cluster (i.e.,
distributed, non-relational, shared-nothing
architecture modeled after Google's BigTable
and is written in Java.
• Best Demo Award at IEEE MDM'12, covered on
Euronews and local media.
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
56
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Big Data Research @ UCY
SmartLab: Massive smartphone
simulations with our first global open
smartphone IaaS cloud –
http://smartlab.cs.ucy.ac.cy/
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
57
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Big Data Research @ UCY
Demetris Zeinalipour, http://www.cs.ucy.ac.cy/~dzeina/
http://smartlab.cs.ucy.ac.cy/
58
Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010
Big Data - What Is It?
Thanks!
Questions?
Demetris Zeinalipour
Assistant Professor
Data Management Systems Laboratory
Department of Computer Science
University of Cyprus
http://dmsl.cs.ucy.ac.cy/
59