Download Olivier Caudron

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Geographic information system wikipedia , lookup

Pattern recognition wikipedia , lookup

Theoretical computer science wikipedia , lookup

Data analysis wikipedia , lookup

Neuroinformatics wikipedia , lookup

Data assimilation wikipedia , lookup

Corecursion wikipedia , lookup

Transcript
Olivier Caudron
Big Data and NoSQL
"Big" Data?
"Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications"
http://en.wikipedia.org/wiki/Big_data (retrieved Feb 28, 2014)
The 3 V's of Big Data (or more… )
Volume
Velocity
Veracity?
Variety
Value?
Why Big Data?
• "Monetizing data" is what the hype is all about:
some "big data" monetization stories that have
gone viral evidently make many people envious
• For many, Big Data is nothing more than finding
as many needles (preferably golden) as
possible in the huge haystack of Internet data
"Big Data is not about the
amounts of data. It's about
the cool stuff you can do
with Big Data"
(Peter Hinssen)
http://datascienceseries.com/assets/blog/GREENPLUM_Information_is_the_new_oil-LR.pdf
Taxonomy of Big Data
• There is a lot of debate on the exact domain of application of
"Big Data"
– First off: Big Data is NOT a conceptual revolution!!!
– The most practical definition of "Big Data" is a negative one: any
problem that is not tractable through "traditional" means because of
its size and/or complexity and/or velocity will be considered a "Big
Data" problem
– … However it's not all that simple…
• "Big Data" was popularized by some big players on the Internet,
however, the reality is much less clear cut:
–
–
–
–
Facebook and Twitter use MySQL mostly (and some Cassandra)
Wikipedia and YouTube use MySQL (and little or no "NoSQL")
Amazon is on Oracle DB
Google is an exception: uses BigTable (NoSQL solution) mostly
Taxonomy of Big Data
• "Big Data" solutions can be divided into 2 categories:
Big Data "processing" solutions are mostly offline (batch, nontransactional) solutions for processing data and can be seen as an
evolution of OLAP
Example: Apache Hadoop (and its ecosystem)
Big Data "database" solutions that come mostly under the "NoSQL"
terminology ("No" SQL or "Not Only" SQL) and can be seen as an
evolution of OLTP
Examples: MongoDB, CouchBase, Cassandra,
Big Table, Redis, Neo4J…
Apache Hadoop in a Nutshell
• Low-level set of libraries designed for parallel processing of
large data sets
• 2 main components:
– Hadoop Distributed File System (file system designed for horizontal
scaling and replication on a cluster of commodity servers)
– Hadoop Map/Reduce (utilities for analyzing data using the
Map/Reduce paradigm)
• Open-source, built by the community under the Apache Software
Foundation and distributed under the Apache License 2.0
• See http://hadoop.apache.org/
Apache Hadoop in a Nutshell
• HDFS is designed to handle immutable files (once written, they
don't change) and is not suitable for just any FS use
• Map/Reduce requires heavy programmer involvement
• Has generated a host of solutions (of diverse levels of maturity)
that are meant to simplify its use and/or build functionality on it
–
–
–
–
–
Pig, Hive, Cascading: higher-level map/reduce frameworks
Yarn: Hadoop resource management
Elasticsearch, Kibana: search and analytics engine
Lingual: SQL layer on Cascading
And more…
• InterSystems is currently integrating Caché with Hadoop
– Real-time copy of Caché data to HADOOP for offline processing
– In development (alpha)
Velocity vs Data Size
Types of NoSQL Databases
Data Complexity
Commonalities of Volume-Oriented NoSQL Databases
• There are too many different NoSQL solutions out there to
characterize them in general terms, but the following usually
applies to all paradigms except graph-oriented:
• Typically non-ACID transactions ("BASE": Basically Available,
Soft state, Eventually consistent)
• Always denormalized: no referential integrity means the same
data will probably be present in several entities and won't be
synchronized by the system
• Often built for horizontal scaling (e.g. sharding)
• Typically optimized for inserts and retrieval, not meant for full
CRUD
• Not typically meant for classical applications (client/server, multitier, web applications)
Key/Value Databases
e.g. Redis, Membase, LevelDB, Aerospike, Tokyo
Cabinet, Project Voldemort, Hyperdex…
• The Key is the only retrieval parameter
– In some products, several data types can be supported for keys,
including collections (lists, maps, sorted sets…)
– Users often structure the key in a way that allows for multi-parameter
record search – quite a dirty trick, and this must be carefully planned
in advance
• The Value can be anything:
– The database doesn't have to understand the contents
– Contents can be completely different for each record
Key/Value Databases
Pros & Cons
• Pros:
– Ultrafast on inserts and key-based retrieval in large volumes
– Horizontal scaling possible (?)
• Cons:
– Messy paradigm
– No standardization whatsoever, no SQL support (usually)
– Popular solutions (Redis) actually in-memory with clunky persistence
options
– Must use tricks for multi-parameter queries (typically, use special
structure for keys)
– Any non-key query is unrealistic (full table scan with document
interpretation for each record required)
– Key size often limited (but key contents essential for queries!)
Document-oriented Databases
e.g. MongoDB, CouchBase, RavenDB, OrientDB,…
• Similar to Key/Value stores except that the database understands
the data structure
– No need to tinker with keys to optimize searches on diverse items
• Typically based on some variant of JSON (e.g. BSON: "Binary"
JSON)
• Typically allows extra indexes to be defined (beyond the key) to
speed up non-key-based queries
Document-oriented Databases
Pros & Cons
• Pros:
–
–
–
–
–
Very popular paradigm at the moment (MongoDB, CouchBase)
Good match with JSON, quite popular at the moment
Handles a reasonable level of complexity
Handles reasonably large amounts of data
Typically provides horizontal scaling out of the box
• Cons:
– (Typically) not optimized for updates and deletes
– No relationship between entities, no normalization, no referential
integrity
– Not really standardized, but is the most converging of all NoSQL DBs
– Typically relies on eventual consistency – no ACID transactions
Column-oriented Databases
e.g. Google BigTable, Apache Cassandra, Hbase,
Accumulo…
Classical relational model
Id
Name
Age
WorksOn
1
Olivier
47
Caché, Ensemble
2
Danny
3
Alain
4
Luc
Caché, DeepSee, iKnow
53
Id
Name
Id
Age
1
Olivier
1
47
2
Danny
3
53
3
Alain
4
Luc
Column-oriented model
Caché
Id
WorksOn
1
Caché
1
Ensemble
2
Caché
2
DeepSee
2
iKnow
3
Caché
Column-oriented Databases
"Lockstep" BigQuery Algorithm
• Select count(*) from People where Age>50
• Select Name, WorksOn from People where Age<50
Id
Name
Id
Age
1
Olivier
1
47
2
Danny
3
53
3
Alain
4
Luc
Id
WorksOn
1
Caché
1
Ensemble
2
Caché
2
DeepSee
2
iKnow
3
Caché
See http://cdn.parleys.com/p/529c6b62e4b039ad2298ca1b/529c5678140df_1385976886785.pdf
Column-oriented Databases
Sharding
• Columns can be distributed on separate servers, distributing the
load automatically
Id
Name
Id
Age
1
Olivier
1
47
2
Danny
3
53
3
Alain
4
Luc
Separate Servers
Id
WorksOn
1
Caché
1
Ensemble
2
Caché
2
DeepSee
2
iKnow
3
Caché
Column-oriented Databases
Sharding and Big Data Aggregation
• Typically, resultsets for big queries are "reconstructed" by higherlevel servers
Root Server
Intermediate Servers
Leaf Servers
Storage Layer (e.g. Google FS)
Column-oriented Databases
Pros & Cons
• Pros:
– Ultrafast queries on huge amounts of data
– No indexing required (each column is its own index)
• Cons:
–
–
–
–
–
Actually less efficient (than relational) for small databases
Requires a significant infrastructure in any relevant scenario
No referential integrity – limited complexity in structure AND queries
Not designed for updates (and deletes?)
Transactions?
Graph-oriented Databases
e.g. Neo4J, OrientDB, Allegrograph, Dex…
Lastname: Bouvier
Firstname: Clancy
Firstname: Mona
Rel: Daughter
Rel: Spouse
Rel: Daughter
Maidenname: Gurney
Lastname: Bouvier
Firstname: Clancy
Maidenname: Bouvier
Lastname: Simpson
Firstname: Marjorie
Nickname: Marge
Rel: Spouse
Rel: Son
Rel: Spouse
Since: 4/19/1987
Rel: Son
Lastname: Simpson
Firstname: Abraham
Rel: Son
Lastname: Simpson
Firstname: Homer
Middlename: Jay
Rel: Daughter
Rel: Son
Lastname: Simpson
Firstname: Bartholomew
Midname: Jojo
AKA: Bart
Rel: Daughter
Rel: Daughter
Lastname: Simpson
Firstname: Lisa
Gender: F
Lastname: Simpson
Firstname: Margaret
Nickname: Maggie
Rel: Sister
Rel: Friend
Rel: Brother
Lastname: Van Houten
Firstname: Milhouse
Middlename: Mussolini
Rel: Employee
Rel: Sister
Rel: Brother
Rel: Daughter
Rel: Victim
Lastname: Burns
Firstname: Montgomery
AKA: Monty
Graph-oriented Databases
Pros & Cons
• Pros:
– According to their supporters, more "natural" way of handling
structured data
– Typically ACID transactions
– Capable of handling reasonable volumes, horizontal scaling typically
supported, indexing possible
– Support a high level of data complexity with good mining tools
– Contrary to other NoSQL solutions, can (possibly) be fit for general,
non-specific use
• Cons:
–
–
–
–
Still unproven paradigm in all but specialized cases
Complexity might be too high for simple problems
Maintenance of the data model might be complicated
Not yet popular, not yet standardized
What about Object-Oriented Databases?
e.g. Versant, Gemstone, ObjectStore, DB4O…
THE classical NoSQL database paradigm!
• Still a very valid paradigm but…
• Object-oriented databases have had their chance and missed it
– Poor overall performance
– Competition of ORM tools (Hibernate, EclipseLink, JPA…) with
equivalent ease of use and better performance of underlying
relational database
– Deserved to generate hype but failed to do it
• The only exception today is Caché – very powerful objectoriented database system, the only OO DB to really pass the test
of real-life use with competitive performance
What about Caché?
Caché pushes ACID transactions to the extreme
• The original NoSQL database! (remember globals? = ultrafast
multidimensional key/value store!)
• Relational database that easily competes with the best of them
• The ONLY object-oriented database to past the test of real-life
projects
• … All in one consistent package
• … With fully ACID transactions
• … With extensive enterprise tooling (monitoring, backup, task
scheduling, horizontal scaling, replication, etc.)
• … With outstanding support from InterSystems
• … And added value technologies (DeepSee, iKnow, Ensemble)
Research Projects
We need your feedback…
• InterSystems is working on several research projects related to
Big Data and NoSQL
• Apache Hadoop integration
• Document-based database implemented in Caché (Morpheus
project)
• Graph-oriented approach via the Globals client interface
(currently Node.js) – github project:
https://github.com/GlobalsDB/Contributions/tree/master/NodeJS/
GlobalsGraphDB
Your feedback is important to determine the future directions
of our technology
Conclusions
• There is no real universal "game changer" in new database
architectures, only scoped solutions to specific problems
• Only graph-oriented databases can possibly attempt at
universality – but they have yet to prove themselves in general
• When considering a NoSQL solution, one must consider the
whole picture including known limitations e.g. ACID
transactions, CRUD, in-memory, etc.
• Having the same data in different data stores (or offline copies
like for Hadoop Map/Reduce) to solve your problem(s) is no trivial
decision: doubling 100s of TB of data is hardly inconsequential
• Caché simplifies these issues – and it pushes the
boundaries of transactional processing of high volumes far
enough to be the right solution in most cases
Caché Big Data Success Stories
Alain Houf, Senior Sales Engineer
Olivier Caudron
Big Data and NoSQL