Download >> This is the developerWorks podcasts Java technical

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
>>
This is the developerWorks podcasts Java technical
series, hosted by Andy Glover. Here's Andy.
GLOVER:
This is Andy Glover with the Java technical
series. My guest today is James Phillips with Couchbase. He is
a co-founder and senior vice president of products. And
Couchbase is a new company, which is the combined forces of
CouchOne and Membase. So I thought we'd start, James, with
what happened here. Tell us about this merger.
PHILLIPS:
Sure. The merger happened in ... I can't even
remember when we announced it now. I think we announced it
mid-February, so it's been a couple of months.
GLOVER:
Okay.
PHILLIPS:
Mid-February 2011. And it was a combination that
was driven, surprisingly -- maybe not surprisingly, but
incredibly -- almost literally by customers or by users of our
technology.
GLOVER:
Yes?
PHILLIPS:
There was a tweet that happened when we were in
the middle of going through integration discussions, merger
discussions, prior to actually closing where someone tweeted,
could someone please combine Membase and Couch?
GLOVER:
Nice.
PHILLIPS:
It really was all about the technology. If you
look at what we were doing at Membase, we were providing this
very, very high-performance sort of caching and clustering
technology that allows you to smear your data across a lot of
servers and to use memory to cache your most recently used
data.
That high-performance, very elastic, scalable data-management
technology is what we were doing at Membase. We were storing
data also on disk, but we weren't giving you the query
ability, indexing, and the other things that people tend to
expect out of a primary database.
GLOVER:
Sure.
PHILLIPS:
We weren't providing those capabilities. And
that's what Couch was all about. Couch was a document database
technology that did allow you to query and index and do
database-like functionality. People loved our elasticity and
our high performance, but they really missed the ability to
query.
On the flipside, Couch didn't have a very high-performance
story. They didn't have an elastic scalability story, and
-1-
people really missed that or wanted that out of Couch
technology.
So marrying those two technologies and now being able to smear
your data across a lot of servers, get very high performance
when you're interacting with your data, and still have the
ability to do queries and indexing and all the other great
stuff that people expect out of database technology is what we
get out of this merger.
I read a blog post when we completed the merger talking about
that tweet and sort of in jest but very seriously saying,
please keep the good ideas coming.
[LAUGHTER]
GLOVER:
Yes. Yes.
PHILLIPS:
Coming from the user population is certainly where
you want to be drawing your ideas.
GLOVER:
Interestingly, in season one I had the opportunity
to talk to Aaron Miller, who was doing some interesting work
with Couch. At the time, at CouchOne, he had ported Couch to
mobile devices -- at the time, Android. And definitely had
some other conversations about other NoSQL-like datastores
such as App Engine and Mongo.
And so I thought it would be interesting to talk about, how
did we get here? Obviously, there's a lot of momentum here if
you've got your users saying hey, why don't you combine
forces? Someone's got to solve this problem. So this isn't
build it and they will come; this is, they want it, let's get
it in front of them as quick as possible. What happened? How
did we get here?
PHILLIPS:
Okay. And absolutely, right? We've all been
through the build it and they will come. Sometimes it works
out, but a lot of times it really sucks.
[LAUGHTER]
GLOVER:
Yes. Been there, done that.
PHILLIPS:
Exactly. It's nice when there is a real acute
problem in the marketplace where people are really looking for
a solution. And we feel fortunate that we are able to deliver
on what is turning out to be a real acute need.
And so, the way we got here, if you go back ... let's go on
the way-, wayback machine here. You go back to the early
1970s, mid-1970s, and that was the point in time at which
relational database technology was invented.
-2-
If you think about the interactive software systems of that
day and you think about the infrastructure atop which those
infrastructures ran: very, very stark contrast to what we find
today.
GLOVER:
Yes.
PHILLIPS:
Monolithic computing was the norm. You had
minicomputers and mainframes. If you needed more capacity you
expanded your box, right? Networking was in its infancy -data networking. The users of interactive software systems at
that point were measured in the tens to hundreds. And if you
were really, really, really bleeding-edge large, maybe you had
a couple thousands users.
GLOVER:
Yes.
PHILLIPS:
Sabre Airlines' reservation network or Bank of
America's teller network maybe had a couple of thousand
concurrent users online at any one time.
GLOVER:
It was also quite expensive, at least hardware was
at that point. Right?
PHILLIPS:
expensive.
Absolutely. And memory was very scarce and
And so the environment in which relational technology was born
basically made a bunch of assumptions that were valid at that
point in time and led to a technology that was ideal for the
kinds of software systems that were being built and the kinds
of infrastructure atop which they were running.
If you fast-forward now 40 years and you look at the kinds of
interactive software systems that are being built today, most
new interactive software systems are interacted with via
browser over the web, and you've got the ability, as a result,
to have concurrent user populations that are literally many
orders of magnitude larger than even the very largest of
applications in the mid-1970s. So being able to support
concurrent utilization of your software systems today is a
challenge to what was being dealt with there.
Additionally, if you look at the mode of computing now, the
infrastructure, it's no longer about, let's go get a bigger
and bigger mainframe, but rather, let's spin up some more
virtual machines or let's launch some new cloud instances. And
the model of computing really is about scaling out versus
scaling up, and being able to match the capacity of the
application, the infrastructure, with the needs of the
concurrent utilization patterns in real time.
GLOVER:
Right.
-3-
PHILLIPS:
And so, application architecture was the first
thing to fall over. If you look 10 years back, just 10 years
ago, we saw this shift toward -- with web applications in
particular, which again are the predominant form of
interactive software at this point -- where you take stateless
web servers and you put them behind a load balancer. And if
you have more users now you spin up some new instances, update
your load balancer, get those new guys in the rotation, and
off you go.
GLOVER:
Right.
PHILLIPS:
And so you've got this very flexible, elastic,
very scalable, cost-effective mechanism for supporting large
sets of users. One more virtual machine or one more DL380 in
the rack. Another thousand users supported, boom. Another
thousand, another Dell box.
So it's got this very nice, linear cost curve associated with
it, and you can basically scale that out almost indefinitely
from a performance perspective while maintaining cost and
performance. So, great model. But if you look at the database
technology that is by and large relational technology, it's
very different.
Relational technology was built to be a shared-everything
technology. If you want to increase the size of your database
or the amount of concurrency that your database can support,
you've got to get a bigger box. It's very difficult to shard a
relational database, which means partitioning the data and
spreading it across multiple servers.
Now, having said that, there have been a lot of Band-Aids sort
of thrown at the product. The first Band-Aid certainly was
sharding. Okay? I can't smear my data out using the relational
model naturally across multiple servers, so I'm just going to
partition my data set.
I'm going to take all of my users west of the Mississippi and
put them on this box and take all of the users east of the
Mississippi and put them on that relational database. And I
need to change my application logic to know that if this guy
is east of the Mississippi, I need to go to this box to get
the data or the other one.
Now, the challenge with sharding a relational database in that
manner is that it's very disruptive to the application. What
happens when you fill up your west-of-the-Mississippi box?
You've got to go reshard, change application logic. You lose
the ability to do any sort of complex cross-server relational
queries.
You can't join across multiple relational database management
systems when they're sharded like that. And so you're starting
-4-
to give away a lot of the richness of the query ability of the
relational model when you do that. And so sharding is a can of
worms.
Another approach -- in order to increase the concurrency
capabilities of relational technology -- has been to be less
normal form in the way that you're defining your schemas.
Relational technology done right is all about normalizing your
data and basically taking a record and splitting it up into
multiple tables where you've got foreign-key relationships et
cetera. So that when you're talking about a row of data, what
you're really talking about is little pieces from a bunch of
tables. And if you want to go update a row, naturally you've
got to lock down all of those tables in order to get an atomic
update, right?
GLOVER:
Yes. Yes.
PHILLIPS:
All of that locking that occurs in relational
technology makes it very difficult to get a high degree of
concurrency. And so what people have done is to denormalize a
lot of their data so you don't have it smeared across multiple
tables and so that when you go do an update you really are
just locking down one table and you're getting it over with
fairly quickly. And again, not using correlational technology
like relational technology in order to solve this problem,
which is increasing concurrency.
The final Band-Aid that has been thrown into the mix is
distributed caching -- something like memcached. Where, okay,
I'm using relational technology. It doesn't do a really good
job with high concurrency, and so I'm going to smear the data
across a bunch of cache machines in front of that big
monolithic relational database management system or even
sharded database management system in order to allow myself to
get some of those cost economic benefits of spreading my data
out.
So memcached can sit in front of, effectively, a relational
database management system and allow me to spread my data out
across lots and lots of servers. If I have more data or I want
more parallelism support, simply get another couple of
servers, throw them into the mix, and off I go, right -- just
like I'm doing at the application layer. The problem, of
course, is that if I'm throwing a completely new tier of datamanagement infrastructure into the mix, I have more
complexity. Now I'm managing both a cache and a database
management system.
GLOVER:
Right.
PHILLIPS:
It only works for reads. You still have to, if you
want durability, write your data back into a database
management system. So if you've got any sort of write
-5-
contention, you're back in the same boat. There are
consistency issues that one needs to deal with.
And so generally speaking, things like sharding and
denormalization and distributed caching, they are all attempts
to sort of bandage over the fact that relational database
technology has not kept pace with changes in application
architecture, the infrastructure atop which applications run,
and the kinds of user populations that are using these
systems.
There's a fundamental problem that needs to be addressed. And
because you're dealing with state and because distributed data
management is a very, very nontrivial problem, people have
been willing to try to paper over this and make work what they
have.
But at some point it's got to give. And I've always said that
the database management technology is the last domino to fall
in this march toward truly distributed application
architecture.
If you look at some of the leading organizations that have had
to grapple with the real pain of trying to make these
technologies work at scale, because at some point you can kind
of Band-Aid, Band-Aid, Band-Aid, but eventually if you get to
a sort of tipping point it's just not going to work anymore.
GLOVER:
Yes.
PHILLIPS:
Amazon invented Dynamo, their distributed database
technology. Google invented BigTable. And techniques like
MapReduce and Hadoop. And so necessity was the mother of
invention in a lot of these organizations where they tried and
tried and tried, and then finally got to the point where they
said, you know what? This is silly. We need to rethink the way
that we're doing data management. And that led to a lot of the
innovation that's now working its way back down to not just
the Googles and the Amazons, but anyone to do data management,
quote-unquote, correctly, which to me is code for more costeffectively, more efficiently, and more in line with the kinds
of architectural approaches that seem to be working at every
other point in the stack, now doing that at the data layer.
At Couchbase, ultimately that's what we're all about. We are
all about knocking over that last domino: allowing datamanagement technology, database technology to finally catch up
with what's been happening over the last decade in a broad way
at the application-logic layer.
That in part is why we did the merger: to bring together what
we believe are all the critical components to make it truly
possible to store your data in that manner and to make it
accessible to everyone.
-6-
GLOVER:
That was excellent. I think you've articulated how
we got here just wonderfully, better than I could have.
And something that I really enjoyed was bringing the example
of Amazon and Google, how these companies had this problem and
dealt with it. They built their own, and all of a sudden
people started realizing okay, this is legit. These aren't two
guys in a garage peddling this object database, right?
And now there are myriad data stores out there, as we already
talked about. We've talked about Couch, we've talked about
Mongo on this series of podcasts. We've talked about BigTable.
Obviously you all are Couch and memcache. And memcache isn't
the only kind of caching technology out there, although
probably the 500-pound gorilla at least on that side.
What makes these two open source projects unique and viable?
What kind of problems do I have that these things solve as
opposed to foo and bar the other options out there?
PHILLIPS:
Sure. The truth of the matter is that we're
really, really early in commercialization -- maybe to use some
words that could be negatively construed, but I think
positively, right? Sort of in the mass marketing, if you will,
or I guess the word I'm looking for is: not everyone is Amazon
or Google and able to invent database technologies that you
use within your organization that you care and feed for that
aren't really packaged up necessarily for consumption by the
masses.
We have to get to that point, right? Until there are
technologies that are really, really easy for any developer
and/or operations team to pick up and begin using with the
kind of comfort and with the feeling of security that they get
from an Oracle or a SQL Server -- or even, I guess, maybe
perhaps at this point MySQL -- we have to get there before
this technology is going to be meaningful in any large way.
And we're early, right? We are still, I think, maturing these
technologies and hardening them and making them operationally
ready to the point that the masses can actually start to
consume and use them.
Having said that, I think that one of the things that sets
Couchbase apart -- and I certainly don't want to make this
just an advertisement for Couchbase, because I think that the
larger conversation about database technology and where it's
heading is certainly more interesting -- but one of the things
that does set us apart is that we are building from some of
the most mature technologies in what is an otherwise very
immature market. memcache was invented almost 10 years ago,
now.
-7-
GLOVER:
Yes. Yes.
PHILLIPS:
It is battle-proven. It is a core component behind
18 of the top 20 websites on this planet by traffic count. It
doesn't get much more hardened than those numbers, right?
GLOVER:
Yes. Exactly.
PHILLIPS:
Couch, likewise: it was the forerunner of all the
emergent document database technologies. Damien Katz invented
that back in 2005. Damien is our CTO here.
GLOVER:
And he comes with -- correct me -- the Lotus Notes
background.
PHILLIPS:
Yes. Absolutely.
GLOVER:
stores.
Which was one of the first popular document
PHILLIPS:
You bet.
So a lot of the replication technology was invented by Damien
back in the Lotus Notes days. Big heritage there, right? And a
lot of maturity in the software systems.
And so by marrying those two technologies in particular -memcache and CouchDB -- we believe that we are far ahead of
many of the other emergent technologies. Not that they are
poor technologies, but they are just less proven. And I think
that one would be hard-pressed to identify a more,
quote-unquote, mission-critical or core component of an IT
architecture than your database. Right?
GLOVER:
Right. Exactly.
PHILLIPS:
And you don't monkey with that. So that I think is
one of the things that really sets us apart. There are various
flavors or approaches to, quote-unquote, NoSQL database
technology, which is kind of a buzzword that the industry's
thrown around this what is probably more appropriately called
nonrelational or elastic database technologies.
So document database technology is one approach. There are
columnar databases like Cassandra or HBase, or BigTable at
Google. There are pure key/value stores. And it is my belief
that probably the most widely applicable, certainly easiest to
understand and develop against -- I think what will ultimately
be the winning approach to having a scalable database
technology that can be distributed -- is this document
approach.
GLOVER:
Okay. And why is that?
-8-
PHILLIPS:
Well,
If you think about
and foremost, it's
they hear the word
document.
GLOVER:
it's really easy to wrap your head around.
what a document database is, actually first
actually a confusing term. Most people when
document they think Word document or Excel
That's right.
PHILLIPS:
And that's not at all what a document database is,
unfortunately, right? And therein lies the confusion. What a
document database is, basically: with a relational database
when you insert a record, you're inserting a set of
well-defined fields that have certain types, and it's very
rigorously defined in a database schema, right? You can't
start inserting data into a relational database until you have
defined that schema and what the columns are and what the
variable types are, et cetera.
GLOVER:
Right.
PHILLIPS:
In a document database, you simply shove in a row
of whatever you want. And the row is a document. The
definition is it's an XML document or it's a JSON document or
it's some other self-describing type of quote-unquote record,
where you've basically said okay, here's a name. And then
you've got the name and slash name. Here's an age, slash age.
And so it looks like a marked-up XMLish JSON kind-of
self-describing document. You've got metadata along with the
data.
You don't have to go about creating a schema before you start
inserting data. If you change your mind later about what kind
of data you want to capture, you don't have to update a schema
and renormalize and all the things that you have do with a
relational database. So it's easy. I've got a record or an
object or a piece of information. I just want to stick it in
there.
GLOVER:
Right.
PHILLIPS:
And it's self-contained, which makes it really,
really easy to distribute, because I don't have to worry about
picking up various pieces from all the tables that have
foreign-key relationships. I've got this blob and I'm going to
move it from here to this other server now in order to smear
it out if I'm changing the size of my cluster.
So, no schema required. Just start shoving stuff in there.
Change your mind about what you're shoving in there at any
point in time. And then later, when you want to query your
information or get it out, you've got the ability to do that.
You basically infer the schema later, versus forcing the
schema before you can do the inserts.
-9-
This notion of hey, I have some data, I just want to store it,
and I want it to be safe and durable, and I want to get it
back later is something that is extremely intuitive for
people.
GLOVER:
Yes. It's easy, and you have a low barrier to
getting started.
PHILLIPS:
Exactly. And some of these other approaches, the
pure key/value approach -- which is what memcache historically
has been -- where you've got one primary key and then a big
blob of data that sits behind it. The problem with that is
your database can't do any work for you, because it doesn't
know how to look inside the blob. The only thing you can do is
put a blob in and get the blob out. If you want to go get a
list of all the blobs that have a cat inside of it, the
database can't do that for you, right?
And so that approach is less full-featured from a database
perspective than developers have become accustomed to, and
rightly so. On the columnar-store side it's just really,
really confusing, and there are still schema-like things that
need to be done in order to use the it. You've got these
column families and supercolumns. It's very foreign and alien
from a developer perspective, and a lot of people have had
trouble wrapping their head around the model.
I think that the document approach is the right sort of middle
ground where you've got enough expressive power in your
document-manipulation language or your query language. And
it's simple enough to start using it that anyone can do it.
That's why I say that I think ultimately it's going to be the
winning approach. And the solution like Mongo or like
Couchbase, I think, are the two representative document
database technologies that I think have a good shot at being
the next relational technology, if you will.
GLOVER:
Right. Right. Going back to Mongo and Couch and
memcache combined, obviously you all have a stack that is
Couch and memcache. Is it optimized in some way that makes it
special or easier than me just rolling my own, so to speak,
and downloading, let's say, Couch from Apache and then go grab
memcache?
PHILLIPS:
Oh, absolutely. The thing is that one of the
things that we talked about earlier is the downside of trying
to manage two completely separate tiers -- database layer, et
cetera. And so Couchbase takes those ingredient technologies
-- memcache and CouchDB -- but melds them into a single-tier
solution. When you install Couchbase, you're not installing
memcache and Couch. You're installing a single tier that's
elastic, that's scalable, that automatically and transparently
-10-
caches your data but also stores it durably in the back end of
CouchDB.
And because we are tightly coupled between those two
technologies, and because they're co-deployed and we're in
control of both of them, we can make all sorts of optimization
decisions about what lives in cache and what doesn't and when
and why.
GLOVER:
Right.
PHILLIPS:
And so Couchbase at some level, like Mongo, is a
complete solution, if you will. It includes the
high-performance caching capability, it includes the ability
to distribute your data across multiple nodes, it includes the
high expressive factor of the ability to do distributed
queries.
All of that stuff is provided in a single tier. Instead of
having a separate caching layer trying to fix up problems at
the database layer, it's just a good database.
GLOVER:
Exactly. And so back to Couchbase and then the
fact that there is open source. You can go to Apache and grab
Couch. So what is the business model with Couchbase? Can I get
ahold of the stack? How does it work?
PHILLIPS:
So Couchbase, if you go to couchbase.com or -- I'm
going to get the ads in here -- Couchbase is our Twitter feed,
as well.
GLOVER:
Excellent.
PHILLIPS:
So Couchbase everywhere, right? So couchbase.com.
If you go there you are able to download free of charge and
use freely the Couchbase binaries. It's a completely open
source project, as well.
We basically have taken those two ingredients -- CouchDB and
memcache -- and the clustering technology that we built when
we were Membase and combined all of that into a single
repository, which is Couchbase.
It's open source, freely available; take it, use it, do
whatever you want with it. We have two editions that we make
available. There's the community edition and the enterprise
edition.
GLOVER:
Okay.
PHILLIPS:
The difference between the two being that the
community edition sort of represents the current branch of
development where we're adding lots of capabilities and
features. The binary that we put out there is the most stable
-11-
build of that high-churn branch, if you will. But it is sort
of high-churn. We certainly don't claim that it's ready for
production use. We can't support it.
And so it's great for developers and for hobbyists and for
folks who want to be on the bleeding edge, from a
feature/functionality perspective, but the enterprise edition
represents a point-in-time snapshot of that open source
project that we sort of pulled off, branched, ran through
very, very rigorous quality-assurance processes, added some
supportability tools into the code. We're able to patch it;
we’re able to support that branch. And we do charge for
production deployments of the enterprise edition beyond
two-node clusters.
GLOVER:
I see.
PHILLIPS:
So even the enterprise edition is free for
development/test/preproduction and in small-scale production.
But if you want to go to large-scale production, the license
is on a per-node basis for the enterprise edition. So it's a
typical sort of open source model where ultimately what you're
paying for is the stability, supportability, maintainability
of the certified edition, if you will.
GLOVER:
Right. Exactly.
PHILLIPS:
But it's all open source. If you want to go build
the software and do with it whatever you please, you're
welcome to do that.
GLOVER:
You're combining memcache and Couch into your
stack. Is your Couch version a fork of Apache Couch? Or is it
bidirectional -- the improvements you guys or your community
makes to Couch feeds back into Apache? Or are there different
versions of Couch now?
PHILLIPS:
No. We're trying to be completely bidirectional.
The challenge of course is that CouchDB is an Apache community
or Foundation project. And we can't just go jamming stuff back
into there without going through the Foundation process.
We may be able to run slightly ahead of CouchDB, because we're
unencumbered by those processes, but our intent -- it's a
reinforcement to us because it's good for us, right -- is to
shove everything back into that CouchDB.
We're not just trying to be a better Couch. We're trying to be
something different, which is an elastic Couchbase with
memcache, all these other things. We want Couch to be the best
core document database technology period. And so we want to
push all the enhancements and fixes back up to the Apache
community. But there may be some lags simply due to process.
-12-
GLOVER:
Sure. Yes. That totally makes sense. You mentioned
couchbase.com. That's where we can go and start. We can
download these libraries and start evaluating them, playing
with them, like you said, hobbyists, et cetera. Are there
tutorials? Is couchbase.com the place to go to read more about
this? Or in addition to going to Couch's website and
memcache's?
PHILLIPS:
Ultimately we do want to be the place where it's a
one-stop shop for the best information.
GLOVER:
Okay.
PHILLIPS:
Part of the value that we want to bring is being a
curator, if you will, and a producer of very good content to
make it super easy for people to get going with these
technologies.
There's a link off of couchbase.com to something called Tech
Zone, or you can go directly to techzone.couchbase.com, where
there's a lot of information about using these technologies.
There's also a really good white paper on the couchbase.com
website called "Why NoSQL?" that basically encapsulates all
the stuff I talked about earlier in this call: Why are we
here? Why is database technology changing? So there's an awful
lot of good information up there, and I would encourage people
who are interested in this to go check it out.
GLOVER:
Awesome. And then you had already mentioned -- but
just in case someone missed it -- Couchbase on Twitter. So you
can follow, keep up to date with the crew at Couchbase.
PHILLIPS:
Yes, yes.
GLOVER:
And I know you all have a blog feed on Couchbase,
as well. And I saw that you have guest bloggers as well. To
feed back into the earlier podcast that we did with CouchOne,
which was the whole mobile stack, I saw there's a tutorial
there listed.
So, excellent. Well, James, I know I speak for everyone
listening when we say, thank you very much for taking some
time out of your day and sharing how we got here. I thought it
was very, very interesting. I look forward to reading that
white paper you just mentioned.
PHILLIPS:
Thank you. Super.
GLOVER:
Hopefully I can spread the good news because I
think you articulated it excellently. Far better than I could.
PHILLIPS:
I really appreciate it.
GLOVER:
So again, thank you very much, and definitely look
-13-
forward to seeing all the new and exciting things that come
out of Couchbase.
PHILLIPS:
Thank you, Andy. We'll talk to you soon.
>>
Andy Glover talking with James Phillips. Find more
episodes in Andy's Java technical series of the developerWorks
podcasts at ibm.com/developerworks/java.
[END OF SEGMENT]
-14-