Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
>> This is the developerWorks podcasts Java technical series, hosted by Andy Glover. Here's Andy. GLOVER: This is Andy Glover with the Java technical series. My guest today is James Phillips with Couchbase. He is a co-founder and senior vice president of products. And Couchbase is a new company, which is the combined forces of CouchOne and Membase. So I thought we'd start, James, with what happened here. Tell us about this merger. PHILLIPS: Sure. The merger happened in ... I can't even remember when we announced it now. I think we announced it mid-February, so it's been a couple of months. GLOVER: Okay. PHILLIPS: Mid-February 2011. And it was a combination that was driven, surprisingly -- maybe not surprisingly, but incredibly -- almost literally by customers or by users of our technology. GLOVER: Yes? PHILLIPS: There was a tweet that happened when we were in the middle of going through integration discussions, merger discussions, prior to actually closing where someone tweeted, could someone please combine Membase and Couch? GLOVER: Nice. PHILLIPS: It really was all about the technology. If you look at what we were doing at Membase, we were providing this very, very high-performance sort of caching and clustering technology that allows you to smear your data across a lot of servers and to use memory to cache your most recently used data. That high-performance, very elastic, scalable data-management technology is what we were doing at Membase. We were storing data also on disk, but we weren't giving you the query ability, indexing, and the other things that people tend to expect out of a primary database. GLOVER: Sure. PHILLIPS: We weren't providing those capabilities. And that's what Couch was all about. Couch was a document database technology that did allow you to query and index and do database-like functionality. People loved our elasticity and our high performance, but they really missed the ability to query. On the flipside, Couch didn't have a very high-performance story. They didn't have an elastic scalability story, and -1- people really missed that or wanted that out of Couch technology. So marrying those two technologies and now being able to smear your data across a lot of servers, get very high performance when you're interacting with your data, and still have the ability to do queries and indexing and all the other great stuff that people expect out of database technology is what we get out of this merger. I read a blog post when we completed the merger talking about that tweet and sort of in jest but very seriously saying, please keep the good ideas coming. [LAUGHTER] GLOVER: Yes. Yes. PHILLIPS: Coming from the user population is certainly where you want to be drawing your ideas. GLOVER: Interestingly, in season one I had the opportunity to talk to Aaron Miller, who was doing some interesting work with Couch. At the time, at CouchOne, he had ported Couch to mobile devices -- at the time, Android. And definitely had some other conversations about other NoSQL-like datastores such as App Engine and Mongo. And so I thought it would be interesting to talk about, how did we get here? Obviously, there's a lot of momentum here if you've got your users saying hey, why don't you combine forces? Someone's got to solve this problem. So this isn't build it and they will come; this is, they want it, let's get it in front of them as quick as possible. What happened? How did we get here? PHILLIPS: Okay. And absolutely, right? We've all been through the build it and they will come. Sometimes it works out, but a lot of times it really sucks. [LAUGHTER] GLOVER: Yes. Been there, done that. PHILLIPS: Exactly. It's nice when there is a real acute problem in the marketplace where people are really looking for a solution. And we feel fortunate that we are able to deliver on what is turning out to be a real acute need. And so, the way we got here, if you go back ... let's go on the way-, wayback machine here. You go back to the early 1970s, mid-1970s, and that was the point in time at which relational database technology was invented. -2- If you think about the interactive software systems of that day and you think about the infrastructure atop which those infrastructures ran: very, very stark contrast to what we find today. GLOVER: Yes. PHILLIPS: Monolithic computing was the norm. You had minicomputers and mainframes. If you needed more capacity you expanded your box, right? Networking was in its infancy -data networking. The users of interactive software systems at that point were measured in the tens to hundreds. And if you were really, really, really bleeding-edge large, maybe you had a couple thousands users. GLOVER: Yes. PHILLIPS: Sabre Airlines' reservation network or Bank of America's teller network maybe had a couple of thousand concurrent users online at any one time. GLOVER: It was also quite expensive, at least hardware was at that point. Right? PHILLIPS: expensive. Absolutely. And memory was very scarce and And so the environment in which relational technology was born basically made a bunch of assumptions that were valid at that point in time and led to a technology that was ideal for the kinds of software systems that were being built and the kinds of infrastructure atop which they were running. If you fast-forward now 40 years and you look at the kinds of interactive software systems that are being built today, most new interactive software systems are interacted with via browser over the web, and you've got the ability, as a result, to have concurrent user populations that are literally many orders of magnitude larger than even the very largest of applications in the mid-1970s. So being able to support concurrent utilization of your software systems today is a challenge to what was being dealt with there. Additionally, if you look at the mode of computing now, the infrastructure, it's no longer about, let's go get a bigger and bigger mainframe, but rather, let's spin up some more virtual machines or let's launch some new cloud instances. And the model of computing really is about scaling out versus scaling up, and being able to match the capacity of the application, the infrastructure, with the needs of the concurrent utilization patterns in real time. GLOVER: Right. -3- PHILLIPS: And so, application architecture was the first thing to fall over. If you look 10 years back, just 10 years ago, we saw this shift toward -- with web applications in particular, which again are the predominant form of interactive software at this point -- where you take stateless web servers and you put them behind a load balancer. And if you have more users now you spin up some new instances, update your load balancer, get those new guys in the rotation, and off you go. GLOVER: Right. PHILLIPS: And so you've got this very flexible, elastic, very scalable, cost-effective mechanism for supporting large sets of users. One more virtual machine or one more DL380 in the rack. Another thousand users supported, boom. Another thousand, another Dell box. So it's got this very nice, linear cost curve associated with it, and you can basically scale that out almost indefinitely from a performance perspective while maintaining cost and performance. So, great model. But if you look at the database technology that is by and large relational technology, it's very different. Relational technology was built to be a shared-everything technology. If you want to increase the size of your database or the amount of concurrency that your database can support, you've got to get a bigger box. It's very difficult to shard a relational database, which means partitioning the data and spreading it across multiple servers. Now, having said that, there have been a lot of Band-Aids sort of thrown at the product. The first Band-Aid certainly was sharding. Okay? I can't smear my data out using the relational model naturally across multiple servers, so I'm just going to partition my data set. I'm going to take all of my users west of the Mississippi and put them on this box and take all of the users east of the Mississippi and put them on that relational database. And I need to change my application logic to know that if this guy is east of the Mississippi, I need to go to this box to get the data or the other one. Now, the challenge with sharding a relational database in that manner is that it's very disruptive to the application. What happens when you fill up your west-of-the-Mississippi box? You've got to go reshard, change application logic. You lose the ability to do any sort of complex cross-server relational queries. You can't join across multiple relational database management systems when they're sharded like that. And so you're starting -4- to give away a lot of the richness of the query ability of the relational model when you do that. And so sharding is a can of worms. Another approach -- in order to increase the concurrency capabilities of relational technology -- has been to be less normal form in the way that you're defining your schemas. Relational technology done right is all about normalizing your data and basically taking a record and splitting it up into multiple tables where you've got foreign-key relationships et cetera. So that when you're talking about a row of data, what you're really talking about is little pieces from a bunch of tables. And if you want to go update a row, naturally you've got to lock down all of those tables in order to get an atomic update, right? GLOVER: Yes. Yes. PHILLIPS: All of that locking that occurs in relational technology makes it very difficult to get a high degree of concurrency. And so what people have done is to denormalize a lot of their data so you don't have it smeared across multiple tables and so that when you go do an update you really are just locking down one table and you're getting it over with fairly quickly. And again, not using correlational technology like relational technology in order to solve this problem, which is increasing concurrency. The final Band-Aid that has been thrown into the mix is distributed caching -- something like memcached. Where, okay, I'm using relational technology. It doesn't do a really good job with high concurrency, and so I'm going to smear the data across a bunch of cache machines in front of that big monolithic relational database management system or even sharded database management system in order to allow myself to get some of those cost economic benefits of spreading my data out. So memcached can sit in front of, effectively, a relational database management system and allow me to spread my data out across lots and lots of servers. If I have more data or I want more parallelism support, simply get another couple of servers, throw them into the mix, and off I go, right -- just like I'm doing at the application layer. The problem, of course, is that if I'm throwing a completely new tier of datamanagement infrastructure into the mix, I have more complexity. Now I'm managing both a cache and a database management system. GLOVER: Right. PHILLIPS: It only works for reads. You still have to, if you want durability, write your data back into a database management system. So if you've got any sort of write -5- contention, you're back in the same boat. There are consistency issues that one needs to deal with. And so generally speaking, things like sharding and denormalization and distributed caching, they are all attempts to sort of bandage over the fact that relational database technology has not kept pace with changes in application architecture, the infrastructure atop which applications run, and the kinds of user populations that are using these systems. There's a fundamental problem that needs to be addressed. And because you're dealing with state and because distributed data management is a very, very nontrivial problem, people have been willing to try to paper over this and make work what they have. But at some point it's got to give. And I've always said that the database management technology is the last domino to fall in this march toward truly distributed application architecture. If you look at some of the leading organizations that have had to grapple with the real pain of trying to make these technologies work at scale, because at some point you can kind of Band-Aid, Band-Aid, Band-Aid, but eventually if you get to a sort of tipping point it's just not going to work anymore. GLOVER: Yes. PHILLIPS: Amazon invented Dynamo, their distributed database technology. Google invented BigTable. And techniques like MapReduce and Hadoop. And so necessity was the mother of invention in a lot of these organizations where they tried and tried and tried, and then finally got to the point where they said, you know what? This is silly. We need to rethink the way that we're doing data management. And that led to a lot of the innovation that's now working its way back down to not just the Googles and the Amazons, but anyone to do data management, quote-unquote, correctly, which to me is code for more costeffectively, more efficiently, and more in line with the kinds of architectural approaches that seem to be working at every other point in the stack, now doing that at the data layer. At Couchbase, ultimately that's what we're all about. We are all about knocking over that last domino: allowing datamanagement technology, database technology to finally catch up with what's been happening over the last decade in a broad way at the application-logic layer. That in part is why we did the merger: to bring together what we believe are all the critical components to make it truly possible to store your data in that manner and to make it accessible to everyone. -6- GLOVER: That was excellent. I think you've articulated how we got here just wonderfully, better than I could have. And something that I really enjoyed was bringing the example of Amazon and Google, how these companies had this problem and dealt with it. They built their own, and all of a sudden people started realizing okay, this is legit. These aren't two guys in a garage peddling this object database, right? And now there are myriad data stores out there, as we already talked about. We've talked about Couch, we've talked about Mongo on this series of podcasts. We've talked about BigTable. Obviously you all are Couch and memcache. And memcache isn't the only kind of caching technology out there, although probably the 500-pound gorilla at least on that side. What makes these two open source projects unique and viable? What kind of problems do I have that these things solve as opposed to foo and bar the other options out there? PHILLIPS: Sure. The truth of the matter is that we're really, really early in commercialization -- maybe to use some words that could be negatively construed, but I think positively, right? Sort of in the mass marketing, if you will, or I guess the word I'm looking for is: not everyone is Amazon or Google and able to invent database technologies that you use within your organization that you care and feed for that aren't really packaged up necessarily for consumption by the masses. We have to get to that point, right? Until there are technologies that are really, really easy for any developer and/or operations team to pick up and begin using with the kind of comfort and with the feeling of security that they get from an Oracle or a SQL Server -- or even, I guess, maybe perhaps at this point MySQL -- we have to get there before this technology is going to be meaningful in any large way. And we're early, right? We are still, I think, maturing these technologies and hardening them and making them operationally ready to the point that the masses can actually start to consume and use them. Having said that, I think that one of the things that sets Couchbase apart -- and I certainly don't want to make this just an advertisement for Couchbase, because I think that the larger conversation about database technology and where it's heading is certainly more interesting -- but one of the things that does set us apart is that we are building from some of the most mature technologies in what is an otherwise very immature market. memcache was invented almost 10 years ago, now. -7- GLOVER: Yes. Yes. PHILLIPS: It is battle-proven. It is a core component behind 18 of the top 20 websites on this planet by traffic count. It doesn't get much more hardened than those numbers, right? GLOVER: Yes. Exactly. PHILLIPS: Couch, likewise: it was the forerunner of all the emergent document database technologies. Damien Katz invented that back in 2005. Damien is our CTO here. GLOVER: And he comes with -- correct me -- the Lotus Notes background. PHILLIPS: Yes. Absolutely. GLOVER: stores. Which was one of the first popular document PHILLIPS: You bet. So a lot of the replication technology was invented by Damien back in the Lotus Notes days. Big heritage there, right? And a lot of maturity in the software systems. And so by marrying those two technologies in particular -memcache and CouchDB -- we believe that we are far ahead of many of the other emergent technologies. Not that they are poor technologies, but they are just less proven. And I think that one would be hard-pressed to identify a more, quote-unquote, mission-critical or core component of an IT architecture than your database. Right? GLOVER: Right. Exactly. PHILLIPS: And you don't monkey with that. So that I think is one of the things that really sets us apart. There are various flavors or approaches to, quote-unquote, NoSQL database technology, which is kind of a buzzword that the industry's thrown around this what is probably more appropriately called nonrelational or elastic database technologies. So document database technology is one approach. There are columnar databases like Cassandra or HBase, or BigTable at Google. There are pure key/value stores. And it is my belief that probably the most widely applicable, certainly easiest to understand and develop against -- I think what will ultimately be the winning approach to having a scalable database technology that can be distributed -- is this document approach. GLOVER: Okay. And why is that? -8- PHILLIPS: Well, If you think about and foremost, it's they hear the word document. GLOVER: it's really easy to wrap your head around. what a document database is, actually first actually a confusing term. Most people when document they think Word document or Excel That's right. PHILLIPS: And that's not at all what a document database is, unfortunately, right? And therein lies the confusion. What a document database is, basically: with a relational database when you insert a record, you're inserting a set of well-defined fields that have certain types, and it's very rigorously defined in a database schema, right? You can't start inserting data into a relational database until you have defined that schema and what the columns are and what the variable types are, et cetera. GLOVER: Right. PHILLIPS: In a document database, you simply shove in a row of whatever you want. And the row is a document. The definition is it's an XML document or it's a JSON document or it's some other self-describing type of quote-unquote record, where you've basically said okay, here's a name. And then you've got the name and slash name. Here's an age, slash age. And so it looks like a marked-up XMLish JSON kind-of self-describing document. You've got metadata along with the data. You don't have to go about creating a schema before you start inserting data. If you change your mind later about what kind of data you want to capture, you don't have to update a schema and renormalize and all the things that you have do with a relational database. So it's easy. I've got a record or an object or a piece of information. I just want to stick it in there. GLOVER: Right. PHILLIPS: And it's self-contained, which makes it really, really easy to distribute, because I don't have to worry about picking up various pieces from all the tables that have foreign-key relationships. I've got this blob and I'm going to move it from here to this other server now in order to smear it out if I'm changing the size of my cluster. So, no schema required. Just start shoving stuff in there. Change your mind about what you're shoving in there at any point in time. And then later, when you want to query your information or get it out, you've got the ability to do that. You basically infer the schema later, versus forcing the schema before you can do the inserts. -9- This notion of hey, I have some data, I just want to store it, and I want it to be safe and durable, and I want to get it back later is something that is extremely intuitive for people. GLOVER: Yes. It's easy, and you have a low barrier to getting started. PHILLIPS: Exactly. And some of these other approaches, the pure key/value approach -- which is what memcache historically has been -- where you've got one primary key and then a big blob of data that sits behind it. The problem with that is your database can't do any work for you, because it doesn't know how to look inside the blob. The only thing you can do is put a blob in and get the blob out. If you want to go get a list of all the blobs that have a cat inside of it, the database can't do that for you, right? And so that approach is less full-featured from a database perspective than developers have become accustomed to, and rightly so. On the columnar-store side it's just really, really confusing, and there are still schema-like things that need to be done in order to use the it. You've got these column families and supercolumns. It's very foreign and alien from a developer perspective, and a lot of people have had trouble wrapping their head around the model. I think that the document approach is the right sort of middle ground where you've got enough expressive power in your document-manipulation language or your query language. And it's simple enough to start using it that anyone can do it. That's why I say that I think ultimately it's going to be the winning approach. And the solution like Mongo or like Couchbase, I think, are the two representative document database technologies that I think have a good shot at being the next relational technology, if you will. GLOVER: Right. Right. Going back to Mongo and Couch and memcache combined, obviously you all have a stack that is Couch and memcache. Is it optimized in some way that makes it special or easier than me just rolling my own, so to speak, and downloading, let's say, Couch from Apache and then go grab memcache? PHILLIPS: Oh, absolutely. The thing is that one of the things that we talked about earlier is the downside of trying to manage two completely separate tiers -- database layer, et cetera. And so Couchbase takes those ingredient technologies -- memcache and CouchDB -- but melds them into a single-tier solution. When you install Couchbase, you're not installing memcache and Couch. You're installing a single tier that's elastic, that's scalable, that automatically and transparently -10- caches your data but also stores it durably in the back end of CouchDB. And because we are tightly coupled between those two technologies, and because they're co-deployed and we're in control of both of them, we can make all sorts of optimization decisions about what lives in cache and what doesn't and when and why. GLOVER: Right. PHILLIPS: And so Couchbase at some level, like Mongo, is a complete solution, if you will. It includes the high-performance caching capability, it includes the ability to distribute your data across multiple nodes, it includes the high expressive factor of the ability to do distributed queries. All of that stuff is provided in a single tier. Instead of having a separate caching layer trying to fix up problems at the database layer, it's just a good database. GLOVER: Exactly. And so back to Couchbase and then the fact that there is open source. You can go to Apache and grab Couch. So what is the business model with Couchbase? Can I get ahold of the stack? How does it work? PHILLIPS: So Couchbase, if you go to couchbase.com or -- I'm going to get the ads in here -- Couchbase is our Twitter feed, as well. GLOVER: Excellent. PHILLIPS: So Couchbase everywhere, right? So couchbase.com. If you go there you are able to download free of charge and use freely the Couchbase binaries. It's a completely open source project, as well. We basically have taken those two ingredients -- CouchDB and memcache -- and the clustering technology that we built when we were Membase and combined all of that into a single repository, which is Couchbase. It's open source, freely available; take it, use it, do whatever you want with it. We have two editions that we make available. There's the community edition and the enterprise edition. GLOVER: Okay. PHILLIPS: The difference between the two being that the community edition sort of represents the current branch of development where we're adding lots of capabilities and features. The binary that we put out there is the most stable -11- build of that high-churn branch, if you will. But it is sort of high-churn. We certainly don't claim that it's ready for production use. We can't support it. And so it's great for developers and for hobbyists and for folks who want to be on the bleeding edge, from a feature/functionality perspective, but the enterprise edition represents a point-in-time snapshot of that open source project that we sort of pulled off, branched, ran through very, very rigorous quality-assurance processes, added some supportability tools into the code. We're able to patch it; we’re able to support that branch. And we do charge for production deployments of the enterprise edition beyond two-node clusters. GLOVER: I see. PHILLIPS: So even the enterprise edition is free for development/test/preproduction and in small-scale production. But if you want to go to large-scale production, the license is on a per-node basis for the enterprise edition. So it's a typical sort of open source model where ultimately what you're paying for is the stability, supportability, maintainability of the certified edition, if you will. GLOVER: Right. Exactly. PHILLIPS: But it's all open source. If you want to go build the software and do with it whatever you please, you're welcome to do that. GLOVER: You're combining memcache and Couch into your stack. Is your Couch version a fork of Apache Couch? Or is it bidirectional -- the improvements you guys or your community makes to Couch feeds back into Apache? Or are there different versions of Couch now? PHILLIPS: No. We're trying to be completely bidirectional. The challenge of course is that CouchDB is an Apache community or Foundation project. And we can't just go jamming stuff back into there without going through the Foundation process. We may be able to run slightly ahead of CouchDB, because we're unencumbered by those processes, but our intent -- it's a reinforcement to us because it's good for us, right -- is to shove everything back into that CouchDB. We're not just trying to be a better Couch. We're trying to be something different, which is an elastic Couchbase with memcache, all these other things. We want Couch to be the best core document database technology period. And so we want to push all the enhancements and fixes back up to the Apache community. But there may be some lags simply due to process. -12- GLOVER: Sure. Yes. That totally makes sense. You mentioned couchbase.com. That's where we can go and start. We can download these libraries and start evaluating them, playing with them, like you said, hobbyists, et cetera. Are there tutorials? Is couchbase.com the place to go to read more about this? Or in addition to going to Couch's website and memcache's? PHILLIPS: Ultimately we do want to be the place where it's a one-stop shop for the best information. GLOVER: Okay. PHILLIPS: Part of the value that we want to bring is being a curator, if you will, and a producer of very good content to make it super easy for people to get going with these technologies. There's a link off of couchbase.com to something called Tech Zone, or you can go directly to techzone.couchbase.com, where there's a lot of information about using these technologies. There's also a really good white paper on the couchbase.com website called "Why NoSQL?" that basically encapsulates all the stuff I talked about earlier in this call: Why are we here? Why is database technology changing? So there's an awful lot of good information up there, and I would encourage people who are interested in this to go check it out. GLOVER: Awesome. And then you had already mentioned -- but just in case someone missed it -- Couchbase on Twitter. So you can follow, keep up to date with the crew at Couchbase. PHILLIPS: Yes, yes. GLOVER: And I know you all have a blog feed on Couchbase, as well. And I saw that you have guest bloggers as well. To feed back into the earlier podcast that we did with CouchOne, which was the whole mobile stack, I saw there's a tutorial there listed. So, excellent. Well, James, I know I speak for everyone listening when we say, thank you very much for taking some time out of your day and sharing how we got here. I thought it was very, very interesting. I look forward to reading that white paper you just mentioned. PHILLIPS: Thank you. Super. GLOVER: Hopefully I can spread the good news because I think you articulated it excellently. Far better than I could. PHILLIPS: I really appreciate it. GLOVER: So again, thank you very much, and definitely look -13- forward to seeing all the new and exciting things that come out of Couchbase. PHILLIPS: Thank you, Andy. We'll talk to you soon. >> Andy Glover talking with James Phillips. Find more episodes in Andy's Java technical series of the developerWorks podcasts at ibm.com/developerworks/java. [END OF SEGMENT] -14-