Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Architecture of the Gracenote Service Steve Scherf February 22, 2006 Introduction Introduction Gracenote, formerly known as CDDB, is the industry leader in media recognition technology and services Started in 1996 as a hobby CD recognition service Since that time, the Gracenote service has evolved to become a truly flexible, reliable, high-performance service for numerous digital media recognition products There have been three major architecture redesigns across its history. Three times a charm. This will be a short journey through the workings of the Gracenote service as it exists now Please ask questions at any time Copyright (c) 2006 Gracenote Inc. 3 Introduction Gracenote offers a wide suite of Internet-based media recognition products – CD recognition – DVD recognition – Audio file recognition • Waveform analysis • Music metadata search engine Other related products – – – – – Playlist generation Recommendation engine (alpha) Third party “link” data delivery Phonetic metadata delivery Content delivery (cover art, artist bios, reviews) All of these products are built on top of a single software architecture Copyright (c) 2006 Gracenote Inc. 4 Introduction Gracenote also offers a number of embedded media recognition solutions – Waveform audiofile recognition – CD recognition – Soon to come - text metadata search These typically run inside home audio components, as well as car stereos and navigation units These products generally operate standalone, and contain scaled-down versions of the same technology used in the Gracenote online service Copyright (c) 2006 Gracenote Inc. 5 Introduction Gracenote has built a database of album metadata for over 4 million CDs comprising information such as: – Artist / album / track names – Album credits (who played what instrument on what tracks, where was it recorded, etc.) – Genre information – Other extended information such as artist place of origin, era (I.e. 60’s, 70’s, etc.) – Associated media recognition data, such as the all-important CD TOC (table of contents) and track audio waveform fingerprints Similar data has been gathered for a large set of the most popular DVDs We also have information linking our database to that of third parties, such as Muze, Loudeye, Yahoo, Apple and AMG Copyright (c) 2006 Gracenote Inc. 6 Introduction All of this data is used to power Gracenote online products How is this large quantity of data replicated, stored, indexed and served to millions of users every day with a minimum of servers and no downtime? Let’s start at the beginning… Copyright (c) 2006 Gracenote Inc. 7 In The Beginning The Birthplace of CDDB Copyright (c) 2006 Gracenote Inc. 9 In The Beginning Gracenote had a humble beginning in 1996 as a hobby which soon grew to a full-time job (with no pay!) Then known as CDDB (aka the CD Database), now affectionately known as CDDB1 Open source server code, had to be portable and easy to run Original requirements placed on the server required that it interoperate with Xmcd (Unix CD player app), and shared its database format This (mis)informed much of the design of the original service Copyright (c) 2006 Gracenote Inc. 10 CDDB1 Key attributes of the service – – – – – – – – Built to be run by anyone with a system on the Internet Portability to any Unix-like platform Remotely maintainable Totally automated Stable (>2 year uptime on a single server!) Simple, open protocol User-contributed database Worldwide distribution of servers allowed 100% service uptime Copyright (c) 2006 Gracenote Inc. 11 CDDB1 Problems – Bad DB structure (i.e. no database - used the filesystem) – Faulty TOC identifier - the disc ID • Up to 10% rate of collision! – No client / user validation - fine for low volume only – Limited data format – Difficult for developers to implement the protocol – Not good for anything except CD recognition Solving problems and extending to support other services required a total redesign So we embarked upon building CDDB2 Copyright (c) 2006 Gracenote Inc. 12 CDDB2 A lot of times when you first start out on a project you think, This is never going to be finished. But then it is, and you think, Wow, it wasn't even worth it. - Jack Handey Copyright (c) 2006 Gracenote Inc. 13 CDDB2 Problems solved – User identification / validation – Client identification / control – Album ID space issues • No more disc ID • Used a good TOC hashing scheme instead – SDK for developers • No more protocol implementation issues • Complex interaction with apps now possible – Encrypted transactions to protect data and users – Compression to reduce network latency – Capability beyond TOC lookup • Easy to add new functionality Copyright (c) 2006 Gracenote Inc. 14 CDDB2 Problems introduced – Inflexible precalculated data – Difficult or impossible to add new protocols – Unscalable • New front-line servers placed a high additional load on the master Oracle server • Could not be expanded outside of a single colocation facility – Slow – Expensive • New Oracle license required for each new server • Sun hardware expensive - bad cost/performance – High maintenance – Mediocre implementation • Calculations done with Oracle • Single points of failure Copyright (c) 2006 Gracenote Inc. 15 The Gracenote Service Redesigned Version 3 Service Version 3 Key attributes – Written in plain C • Best performance • Best control over behavior • Greater transparency – Extremely portable to any Unix-like platform • Thoroughly tested on Max OS X, Linux, Solaris, Free BSD • Runs on any ILP32 or LP64 system • Tested on Sparc/Intel/AMDPower PC processors Copyright (c) 2006 Gracenote Inc. 17 Service Version 3 Key attributes – Near 100% original code • The asymptotic goal is 100% • Key custom components: – Server core operating environment (ds_core) – XML parser – XML object database interface – Database cache – Heap manager (i.e. malloc replacement) – Web server module Copyright (c) 2006 Gracenote Inc. 18 Service Version 3 No closed source software – This keeps the operation of our own service transparent – Oracle engineered out of the service completely Effective 100% uptime – – – – Massive server redundancy Multiple colocation sites Automatic server detection and failover Intra-server load balancing and routing Copyright (c) 2006 Gracenote Inc. 19 Service Version 3 Key Attributes – Modularity • All server types run the same code – Simple configuration changes decide what type of server a box should be – New features or protocols added by creating a plugin module – Geographically distributable • Any number of separate server sites can be supported • Each server has an identical copy of the database – Self-replicating • Servers are kept in sync in near-realtime • Only a thin pipe required to keep all sites in sync Copyright (c) 2006 Gracenote Inc. 20 Service Version 3 Key attributes – High performance • Designed for performance • Data is normalized (XML) to remove the need for database joins in realtime • Much of the database and indices is kept in RAM – Transformable data • XML format allows data to be easily transformed to any protocol we wish to support – Client / server interaction • Smart client SDK allows much of the work to be offloaded to the user’s computer • The client decides what it needs and the service provides it • Client synthesizes results from what it’s provided Copyright (c) 2006 Gracenote Inc. 21 Service Version 3 Key attributes – Uses POSIX threads • Threads are used instead of multiple processes • A single process per server removes the need for IPCs • Affords excellent scalability – Designed to function nominally on any single system • The entire service will even run on a laptop computer • Bigger hardware simply translates to faster operation – More RAM means more data and indexes in-core – More CPUs allows better concurrent execution • All data fits on a single server – All transaction servers have an entire copy of the database – Any server can take on any role required at any time Copyright (c) 2006 Gracenote Inc. 22 Server Architecture Server Architecture Operating environment - ds_core – ds_core is the foundation of the Gracenote service – Handles all the work of “daemonizing” • Forking / disassociating from the terminal / signal handling – Abstracts and extends POSIX pthreads • Hides the much of the more ugly parts of POSIX threads • Testable joins • Shared locks • Enforces lock hierarchy • Thread tracking and auto-cleanup Copyright (c) 2006 Gracenote Inc. 24 Server Architecture Operating environment - ds_core – Provides message logging facilities – Provides plug-in loading and execution evironment • Manages plug-in state • Provides facility for intra-module API calls – Manually-loaded modules in Unix do not have visibility into other modules’ symbol space – ds_core provides symbol resolution and API call protection against unloading – Memory allocation tracking • Logging of allocations/frees • Bounds checking upon free • Leak containment – Monitoring and control interface • Simple text-based network control interface • Allows interaction with humans via telnet, as well as monitoring software Copyright (c) 2006 Gracenote Inc. 25 Server Architecture Threading model – POSIX threads are utilized • ds_core abstracts the API, so other thread library could be used w/o affecting plugin modules – Threads are spawned on-demand • OSes spawn new threads at least as quickly as pooled threads can be awakened and repurposed, sometimes faster • Simpler than pooling – One thread, one query model • Each user query is handled by a single thread • Thread dies when query is complete • Threads traverse vertically through software layers during queries – Modules do not own user query threads, they simply take API calls from the top level where threads are created • Modules may create maintenance threads only Copyright (c) 2006 Gracenote Inc. 26 Server Hierarchy ORACLE MASTER OCIBOX MASTER XP XP XP AP MASTER XP AP XP AP XP AP CISCO LOAD BALANCER Internet Copyright (c) 2006 Gracenote Inc. 27 SYNC NODE XP AP AP XP XP Server Architecture Server types – XP - transaction processor • This is where all the work is done to answer a user request • Performs media recognition tasks – TOC (CD) lookup – Audiofile waveform recognition – Text search • Performs other types of user queries as well • Synchronizes data from upstream source (master) • Generates lookup statistics • Has an entire copy of the database • Servers can “specialize” in one query type, or serve them all – Configuration options control this Copyright (c) 2006 Gracenote Inc. 28 XP Module Layout COMM DBS XML (NETWORK) Berkeley DB NETWORK DATABASE Copyright (c) 2006 Gracenote Inc. 29 DBS SYNC CD LOOKUP DVD LOOKUP WAVEFORM TEXT SEARCH CACHE DISPATCH MASTERSHIP ds_core Server Architecture Server types – AP - application processor • These servers translate user queries to XP calls and back – They contain no database • Protocol plugins allow the AP to speak any client protocol desired – The list currently includes 5 protocols, including CDDB1, CDDB2, XML RPC, etc • Plugin modules for network socket handling and HTTP requests allow the AP to look like a web server – Can serve up documents if desired – Allows protocols that are encapsulated in HTTP to be routed to the correct protocol module • Are capable of load balancing across XPs – They know what query types each XP supports – Queries are spread across XPs – Failed XPs are detected automatically and taken out of rotation – The combination of AP and XP servers form the basis of the Gracenote service 30 Copyright (c) 2006 Gracenote Inc. as users perceive it. Server Architecture Server types – Master • Responsible for fetching new and updated data from upstream Oracle master repository • Distributes data to downstream XPs and sync nodes – Sync node • Just like a master, but it’s upstream parent is another master or sync node rather than an Oracle instance • These are used for replicating data at geographically remote locations – OCI Box • Serves simply as a conduit between Oracle and the master servers Copyright (c) 2006 Gracenote Inc. 31 Server Architecture Server types – The distinction between server types is somewhat arbitrary • A single server could be configured to load all necessary modules to perform as two or more types of server at once • In practice, this is not desirable since specialization increases performance and flexibility of control Copyright (c) 2006 Gracenote Inc. 32 Server Architecture Data – All data is in XML format – All textual data encapsulated in XML is UTF-8 – Data is segmented into several distinct tables to allow XPs to specialize • All XPs currently have the same data, for now • This may change if data and product growth outstrip hardware improvement (a race with Moore’s law). • Replication can be directed to allow only certain types of data to reside on an XP – E.g., servers performing TOC recognition only need album data and TOC tables, but not waveform, DVD, etc. Copyright (c) 2006 Gracenote Inc. 33 Feature Details Development Environment Third party libraries – ICU • Character set translation • Just used for translating fat tilde characters in Shift-JIS – Libtomcrypt • Cryptographic algorithms that form the basis of our secure protocols – Berkeley DB • C API database engine – Expat • The core of our XML parser – Zlib • Compression library Copyright (c) 2006 Gracenote Inc. 35 Development Environment Compiler – gcc / Sun Workshop on Solaris – gcc on all other platforms Build environment – GNU autotools Debug environment – dbx debugger • gdb is hopelessly broken for threaded apps – Rational purify/quantify • Buggy and slow, but invaluable Copyright (c) 2006 Gracenote Inc. 36 Performance The average query takes only a handful of milliseconds – The user generally doesn’t perceive the time it takes to get his answer - the network transit time is far longer Disk I/O is the largest cost in a user query – You can’t completely remove this as a factor without storing everything in RAM – We have plans to do just this Our biggest CPU cost in a query is XML parsing – Our custom parser is fast, but when you do something a lot, even one instruction less can make a difference Copyright (c) 2006 Gracenote Inc. 37 Performance Future directions – Store all data in RAM • Special database code required – Separate indexable versus servable data • Different compression characteristics • XML parse cost reduction – Preparse XML • Some data is already servable, and can be shoved out on the wire without transformation – Most of our query types will take 1 millisecond or less when work is complete Copyright (c) 2006 Gracenote Inc. 38 Scalability Each XP and AP can process millions of user transactions per day – They operate totally independently of other servers Adding more XP/AP boxes scales serving capacity linearly – No bottlenecks, except perhaps the hardware load balancer – More of those can be added too Each master can feed hundreds of XPs or sync nodes – If a master’s limit is reached, masters can cascade – This means there is effectively no limit to the number of servers that can be supported Copyright (c) 2006 Gracenote Inc. 39 Scalability Servers can be located anywhere – Less than a T1 line is required to keep a server in sync – Bandwidth required is on the order of several hundred megabytes per day – Individual sync nodes can keep remote server groups fully in sync Copyright (c) 2006 Gracenote Inc. 40 Availability The service can survive the failure of more than half of the XPs, 3/4 of the APs, all of the masters/sync nodes and all OCI boxes The loss of one entire colo facility can be survived – Both of these have happened! – Users don’t feel a thing Downstream servers detect the failure of upstream servers and take them out of rotation – All upstream servers are normally active – Dead or failing ones are dropped and retested until recovered Cisco load balancer detects failed APs and takes them out of service in a similar fashion Copyright (c) 2006 Gracenote Inc. 41 Service Holiday Load 12/2005-1/2006 Copyright (c) 2006 Gracenote Inc. 42 Historical Query Load Copyright (c) 2006 Gracenote Inc. 43 Metrics All queries generate a stat packet containing – – – – – – The query type Query result (success, failure, error) User inputs (TOC, waveform, etc.) Response to user (album ID given, etc.) Elapsed time Session information Stat packets flow upstream to the master repository for tallying and aggregation – Raw packets are deleted w/o backup Server and query performance can be monitored – Flaky servers can be spotted this way – True query performance can be determined Copyright (c) 2006 Gracenote Inc. 44 Protocols Our protocols are all HTTP-based, except the oldest form of CDDB1 – Newer protocols utilize HTTP MIME-like headers (HTTP headers technically are not MIME) for encapsulation details like encryption and compression format – Older protocols like, CDDB2 and eCDDB, that support encryption and compression only use headers for size and type info. Encryption and compression are rolled into the protocol itself. XML RPC is our only “standard” protocol – Used for true B2B interaction with the service – All others are custom protocols for specific needs, or legacy Most of them are XML-based protocols – Only CDDB1/2 are not Copyright (c) 2006 Gracenote Inc. 45 Implementation Details XML XML is used for performance – Because it is denormalized, only one DB access is required to fetch all fields of a data object – CPU load is thus traded for disk I/O load – A fair trade, but caution must be taken to handle XML efficiently We have implemented two custom XML parsers for CPU efficiency – Watersprite • An XML “compiler” that generates C data structures for known XML objects • Slow to parse, but very fast to access fields of the data structure • Simplifies C code because XML is accessed through field dereferencing Copyright (c) 2006 Gracenote Inc. 47 XML Custom parsers – xmlstack • An event-driven parser • Implements a directed parsing algorithm • Ignores uninteresting parts of a document • Can scan a 1MB XML document with 50,000 nodes in under 5 milliseconds on AMD Opteron Copyright (c) 2006 Gracenote Inc. 48 Security Client / server communication – – – – All newer protocols support mandatory encryption Protects user privacy and trade secrets Algorithms chosen for speed and tiny implementation size Diffie-Hellman key exchange is used for trading session keys • Theoretically not as strong as RSA or ECC, but still unbroken • Very fast – TEA 128 bit encryption in OFB chaining mode is used for data encryption • Fast and tiny Copyright (c) 2006 Gracenote Inc. 49 Security ID Validation – Gracenote employs various IDs for identifying albums, tracks, DVDs, clients, users, etc. – All IDs are cryptographically tagged • Cannot be faked • No database of issued IDs is required to validate them Copyright (c) 2006 Gracenote Inc. 50 Database Architecture The core of our database architecture is Berkeley DB – Fast and small C API key/data pair database We have created an XML wrapper around BDB – Indices can be created from specific fields in the XML Future directions – BDB is nice but too slow for us – New database technology is being developed for our purposes • Hybrid memory-mapped / in-core DB – Ephemeral indices – No transaction log, but still transactional – Compressed data – Plugin architecture for transformation/indexing modules Copyright (c) 2006 Gracenote Inc. 51 Caching We have created a custom LRU caching module on top of BDB – Any XML object can be cached – Raw objects can be cached, but not totally transparently Caching characteristics can vary by object type – Amount of cache devoted to an object type – Likelihood of an object being needed soon (popularity) is used to determine if an object should be cached Cache can be preloaded from its backing table at startup Cache images can be saved at shutdown for fast restart Copyright (c) 2006 Gracenote Inc. 52 Caching Cached objects can optionally be compressed – 5-20x compression is standard (XML is very compressible) – No special compression used - just standard zip with medium compression and Huffman encoding – Specialized XML compressors only gain about 5% more compression for much higher CPU cost Copyright (c) 2006 Gracenote Inc. 53 Heap Manager - Concur Due to scaling issues with standard malloc, we have implemented a custom replacement - Concur (aka ccmalloc) – Massively scalable • Designed to allow concurrent threaded access to the heap with no loss of concurrency (take that, Amdahl’s Law!) • Implements N heaps (generally 4P. Theoretically one thread/heap, but in practice this isn’t always true • One lock per size class to overcome this – Hundreds of size classes – Based loosely on Hoard heap manager • http://www.hoard.org • Improved scalability over Hoard, which has one lock/heap • Ours is portable Copyright (c) 2006 Gracenote Inc. 54 Heap Manager - Concur Low memory waste overhead – Less than 1% fixed memory waste – About 10% average waste, much less than most others Virtually no fragmentation – This was a huge problem for us because we have many slowly growing tables that can cause virtual memory holes Supports raiding of foreign heaps when physical memory is low Optimized for large memory utilization - 20-30 MB startup overhead Fast – 3-5x faster than GNU libc malloc, more when multithreaded – Same speed as Solaris 10 malloc w/1 thread, but scales better Copyright (c) 2006 Gracenote Inc. 55 Infrastructure More than one colo facility – Currently two, soon to be three – Enough servers at each site for N+1 redundancy • We can lose any one colo during peak times without loss of service Telco-quality infrastructure at colos Cisco CSS load balancers – Balance user traffic across colos transparently – Balance traffic across APs within a colo Copyright (c) 2006 Gracenote Inc. 56 Infrastructure Enterprise-class servers – Sun v40z, 4xAMD Opteron, 8x72GB SCSI RAID 10, 32GB RAM – Rackable, 2xAMD Opteron 8x72GB SCSI RAID 10, 16GB RAM Service monitoring at several levels – – – – User-level external query testing Internal self-monitoring across servers Server log monitoring 24x7 staff to handle service events Copyright (c) 2006 Gracenote Inc. 57 Problems and Weaknesses Problems and Weaknesses This is all wonderful, but not perfect Some problems and weaknesses exist – Addition of unanticipated query types may require data remodeling • Data objects are designed for specific and predicted use cases • If needs arise that are in conflict with the design, it’s back to the drawing board – XML handling has high CPU cost • Though we are very efficient at XML processing, it is still the #1 CPU cost in the service • Still way faster than equivalent disk access – User-induced crashes can be devastating • Sometimes user queries can bring down a server • When they retry, it brings down another... and so on – Unanticipated database growth well past the size of RAM would require a different data model and significant redesign Copyright (c) 2006 Gracenote Inc. 59 Questions?