Download Architecture of the Gracenote Service

Document related concepts

Entity–attribute–value model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Relational model wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Database model wikipedia , lookup

Clusterpoint wikipedia , lookup

Transcript
Architecture of the Gracenote Service
Steve Scherf
February 22, 2006
Introduction
Introduction






Gracenote, formerly known as CDDB, is the industry
leader in media recognition technology and services
Started in 1996 as a hobby CD recognition service
Since that time, the Gracenote service has evolved
to become a truly flexible, reliable, high-performance
service for numerous digital media recognition
products
There have been three major architecture redesigns
across its history. Three times a charm.
This will be a short journey through the workings of
the Gracenote service as it exists now
Please ask questions at any time
Copyright (c) 2006 Gracenote Inc.
3
Introduction

Gracenote offers a wide suite of Internet-based media
recognition products
– CD recognition
– DVD recognition
– Audio file recognition
• Waveform analysis
• Music metadata search engine

Other related products
–
–
–
–
–

Playlist generation
Recommendation engine (alpha)
Third party “link” data delivery
Phonetic metadata delivery
Content delivery (cover art, artist bios, reviews)
All of these products are built on top of a single software
architecture
Copyright (c) 2006 Gracenote Inc.
4
Introduction

Gracenote also offers a number of embedded media
recognition solutions
– Waveform audiofile recognition
– CD recognition
– Soon to come - text metadata search

These typically run inside home audio components,
as well as car stereos and navigation units

These products generally operate standalone, and
contain scaled-down versions of the same
technology used in the Gracenote online service
Copyright (c) 2006 Gracenote Inc.
5
Introduction

Gracenote has built a database of album metadata for over 4
million CDs comprising information such as:
– Artist / album / track names
– Album credits (who played what instrument on what tracks, where was it
recorded, etc.)
– Genre information
– Other extended information such as artist place of origin, era (I.e. 60’s,
70’s, etc.)
– Associated media recognition data, such as the all-important CD TOC
(table of contents) and track audio waveform fingerprints

Similar data has been gathered for a large set of the most
popular DVDs

We also have information linking our database to that of third
parties, such as Muze, Loudeye, Yahoo, Apple and AMG
Copyright (c) 2006 Gracenote Inc.
6
Introduction

All of this data is used to power Gracenote online
products

How is this large quantity of data replicated, stored,
indexed and served to millions of users every day
with a minimum of servers and no downtime?

Let’s start at the beginning…
Copyright (c) 2006 Gracenote Inc.
7
In The Beginning
The Birthplace of CDDB
Copyright (c) 2006 Gracenote Inc.
9
In The Beginning





Gracenote had a humble beginning in 1996 as a
hobby which soon grew to a full-time job (with no
pay!)
Then known as CDDB (aka the CD Database), now
affectionately known as CDDB1
Open source server code, had to be portable and
easy to run
Original requirements placed on the server required
that it interoperate with Xmcd (Unix CD player app),
and shared its database format
This (mis)informed much of the design of the original
service
Copyright (c) 2006 Gracenote Inc.
10
CDDB1

Key attributes of the service
–
–
–
–
–
–
–
–
Built to be run by anyone with a system on the Internet
Portability to any Unix-like platform
Remotely maintainable
Totally automated
Stable (>2 year uptime on a single server!)
Simple, open protocol
User-contributed database
Worldwide distribution of servers allowed 100% service uptime
Copyright (c) 2006 Gracenote Inc.
11
CDDB1

Problems
– Bad DB structure (i.e. no database - used the filesystem)
– Faulty TOC identifier - the disc ID
• Up to 10% rate of collision!
– No client / user validation - fine for low volume only
– Limited data format
– Difficult for developers to implement the protocol
– Not good for anything except CD recognition

Solving problems and extending to support other services
required a total redesign

So we embarked upon building CDDB2
Copyright (c) 2006 Gracenote Inc.
12
CDDB2
A lot of times when you first start out on a
project you think, This is never going to be
finished. But then it is, and you think, Wow,
it wasn't even worth it.
- Jack Handey
Copyright (c) 2006 Gracenote Inc.
13
CDDB2

Problems solved
– User identification / validation
– Client identification / control
– Album ID space issues
• No more disc ID
• Used a good TOC hashing scheme instead
– SDK for developers
• No more protocol implementation issues
• Complex interaction with apps now possible
– Encrypted transactions to protect data and users
– Compression to reduce network latency
– Capability beyond TOC lookup
• Easy to add new functionality
Copyright (c) 2006 Gracenote Inc.
14
CDDB2

Problems introduced
– Inflexible precalculated data
– Difficult or impossible to add new protocols
– Unscalable
• New front-line servers placed a high additional load on the master
Oracle server
• Could not be expanded outside of a single colocation facility
– Slow
– Expensive
• New Oracle license required for each new server
• Sun hardware expensive - bad cost/performance
– High maintenance
– Mediocre implementation
• Calculations done with Oracle
• Single points of failure
Copyright (c) 2006 Gracenote Inc.
15
The Gracenote Service Redesigned
Version 3
Service Version 3

Key attributes
– Written in plain C
• Best performance
• Best control over behavior
• Greater transparency
– Extremely portable to any Unix-like platform
• Thoroughly tested on Max OS X, Linux, Solaris, Free BSD
• Runs on any ILP32 or LP64 system
• Tested on Sparc/Intel/AMDPower PC processors
Copyright (c) 2006 Gracenote Inc.
17
Service Version 3

Key attributes
– Near 100% original code
• The asymptotic goal is 100%
• Key custom components:
– Server core operating environment (ds_core)
– XML parser
– XML object database interface
– Database cache
– Heap manager (i.e. malloc replacement)
– Web server module
Copyright (c) 2006 Gracenote Inc.
18
Service Version 3

No closed source software
– This keeps the operation of our own service transparent
– Oracle engineered out of the service completely

Effective 100% uptime
–
–
–
–
Massive server redundancy
Multiple colocation sites
Automatic server detection and failover
Intra-server load balancing and routing
Copyright (c) 2006 Gracenote Inc.
19
Service Version 3

Key Attributes
– Modularity
• All server types run the same code
– Simple configuration changes decide what type of server
a box should be
– New features or protocols added by creating a plugin
module
– Geographically distributable
• Any number of separate server sites can be supported
• Each server has an identical copy of the database
– Self-replicating
• Servers are kept in sync in near-realtime
• Only a thin pipe required to keep all sites in sync
Copyright (c) 2006 Gracenote Inc.
20
Service Version 3

Key attributes
– High performance
• Designed for performance
• Data is normalized (XML) to remove the need for database
joins in realtime
• Much of the database and indices is kept in RAM
– Transformable data
• XML format allows data to be easily transformed to any
protocol we wish to support
– Client / server interaction
• Smart client SDK allows much of the work to be offloaded to
the user’s computer
• The client decides what it needs and the service provides it
• Client synthesizes results from what it’s provided
Copyright (c) 2006 Gracenote Inc.
21
Service Version 3

Key attributes
– Uses POSIX threads
• Threads are used instead of multiple processes
• A single process per server removes the need for IPCs
• Affords excellent scalability
– Designed to function nominally on any single system
• The entire service will even run on a laptop computer
• Bigger hardware simply translates to faster operation
– More RAM means more data and indexes in-core
– More CPUs allows better concurrent execution
• All data fits on a single server
– All transaction servers have an entire copy of the
database
– Any server can take on any role required at any time
Copyright (c) 2006 Gracenote Inc.
22
Server Architecture
Server Architecture

Operating environment - ds_core
– ds_core is the foundation of the Gracenote service
– Handles all the work of “daemonizing”
• Forking / disassociating from the terminal / signal handling
– Abstracts and extends POSIX pthreads
• Hides the much of the more ugly parts of POSIX threads
• Testable joins
• Shared locks
• Enforces lock hierarchy
• Thread tracking and auto-cleanup
Copyright (c) 2006 Gracenote Inc.
24
Server Architecture

Operating environment - ds_core
– Provides message logging facilities
– Provides plug-in loading and execution evironment
• Manages plug-in state
• Provides facility for intra-module API calls
– Manually-loaded modules in Unix do not have visibility into other
modules’ symbol space
– ds_core provides symbol resolution and API call protection
against unloading
– Memory allocation tracking
• Logging of allocations/frees
• Bounds checking upon free
• Leak containment
– Monitoring and control interface
• Simple text-based network control interface
• Allows interaction with humans via telnet, as well as monitoring
software
Copyright (c) 2006 Gracenote Inc.
25
Server Architecture

Threading model
– POSIX threads are utilized
• ds_core abstracts the API, so other thread library could be
used w/o affecting plugin modules
– Threads are spawned on-demand
• OSes spawn new threads at least as quickly as pooled
threads can be awakened and repurposed, sometimes faster
• Simpler than pooling
– One thread, one query model
• Each user query is handled by a single thread
• Thread dies when query is complete
• Threads traverse vertically through software layers during
queries
– Modules do not own user query threads, they simply take
API calls from the top level where threads are created
• Modules may create maintenance threads only
Copyright (c) 2006 Gracenote Inc.
26
Server Hierarchy
ORACLE
MASTER
OCIBOX
MASTER
XP
XP
XP
AP
MASTER
XP
AP
XP
AP
XP
AP
CISCO LOAD
BALANCER
Internet
Copyright (c) 2006 Gracenote Inc.
27
SYNC
NODE
XP
AP
AP
XP
XP
Server Architecture

Server types
– XP - transaction processor
• This is where all the work is done to answer a user request
• Performs media recognition tasks
– TOC (CD) lookup
– Audiofile waveform recognition
– Text search
• Performs other types of user queries as well
• Synchronizes data from upstream source (master)
• Generates lookup statistics
• Has an entire copy of the database
• Servers can “specialize” in one query type, or serve them all
– Configuration options control this
Copyright (c) 2006 Gracenote Inc.
28
XP Module Layout
COMM
DBS XML
(NETWORK)
Berkeley DB
NETWORK
DATABASE
Copyright (c) 2006 Gracenote Inc.
29
DBS
SYNC
CD LOOKUP
DVD LOOKUP
WAVEFORM
TEXT SEARCH
CACHE
DISPATCH
MASTERSHIP
ds_core
Server Architecture

Server types
– AP - application processor
• These servers translate user queries to XP calls and back
– They contain no database
• Protocol plugins allow the AP to speak any client protocol desired
– The list currently includes 5 protocols, including CDDB1,
CDDB2, XML RPC, etc
• Plugin modules for network socket handling and HTTP requests
allow the AP to look like a web server
– Can serve up documents if desired
– Allows protocols that are encapsulated in HTTP to be routed to
the correct protocol module
• Are capable of load balancing across XPs
– They know what query types each XP supports
– Queries are spread across XPs
– Failed XPs are detected automatically and taken out of rotation
– The combination of AP and XP servers form the basis of the Gracenote
service
30
Copyright (c) 2006 Gracenote
Inc. as users perceive it.
Server Architecture

Server types
– Master
• Responsible for fetching new and updated data from
upstream Oracle master repository
• Distributes data to downstream XPs and sync nodes
– Sync node
• Just like a master, but it’s upstream parent is another master
or sync node rather than an Oracle instance
• These are used for replicating data at geographically remote
locations
– OCI Box
• Serves simply as a conduit between Oracle and the master
servers
Copyright (c) 2006 Gracenote Inc.
31
Server Architecture

Server types
– The distinction between server types is somewhat arbitrary
• A single server could be configured to load all necessary
modules to perform as two or more types of server at once
• In practice, this is not desirable since specialization increases
performance and flexibility of control
Copyright (c) 2006 Gracenote Inc.
32
Server Architecture

Data
– All data is in XML format
– All textual data encapsulated in XML is UTF-8
– Data is segmented into several distinct tables to allow XPs to
specialize
• All XPs currently have the same data, for now
• This may change if data and product growth outstrip hardware
improvement (a race with Moore’s law).
• Replication can be directed to allow only certain types of data
to reside on an XP
– E.g., servers performing TOC recognition only need
album data and TOC tables, but not waveform, DVD, etc.
Copyright (c) 2006 Gracenote Inc.
33
Feature Details
Development Environment

Third party libraries
– ICU
• Character set translation
• Just used for translating fat tilde characters in Shift-JIS
– Libtomcrypt
• Cryptographic algorithms that form the basis of our secure
protocols
– Berkeley DB
• C API database engine
– Expat
• The core of our XML parser
– Zlib
• Compression library
Copyright (c) 2006 Gracenote Inc.
35
Development Environment

Compiler
– gcc / Sun Workshop on Solaris
– gcc on all other platforms

Build environment
– GNU autotools

Debug environment
– dbx debugger
• gdb is hopelessly broken for threaded apps
– Rational purify/quantify
• Buggy and slow, but invaluable
Copyright (c) 2006 Gracenote Inc.
36
Performance

The average query takes only a handful of
milliseconds
– The user generally doesn’t perceive the time it takes to get his
answer - the network transit time is far longer

Disk I/O is the largest cost in a user query
– You can’t completely remove this as a factor without storing
everything in RAM
– We have plans to do just this

Our biggest CPU cost in a query is XML parsing
– Our custom parser is fast, but when you do something a lot, even
one instruction less can make a difference
Copyright (c) 2006 Gracenote Inc.
37
Performance

Future directions
– Store all data in RAM
• Special database code required
– Separate indexable versus servable data
• Different compression characteristics
• XML parse cost reduction
– Preparse XML
• Some data is already servable, and can be shoved out on the
wire without transformation
– Most of our query types will take 1 millisecond or less when work
is complete
Copyright (c) 2006 Gracenote Inc.
38
Scalability

Each XP and AP can process millions of user
transactions per day
– They operate totally independently of other servers

Adding more XP/AP boxes scales serving capacity
linearly
– No bottlenecks, except perhaps the hardware load balancer
– More of those can be added too

Each master can feed hundreds of XPs or sync
nodes
– If a master’s limit is reached, masters can cascade
– This means there is effectively no limit to the number of servers
that can be supported
Copyright (c) 2006 Gracenote Inc.
39
Scalability

Servers can be located anywhere
– Less than a T1 line is required to keep a server in sync
– Bandwidth required is on the order of several hundred megabytes
per day
– Individual sync nodes can keep remote server groups fully in
sync
Copyright (c) 2006 Gracenote Inc.
40
Availability

The service can survive the failure of more than half
of the XPs, 3/4 of the APs, all of the masters/sync
nodes and all OCI boxes

The loss of one entire colo facility can be survived
– Both of these have happened!
– Users don’t feel a thing

Downstream servers detect the failure of upstream
servers and take them out of rotation
– All upstream servers are normally active
– Dead or failing ones are dropped and retested until recovered

Cisco load balancer detects failed APs and takes
them out of service in a similar fashion
Copyright (c) 2006 Gracenote Inc.
41
Service Holiday Load 12/2005-1/2006
Copyright (c) 2006 Gracenote Inc.
42
Historical Query Load
Copyright (c) 2006 Gracenote Inc.
43
Metrics

All queries generate a stat packet containing
–
–
–
–
–
–

The query type
Query result (success, failure, error)
User inputs (TOC, waveform, etc.)
Response to user (album ID given, etc.)
Elapsed time
Session information
Stat packets flow upstream to the master repository
for tallying and aggregation
– Raw packets are deleted w/o backup

Server and query performance can be monitored
– Flaky servers can be spotted this way
– True query performance can be determined
Copyright (c) 2006 Gracenote Inc.
44
Protocols

Our protocols are all HTTP-based, except the oldest
form of CDDB1
– Newer protocols utilize HTTP MIME-like headers (HTTP headers
technically are not MIME) for encapsulation details like encryption
and compression format
– Older protocols like, CDDB2 and eCDDB, that support encryption
and compression only use headers for size and type info.
Encryption and compression are rolled into the protocol itself.

XML RPC is our only “standard” protocol
– Used for true B2B interaction with the service
– All others are custom protocols for specific needs, or legacy

Most of them are XML-based protocols
– Only CDDB1/2 are not
Copyright (c) 2006 Gracenote Inc.
45
Implementation Details
XML

XML is used for performance
– Because it is denormalized, only one DB access is required to
fetch all fields of a data object
– CPU load is thus traded for disk I/O load
– A fair trade, but caution must be taken to handle XML efficiently

We have implemented two custom XML parsers for
CPU efficiency
– Watersprite
• An XML “compiler” that generates C data structures for known
XML objects
• Slow to parse, but very fast to access fields of the data
structure
• Simplifies C code because XML is accessed through field
dereferencing
Copyright (c) 2006 Gracenote Inc.
47
XML

Custom parsers
– xmlstack
• An event-driven parser
• Implements a directed parsing algorithm
• Ignores uninteresting parts of a document
• Can scan a 1MB XML document with 50,000 nodes in under
5 milliseconds on AMD Opteron
Copyright (c) 2006 Gracenote Inc.
48
Security

Client / server communication
–
–
–
–
All newer protocols support mandatory encryption
Protects user privacy and trade secrets
Algorithms chosen for speed and tiny implementation size
Diffie-Hellman key exchange is used for trading session keys
• Theoretically not as strong as RSA or ECC, but still unbroken
• Very fast
– TEA 128 bit encryption in OFB chaining mode is used for data
encryption
• Fast and tiny
Copyright (c) 2006 Gracenote Inc.
49
Security

ID Validation
– Gracenote employs various IDs for identifying albums, tracks,
DVDs, clients, users, etc.
– All IDs are cryptographically tagged
• Cannot be faked
• No database of issued IDs is required to validate them
Copyright (c) 2006 Gracenote Inc.
50
Database Architecture

The core of our database architecture is Berkeley
DB
– Fast and small C API key/data pair database

We have created an XML wrapper around BDB
– Indices can be created from specific fields in the XML

Future directions
– BDB is nice but too slow for us
– New database technology is being developed for our purposes
• Hybrid memory-mapped / in-core DB
– Ephemeral indices
– No transaction log, but still transactional
– Compressed data
– Plugin architecture for transformation/indexing modules
Copyright (c) 2006 Gracenote Inc.
51
Caching

We have created a custom LRU caching module on
top of BDB
– Any XML object can be cached
– Raw objects can be cached, but not totally transparently

Caching characteristics can vary by object type
– Amount of cache devoted to an object type
– Likelihood of an object being needed soon (popularity) is used to
determine if an object should be cached

Cache can be preloaded from its backing table at
startup

Cache images can be saved at shutdown for fast
restart
Copyright (c) 2006 Gracenote Inc.
52
Caching

Cached objects can optionally be compressed
– 5-20x compression is standard (XML is very compressible)
– No special compression used - just standard zip with medium
compression and Huffman encoding
– Specialized XML compressors only gain about 5% more
compression for much higher CPU cost
Copyright (c) 2006 Gracenote Inc.
53
Heap Manager - Concur

Due to scaling issues with standard malloc, we have
implemented a custom replacement - Concur (aka
ccmalloc)
– Massively scalable
• Designed to allow concurrent threaded access to the heap
with no loss of concurrency (take that, Amdahl’s Law!)
• Implements N heaps (generally 4P. Theoretically one
thread/heap, but in practice this isn’t always true
• One lock per size class to overcome this
– Hundreds of size classes
– Based loosely on Hoard heap manager
• http://www.hoard.org
• Improved scalability over Hoard, which has one lock/heap
• Ours is portable
Copyright (c) 2006 Gracenote Inc.
54
Heap Manager - Concur

Low memory waste overhead
– Less than 1% fixed memory waste
– About 10% average waste, much less than most others

Virtually no fragmentation
– This was a huge problem for us because we have many slowly
growing tables that can cause virtual memory holes

Supports raiding of foreign heaps when physical
memory is low

Optimized for large memory utilization - 20-30 MB
startup overhead

Fast
– 3-5x faster than GNU libc malloc, more when multithreaded
– Same speed as Solaris 10 malloc w/1 thread, but scales better
Copyright (c) 2006 Gracenote Inc.
55
Infrastructure

More than one colo facility
– Currently two, soon to be three
– Enough servers at each site for N+1 redundancy
• We can lose any one colo during peak times without loss of
service

Telco-quality infrastructure at colos

Cisco CSS load balancers
– Balance user traffic across colos transparently
– Balance traffic across APs within a colo
Copyright (c) 2006 Gracenote Inc.
56
Infrastructure

Enterprise-class servers
– Sun v40z, 4xAMD Opteron, 8x72GB SCSI RAID 10, 32GB RAM
– Rackable, 2xAMD Opteron 8x72GB SCSI RAID 10, 16GB RAM

Service monitoring at several levels
–
–
–
–
User-level external query testing
Internal self-monitoring across servers
Server log monitoring
24x7 staff to handle service events
Copyright (c) 2006 Gracenote Inc.
57
Problems and Weaknesses
Problems and Weaknesses

This is all wonderful, but not perfect

Some problems and weaknesses exist
– Addition of unanticipated query types may require data remodeling
• Data objects are designed for specific and predicted use cases
• If needs arise that are in conflict with the design, it’s back to the
drawing board
– XML handling has high CPU cost
• Though we are very efficient at XML processing, it is still the #1 CPU
cost in the service
• Still way faster than equivalent disk access
– User-induced crashes can be devastating
• Sometimes user queries can bring down a server
• When they retry, it brings down another... and so on
– Unanticipated database growth well past the size of RAM would require
a different data model and significant redesign
Copyright (c) 2006 Gracenote Inc.
59
Questions?