Download Coda Server Internals

Coda Server Internals Peter J Braam Contents Data structure overview Volumes Vnodes Inodes Data Structure Overview Object Inodes Purpose File Contents Resides where /vicep* partitions Volumes Vnodes Directory cnts ACL Reslogs Meta Data & Dir contents RVM Volinfo records Volume location VLDB, VRDB: RW db files Security VSGDB, .pdb, .tk files: dynamic RO db files Configuration Data Static data VSGDB Pdb records Tokens Servers/SCM Partitions Startup flags Skipvolumes LOG & DATA & DB Locators RVM layout (coda_globals.h)  Already_initialized (int)  struct VolHead[MAXVOLS]  struct VnodeDiskObject *SmallVnodeFreeLists[SM_FREESIZE]  short SmallVnodeIndex  …. Same for large …  MaxVolId (unsigned long)  Remainder is dynamically allocated Volume zoo RVM: structures VolumeData VolHead VolumeHeader VolumeDiskData (volume.h, camprivate.h) VM: structures Volume VolumeInfo …….. A volume in RVM VolHead VolumeHeader VolumeHeader stamp id parentid type VolumeData *volumeDiskData *smallVnodeLists nsmallVnodes nsmallLists -- same for big -- contains pointer to rvm malloced data VolumeDiskData (rvm) Lots of stuff: Identity & location: partition, name, runtime info: use, inService, blessed, salvaged Vnode related: next uniquefier Versionvector Resolution flags, pointer to recov_vol_log Quota Resource usage: filecount, diskused etc Volumes in VM struct Volumes sit in VolHash with copies of RVM data structures Salvage before “attaching” to VolHash Model of operation (FS): GetVolume copy out from RVM Do your mods in VM PutVolume does RVM transaction Model of operation (Volutil):  operate on RVM Volumes in Venus RPC’s One RPC: GetVolInfo used for mount point traversal Only relates to volume location database volume replication database VSGDB Could sit in separate Volume Location Server Vnodes (cvnode.h) Small & large: large for directories difference is ACL at back of large vnodes Inode field: small vnodes: points to diskfile inode number large vnodes: is RVM address of dir inode Contain important small structure: vv_t Pointers to reslog entries VM: cvnode’s with hash table, freelists etc Vnodes in RVM RVM: VnodeDiskinfo (rvm_malloced) vnodes sit on rec_smolists each link points to a DiskVnode lists link vnodes with identical vnodenumbers but different uniquefiers new vnodes grabbed from FreeLists (index.cc, recov{a,b,c}.cc) volumes have arrays of rec_smolists which grow when they are full Vnodes in action Model: GetFSObj calls GetVnode work is done PutFS Objects calls rvm_begin_transaction ReplaceVnode - copies data from VM to RVM rvm_end_transaction Getting a vnode takes 3 pointer derefs, possibly 3 page faults vs. 1 for local file systems. Is this necessary? Probably not. Cure it: yes! Directories (rvm) DirInode page table and “copy on write” refcount DirPages 2048 bytes each build up the directory divided into 64 32byte blobs Hash table for fast name lookups Blob Freelist Array of free blobs per page Directories More than one vnode can point to directory (copy on write) VM: hash table of DirHandles point to VM contiguous copy of dir point to DirInode have a lock etc Model: as for volumes & vnodes Critique: too baroque Files Vnode references file by InodeNumber Files are copy on write There are “FileInodes” like dir inodes, but they are held in external DB or in inode itself Server always reads/writes whole files (could be exploited) Volinit and salvage Set up volume hash table, serverlist, DiskPartitionList Cycle through partitions, check each for list of inodes every inode has a vnode every vnode has a directory name every directory name has a vnode Put volume in a VM hash table Server connection info Array of HostEntry (a “venus”) Contains a linked list of connections Contains a callback connection id Connection setup first binding creates a host & callback conn new binding creates a new connection and verifies callback in RPC2_NewBinding & ViceNewConnectFS Callbacks Hashtable of FileEntries: each contains Fid number of users linked list of callbacks Callbacks: point to HostEntry Ops: RPC: BreakCallBack Local: placing, delete, deleteVenus Callbacks Connection is non-authenticated. Should be fixed. Session key for CB connection should not expire. Side effect of callback connection is used for BackFetch bulk transfer of files during reintegration. RPC processing Venus RPC’s: srvproc.cc - standard file ops srvproc2.cc - standard volume ops codaproc.cc - repair stuff codaproc2.cc - reintegration stuff Volutil RPC’s: vol-your-rpc.cc (in coda-src/volutil) Resolution: below RPC processing RPC structure: ValidateParms: validate, hand off COP2, cid GetObject: vm copy, lock objects CheckSemantics: Concurrency, Integrity, Permissions Perform operations: BulkTransfer, UpdateObjects, OutParms PutObject: rvm transactions, inode deletions vlists GetFSObjects: instantiate a vlist RPC needs list of objects copied from RVM Modification status is held there (did CopyOnWrite kick in etc) PutObjects rvm_begin_transaction walk through the list, copy, rvm_set_range, unlock rvm_end_transaction COP2 handling In COP2 Venus give final VV to server are sent out by Venus (with some delay) often piggybacked in bulk server knows about pending COP2 entries in hash table (coppend.cc) Manager thread CopPendingManager Runs every minute. Removes entries more than 900 secs old Cop2 to RVM Data can be PiggyBacked on another rpc sent in ViceCop2 rpc. Both cases call InternalCop2 (srvproc.cc) InternalCop2 (codaproc.cc) notifies the manager to dequeue gets the FS objects listed for the COP2 installs final VV’s into RVM (rvm transaction!) COP2 Problems Easy cause of conflicts in replicated volumes when clients access objects in rapid succession. (Can be fixed easily during the writeback caching operation) Not optimized for singly replicated volume. Resolution Initiated by client with RPC to coordinator ViceResolve (codaproc.cc) coordinator sets up connections in VSG (unauthenticated) LockAndFetch (res/reslock, resutil): lock volumes, collect “closure” Resolution - special cases RegResDirRequired (rvmres/rvmrescoord.cc) check for unresolved ancestors already inconsistent runts (missing objects) weak equality (identical storeid) RecovDirResolve Phase II: (rvmres/{rescoord,subphase?}.cc) coordinator request logs from other servers subordinates lock affected dirs,marshall logs coordinator merges logs Phase III: ship merged log to subordinates perform operations on VM copies Return results to coordinator Resolution Phase IV: (is old Phase 3 …) collect results, compute new VV’s ship to subordinates commit results Comments on resolution Old versions of resolution: OldDirResolve: resolve only runts and weak DirResolve: resolve only in VM Remove these resolve directory has nothing to do with resolution: should be called librepair. Srv uses merely one function in it - repair uses the rest Volume Log During FS operations, log entries are created for use during resolution Different format per operation (rvmres/recov_vollog.cc) Added to the vlist by SpoolVMLogRecord Put in RVM at commit time Repair Venus makes ViceRepair RPC. File and symlink repair: BulkTransfer the object Directory repair, BulkTransfer the repair file and replay operations Venus follows this with a COP2 multi rpc For directory repair Venus invokes asynchronous resolve Future Good: Design is simple and efficient There is little C++: should eliminate easy to multi-thread Bad: Scalability ~8GB in practice, ~40GB in theory Data handling is bad: tricky to fix Volume code was & is worst: rewrite

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Coda Server Internals