Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Coda Server Internals Peter J Braam Contents Data structure overview Volumes Vnodes Inodes Data Structure Overview Object Inodes Purpose File Contents Resides where /vicep* partitions Volumes Vnodes Directory cnts ACL Reslogs Meta Data & Dir contents RVM Volinfo records Volume location VLDB, VRDB: RW db files Security VSGDB, .pdb, .tk files: dynamic RO db files Configuration Data Static data VSGDB Pdb records Tokens Servers/SCM Partitions Startup flags Skipvolumes LOG & DATA & DB Locators RVM layout (coda_globals.h) Already_initialized (int) struct VolHead[MAXVOLS] struct VnodeDiskObject *SmallVnodeFreeLists[SM_FREESIZE] short SmallVnodeIndex …. Same for large … MaxVolId (unsigned long) Remainder is dynamically allocated Volume zoo RVM: structures VolumeData VolHead VolumeHeader VolumeDiskData (volume.h, camprivate.h) VM: structures Volume VolumeInfo …….. A volume in RVM VolHead VolumeHeader VolumeHeader stamp id parentid type VolumeData *volumeDiskData *smallVnodeLists nsmallVnodes nsmallLists -- same for big -- contains pointer to rvm malloced data VolumeDiskData (rvm) Lots of stuff: Identity & location: partition, name, runtime info: use, inService, blessed, salvaged Vnode related: next uniquefier Versionvector Resolution flags, pointer to recov_vol_log Quota Resource usage: filecount, diskused etc Volumes in VM struct Volumes sit in VolHash with copies of RVM data structures Salvage before “attaching” to VolHash Model of operation (FS): GetVolume copy out from RVM Do your mods in VM PutVolume does RVM transaction Model of operation (Volutil): operate on RVM Volumes in Venus RPC’s One RPC: GetVolInfo used for mount point traversal Only relates to volume location database volume replication database VSGDB Could sit in separate Volume Location Server Vnodes (cvnode.h) Small & large: large for directories difference is ACL at back of large vnodes Inode field: small vnodes: points to diskfile inode number large vnodes: is RVM address of dir inode Contain important small structure: vv_t Pointers to reslog entries VM: cvnode’s with hash table, freelists etc Vnodes in RVM RVM: VnodeDiskinfo (rvm_malloced) vnodes sit on rec_smolists each link points to a DiskVnode lists link vnodes with identical vnodenumbers but different uniquefiers new vnodes grabbed from FreeLists (index.cc, recov{a,b,c}.cc) volumes have arrays of rec_smolists which grow when they are full Vnodes in action Model: GetFSObj calls GetVnode work is done PutFS Objects calls rvm_begin_transaction ReplaceVnode - copies data from VM to RVM rvm_end_transaction Getting a vnode takes 3 pointer derefs, possibly 3 page faults vs. 1 for local file systems. Is this necessary? Probably not. Cure it: yes! Directories (rvm) DirInode page table and “copy on write” refcount DirPages 2048 bytes each build up the directory divided into 64 32byte blobs Hash table for fast name lookups Blob Freelist Array of free blobs per page Directories More than one vnode can point to directory (copy on write) VM: hash table of DirHandles point to VM contiguous copy of dir point to DirInode have a lock etc Model: as for volumes & vnodes Critique: too baroque Files Vnode references file by InodeNumber Files are copy on write There are “FileInodes” like dir inodes, but they are held in external DB or in inode itself Server always reads/writes whole files (could be exploited) Volinit and salvage Set up volume hash table, serverlist, DiskPartitionList Cycle through partitions, check each for list of inodes every inode has a vnode every vnode has a directory name every directory name has a vnode Put volume in a VM hash table Server connection info Array of HostEntry (a “venus”) Contains a linked list of connections Contains a callback connection id Connection setup first binding creates a host & callback conn new binding creates a new connection and verifies callback in RPC2_NewBinding & ViceNewConnectFS Callbacks Hashtable of FileEntries: each contains Fid number of users linked list of callbacks Callbacks: point to HostEntry Ops: RPC: BreakCallBack Local: placing, delete, deleteVenus Callbacks Connection is non-authenticated. Should be fixed. Session key for CB connection should not expire. Side effect of callback connection is used for BackFetch bulk transfer of files during reintegration. RPC processing Venus RPC’s: srvproc.cc - standard file ops srvproc2.cc - standard volume ops codaproc.cc - repair stuff codaproc2.cc - reintegration stuff Volutil RPC’s: vol-your-rpc.cc (in coda-src/volutil) Resolution: below RPC processing RPC structure: ValidateParms: validate, hand off COP2, cid GetObject: vm copy, lock objects CheckSemantics: Concurrency, Integrity, Permissions Perform operations: BulkTransfer, UpdateObjects, OutParms PutObject: rvm transactions, inode deletions vlists GetFSObjects: instantiate a vlist RPC needs list of objects copied from RVM Modification status is held there (did CopyOnWrite kick in etc) PutObjects rvm_begin_transaction walk through the list, copy, rvm_set_range, unlock rvm_end_transaction COP2 handling In COP2 Venus give final VV to server are sent out by Venus (with some delay) often piggybacked in bulk server knows about pending COP2 entries in hash table (coppend.cc) Manager thread CopPendingManager Runs every minute. Removes entries more than 900 secs old Cop2 to RVM Data can be PiggyBacked on another rpc sent in ViceCop2 rpc. Both cases call InternalCop2 (srvproc.cc) InternalCop2 (codaproc.cc) notifies the manager to dequeue gets the FS objects listed for the COP2 installs final VV’s into RVM (rvm transaction!) COP2 Problems Easy cause of conflicts in replicated volumes when clients access objects in rapid succession. (Can be fixed easily during the writeback caching operation) Not optimized for singly replicated volume. Resolution Initiated by client with RPC to coordinator ViceResolve (codaproc.cc) coordinator sets up connections in VSG (unauthenticated) LockAndFetch (res/reslock, resutil): lock volumes, collect “closure” Resolution - special cases RegResDirRequired (rvmres/rvmrescoord.cc) check for unresolved ancestors already inconsistent runts (missing objects) weak equality (identical storeid) RecovDirResolve Phase II: (rvmres/{rescoord,subphase?}.cc) coordinator request logs from other servers subordinates lock affected dirs,marshall logs coordinator merges logs Phase III: ship merged log to subordinates perform operations on VM copies Return results to coordinator Resolution Phase IV: (is old Phase 3 …) collect results, compute new VV’s ship to subordinates commit results Comments on resolution Old versions of resolution: OldDirResolve: resolve only runts and weak DirResolve: resolve only in VM Remove these resolve directory has nothing to do with resolution: should be called librepair. Srv uses merely one function in it - repair uses the rest Volume Log During FS operations, log entries are created for use during resolution Different format per operation (rvmres/recov_vollog.cc) Added to the vlist by SpoolVMLogRecord Put in RVM at commit time Repair Venus makes ViceRepair RPC. File and symlink repair: BulkTransfer the object Directory repair, BulkTransfer the repair file and replay operations Venus follows this with a COP2 multi rpc For directory repair Venus invokes asynchronous resolve Future Good: Design is simple and efficient There is little C++: should eliminate easy to multi-thread Bad: Scalability ~8GB in practice, ~40GB in theory Data handling is bad: tricky to fix Volume code was & is worst: rewrite