Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
What’s new in Condor? What’s coming up? Condor Week 2008 Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison [email protected] http://www.cs.wisc.edu/condor Release Situation › Stable Series Current: Condor v7.0.1 (Feb 27th 2008) Last Year: Condor ver 6.8.4. (Feb 5th 2007) › Development Series Current: Condor v7.1.0 (April 1st 2008) Last Year : Condor ver 6.9.2. (April 10th 2007) › v6.9 Series : ~ 14 months 2 3 Special Condor Week Edition 5 How many cores in one new UW Condor cluster rack? 6 New Ports › RHEL 5 x86 & x86_64 with stduniv and › › › › › glibc 2.5 Playstation 3 HPUX 11i Itanium (almost done) Cross testing on x86-like platforms Debian clipped port Out with the old. Red Hat Linux 7.x systems on the x86 processor. Digital Unix systems on the Alpha processor. Yellow Dog Linux 3.0 systems on the PPC processor. MacOS 10.3 systems on the PPC processor. 7 › › › › › Big v7.0 Goodies Scalability Improvements GCB Improvements Privilege Separation New Quill Virtual Machine Universe 8 Scalability 9 Condor’s Privilege Separation › Apply principle of › › › › least privilege to Condor No more root / superuser privilege required Currently completed on execute side Use glexec or Condor’s own “sudo” Can still run the “old way” if you want 16 Quill Take Two in v7.x › Shared databases › More than just the JobAd, e.g. Startd: Machine ClassAds Negotiator: matches Run: Job User Log information › More than just PostgreSQL DBMS › All the details: http://www.cs.wisc.edu/condor/quill_overview_07-18-2007.pdf 17 StartD SchedD DBMS Disk Negotiator QuillD sql.log 18 Virtual Machine Universe › Submit a “Job” that consists of a virtual › › › › › › machine image Condor schedules, manages, and monitors VM job Works w/ VMware Server and Xen Matchmaking Checkpoint/Restart/Migration Data Movement Plug: BoF Session 1:30pm tomorrow 19 What else? GCB Improvments! 20 21 22 › Improved Scalability: Only use the broker if required! Local Host Optimizations • Bypass GCB if two daemons are talking on the same host Local Network Optimizations • Two hosts on the same private net bypass the broker • Every network is assigned a unique network name • Daemons advertise (a) public accessible IP; (b) real IP; (c) network name. • Names match ? use real ip : use public IP. › Improved Robustness Broker dies -> master finds another broker and restarts. When master starts up, it pings a list o brokers and randomly chooses from those that respond. Bug fixes › Improved Logging – now they are helpful and sane. 23 Process Tracking Guarantee Iron-clad tracking of process groups Even if running as the job submitter Uses supplementary group ids Linux only Also as a standalone-daemon for OSG USE_GID_PROCESS_TRACKING = True MIN_TRACKING_GID = 750 MAX_TRACKING_GID = 757 24 Better Collector Authorization › New authorization levels to allow different rules for submission –vsexecution ADVERTISE_STARTD, ADVERTISE_SCHEDD › New config setting COLLECTOR_REQUIREMENTS expression must evaluate to true for Collector to accept the ad. 25 # Well-known ports for the trusted daemons # Use the below ports if launching the condor_master # as root; else, pick 3 ports above 1024. MASTER_PORT = 890 SCHEDD_PORT = 891 STARTD_PORT = 892 MASTER_ARGS = -p $(MASTER_PORT) SCHEDD_ARGS = -p $(SCHEDD_PORT) STARTD_ARGS = -p $(STARTD_PORT) COLLECTOR_REQUIREMENTS = \ ( MyType =?= "Machine" && \ regexp( "<[0-9.]*:$(STARTD_PORT)>" , MyAddress ) ) || \ ( MyType =?= "Scheduler" && \ regexp( "<[0-9.]*:$(SCHEDD_PORT)>" , MyAddress ) ) || \ ( MyType =?= "DaemonMaster" && \ regexp( "<[0-9.]*:$(MASTER_PORT)>" , MyAddress ) ) || \ ( MyType =!= "Machine" && MyType =!= "Scheduler" && \ MyType =!= "DaemonMaster" ) 26 Handy New Attributes › In your machine ad TotalTimeBackfillBusy, TotalTimeBackfillIdle,TotalTimeBackfillKilling TotalTimeClaimedBusy,TotalTimeClaimedIdle TotalTimeClaimedRetiring, TotalTimeClaimedSuspended TotalTimeMatchedIdle, TotalTimeOwnerIdle TotalTimePreemptingKilling,TotalTimePreemptingVacatin g,TotalTimeUnclaimedBenchmarking,TotalTimeUnclaimed Idle › In your job ad NumJobStarts NumJobReconnects NumShadowExceptions NumShadowStarts 27 And last but not least… › Leases added to COD. › Simple best-fit algorithm added to dedicated › › › scheduler. Can reference resource usage and quota information in preemption policy. condor_config_val –dump [-v] Chirp improvements Jobs can write messages into the user log Can use proc 0 ClassAd as a “scratch pad” › Condor shutdown via expressions External Awareness 28 … and finally … › File Transfer I/O Throttling MAX_CONCURRENT_DOWNLOADS and MAX_CONCURRENT_UPLOADS › More types of jobs can survive across a shutdown/crash of submit machine Such as jobs that stream stdout/err. › User’s job log changes. › › › › Can have a centralized job log file. Get values of any job ad attribute in log. “Cron” like job scheduling (Crondor?) Job Router shipped (Dan’s talk) License Change Source code publically released on web 29 … and finally … … and before shipping the new stable release … We squashed LOTS of bugs! 30 31 Shiny new “bug free” Condor v7.0.x stable series! 32 Enough already, Todd. Tell me about what is cooking with v7.1.x and beyond. 33 Terms of License Any and all dates in these slides are relative from a date hereby unspecified in the event of a likely situation involving a frequent condition. Viewing, use, reproduction, display, modification and redistribution of these slides, with or without modification, in source and binary forms, is permitted only after a deposit by said user into PayPal accounts registered to Todd Tannenbaum …. Generalizing the Startd/Starter Architecture › Making the startd more generic with the › › › underlying system. How about : running without a starter, running w/o a schedd+shadow, pulling jobs, running starter less jobs that it does not fork/exec, … Lightweight Jobs Examples • “Work Fetch” Ref to Derek’s Talk • Blue Heron Project Ref to Tom, Amanda, and Greg’s Talk 35 Some Love for Windows › Jobs can write to the registry Condor allocates HKEY_CURRENT_USER. › Problems w/ the Batch Login approach sessions on Windows Server 2003 fixed (by not using them ) › Interoperability with Samba (as a PDC) has been improved › Arch class-ad attribute now reflects the wide range of architectures available to the Windows world; it no longer simply returns INTEL 36 Green Computing › The startd has the ability to place a machine into a low power state. (Standby, Hibernate, Soft-Off, etc.) HIBERNATE, HIBERNATE_CHECK_INTERVAL If all slots return non-zero, then the machine is powered down; otherwise; it continues running. › Machine ClassAd contains all information required for a client to wake it up Condor can wake it up, also a standalone tool. This was NOT as easy as it should be. › Machines in “Offline State” Lots of other uses › Wake-up on Matchmaking Pressure 37 Plugins › Think “Firefox”… › Callouts from Condor daemons on › › appropriate events Plugin could re-implement or modify action (different than a client API) Will only build “as needed” as refactoring happens to add features Miron : “I don’t want your plugs, I want new features!” › Examples: Collector, Accountant, File Transfers, Scheduling Algorithms, … 38 Scheduling in Condor Today CM startd startd startd startd startd schedd schedd startd startd startd startd startd CM schedd schedd schedd › Distributed Ownership › Settings reflect 3 separate viewpoints: Pool manager, Resource Owner, Job Submitter 39 But some sites want to use Condor like this: schedd startd startd startd startd startd › Just one submission point (schedd) › All resources owned by one entity › We can do better for these sites. Policy configurations are complicated. Some useful policies not present because they are hard to do a wide-area distributed system. Today the dedicated “scheduler” only supports FIFO and a naive Best Fit algorithms. 40 So what to do? schedd startd startd startd startd startd › Give the schedd more scheduling options. Examples: why can’t the schedd do priority preemption without the matchmakers help? Or move jobs from slow to fast claimed resources ? › Pluggable scheduler routines. 41 DAGMan Improvements › Automatic running of rescue DAGs (useful › › › for nested DAGs) Significantly improved speed of DAG recovery mode Assignment of “node categories” and category throttles Added generic node priorities & Depth First Traversal algorithm 42 DAGMan Depth First Example 43 Category Example Setup Run <= 2 Big job Big job Big job Run <= 5 Small job Small jobjob Small Small job Small jobjob Small Small job Small jobjob Small Cleanup 44 DAGMan Future Work › DAG Splicing › Allowing custom attributes in node ClassAds › Fixing condor_hold semantics › Configurable job start rate › Node iteration 45 DAGMan Future Work › Scalability Current potential about 1 million nodes Future up to 10 million nodes › Submit files which generate more than one cluster 46 EC2 / VM Universe Next Steps: Impregnate Condor into the Image › When? On Demand. How? Job Router, GlideIn Factory, … › File Transfer To/From S3 › (Plugin!) Options to handle Amazon’s looming threat: NAT only Overlay Network ? • GCB • OpenVPN Communicate by way of S3 ? 47 Negotiation Performance › v6.8 -> automatic “significant attributes”, Match › caching v7.1.0 -> “resource request” ads Simple explanation: Resource request ad == a count plus all significant attributes. Inserted into a schedd submitter ad. “Give me 400 resources like this, and 200 resources like that, etc”. › Matchmaking algorithms remains the same, just › › how it “learns” about jobs changes. Disabled by default. Possibilities, possibilities… More robust against unresponsive schedds No startd Rank preemption? Others? 48 And… › The End ™ of the NFS Locking issue › Avoid redundant copies of the same executable in the Condor spool Maybe more? › The “Stamping of a Passport” › End-to-End Security Ref Ian’s Talk › A web site design from this decade. 50 Thank you for being such an awesome audience and an awesome user community!!! Jason Stowe, enjoying free bacon at a local pub. Only in Wisconsin. 51