Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
High-Performance Computing Survival Guide James R. Knight Yale Center for Genome Analysis Department of Genetics Yale University January 30, 2017 1950’s – The Beginning... 2017 – Looking very similar... ...but there are differences • Not a single computer but thousands of them, called a cluster – Hundreds of physical “computers”, called nodes – Each with 4-64 CPU’s, called cores • Nobody works in the server rooms anymore – IT is there to fix what breaks, not to run computations (or help you run computations) – Everything is done by remote connections • Computation is performed by submitting jobs for running – This actually hasn’t changed...but how you run jobs has... A Compute Cluster farnam.hpc.yale.edu You are here! 300+ Users. 99 compute nodes for general use. 1000TB disk space. Farnam1 (login) Compute-3-2 Compute-3-1 Compute-1-1 Network Compute-1-2 Compute-2-1 Compute-2-2 You Use a Compute Cluster! Surfing the Web You are here! Return the webpage Click on a link Blah.com Compute Compute Compute Network Compute Construct the webpage contents Compute Compute How you’ll be using Farnam You are here! Connect by ssh farnam.hpc.yale.edu 300+ Users. 99 compute nodes for general use. 1000TB disk space. Farnam1 (login) Compute-3-2 Compute-3-1 Compute-1-1 Run commands on compute nodes (and Network submit sbatch jobs to the rest of the cluster) Compute-1-2 Compute-2-1 Compute-2-2 Connect by srun 1970’s – Terminals, In the Beginning... 2017 – Pretty much the same... • Terminal app on Mac • Look in the “Other” folder in Launchpad Your “New” User Interface – Hunt and Peck! • Type a command at the prompt, hit the return key program arguments... • This runs the program, which will read the arguments, read inputs, perform computations and produce outputs • When it completes, the prompt is displayed, telling you it is ready for the next command • Key commands to learn: ssh [email protected] srun –-pty bash Helpful Tips • Take a Linux basics tutorial • The faster you can type, the faster you will be done • Select and learn a text editor – Vi or Emacs • Select and learn a programming language – Python, R or Perl • Ask these questions to keep you oriented – – – – – What computer am I on? What directory am I in? Where are the files for my analysis? What program(s) do I have running? What jobs do I have running? Directories and Paths • Linux directory structure same as Mac/Windows folder structure – Folders/directories containing files and other sub-folders/sub-dirs – “Easy-to-access” directories: HOME directory • A path is a string naming a file or directory in the structure – The slash character (‘/’) is separator for directories /Users/jamesknight/Desktop/hpc_survival_guide_jan_2015.pptx The Shell • When you type commands and run programs, you are actually running a program called a shell – Designed to take user input, run programs and display output – Started automatically when Terminal app started or when you log into a computer – Linux runs the bash shell, by default • Maintains useful environment variables – $PWD, which holds your current working directory path – $HOME or ~, which holds your home directory path – $PATH, which holds locations of programs • Powerful tool for organizing and executing commands – Useful to combine programs or redirect inputs and outputs, without having to write a program to do that – Full-fledged programming language, used to write shell scripts to run sets of commands The Program’s Viewpoint • Programs start knowing nothing, and must figure out what to do – Lines of code are generalized instructions – Specifics come from reading the program’s environment Command-line Arguments (what you typed) Standard Input The Program (keyboard) Standard Output (screen) Standard Error Files to read Files to write (screen) Shell Redirection, Piping and Multiple Commands • The shell lets you redirect stdin, stdout and stderr to configure how your program communicates • myprog < inFile > outFile 2> errFile – “< inFile” redirects stdin so that program reads contents of “inFile” – “> outFile” redirects stdout so that program writes standard output to “outFile” – “2> errFile” redirects stderr so that program writes standard error to “errFile” • echo Hello | sed s/Hello/Goodbye/ – The “|” (called a pipe) redirects the echo program’s standard output so that it writes to the standard input of the sed program – This command writes “Goodbye” to the screen • echo Hello ; echo Goodbye – The semi-colon separates commands, allowing multiple programs to run from one command-line – This command writes “Hello” then “Goodbye” to the screen Running Commands on Farnam • Linux commands are “built-in” and usable when you login • Most bioinformatics tools are not • Use “module” program to setup bioinformatics tools for use – module avail – list installed tools/programs – module load toolname – load tool into shell environment module load Tools/SimpleQueue/3.0 • TIP: Add module load commands to ~/.bashrc – Automatically loaded into every shell – Lines of the file are commands run before showing the first prompt – Don’t change the lines already there, just add your lines Writing Scripts • Sometimes Linux’s built-in programs, and existing bioinformatics programs, are not enough – To combine programs together in a specific way – To run programs on many different files/datasets – To perform custom statistical analyses on data files • Scripting languages make it easy to write your own programs – bash, python, R, perl – Write the lines of the script using a text editor – Use the language’s program to run the script python myscript arguments... Then, test, debug and rewrite... Writing Scripts • A script is like a lab protocol – Instructions on how to perform a task – Executed in order, from beginning to end – Just as protocol steps can have sub-steps, repeated steps and sub-protocols, script statements can have sub-statements, loops and function calls • Types of statements in a script – Computation (assignment, input/output), if-then-else, for and while loops, functions • Each programming language has its own unique syntax that you must follow REMEMBER: You are the protocol writer... ...writing for someone very, very, very stupid Writing Scripts • Instead of reagents, tubes and plates, scripts operate on values, variables, data structures and files – – – – Values: numbers (1, 2, 87.5), strings (“I am a string!”) Variables: holder for a value Data structures: holder for collections of values Files: Series of strings (text files) or numbers (binary files) stored on disk • Important data structures: – List or Array – ordered collection of values [ 1, 2, 4, 3 ] – Hash or Dictionary – collection of “name, value” pairs, like a telephone book – Record or Struct – collection of named variables/data-structures – Matrix – two-dimensional collection of values That’s fine, but how do you do this, really??? • My best recommendation: Think about it, and write it down, as a protocol, then translate it into the programming language – Make the step descriptions comments in the script • Comments are lines beginning with ‘#’, which are ignored when executing the script – Refine into sub-steps when translation is difficult • Example: writing echo in Python – Echo takes the command-line arguments and writes them to standard output [jk2269@compute-7-2 ~]$ echo Hello from the cluster! Hello from the cluster! [jk2269@compute-7-2 ~]$ That’s fine, but how do you do this, really??? • Attempt #1: Implement that description – Python has a sys.argv list with the command-line arguments – Python has a print statement to write to standard output • Program: # # Print the command-line arguments to stdout. # import sys print sys.argv [jk2269@compute-7-2 ~]$ python myecho.py Hello from the cluster! ['myecho.py', 'Hello', 'from', 'the', 'cluster!'] [jk2269@compute-7-2 ~]$ That’s fine, but how do you do this, really??? • Attempt #2: Refine, write each argument separately, so that the output can be formatted better. – Python can loop over the values of the sys.argv list – The print statement can write string values like “ “ (a space) • Program: # # 1. for each command-line argument, except the first, # a. print the argument and a space # import sys for arg in sys.argv[1:]: print arg, “ “ [jk2269@compute-7-2 ~]$ python myecho.py Hello from the cluster! Hello from the cluster! [jk2269@compute-7-2 ~]$ That’s fine, but how do you do this, really??? • Attempt #3: Fix the extra newline characters – Ending the print statement with a comma avoids the newline, but does print a space (so skip the explicit space) – Add an extra print statement to get it to print the newline • Program: # # 1. for each command-line argument, except the first, # a. print the argument, with no newline # 2. print a newline # import sys for arg in sys.argv[1:]: print arg, print [jk2269@compute-7-2 ~]$ python myecho.py Hello from the cluster! Hello from the cluster! [jk2269@compute-7-2 ~]$ That’s fine, but how do you do this, really??? • Attempt #4: Try a different approach, construct the string to be output, then print it. – Python has a join function that combines a list of strings into a string, with a separator. • Program: # # 1. Combine the command-line arguments into a # string, separating them by spaces # 2. Print the string # import sys line = “ “.join(sys.argv[1:]) print line [jk2269@compute-7-2 ~]$ python myecho.py Hello from the cluster! Hello from the cluster! [jk2269@compute-7-2 ~]$ That’s fine, but how do you do this, really??? • Why scripting/programming is hard: – You must think of everything • Use testing, iteration and refinement to make sure that you have thought of everything • You can get to “good enough” – You have to write everything in a foreign language, with no allowance for error • My best recommendation: Think about it, and write it down, as a protocol, then translate it into the programming language – Design what you want the program to do as you would a protocol, in English (or your favorite language) – Match program statements to the steps, refining the steps so that they can be translated Running Jobs on the Cluster • You must make reservations! – Cluster is a shared resource, so you must ask for exclusive use of nodes and cores – The job request goes into a queue (also called a partition), and is granted when resources are available – How to do this? srun and sbatch! • Interactive jobs – “srun –-pty bash” – request 1 core on 1 node – “srun –-pty –c 20 -–mem-per-cpu 6000 bash” – request 1 node, with 20 cores • Batch jobs – “sbatch myjob.sh” – Request to run the bash script myjob.sh • Farnam’s cluster runs Slurm to manage the queues Running Jobs on the Cluster Example myjob.sh file #!/bin/bash #SBATCH –c 20 –-mem-per-cpu=6000 #SBATCH –t 168:00:00 #SBATCH –[email protected] #SBATCH –-mail-type=END,FAIL Lines containing options for the job request Load your tools source ~/.bashrc cd ~/project echo Hello echo Goodbye Set working directory The lines of your script Running Jobs on the Cluster • What if I have to run a program on 100 datasets? – You could make 100 scripts, or you could use SimpleQueue! • Write a text file, where each line is a one-line shell command • Use the sqSlurm.py program to make a SBATCH script • Submit the SBATCH script • Python program that can write the text file (let’s call it “writeit.py”) import sys import os cwd = os.getcwd() for arg in sys.argv[1:]: print "source ~/.bashrc ; cd”, cwd, “; python myscript”, arg • Commands to run python writeit.py dataset*.gz > runit.smplq sqSlurm.py –n 3 –t 2 –m 120 –w 24:00:00 runit.smplq > runit.sh sbatch runit.sh • Set number of nodes and number of tasks per node – Divide number of cores a task will use by 20 (cores per node) What do you need to know how to do to “survive”? • How to get into the cluster, and back out again. • How to run commands in the shell. • How to navigate around the directories (and make and remove them). • How to create, look at and edit text files. • How to write scripts to do the computations you need to do. • How to submit jobs, to run things on the compute nodes. Helpful Tips • Take a Linux basics tutorial • The faster you can type, the faster you will be done • Select and learn a text editor – Vi or Emacs • Select and learn a programming language – Python, R or Perl • Ask these questions to keep you oriented – – – – – What computer am I on? What directory am I in? Where are the files for my analysis? What program(s) do I have running? What jobs do I have running? Helpful Tips • Ask these questions to keep you oriented – What computer am I on? • Look at the prompt, ‘hostname’ – What directory am I in? • Look at the prompt and window top • ‘pwd’, ‘cd’ – Where are the files for my analysis? • ‘ls’ • ‘mkdir’, ‘rm’, ‘rmdir’ • ‘more’ or ‘less’, ‘head’, ‘tail’ – What program(s) do I have running? • ‘ps’, ‘top’, ‘screen -r’ – What jobs do I have running? • ‘squeue | grep netId’ Golden Rule for Bioinformatic Clusters • Never, ever, ever read and write SAM files. Always pipe it through samtools to convert from SAM to BAM, if the software doesn’t support native BAM files.