* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 004 - cse.sc.edu
DNA damage theory of aging wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Gel electrophoresis of nucleic acids wikipedia , lookup
Genealogical DNA test wikipedia , lookup
Primary transcript wikipedia , lookup
History of genetic engineering wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Cell-free fetal DNA wikipedia , lookup
Molecular cloning wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
Epigenomics wikipedia , lookup
DNA supercoil wikipedia , lookup
Nucleic acid double helix wikipedia , lookup
Extrachromosomal DNA wikipedia , lookup
Non-coding DNA wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
DNA vaccination wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Point mutation wikipedia , lookup
Helitron (biology) wikipedia , lookup
Bioinformatics 生物信息学理论和实践 唐继军 [email protected] 13928761660 www.cse.sc.edu/~jtang/BJFU 作业 • GTTGCAGCAATGGTAGACTCAACGGTAGCAAT AACTGCAGGACCTAGAGGAAAAACAGTAGGG ATTAATAAGCCCTATGGAGCACCAGAAATTAC AAAAGATGGTTATAAGGTGATGAAGGGTATC AAGCCTGAA • 为什么用缺省blast出不来结果?需要如 何选择? • 相关物种的最新pubmed文章有哪些? Working with Directories • Directories are a means of organizing your files on a Linux computer. • They are equivalent to folders on Windows and Macintosh computers • Directories contain files, executable programs, and sub-directories • Understanding how to use directories is crucial to manipulating your files on a Linux system. File & Directory Commands • This is a minimal list of Linux commands that you must know for file management: ls (list) mkdir (make directory) cd (change directory) pwd (present directory) cp (copy) rm (remove) mv (move) more (view by page) cat (view entire) man (help) • All of these commands can be modified with many options. Learn to use Linux ‘man’ pages for more information. Navigation • pwd (present working directory) shows the name and location of the directory where you are currently working: > pwd /home/jtang • This is a “pathname,” the slashes indicate sub-directories • The initial slash is the “root” of the whole filesytem • ls (list) gives you a list of the files in the current directory: • > ls assembin4.fasta Misc test2.txt bin temp testfile • Use the ls -l (long) option to get more information about each file > ls -l total 1768 drwxr-x--- 2 browns02 users 8192 Aug 28 18:26 Opioid -rw-r----- 1 browns02 users 6205 May 30 2000 af124329.gb_in2 -rw-r----- 1 browns02 users 131944 May 31 2000 af151074.fasta Sub-directories • cd (change directory) moves you to another directory >cd Misc > pwd /u/browns02/Misc • mkdir (make directory) creates a new sub-directory inside of the current directory > ls assembler phrap > mkdir subdir > ls assembler phrap space space subdir • rmdir (remove directory) deletes a subdirectory, but the sub-directory must be empty > rmdir subdir > ls assembler phrap space Create new files • nano • vi/vim • emacs Programming • • • • • perl python c/c++ R Java more • Use the command more to view at the contents of a file one screen at a time: > more t27054_cel.pep !!AA_SEQUENCE 1.0 P1;T27054 - hypothetical protein Y49E10.20 - Caenorhabditis elegans Length: 534 May 30, 2000 13:49 Type: P Check: 1278 .. 1 MLKKAPCLFG SAIILGLLLA AAGVLLLIGI PIDRIVNRQV IDQDFLGYTR 51 DENGTEVPNA MTKSWLKPLY AMQLNIWMFN VTNVDGILKR HEKPNLHEIG 101 PFVFDEVQEK VYHRFADNDT RVFYKNQKLY HFNKNASCPT CHLDMKVTIP t27054_cel.pep (87%) • Hit the spacebar to page down through the file • Ctrl-U moves back up a page • At the bottom of the screen, more shows how much of the file has been displayed • Similar command: less Copy & Move • cp lets you copy a file from any directory to any other directory, or create a copy of a file with a new name in one directory • cp filename.ext newfilename.ext • cp filename.ext subdir/newname.ext • cp /u/jdoe01/filename.ext ./subdir/newfilename.ext • mv allows you to move files to other directories, but it is also used to rename files. • Filename and directory syntax for mv is exactly the same as for the cp command. • mv filename.ext subdir/newfilename.ext • NOTE: When you use mv to move a file into another directory, the current file is deleted. Delete • Use the command rm (remove) to delete files • There is no way to undo this command!!! • We have set the server to ask if you really want to remove each file before it is deleted. • You must answer “Y” or else the file is not deleted. • But can use –f • rm –rf View File Permissions • Use the ls -l command to see the permissions for all files in a directory: $ ls -l total 2 -rw-r--r-- 1 jtang None 56 Feb 29 11:21 data.txt -rwxr-xr-x 1 jtang None 33 Feb 29 11:21 test.pl • The username of the owner is shown in the third column. (The owner of the files listed above is jtang) • The owner belongs to the group “None” • The access rights for these files is shown in the first column. This column consists of 10 characters known as the attributes of the file: r, w, x, and r w x - indicates read permission indicates write (and delete) permission indicates execute (run) permission indicates no permission for that operation Change Protections • Only the owner of a file can change its protections • To change the protections on a file use the chmod (change mode) command. [Beware, this is a confusing command.] • Taken all together, it looks like this: > chmod 644 data.txt This will set the owner to have read, write; add the permission for the group and the world to read 600, 755, 700, Commands for Files • Files are used to store information, for example, data or the results of some analysis. • You will mostly deal with text files • Files on the RCR Alpha are automatically backed up to tape every night. • cat dumps the entire contents of a file onto the screen. • For a long file this can be annoying, but it can also be helpful if you want to copy and paste (use the buffer of your telnet program) FTP/SCP is Simple • File Transfer Protocol is standard for all computers on any network. • The best way to move lots of data to and from remote machines: • put raw data onto the server for analysis • get results back to the desktop for use in papers and grants • Graphical FTP applications for desktop PCs • On a Mac, use Fetch, CyberDuck (!) • On a Windows PC, use WS_FTP, FileZilla • winscp Some More Advanced Linux Commands • grep: searches a file for a specific text pattern • cut: copies one or more columns from a tab-delimited text file • wc: word count • | : the pipe — sends output of one command as input to the next • > : redirect output to a file Perl Why Write Programs? • Automate computer work that you do by hand save time & reduce errors • Run the same analysis on lots of similar data files = scale-up • Analyze data, make decisions • sort Blast results by e-value &/or species of best mach • Build a pipeline • Create new analysis methods Why Perl? • Fairly easy to learn the basics • Many powerful functions for working with text: search & extract, modify, combine • Can control other programs • Free and available for all operating systems • Most popular language in bioinformatics • Many pre-built “modules” are available that do useful things Get Perl • You can install Perl on any type of computer • Download and install Perl on your own computer: www.perl.org Programming Concepts • Program = a text file that contains instructions for the computer to follow • Programming Language = a set of commands that the computer understands (via a “command interpreter”) • Input = data that is given to the program • Output = something that is produced by the program Programming • Write the program (with a text editor) • Run the program • Look at the output • Correct the errors (debugging) • Repeat (computers are VERY dumb -they do exactly what you tell them to do, so be careful what you ask for…) Basic Concepts • • • • • Variables and Assignment Conditions Loop Input/Output (I/O) Procedures/functions Strings • Text is handled in Perl as a string • This basically means that you have to put quotes around any piece of text that is not an actual Perl instruction. • Perl has two kinds of quotes - single ‘ and double “ (they are different- single quote will print as is) Print • Perl uses the term “print” to create output • Without a print statement, you won’t know what your program has done • You need to tell Perl to put a carriage return at the end of a printed line • Use the “\n” (newline) command • Include the quotes • The “\” character is called an escape - Perl uses it a lot Your First Perl Program • Open a new text file >nano prog1.pl • Type: #!/usr/bin/perl #my first perl program print "Hello world\n"; Program details • Perl programs always start with the line: #!/usr/bin/perl • this tells the computer that this is a Perl program and where to get the Perl interpreter • All other lines that start with # are considered comments, and are ignored by Perl • Lines that are Perl commands end with a ; Run your Perl program >perl prog1.pl [#use the perl interpreter to run your script] >chmod 755 *.pl [#make the file executable] >./prog1.pl [run it] #!/usr/bin/perl $DNA = 'ACGT'; # Next, we print the DNA onto the screen print $DNA, "\n"; print '$DNA\n'; print "$DNA\n"; exit; Numbers and Functions • Perl handles numbers in most common formats: 456 5.6743 6.3E-26 • Mathematical functions work pretty much as you would expect: 4+7 6*4 43-27 256/12 2/(3-5) Do the Math (your 2nd Perl program) #!/usr/bin/perl print "4+5\n"; print 4+5 , "\n"; print "4+5=" , 4+5 , "\n"; [Note: use commas to separate multiple items in a print statement, whitespace is ignored] Variables • To be useful at all, a program needs to be able to store information from one line to the next • Perl stores information in variables • A variable name starts with the “$” symbol, and it can store strings or numbers • Variables are case sensitive • Give them sensible names • Use the “=”sign to assign values to variables $one_hundred = 100; $my_sequence = "ttattagcc"; You can do Math with Variables #!/usr/bin/perl #put some values in variables $sequences_analyzed = 200 ; $new_sequences = 21 ; #now we will do the work $percent_new_sequences =( $new_sequences / $sequences_analyzed) *100 ; print "% of new sequences = " , $percent_new_sequences; % of new sequences = 952.381 String Operations • Strings (text) in variables can be used for some math-like operations • Concatenate (join) use the dot . operator $seq1= "ACTG"; $seq2= "GGCTA"; $seq3= $seq1 . $seq2; print $seq3; ACTGGGCTA #!/usr/bin/perl # Storing DNA in a variable, and printing it out # First we store the DNA in a variable called $DNA $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; # Next, we print the DNA onto the screen print $DNA; # Finally, we'll specifically tell the program to exit. exit; #!/usr/bin/perl -w $DNA1 = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; $DNA2 = 'ATAGTGCCGTGAGAGTGATGTAGTA'; print "Here are the original two DNA fragments:\n\n"; print $DNA1, "\n"; print $DNA2, "\n\n"; # Using "string interpolation" $DNA3 = "$DNA1$DNA2"; print "Here is the concatenation of the first two fragments (version 1):\n\n"; print "$DNA3\n\n"; # An alternative way using the "dot operator": $DNA3 = $DNA1 . $DNA2; print “Here is the concatenation of the first two fragments (version 2):\n\n”; print "$DNA3\n\n"; exit; #!/usr/bin/perl –w $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; print "Here is the starting DNA:\n\n"; print "$DNA\n\n"; # Transcribe the DNA to RNA by substituting all T's with U's. $RNA = $DNA; $RNA =~ s/T/U/g; # Print the RNA onto the screen print "Here is the result of transcribing the DNA to RNA:\n\n"; print "$RNA\n"; # Exit the program. exit; Exercises • Create a dir named Exercises in your home dir • Create a folder Class1 in your Exercises dir • Create three perl programs • Prog2: Cancatenate three DNAs • Prog3: Convert a DNA to one with lower cases • A->a, C->c, G->g, T->t • Chmod, Test and Debug #!/usr/bin/perl -w $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; print "$DNA\n\n"; $revcom = reverse $DNA; $revcom $revcom $revcom $revcom =~ =~ =~ =~ s/A/T/g; s/T/A/g; s/G/C/g; s/C/G/g; # Print the reverse complement DNA onto the screen print "Here is the reverse complement DNA:\n\n"; print "$revcom\n"; #!/usr/bin/perl -w $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; print "$DNA\n\n"; $revcom = reverse $DNA; # See the text for a discussion of tr/// $revcom =~ tr/ACGTacgt/TGCAtgca/; # Print the reverse complement DNA onto the screen print "Here is the reverse complement DNA:\n\n"; print "$revcom\n"; exit; Exercise • Change your previous program so that it can convert to lowercases easier More • In Exercise, create a dir named Class2 • Using nano, create a file named NM_021964fragment.pep • Put some amino acid sequence into it • Save and quit #!/usr/bin/perl -w # The filename of the file containing the protein sequence data $proteinfilename = 'NM_021964fragment.pep'; # First we have to "open" the file open(PROTEINFILE, $proteinfilename); $protein = <PROTEINFILE>; # Now that we've got our data, we can close the file. close PROTEINFILE; # Print the protein onto the screen print "Here is the protein:\n\n"; print $protein; exit; More • Using nano, add two more lines to NM_021964fragment.pep • Save and quit #!/usr/bin/perl -w $proteinfilename = 'NM_021964fragment.pep'; open(PROTEINFILE, $proteinfilename); # First line $protein = <PROTEINFILE>; print “\nHere is the first line of the protein file:\n\n”; print $protein; # Second line $protein = <PROTEINFILE>; print “\nHere is the second line of the protein file:\n\n”; print $protein; # Third line $protein = <PROTEINFILE>; print “\nHere is the third line of the protein file:\n\n”; print $protein; close PROTEINFILE; exit; Exercise • Create a file named dna.fasta • Add two lines to this file: • >DNA1 • ATGCGGGATGGAGCGCGC • Write a program, open it, print the DNA name and the sequence • How to avoid the print of “>”? #!/usr/bin/perl -w # The filename of the file containing the protein sequence data $proteinfilename = 'NM_021964fragment.pep'; # First we have to "open" the file open(PROTEINFILE, $proteinfilename); # Read the protein sequence data from the file, and store it # into the array variable @protein @protein = <PROTEINFILE>; # Print the protein onto the screen print @protein; # Close the file. close PROTEINFILE; exit; #!/usr/bin/perl -w # "scalar context" and "list context" @bases = ('A', 'C', 'G', 'T'); print "@bases\n"; $a = @bases; print $a, "\n"; ($a) = @bases; print $a, "\n"; exit; #!/usr/bin/perl -w # array indexing @bases = ('A', 'C', 'G', 'T'); print "@bases\n"; print $bases[0], "\n"; print $bases[1], "\n"; print $bases[2], "\n"; print $bases[3], "\n"; exit; #!/usr/bin/perl -w # array indexing @coins = ("Quarter","Dime","Nickel"); print $coins; print $coins[0], "\n"; exit; #!/usr/bin/perl -w # array indexing @coins = qw(Quarter Dime Nickel); print $coins[0], "\n"; exit; #!/usr/bin/perl -w # array indexing @coins = qw(Quarter Dime Nickel); $x = join('‘, @coins); print $x; print join(' ', @coins); exit; #!/usr/bin/perl -w # array indexing $coins = "Quarter Dime Nicke"; @y = split(' ', $coins); print $y[0], "\n"; @y = split(',', $coins); print $y[0]; exit; String functions • Chomp • Length of a string • Substring #!/usr/bin/perl -w $proteinfilename = 'NM_021964fragment.pep'; open(PROTEINFILE, $proteinfilename); $protein = <PROTEINFILE>; close PROTEINFILE; $len = length $protein; print $len, ""; exit; #!/usr/bin/perl -w $proteinfilename = 'NM_021964fragment.pep'; open(PROTEINFILE, $proteinfilename); $protein = <PROTEINFILE>; close PROTEINFILE; chomp $protein; $len = length $protein; print $len, ""; exit; #!/usr/bin/perl -w $proteinfilename = 'NM_021964fragment.pep'; open(PROTEINFILE, $proteinfilename); $protein = <PROTEINFILE>; close PROTEINFILE; chomp $protein; $st1 = substr($protein, 0, 2); print $st1, ""; exit; #or substr $protein, 0, 2; #!/usr/bin/perl -w $proteinfilename = 'NM_021964fragment.pep'; open(PROTEINFILE, $proteinfilename); $protein = <PROTEINFILE>; close PROTEINFILE; chomp $protein; $st1 = substr($protein, 3); print $st1, ""; exit; #or substr $protein, 0, 2; Exercise • Create a DNA fasta file with one > and three lines of sequence data • Show those lines onto the screen • Show the number of characters in the sequence • How can we show them into one line? • Play with substr method • Can we tell how many A in the sequence? #!/usr/bin/perl -w $proteinfilename = 'NM_021964fragment.pep'; unless ( open(PROTEINFILE, $proteinfilename) ) { print "Could not open file $proteinfilename!\n"; exit; } while( $protein = <PROTEINFILE> ) { print " ###### print $protein; } # Close the file. close PROTEINFILE; exit; Here is the next line of the file:\n"; Bigger Exercise • Create a DNA fasta file with one > and several lines of sequence data • Show those lines onto the screen • Show the number of characters in the sequence • How can we show them into one line? Comparison • String comparison (are they the same, > or <) • • • • • • eq (equal ) ne (not equal ) ge (greater or equal ) gt (greater than ) lt (less than ) le (less or equal ) Conditions • if () {} • elsif() {} • else {} #!/usr/bin/perl –w $word = 'MNIDDKL'; if($word eq 'QSTVSGE') { print "QSTVSGE\n"; } elsif($word eq 'MRQQDMISHDEL') { print "MRQQDMISHDEL\n"; } elsif ( $word eq 'MNIDDKL' ) { print "MNIDDKL-the magic word!\n"; } else { print "Is \”$word\“ a peptide?\n"; } exit;