Download 004 - cse.sc.edu

Document related concepts

DNA damage theory of aging wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Primary transcript wikipedia , lookup

History of genetic engineering wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

Molecular cloning wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Epigenomics wikipedia , lookup

DNA supercoil wikipedia , lookup

Nucleic acid double helix wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Non-coding DNA wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

DNA vaccination wikipedia , lookup

Genomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Point mutation wikipedia , lookup

Helitron (biology) wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Transcript
Bioinformatics
生物信息学理论和实践
唐继军
[email protected]
13928761660
www.cse.sc.edu/~jtang/BJFU
作业
• GTTGCAGCAATGGTAGACTCAACGGTAGCAAT
AACTGCAGGACCTAGAGGAAAAACAGTAGGG
ATTAATAAGCCCTATGGAGCACCAGAAATTAC
AAAAGATGGTTATAAGGTGATGAAGGGTATC
AAGCCTGAA
• 为什么用缺省blast出不来结果?需要如
何选择?
• 相关物种的最新pubmed文章有哪些?
Working with Directories
• Directories are a means of organizing your
files on a Linux computer.
• They are equivalent to folders on Windows and
Macintosh computers
• Directories contain files, executable
programs, and sub-directories
• Understanding how to use directories is
crucial to manipulating your files on a Linux
system.
File & Directory Commands
• This is a minimal list of Linux commands that you
must know for file management:
ls (list)
mkdir (make directory)
cd (change directory)
pwd (present directory)
cp (copy)
rm (remove)
mv (move)
more (view by page)
cat (view entire)
man (help)
• All of these commands can be modified with many
options. Learn to use Linux ‘man’ pages for more
information.
Navigation
• pwd (present working directory) shows the name
and location of the directory where you are currently
working: > pwd
/home/jtang
• This is a “pathname,” the slashes indicate sub-directories
• The initial slash is the “root” of the whole filesytem
• ls (list) gives you a list of the files in the current
directory:
• > ls
assembin4.fasta Misc
test2.txt
bin
temp
testfile
• Use the ls -l (long) option to get more information about each file
> ls -l
total 1768
drwxr-x--- 2 browns02 users 8192 Aug 28 18:26 Opioid
-rw-r----- 1 browns02 users 6205 May 30 2000 af124329.gb_in2
-rw-r----- 1 browns02 users 131944 May 31 2000 af151074.fasta
Sub-directories
• cd (change directory) moves you to another
directory
>cd Misc
> pwd
/u/browns02/Misc
• mkdir (make directory) creates a new
sub-directory inside of the current directory
> ls
assembler phrap
> mkdir subdir
> ls
assembler phrap
space
space
subdir
• rmdir (remove directory) deletes a subdirectory, but the sub-directory must be
empty
> rmdir subdir
> ls
assembler phrap
space
Create new files
• nano
• vi/vim
• emacs
Programming
•
•
•
•
•
perl
python
c/c++
R
Java
more
• Use the command more to view at the contents of
a file one screen at a time:
> more t27054_cel.pep
!!AA_SEQUENCE 1.0
P1;T27054 - hypothetical protein Y49E10.20 - Caenorhabditis elegans
Length: 534 May 30, 2000 13:49 Type: P Check: 1278 ..
1 MLKKAPCLFG SAIILGLLLA AAGVLLLIGI PIDRIVNRQV IDQDFLGYTR
51 DENGTEVPNA MTKSWLKPLY AMQLNIWMFN VTNVDGILKR HEKPNLHEIG
101 PFVFDEVQEK VYHRFADNDT RVFYKNQKLY HFNKNASCPT CHLDMKVTIP
t27054_cel.pep (87%)
• Hit the spacebar to page down through the file
• Ctrl-U moves back up a page
• At the bottom of the screen, more shows how much of
the file has been displayed
• Similar command: less
Copy & Move
• cp lets you copy a file from any directory to
any other directory, or create a copy of a file
with a new name in one directory
• cp filename.ext newfilename.ext
• cp filename.ext subdir/newname.ext
• cp /u/jdoe01/filename.ext ./subdir/newfilename.ext
• mv allows you to move files to other
directories, but it is also used to rename files.
• Filename and directory syntax for mv is exactly the same as
for the cp command.
• mv filename.ext subdir/newfilename.ext
• NOTE: When you use mv to move a file into another
directory, the current file is deleted.
Delete
• Use the command rm (remove) to delete
files
• There is no way to undo this command!!!
• We have set the server to ask if you really want
to remove each file before it is deleted.
• You must answer “Y” or else the file is not
deleted.
• But can use –f
• rm –rf
View File Permissions
• Use the ls -l command to see the permissions for all files
in a directory:
$ ls -l
total 2
-rw-r--r-- 1 jtang None 56 Feb 29 11:21 data.txt
-rwxr-xr-x 1 jtang None 33 Feb 29 11:21 test.pl
• The username of the owner is shown in the third column. (The owner
of the files listed above is jtang)
• The owner belongs to the group “None”
• The access rights for these files is shown in the first
column. This column consists of 10 characters known as
the attributes of the file: r, w, x, and r
w
x
-
indicates read permission
indicates write (and delete) permission
indicates execute (run) permission
indicates no permission for that operation
Change Protections
• Only the owner of a file can change its protections
• To change the protections on a file use the chmod
(change mode) command.
[Beware, this is a confusing command.]
• Taken all together, it looks like this:
> chmod 644 data.txt
This will set the owner to have read, write; add the permission for the group and
the world to read
600, 755, 700,
Commands for Files
• Files are used to store information, for
example, data or the results of some analysis.
• You will mostly deal with text files
• Files on the RCR Alpha are automatically backed up to tape
every night.
• cat dumps the entire contents of a file onto
the screen.
• For a long file this can be annoying, but it can also be
helpful if you want to copy and paste (use the buffer of
your telnet program)
FTP/SCP is Simple
• File Transfer Protocol is standard for all
computers on any network.
• The best way to move lots of data to and
from remote machines:
• put raw data onto the server for analysis
• get results back to the desktop for use in papers
and grants
• Graphical FTP applications for desktop PCs
• On a Mac, use Fetch, CyberDuck (!)
• On a Windows PC, use WS_FTP, FileZilla
• winscp
Some More Advanced
Linux Commands
• grep: searches a file for a specific text
pattern
• cut: copies one or more columns from a
tab-delimited text file
• wc: word count
• | : the pipe — sends output of one
command as input to the next
• > : redirect output to a file
Perl
Why Write Programs?
• Automate computer work that you do by hand save time & reduce errors
• Run the same analysis on lots of similar data files
= scale-up
• Analyze data, make decisions
• sort Blast results by e-value &/or species of best mach
• Build a pipeline
• Create new analysis methods
Why Perl?
• Fairly easy to learn the basics
• Many powerful functions for working with
text: search & extract, modify, combine
• Can control other programs
• Free and available for all operating systems
• Most popular language in bioinformatics
• Many pre-built “modules” are available
that do useful things
Get Perl
• You can install Perl on any type of
computer
• Download and install Perl on your own
computer:
www.perl.org
Programming Concepts
• Program = a text file that contains
instructions for the computer to follow
• Programming Language = a set of
commands that the computer understands
(via a “command interpreter”)
• Input = data that is given to the program
• Output = something that is produced by
the program
Programming
• Write the program (with a text editor)
• Run the program
• Look at the output
• Correct the errors (debugging)
• Repeat
(computers are VERY dumb -they do exactly
what you tell them to do, so be careful
what you ask for…)
Basic Concepts
•
•
•
•
•
Variables and Assignment
Conditions
Loop
Input/Output (I/O)
Procedures/functions
Strings
• Text is handled in Perl as a string
• This basically means that you have to put
quotes around any piece of text that is not
an actual Perl instruction.
• Perl has two kinds of quotes - single ‘
and double “
(they are different- single quote will print as is)
Print
• Perl uses the term “print” to create output
• Without a print statement, you won’t
know what your program has done
• You need to tell Perl to put a carriage
return at the end of a printed line
• Use the “\n” (newline) command
• Include the quotes
• The “\” character is called an escape - Perl
uses it a lot
Your First Perl Program
• Open a new text file
>nano prog1.pl
• Type:
#!/usr/bin/perl
#my first perl program
print "Hello world\n";
Program details
• Perl programs always start with the line:
#!/usr/bin/perl
• this tells the computer that this is a Perl program and
where to get the Perl interpreter
• All other lines that start with # are considered
comments, and are ignored by Perl
• Lines that are Perl commands end with a
;
Run your Perl program
>perl prog1.pl
[#use the perl interpreter to run your script]
>chmod 755 *.pl
[#make the file executable]
>./prog1.pl
[run it]
#!/usr/bin/perl
$DNA = 'ACGT';
# Next, we print the DNA onto the screen
print $DNA, "\n";
print '$DNA\n';
print "$DNA\n";
exit;
Numbers and Functions
• Perl handles numbers in most common formats:
456
5.6743
6.3E-26
• Mathematical functions work pretty much as you
would expect:
4+7
6*4
43-27
256/12
2/(3-5)
Do the Math
(your 2nd Perl program)
#!/usr/bin/perl
print "4+5\n";
print 4+5 , "\n";
print "4+5=" , 4+5 , "\n";
[Note: use commas to separate multiple items
in a print statement, whitespace is ignored]
Variables
• To be useful at all, a program needs to be able to
store information from one line to the next
• Perl stores information in variables
• A variable name starts with the “$” symbol, and it
can store strings or numbers
• Variables are case sensitive
• Give them sensible names
• Use the “=”sign to assign values to variables
$one_hundred = 100;
$my_sequence = "ttattagcc";
You can do Math with Variables
#!/usr/bin/perl
#put some values in variables
$sequences_analyzed = 200 ;
$new_sequences = 21 ;
#now we will do the work
$percent_new_sequences =( $new_sequences /
$sequences_analyzed) *100 ;
print "% of new sequences = " , $percent_new_sequences;
% of new sequences = 952.381
String Operations
• Strings (text) in variables can be used for some
math-like operations
• Concatenate (join) use the dot . operator
$seq1= "ACTG";
$seq2= "GGCTA";
$seq3= $seq1 . $seq2;
print $seq3;
ACTGGGCTA
#!/usr/bin/perl
# Storing DNA in a variable, and printing it out
# First we store the DNA in a variable called $DNA
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
# Next, we print the DNA onto the screen
print $DNA;
# Finally, we'll specifically tell the program to exit.
exit;
#!/usr/bin/perl -w
$DNA1 = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
$DNA2 = 'ATAGTGCCGTGAGAGTGATGTAGTA';
print "Here are the original two DNA fragments:\n\n";
print $DNA1, "\n";
print $DNA2, "\n\n";
# Using "string interpolation"
$DNA3 = "$DNA1$DNA2";
print "Here is the concatenation of the first two fragments
(version 1):\n\n";
print "$DNA3\n\n";
# An alternative way using the "dot operator":
$DNA3 = $DNA1 . $DNA2;
print “Here is the concatenation of the first two fragments
(version 2):\n\n”;
print "$DNA3\n\n";
exit;
#!/usr/bin/perl –w
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
print "Here is the starting DNA:\n\n";
print "$DNA\n\n";
# Transcribe the DNA to RNA by substituting all T's with U's.
$RNA = $DNA;
$RNA =~ s/T/U/g;
# Print the RNA onto the screen
print "Here is the result of transcribing the DNA to RNA:\n\n";
print "$RNA\n";
# Exit the program.
exit;
Exercises
• Create a dir named Exercises in your home
dir
• Create a folder Class1 in your Exercises dir
• Create three perl programs
• Prog2: Cancatenate three DNAs
• Prog3: Convert a DNA to one with lower cases
• A->a, C->c, G->g, T->t
• Chmod, Test and Debug
#!/usr/bin/perl -w
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
print "$DNA\n\n";
$revcom = reverse $DNA;
$revcom
$revcom
$revcom
$revcom
=~
=~
=~
=~
s/A/T/g;
s/T/A/g;
s/G/C/g;
s/C/G/g;
# Print the reverse complement DNA onto the screen
print "Here is the reverse complement DNA:\n\n";
print "$revcom\n";
#!/usr/bin/perl -w
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
print "$DNA\n\n";
$revcom = reverse $DNA;
# See the text for a discussion of tr///
$revcom =~ tr/ACGTacgt/TGCAtgca/;
# Print the reverse complement DNA onto the screen
print "Here is the reverse complement DNA:\n\n";
print "$revcom\n";
exit;
Exercise
• Change your previous program so that it
can convert to lowercases easier
More
• In Exercise, create a dir named Class2
• Using nano, create a file named
NM_021964fragment.pep
• Put some amino acid sequence into it
• Save and quit
#!/usr/bin/perl -w
# The filename of the file containing the protein sequence data
$proteinfilename = 'NM_021964fragment.pep';
# First we have to "open" the file
open(PROTEINFILE, $proteinfilename);
$protein = <PROTEINFILE>;
# Now that we've got our data, we can close the file.
close PROTEINFILE;
# Print the protein onto the screen
print "Here is the protein:\n\n";
print $protein;
exit;
More
• Using nano, add two more lines to
NM_021964fragment.pep
• Save and quit
#!/usr/bin/perl -w
$proteinfilename = 'NM_021964fragment.pep';
open(PROTEINFILE, $proteinfilename);
# First line
$protein = <PROTEINFILE>;
print “\nHere is the first line of the protein file:\n\n”;
print $protein;
# Second line
$protein = <PROTEINFILE>;
print “\nHere is the second line of the protein file:\n\n”;
print $protein;
# Third line
$protein = <PROTEINFILE>;
print “\nHere is the third line of the protein file:\n\n”;
print $protein;
close PROTEINFILE;
exit;
Exercise
• Create a file named dna.fasta
• Add two lines to this file:
• >DNA1
• ATGCGGGATGGAGCGCGC
• Write a program, open it, print the DNA
name and the sequence
• How to avoid the print of “>”?
#!/usr/bin/perl -w
# The filename of the file containing the protein sequence data
$proteinfilename = 'NM_021964fragment.pep';
# First we have to "open" the file
open(PROTEINFILE, $proteinfilename);
# Read the protein sequence data from the file, and store it
# into the array variable @protein
@protein = <PROTEINFILE>;
# Print the protein onto the screen
print @protein;
# Close the file.
close PROTEINFILE;
exit;
#!/usr/bin/perl -w
# "scalar context" and "list context"
@bases = ('A', 'C', 'G', 'T');
print "@bases\n";
$a = @bases;
print $a, "\n";
($a) = @bases;
print $a, "\n";
exit;
#!/usr/bin/perl -w
# array indexing
@bases = ('A', 'C', 'G', 'T');
print "@bases\n";
print $bases[0], "\n";
print $bases[1], "\n";
print $bases[2], "\n";
print $bases[3], "\n";
exit;
#!/usr/bin/perl -w
# array indexing
@coins = ("Quarter","Dime","Nickel");
print $coins;
print $coins[0], "\n";
exit;
#!/usr/bin/perl -w
# array indexing
@coins = qw(Quarter Dime Nickel);
print $coins[0], "\n";
exit;
#!/usr/bin/perl -w
# array indexing
@coins = qw(Quarter Dime Nickel);
$x = join('‘, @coins);
print $x;
print join(' ', @coins);
exit;
#!/usr/bin/perl -w
# array indexing
$coins = "Quarter Dime Nicke";
@y = split(' ', $coins);
print $y[0], "\n";
@y = split(',', $coins);
print $y[0];
exit;
String functions
• Chomp
• Length of a string
• Substring
#!/usr/bin/perl -w
$proteinfilename = 'NM_021964fragment.pep';
open(PROTEINFILE, $proteinfilename);
$protein = <PROTEINFILE>;
close PROTEINFILE;
$len = length $protein;
print $len, "";
exit;
#!/usr/bin/perl -w
$proteinfilename = 'NM_021964fragment.pep';
open(PROTEINFILE, $proteinfilename);
$protein = <PROTEINFILE>;
close PROTEINFILE;
chomp $protein;
$len = length $protein;
print $len, "";
exit;
#!/usr/bin/perl -w
$proteinfilename = 'NM_021964fragment.pep';
open(PROTEINFILE, $proteinfilename);
$protein = <PROTEINFILE>;
close PROTEINFILE;
chomp $protein;
$st1 = substr($protein, 0, 2);
print $st1, "";
exit;
#or substr $protein, 0, 2;
#!/usr/bin/perl -w
$proteinfilename = 'NM_021964fragment.pep';
open(PROTEINFILE, $proteinfilename);
$protein = <PROTEINFILE>;
close PROTEINFILE;
chomp $protein;
$st1 = substr($protein, 3);
print $st1, "";
exit;
#or substr $protein, 0, 2;
Exercise
• Create a DNA fasta file with one > and
three lines of sequence data
• Show those lines onto the screen
• Show the number of characters in the
sequence
• How can we show them into one line?
• Play with substr method
• Can we tell how many A in the sequence?
#!/usr/bin/perl -w
$proteinfilename = 'NM_021964fragment.pep';
unless ( open(PROTEINFILE, $proteinfilename) ) {
print "Could not open file $proteinfilename!\n";
exit;
}
while( $protein = <PROTEINFILE> ) {
print "
######
print $protein;
}
# Close the file.
close PROTEINFILE;
exit;
Here is the next line of the file:\n";
Bigger Exercise
• Create a DNA fasta file with one > and
several lines of sequence data
• Show those lines onto the screen
• Show the number of characters in the
sequence
• How can we show them into one line?
Comparison
• String comparison (are they the same, > or <)
•
•
•
•
•
•
eq (equal )
ne (not equal )
ge (greater or equal )
gt (greater than )
lt (less than )
le (less or equal )
Conditions
• if () {}
• elsif() {}
• else {}
#!/usr/bin/perl –w
$word = 'MNIDDKL';
if($word eq 'QSTVSGE') {
print "QSTVSGE\n";
}
elsif($word eq 'MRQQDMISHDEL') {
print "MRQQDMISHDEL\n";
}
elsif ( $word eq 'MNIDDKL' ) {
print "MNIDDKL-the magic word!\n";
}
else {
print "Is \”$word\“ a peptide?\n";
}
exit;