Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Stand-alone tools 2. Practice –the ClustalX application. 1. Download the zip file to the GMS6014 folder. 2. Unzip the files to a folder named “clustalx”. 3. Edit the MDM2_isoforms_5.fasta file with WordPad and save. 4. Run the .exe file. 5. Load sequence file, select sequences, perform alignment. 6. Write the alignment to a ps file. Stand-alone tools 3. Command line applications: Accounts for a large number of high-quality, sophisticated programs. Practice – (install and) run standalone blast in your own computer Pet Projects: Searching for potential ortholog of oncogene MDM2 in the fruit fly genome Practice – Install the blast program (1) 1. Download the BLAST executable file, save the file in a folder, such as c:\GMS6014\blast\ 2. Run the installation program by double click. Inspect the folder following installation. 3. Add three more folders to your /blast directory, “/query”, “/dbs”, and “/out”. Practice – Install the blast program (2) 5. Inspect the contents of the doc, data, and bin folder. Move the programs from blast\bin to the blast folder. 6. Bring a command (cmd) window by typing “cmd” in the StartRun box. 7. Go to the blast folder by typing “cd C:\GMS6014\blast” 8. Try to run the program by typing “blastall”, read the output. Practice -- BLAST search in your own computer 1. Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2. Start a CMD window, navigate to the C:\GMS6014\blast folder. 3. At the prompt “C:\GMS6014\blast >” type the command “formatdb –i dbs\Dm.P –p T” -- format the dataset for the program. 4. Compose the query sequence save as “3TNF.txt” in the “blast\query\” folder. 5. Initiated the search by typing “blastall –p blastp –d dbs\Dm.P –i query\4_MMD2.fasta –o out\Mdm2_DmP.html –T T” What’s in a command? formatdb –i dbs\Dm.P –p T Program – format database for search. Feed me the input file name Tell me is it a protein sequence file? For more info, refer to the “user manual” file in the blast\doc folder. Advantages of Running BLAST at Your Own Machine Do it at any time, no waiting on the line. Search for multiple sequences at once. Search a defined data set. Automate Blast analysis. Combine Blast with other analysis. ….. BLAST is a program implemented in C/C++ void BlastTickProc(Int4 sequence_number, BlastThrInfoPtr thr_info) { if(thr_info->tick_callback && (sequence_number > (thr_info->last_db_seq + thr_info->db_incr))) { NlmMutexLockEx(&thr_info->callback_mutex); thr_info->last_db_seq += thr_info->db_incr; thr_info->tick_callback(sequence_number, thr_info->number_of_pos_hits); thr_info->last_tick = Nlm_GetSecs(); Should I care ? NlmMutexUnlock(thr_info->callback_mutex); } return; } /* Sends out a message every PERIOD (i.e., 60 secs.) for the index. THis function runs as a separate thread and only runs on a threaded platform. If you care: 1.) Data structure and Algorithm char: name SEQ char: sequence int: seq_length Identify the best alignment for two sequences (p69-73) Seq1: MA-DSV—WC.. Seq2: MALD-IHWS.. Programming language comparison Translation : C Translation : Python /* TRANSLATION: 3 or 6 frame translate cDNA sequences */ //--------------------------------------------------------------------------#include "translation.hpp" f#Translation -- read from fasta DNA file and translate into three frames int main(int argc, char **argv) { int num_seq=0; char string[MAXLINE]; DSEQ * dseq; import string # from Bio import Fasta from Bio.Tools import Translate infile.getline (string,MAXLINE); if (string[0]=='>') strncpy (dbname,string,MAXLINE); while (!infile.eof()) { dseq=Get_Lib_Seq (); if (dseq->reverse==0) Translation (&dseq->name[1], dseq->seq); else Translation (&dseq->name[1], dseq->r_seq); num_seq++; if (num_seq%1000==0) { cout<<num_seq<<endl; cout<<dseq->name<<endl; } delete dseq; } infile.close(); outfile.close(); cout<<num_seq<<" translated"<<endl; getch(); return 0; } DSEQ* Get_Lib_Seq() { int i,n; char str[MAXLINE]; DSEQ* dseq; n = 0; dseq=new DSEQ; strcpy (dseq->name, dbname); from Bio.Alphabet import IUPAC from Bio.Seq import Seq ifile = "S:\\Seq\\test.fasta" parser = Fasta.RecordParser() file =open (ifile) iterator = Fasta.Iterator (file, parser) cur_rec = iterator.next() cur_seq = Seq (cur_rec.sequence,IUPACUnambiguousDNA()) translator = Translate.unambiguous_dna_by_id[1] translator.translate (cur_seq) Programming languages Efficiency, Power Simplicity, Fast Dev. C/C++ Perl - Bioperl Java Biojava Python Biopython Observe: scripting is not that difficult Example: Python and bioPython. 1. Simple python scripts. 2. Batch Blast with a Python script. Blast output Questions after the Blast search? Questions: • Is this a expressed gene in the Fruit fly? - Gene prediction & gene structure • Is this the true ortholog of MDM2? - Fundamentals of sequence comparison • What can we learn from the comparison of sequences? -- protein dommains/motifs. Blast output How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P P Seq_B: M P P W I Judging the match using “Scoring Matrix” Q: which one is a better match to the query ? Query: M A T W L Query: M A T W L Seq_A: M A T P P Seq_B: M P P W I Score: 5 4 5 -4 -3 Score: 5-1-1 112 Total: 7 Total: 16 “Scoring Matrix” assigns a score to each pair of amino acids A S T L I V K L –1 –2 –2 4 3 1 -2 –4 BLOSUM-62 D ... BLOSUM - Blocks Substitution Matrices Block: very well conserved region of a protein family. – perform the same (similar) function. ASLDEFL SALEDFL ASLDDYL ASIDEFY ASIDEFY … Score(a1/a2) = 2* log2 observed frequency of a1/a2 predicated frequency of a1/a2 AA: 6 AS: 3 SS: 0 BLOSUM - Blocks Substitution Matrices Block: very well conserved region of a protein family. – perform the same (similar) function. ASLDEFL ASLEDFL ASLDDYL SALEEFL ASLDDYL SALEEFL … Score (a1/a2) observed > 0 frequency of a1/a2 > predicated frequency of a1/a2 =0 observed < 0 frequency of a1/a2 < predicated frequency of a1/a2 BLOSUM - Blocks Substitution Matrices Block: very well conserved region of a protein family. – perform the same (similar) function. ASLDEFL ASLEDFL ASLDDYL SALEEFL ASLDDYL SALEEFL … observed frequency of L/I i.e: 0.03 > Score (L/I) > 0 predicated frequency of L/I i.e: 0.1*0.1 = 0.01 Substitution of L / I is common in conserved sequences BLOSUM - Blocks Substitution Matrices Block: very well conserved region of a protein family. – perform the same (similar) function. ASLDEFL ASLEDFL ASLDDYL SALEEFL ASLDDYL SALEEFL … observed frequency of L/K i.e: 0.0002 < Score (L/K) < 0 predicated frequency of L/K i.e: 0.1*0.1 = 0.01 Substitution of L / K is rare in conserved sequences “Scoring Matrix” assigns a score to each pair of amino acids A S T L I V K L –1 –2 –2 4 3 1 -2 –4 BLOSUM-62 D ... Scoring matrix –BLOSUM 62