* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download aligning any odd sequences
Survey
Document related concepts
Transcript
Transfer of information The main topic of this course is transfer of information. A month in the lab can easily save you an hour in front of the computer. Nothing is impossible for a man who doesn’t have to do it himself. But, to err is human, but to really screw things up, you need a computer. ©CMBI 2005 Transfer of information The main topic of this course is transfer of information. In the protein world that leads to the questions: 1)From which protein can I transfer information 2)How do I transfer what information from where to where Today’s answer is BLAST… ©CMBI 2005 Equivalent structural positions To know if positions in two different proteins are equivalent, we need to know both protein structures and compare them with protein structure comparison software. But by the time you have solved one or two protein structures the four years of your PhD period are over... So, we need a short-cut, and that, ladies and gentleman, will be a sequence alignment (i.e. Blast + ...). ©CMBI 2005 Database Searching with BLAST Database searching with BLAST involves a series of topics we will deal with today: •Database Searching •Sequence Alignment •Scoring Matrices •Significance of an alignment and: •BLAST, algorithm •BLAST, parameters •BLAST, output ©CMBI 2005 Database Searching Identify similarities between: your query sequence likely with unknown structure and function database subject sequences with elucidated structures and function ©CMBI 2005 Database searching concept The query sequence is compared/aligned with every subject sequence in the database. High-scoring database sequences are assumed to be evolutionary related to the query sequence. If sequences are related by divergence from a common ancestor, there are said to be homologous. We can only transfer information between homologs. (And we will learn later that that is because structure is maintained longer during evolution than sequence). ©CMBI 2005 Transfer of information We want to be able to say things like “this serine is phorphorylated in the database protein, so in my homologous protein the corresponding serine is likely to be phosphorylated too”. That requires that the green serine and the purple serine both come from a common ancestor that was phosphorylated too. And that, in turn, requires that both serines are located at the same location in their respective structures. ©CMBI 2005 Sequence alignment TTSASDFRTRTTHISILLMRL STSATSYRTRSTHLSLMLMRI But this is the topic of another seminar. Today we discuss finding sequences… ©CMBI 2005 Which Matrix to use? Close relationships (Low PAM, high Blosum) Distant relationships (High PAM, low Blosum) BLOSUM 80 PAM 20 BLOSUM 62 PAM 120 More conserved Often used defaults are: PAM250, BLOSUM62 BLOSUM 45 PAM 250 More variable Significance of alignment (1) When is an alignment statistically significant? In other words: How much different is the alignment score found from scores obtained by aligning any odd sequences to the query sequence? Or: What is the probability that an alignment with this score could have arisen by chance? ©CMBI 2005 Significance of alignment (2) Database size= 20 x 106 amino acids peptide #hits A AP IAP LIAP WLIAP KWLIAP KWLIAPY 1 x 106 50000 2500 125 6 0,3 0,015 ©CMBI 2005 BLAST Question: What database sequences are most similar to (or contain the most similar regions to) my own sequence? •BLAST finds the highest scoring locally optimal alignments between a query sequence and all database sequences. •Very fast algorithm •Can be used to search extremely large databases •Sufficiently sensitive and selective for most purposes •Robust – the default parameters can usually be used ©CMBI 2005 BLAST – Algorithme Step 1: Read/understand user query sequence. Step 2: Use hashing technology to select several thousand likely candidates. Step 3: Do a real alignment between the query sequence and those likely candidate. ‘Real alignment’ is a main topic of this course. Step 4: Present output to user. ©CMBI 2005 BLAST Algorithm, Step 2 The program first looks for series of short, highly similar fragment, it extends these matching segments in both directions by adding residues. Residues will be added until the incremental score drops below a threshold. ©CMBI 2005 Basic BLAST Algorithms Program Query Database BLASTP Protein Protein BLASTN DNA DNA BLASTX translatedDNA protein TBLASTN protein translatedDNA TBLASTX translatedDNA translatedDNA ©CMBI 2005 PSI-BLAST Position-Specific Iterated BLAST • Distant relationships are often best detected by motif or profile searches rather than pair-wise comparisons • PSI-BLAST first performs a BLAST search. • PSI-BLAST uses the information from significant BLAST alignments returned to construct a position specific score matrix, which replaces the query sequence for the next round of database searching. • PSI-BLAST may be iterated until no new significant alignments are found. ©CMBI 2005 BLAST Input Steps in running BLAST: •Entering your query sequence (cut-and-paste) •Select the database(s) you want to search And, optionally: •Choose output parameters •Choose alignment parameters (scoring matrix, filters,….) Example query= >something AFIWLLSCYALLGTTFGCGVNAIHPVLTGLSKIVNGEEAVPGTWPWQVTLQDRSGFHFC GGSLISEDWVVTAAHCGVRTSEILIAGEFDQGSDEDNIQVLRIAKVFKQPKYSILTVNND ITLLKLASPARYSQTISAVCLPSVDDDAGSLCATTGWGRTKYNANKSPDKLERAALPLLT NAECKRSWGRRLTDVMICGAASGVSSCMGDSGGPLVCQKDGAYTLVAIVSWASDTCSASS GGVYAKVTKIIPWVQKILSSN ©CMBI 2005 BLAST Output A low probability indicates that a match is unlikely to ave arisen by chance A high score, or preferably indicates a likely relationship ©CMBI 2005 BLAST Output Low scores with high probabilities suggest that matches have arisen by chance ©CMBI 2005 Alignment Significance in BLAST P-value (probability) Relates the score for an alignment to the likelihood that it arose by chance. The closer to zero, the greater the confidence that the hit is real. E-value (expect value) The number of alignments with E that would be expected by chance in that database (e.g. if E=10, 10 matches with scores this high are expected to be found by chance). A match will be reported if its E is below the threshold. Lower E thresholds are more stringent, and report fewer matches. ©CMBI 2005 BLAST result: easy ©CMBI 2005 BLAST result: less easy ©CMBI 2005 BLAST result: very difficult ©CMBI 2005 Low complexity filter Many sequences contain repeats or stretches that consist predominantly of one type of amino acid. E.g. Many nuclear proteins have a poly-asparagine tail, membrane proteins often consist of mainly hydrophobic amino acids, or many binding proteins have proline rich stretches. ASDFGTRGHPPPPPPPPPPP--------------NPPPPPPPPPLTSSDFRGT Are NOT homologs, but analogs. ©CMBI 2005 Demo IJs, CNCZ, en het internet dienende komt nu een demo… ©CMBI 2005