Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Recap Don’t forget to – – pick a paper and Email me See the schedule to see what’s taken – http://www.cs.siena.edu/~ebreimer/csis-400-f03/schedule.html Agenda Questions for you (10 minutes) Overview (40 minutes) – – – – chromosomes sequence comparison string matching alignment Quiz (25+ minutes) Questions for you List two different functions performed by genes? What is the length of the human genome? Why is the double-helix/base-pairing so important? Questions for you Protein sequences are composed of a chain of what? How many different amino acids are found in proteins? Proteins always form in a helix shape (True or False)? Questions that would stump Dr. B. What is the lower limit on the length of a functional protein? – – – – 10-20 40-50 60-70 100 What is the upper limit on the length of proteins found in cells – – – 100’s 1000’s 1000000’s Questions that would stump Dr. B. What is average length of a human gene? – – – 300 3000 30,000 Approximately, how many genes are in the human genome? – – – – – 400 4000 40,000 400,000 4,000,000 Remember this picture? Sugar A T Sugar Acid Sugar Acid T A Sugar Acid Sugar Acid G C Acid Sugar Acid Sugar Acid A T Sugar Acid Chromosomes DNA molecule and associated proteins The 3,000,000,000 nucleotide human genome is divided among – – 22 pairs of autosomes and 1 pair of sex chromosomes Together the 23 chromosomes carry all the hereditary information of an organism. Chromosomes DNA Sequence Comparison Overview There are 3 different types of comparisons that are important 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery) Whole Genome Comparison Problem: Exactly how similar are two different genomes? Given a set of genomes – which two are most similar – which two are least similar Whole Genome Comparison Ranking a set of genomes based on similarity gives us clues about heredity Similarity Rank evolution G4 G2 G5 0.99 G3 G1 0.97 G1 G4 G2 G4 G5 0.91 G3 G5 G4 G2 0.90 G4 G1 0.80 G4 G3 0.78 G2 G5 G3 G1 Whole Genome Comparison Solution: Design a metric that quantifies similarity something you can measure or something you can compute that accurately quantifies similarity Whole Genome Comparison But what does it really mean for two genomes to be similar? Obviously, if two genomes exactly match then they are similar But, what’s more important – – rough, overall similarity, or exact, local similarity A picture will explain Whole Genome Comparison Exact matching genomes GCCTGACTTAGACAGTCGCTGATCGATGCTATGCA GCCTGACTTAGACAGTCGCTGATCGATGCTATGCA Whole Genome Comparison Rough overall similarity GCTTACTTAGACAAGTCGCTGATCATGCTATGCA GCCTGACTTAGACAGTCGCTGCTCGATGCTTGCA 2 Mismatched pairs 4 unmatched nucleotides Whole Genome Comparison Exact local similarities TACCCAGCTCTTAGACAGCTGATCGATGGAACTAT CTGACTTAGACAGCTGATCGATGCTATGCAAGCT Whole Genome Comparison The first metric: Edit Distance The number of edit operations needed to make the two sequences equal Edit Distance was previously used in – Spell checkers – Approximate database searching Edit Distance 3 edit operations 1. delete a symbol 2. insert a symbol 3. modify a symbol modify = delete + insert modify counts as two edit operations Edit Distance What is the edit distance between these two sequences? Note: edit distance implies the minimum number of basic edit operations needed to make the string equal ERICWASABIGNERD ERICSTILLISANERD ERICWASABIGNERD (5 deletions) ERICSTILLISANERD (6 deletions) Edit Distance ERICWASABIGNERD (15 symbols) ERICSTILLISANERD (16 symbols) ERICWASABIGNERD (5 deletions) ERICSTILLISANERD (6 deletions) Metrics – – Matches 10 / Smaller Sequence 15 = 66% (Edits 11 – Symbols 31) / Symbols 31 = 64% Edit Distance There are problems with edit distance It doesn’t properly reward exact local similarity – which is often a true sign of biological similarity Similar organisms often share a lot of similar genes But may have a few genes that don’t match at all Biologists need a metric that can reflect this type of situation Edit Distance Another problem Two organisms might have almost identical DNA Except one has extra segments Metrics Matches 99 / Smaller Sequence 100 = 99% (Edits 50 – Symbols 250) / Symbols 250 = 80% Edit Distance How is it possible that two metrics based on the same principle (edit distance) could produce such different results? Metrics Matches 99 / Smaller Sequence 100 = 99% (Edits 50 – Symbols 250) / Symbols 250 = 80% Recall There are 3 different types of comparisons that are important 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery) Gene Search Problem: Biologist have sequenced a brand new segment of DNA from a previously un-sequenced organism. They want to know Is this segment a gene? Advantage: Genes are similar across different organisms. Two organisms that do the same exact function are likely to have a nearly-exact gene. Gene Search Solution: Take your newly sequenced segment And search all the previously sequenced genomes. Find segments (in other genomes) that highly match your segment. Advantage: – – – – Other genomes are marked-up Segments that are known to be genes are labeled If your segment matches a known gene then BAM! You’ve found a gene in a previously un-sequenced organism. Gene Search Obviously, you want to search for a segment that is highly similar to your target segment. However, this type of comparison is completely different than whole genome comparison What is the fundamental difference? Gene Search vs. Whole Genome Comparison Whole genome comparison considers sequences in their entirety – – Two sequences Beginning to End Gene Search vs. Whole Genome Comparison Gene search doesn’t consider the entire search sequence when evaluating similarity Two sequences – – Target (the segment you sequenced) Search Sequence (possibly a genome) Gene Search You want to find a sub-segment of the search sequence that highly matches the target sequence. The entire search sequence is analyzed But in evaluating similarity, we don’t need to consider the search sequence in its entirety Looking for localized similarity Gene Search How do you even know that your newly sequenced segment is a gene? Perhaps only part of it is a gene and the rest is junk. Gene Search Now, you are trying to find a portion of your segment that highly matches a portion of the search sequence. Writing an algorithm to find such matches is hard Gene Search 1. Writing such algorithms required coordination between Biologists – 2. Who have some clues about true biological similarity And Computer Scientists – Who have some clues about what problems can be solved efficiently and reliably. Recall There are 3 different types of comparisons that are important 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery) Next Class Motif discovery (computer science perspective) Alignment (the technique used to measure similarity) – – – Global alignment Local alignment Scoring matrices Homework Pick a paper! Email me. Read pages 159-172