Download week4 - Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Recap

Don’t forget to
–
–

pick a paper and
Email me
See the schedule to see what’s taken
–
http://www.cs.siena.edu/~ebreimer/csis-400-f03/schedule.html
Agenda


Questions for you (10 minutes)
Overview (40 minutes)
–
–
–
–

chromosomes
sequence comparison
string matching
alignment
Quiz (25+ minutes)
Questions for you

List two different functions performed by
genes?

What is the length of the human genome?

Why is the double-helix/base-pairing so
important?
Questions for you

Protein sequences are composed of a chain
of what?

How many different amino acids are found in
proteins?

Proteins always form in a helix shape (True
or False)?
Questions that would stump Dr. B.

What is the lower limit on the length of a functional
protein?
–
–
–
–

10-20
40-50
60-70
100
What is the upper limit on the length of proteins
found in cells
–
–
–
100’s
1000’s
1000000’s
Questions that would stump Dr. B.

What is average length of a human gene?
–
–
–

300
3000
30,000
Approximately, how many genes are in the human
genome?
–
–
–
–
–
400
4000
40,000
400,000
4,000,000
Remember
this picture?
Sugar
A
T
Sugar
Acid
Sugar
Acid
T
A
Sugar
Acid
Sugar
Acid
G
C
Acid
Sugar
Acid
Sugar
Acid
A
T
Sugar
Acid
Chromosomes


DNA molecule and associated proteins
The 3,000,000,000 nucleotide human
genome is divided among
–
–

22 pairs of autosomes and
1 pair of sex chromosomes
Together the 23 chromosomes carry all the
hereditary information of an organism.
Chromosomes
DNA Sequence Comparison


Overview
There are 3 different types of comparisons
that are important
1. Whole genome comparison
2. Gene search
3. Motif discovery (shared pattern
discovery)
Whole Genome Comparison


Problem: Exactly how similar are two
different genomes?
Given a set of genomes
– which two are most similar
– which two are least similar
Whole Genome Comparison



Ranking a set of genomes based
on similarity gives us clues about
heredity
Similarity Rank
evolution
G4
G2 G5 0.99
G3 G1 0.97
G1
G4
G2
G4 G5 0.91
G3
G5
G4 G2 0.90
G4 G1 0.80
G4 G3 0.78
G2
G5
G3
G1
Whole Genome Comparison




Solution: Design a metric that quantifies
similarity
something you can measure or
something you can compute
that accurately quantifies similarity
Whole Genome Comparison



But what does it really mean for two
genomes to be similar?
Obviously, if two genomes exactly match
then they are similar
But, what’s more important
–
–

rough, overall similarity, or
exact, local similarity
A picture will explain
Whole Genome Comparison

Exact matching genomes
GCCTGACTTAGACAGTCGCTGATCGATGCTATGCA
GCCTGACTTAGACAGTCGCTGATCGATGCTATGCA
Whole Genome Comparison

Rough overall similarity
GCTTACTTAGACAAGTCGCTGATCATGCTATGCA
GCCTGACTTAGACAGTCGCTGCTCGATGCTTGCA


2 Mismatched pairs
4 unmatched nucleotides
Whole Genome Comparison

Exact local similarities
TACCCAGCTCTTAGACAGCTGATCGATGGAACTAT
CTGACTTAGACAGCTGATCGATGCTATGCAAGCT
Whole Genome Comparison



The first metric: Edit Distance
The number of edit operations needed to
make the two sequences equal
Edit Distance was previously used in
– Spell checkers
– Approximate database searching
Edit Distance



3 edit operations
1. delete a symbol
2. insert a symbol
3. modify a symbol
modify = delete + insert
modify counts as two edit operations
Edit Distance





What is the edit distance between these two
sequences?
Note: edit distance implies the minimum number of
basic edit operations needed to make the string
equal
ERICWASABIGNERD
ERICSTILLISANERD
ERICWASABIGNERD (5 deletions)
ERICSTILLISANERD (6 deletions)
Edit Distance



ERICWASABIGNERD (15 symbols)
ERICSTILLISANERD (16 symbols)
ERICWASABIGNERD (5 deletions)
ERICSTILLISANERD (6 deletions)
Metrics
–
–
Matches 10 / Smaller Sequence 15 = 66%
(Edits 11 – Symbols 31) / Symbols 31 = 64%
Edit Distance


There are problems with edit distance
It doesn’t properly reward exact local similarity
–



which is often a true sign of biological similarity
Similar organisms often share a lot of similar genes
But may have a few genes that don’t match at all
Biologists need a metric that can reflect this type of
situation
Edit Distance



Another problem
Two organisms might have almost identical DNA
Except one has extra segments
Metrics
Matches 99 / Smaller Sequence 100 = 99%
(Edits 50 – Symbols 250) / Symbols 250 =
80%
Edit Distance

How is it possible that two metrics based on the
same principle (edit distance) could produce such
different results?
Metrics
Matches 99 / Smaller Sequence 100 = 99%
(Edits 50 – Symbols 250) / Symbols 250 =
80%
Recall

There are 3 different types of comparisons
that are important
1. Whole genome comparison
2. Gene search
3. Motif discovery (shared pattern
discovery)
Gene Search





Problem: Biologist have sequenced a brand new
segment of DNA from a previously un-sequenced
organism.
They want to know
Is this segment a gene?
Advantage: Genes are similar across different
organisms.
Two organisms that do the same exact function are
likely to have a nearly-exact gene.
Gene Search





Solution:
Take your newly sequenced segment
And search all the previously sequenced genomes.
Find segments (in other genomes) that highly
match your segment.
Advantage:
–
–
–
–
Other genomes are marked-up
Segments that are known to be genes are labeled
If your segment matches a known gene then BAM!
You’ve found a gene in a previously un-sequenced
organism.
Gene Search



Obviously, you want to search for a segment that is
highly similar to your target segment.
However, this type of comparison is completely
different than whole genome comparison
What is the fundamental difference?
Gene Search vs. Whole Genome
Comparison

Whole genome comparison considers sequences in
their entirety
–
–
Two sequences
Beginning to End
Gene Search vs. Whole Genome
Comparison


Gene search doesn’t consider the entire search sequence
when evaluating similarity
Two sequences
–
–
Target (the segment you sequenced)
Search Sequence (possibly a genome)
Gene Search




You want to find a sub-segment of the search
sequence that highly matches the target
sequence.
The entire search sequence is analyzed
But in evaluating similarity, we don’t need to
consider the search sequence in its entirety
Looking for localized similarity
Gene Search


How do you even know that your newly sequenced
segment is a gene?
Perhaps only part of it is a gene and the rest is
junk.
Gene Search


Now, you are trying to find a portion of your
segment that highly matches a portion of the
search sequence.
Writing an algorithm to find such matches is hard
Gene Search

1.
Writing such algorithms required coordination
between
Biologists
–
2.
Who have some clues about true biological similarity
And Computer Scientists
–
Who have some clues about what problems can be solved
efficiently and reliably.
Recall

There are 3 different types of comparisons
that are important
1. Whole genome comparison
2. Gene search
3. Motif discovery (shared pattern
discovery)
Next Class


Motif discovery (computer science
perspective)
Alignment (the technique used to measure
similarity)
–
–
–
Global alignment
Local alignment
Scoring matrices
Homework


Pick a paper! Email me.
Read pages 159-172
Related documents