Download Modular, operating system aware, network security analyser.

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
What is computational
biology?
1
Mona Singh
Genome
• The entire hereditary information
content of an organism
2
Mona Singh
DNA
• String over 4 letter alphabet A, T, G, C
• Organism’s genome is distributed over
chromosomes (e.g., 46 chromosomes in
human—22 pairs & XY)
• Genome size: number of base pairs in an
organism
3
Mona Singh
Genome Sizes
4
Human
3 billion bps
Mouse
3 billion bps
Fruit fly
165 million bps
Nematode worm
97 million bps
Yeast
15 million bps
E coli
5 million bps
~ 400 genomes sequenced
Mona Singh
How are genomes sequenced?
• Can only sequence a few hundred base
pairs at a time
• Make many copies of the DNA and cut into
smaller (overlapping) pieces
• Assemble pieces: certain substrings occur
in multiple fragments
5
Mona Singh
Genomes to Life
ATGCCTTAC
GTACCCTGC
GGCAGCACT
Genome
6
Mona Singh
?
• Portions of DNA code for genes, which
carry the information for making proteins
• Proteins play key roles in most biological
processes (e.g., signaling, catalysis,
immune response, etc.)
7
Mona Singh
Gene Finding
gucgcuaccauuaccaguuggucuggugucaaaaauaauaau
aaccgggcaggccaugucugcccguauuucgcguaaggaaau
ccauuauguacuauuuaaaaaacacaaacuuuuggauguucg
guuuauucuuuuucuuuuacuuuuuuaucaugggagccuacu
ucccguuuuucccgauuuggcuacaugacaucaaccauauca
gcaaaagugauacggguauuauuuuugccgcuauuucucugu
ucucgcuauuauuccaaccgcuguuuggucugcuuucugaca
aacucgggcugcgcaaauaccugcuguggauuauuaccggca
uguuagugauguuugcgccguucuuuauuuuuaucuucgggc
cacuguuacaauacaacauuuuaguaggaucgauuguuggug
guauuuaucuaggcuuuuguuuuaacgccggugcgccagcag
uagaggcauuuauugagaaagucagccgucgcaguaauuucg
aauuuggucgcgcgcggauguuuggcuguguuggcugggcgc
ugugugccucgauugucggcaucauguucaccaucaauaauc
aguuuguuuucuggcugggcucuggcugugcacucauccucg
ccguuuuacucuuuuucgccaaaacggaugcgcccucuucug
ccacgguugccaaugcgguaggugccaaccauucggcauuua
gccuuaagcuggcacuggaacuguucagacagccaaaacugu
gguuuuugucacuguauguuauuggcguuuccugcaccuacg
auGuuuuugaccaacaguuugcuaauuucuuuacuucguucu
gucaggugaa...gcaaucaaugucggaugcggcgcgacgcu
8
Mona Singh
Gene Finding
gucgcuaccauuaccaguuggucuggugucaaaaauaauaau
aaccgggcaggccaugucugcccguauuucgcguaaggaaau
ccauuauguacuauuuaaaaaacacaaacuuuuggauguucg
guuuauucuuuuucuuuuacuuuuuuaucaugggagccuacu
ucccguuuuucccgauuuggcuacaugacaucaaccauauca
gcaaaagugauacggguauuauuuuugccgcuauuucucugu
ucucgcuauuauuccaaccgcuguuuggucugcuuucugaca
aacucgggcugcgcaaauaccugcuguggauuauuaccggca
uguuagugauguuugcgccguucuuuauuuuuaucuucgggc
cacuguuacaauacaacauuuuaguaggaucgauuguuggug
guauuuaucuaggcuuuuguuuuaacgccggugcgccagcag
uagaggcauuuauugagaaagucagccgucgcaguaauuucg
aauuuggucgcgcgcggauguuuggcuguguuggcugggcgc
ugugugccucgauugucggcaucauguucaccaucaauaauc
aguuuguuuucuggcugggcucuggcugugcacucauccucg
ccguuuuacucuuuuucgccaaaacggaugcgcccucuucug
ccacgguugccaaugcgguaggugccaaccauucggcauuua
gccuuaagcuggcacuggaacuguucagacagccaaaacugu
gguuuuugucacuguauguuauuggcguuuccugcaccuacg
auGuuuuugaccaacaguuugcuaauuucuuuacuucguucu
gucaggugaa...gcaaucaaugucggaugcggcgcgacgcu
9
Mona Singh
MYYLKNTNFWMFGLFFFFYFFIMGAY
FPFFPIWLHDINHISKSDTGIIFAAI
SLFSLLFQPLFGLLSDKLGLRKYLLW
IITGMLVMFAPFFIFIFGPLLQYNIL
VGSIVGGIYLGFCFNAGAPAVEAFIE
KVSRRSNFEFGRARMFGCVGWALCAS
IVGIMFTINNQFVFWLGSGCALILAV
LLFFAKTDAPSSATVANAVGANHSAF
SLKLALELFRQPKLWFLSLYVIGVSC
TYDVFDQQFANFFTSFFATGEQGTRV
FGYVTTMGELLNASIMFFAPLIINRI
GGKNALLLAGTIMSVRIIGSSFATSA
LEVVILKTLHMFEVPFLLVGCFKYIT
The Genetic Code
AUG = methionine/start
UUA = Leucine
UUG = Leucine
UAA = Stop
UAG = Stop
UGA = Stop
.
.
.
10
Mona Singh
Stryer, Biochemistry
Gene Finding
gucgcuaccauuaccaguuggucuggugucaaaaauaauaauaaccgg
gcaggccaugucugcccguauuucgcguaaggaaauccauuauguacu
auuuaaaaaacacaaacuuuuggauguucgguuuauucuuuuucuuuu
acuuuuuuaucaugggagccuacuucccguuuuucccgauuuggcuac
augacaucaaccauaucagcaaaagugauacggguauuauuuuugccg
cuauuucucuguucucgcuauuauuccaaccgcuguuuggucugcuuu
cugacaaacucgggcugcgcaaauaccugcuguggauuauuaccggca
uguuagugauguuugcgccguucuuuauuuuuaucuucgggccacugu
uacaauacaacauuuuaguaggaucgauuguuggugguauuuaucuag
gcuuuuguuuuaacgccggugcgccagcaguagaggcauuuauugaga
aagucagccgucgcaguaauuucgaauuuggucgcgcgcggauguuug
gcuguguuggcugggcgcugugugccucgauugucggcaucauguuca
ccaucaauaaucaguuuguuuucuggcugggcucuggcugugcacuca
uccucgccguuuuacucuuuuucgccaaaacggaugcgcccucuucug
ccacgguugccaaugcgguaggugccaaccauucggcauuuagccuua
agcuggcacuggaacuguucagacagccaaaacugugguuuuugucac
uguauguuauuggcguuuccugcaccuacgauguuuuugaccaacagu
uugcuaauuucuuuacuucguucugucaggugaa...gcaaucaaugu
cggaugcggcgcgacgcu
11
Mona Singh
Gene Finding
Reading off
from 1st start
triplet
aug ucu gcc cgu auu ucg cgu aag gaa
auc cau uau gua cua uuu aaa ...
Translating
Met Ser Ala Arg Ile Ser Arg Lys Glu
(3 letter amino Ile His Tyr Val Leu Phe Lys ...
acid code)
(1 letter code) M S A R I S R K E I H Y V L F K ...
12
Mona Singh
Gene Finding
Reading off
from 1st start
triplet
aug ucu gcc cgu auu ucg cgu aag gaa
auc cau uau gua cua uuu aaa ...
Translating
Met Ser Ala Arg Ile Ser Arg Lys Glu
(3 letter amino Ile His Tyr Val Leu Phe Lys ...
acid code)
(1 letter code) M S A R I S R K E I H Y V L F K ...
Actual
protein
sequence
13
Mona Singh
M Y Y L K N T N F W M F G L F F ...
Computational Gene Finding
Methods
• Statistical bias: protein coding regions “look
different”
- compare coding vs. non-coding
regions
(Hidden Markov Models, Neural Nets)
• Sequence similarity
- similar to known protein?
14
Mona Singh
Gene finding is hard
• In some genomes, only a small portion of
genome codes for protein (needle in
haystack)
• Some genes contain introns and exons –
exons are the part that actually encode
the protein part – and exons can be short
• Have to get the precise boundaries to get
correct protein
15
Mona Singh
Number of genes
16
Human
~30,000
Mouse
~30,000
Fruit fly
~13,500
Nematode worm
~19,000
Yeast
~6,000
E coli
~4,000
Mona Singh
Predicting Protein Function
MYYLKNTNFWMFGLFFFFYFFIMGAY
FPFFPIWLHDINHISKSDTGIIFAAI
SLFSLLFQPLFGLLSDKLGLRKYLLW
IITGMLVMFAPFFIFIFGPLLQYNIL
VGSIVGGIYLGFCFNAGAPAVEAFIE
KVSRRSNFEFGRARMFGCVGWALCAS
IVGIMFTINNQFVFWLGSGCALILAV
LLFFAKTDAPSSATVANAVGANHSAF
SLKLALELFRQPKLWFLSLYVIGVSC
TYDVFDQQFANFFTSFFATGEQGTRV
FGYVTTMGELLNASIMFFAPLIINRI
GGKNALLLAGTIMSVRIIGSSFATSA
LEVVILKTLHMFEVPFLLVGCFKYIT
17
Mona Singh
DNA binding protein
Functions of Human Proteins
Science, 2001
18
Mona Singh
Sequence similarity
Ex: cystic fibrosis gene and bacterial nickel
transport gene
CF: EGGNAILENISFSISPGQRVGLLGRTGSGKSTLLSAFLRLL----NT: QAAQPLVHGVSLTLQRGRVLALVGGSGSGKSLTCAATLGILPAGVR
CF: NTEGEIQIDGVSWDSITL---------QQWRKAFGVIPQKVFIFSG
NT: QTAGEILADGKPVSPCALRGIKIATIMQNPRSAFNPL--------CF: TFRKNLDPYEQWSDQEIWKVADEVGLRSVIEQFP-GKLDFVLVDGG
NT: ---HTMHTHARETCLALGKPADDATLTAAIEAVGLENAARVLKLYP
CF: CVLSHGHKQLMCLARSVLSKAKILLLDEPSAHLDPV
NT: FEMSGGMLQRMMIAMAVLCESPFIIADEPTTDLDVV
19
Mona Singh
Database Searches
http://www.ncbi.nlm.nih.gov
20
Mona Singh
Database Searches
Sequences producing significant alignments:
gi|5523990|gb|AAD44047.1|AF108138_1 (AF108138) DNA helicase
gi|7511524|pir||T37310 PIF1 protein - Caenorhabditis elegans helicase
gi|7493349|pir||T40739 rrm3-pif1 helicase homolog - fission...
gi|11282390|pir||T47241 RRM3/PIF1 helicase homolog - fission yeast
gi|6321820|ref|NP_011896.1| DNA helicase; Rrm3p [Saccharomyces
gi|6323579|ref|NP_013650.1| 5' to 3' DNA helicase; Pif1p [Saccharo
gi|558414|emb|CAA86260.1| (Z38114) len: 750, CAI: 0.14, inc...
gi|7687929|emb|CAB89609.1| (AL354532) possible DNA helicase...
21
Mona Singh
E-Value
4e-84
1e-77
3e-59
3e-59
4e-43
1e-41
1e-41
4e-41
Protein Structure
Sequence: KETAAAKFERQHMDSSTSAASSSN…
Structure:
22
Mona Singh
Proteins
Primary
Amino acids
23
Secondary
a-helix
Tertiary
Quaternary
Polypeptide
chain
Assembled
subunits
Mona Singh
Lehninger, Principles of Biochemistry
Protein Structure Prediction
• Physics-based methods
• Statistics-based method
24
Mona Singh
Statistics & Protein Structure
Prediction
Given a new sequence and a library of folds, figure out
which (if any) is a good fit to the sequence.
25
Mona Singh
Secondary structure prediction
• Given a protein sequence, can you tell its
secondary structure
– E.g., LKVVAKRELVQNNQ
aaaa
bbbb aaaaaaa
a=alpha, b=beta : ~70% accuracy
(neural nets or other learning techniques)
26
Mona Singh
Genome annotation
• Many other important features of DNA
– E.g., proteins bind DNA regulatory elements:
determines which genes are “on” when
• Statistical & comparative approaches for
finding them
– Motif finding
27
Mona Singh
Universal phylogenetic tree
Prokaryotes
28
Eukaryotes
Mona Singh
Woese et al.
Building phylogenetic trees
Use DNA (or protein)
sequences from various
organisms
e.g.,
human ATCGAGGC
mouse ATCCAGCC
yeast ATTAAGTA
29
Mona Singh
Building phylogenetic trees
Human
E.g.,
Distance
Matrix:
1
Human
30
Mona Singh
Human 0
2
4
Mouse
Yeast
0
4
4
0
1
Tree:
Mouse Yeast
2
4
2
1
Mouse Yeast
Intracellular networks
Stimulus
Stimulus
Protein
DNA
31
Mona Singh
RNA
Network of cells
fn
fn
Protein
DNA
RNA
Protein
DNA
RNA
fn
fn
Protein
DNA
RNA
Protein
DNA
RNA
fn
fn
Protein
DNA
32
Mona Singh
RNA
Protein
DNA
RNA
fn
fn
Protein
DNA
33
Mona Singh
RNA
Lecture Notes
• www.cs.princeton.edu/~mona/computational_biology_
notes.html
34
Mona Singh
Related documents