Download BIOINFORMATICS AND GENE DISCOVERY

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genomic library wikipedia , lookup

Transposable element wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Gene nomenclature wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Gene therapy wikipedia , lookup

Y chromosome wikipedia , lookup

Pathogenomics wikipedia , lookup

Minimal genome wikipedia , lookup

Gene desert wikipedia , lookup

Non-coding DNA wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Genetic engineering wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Frameshift mutation wikipedia , lookup

Neocentromere wikipedia , lookup

Public health genomics wikipedia , lookup

NEDD9 wikipedia , lookup

Genomics wikipedia , lookup

Gene expression profiling wikipedia , lookup

Human genome wikipedia , lookup

Primary transcript wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Epigenetics of human development wikipedia , lookup

History of genetic engineering wikipedia , lookup

X-inactivation wikipedia , lookup

Point mutation wikipedia , lookup

Gene wikipedia , lookup

Genome editing wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression programming wikipedia , lookup

Designer baby wikipedia , lookup

Genome evolution wikipedia , lookup

Microevolution wikipedia , lookup

Genome (book) wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL
Bioinformatics Tutorials
BIOINFORMATICS
AND
GENE DISCOVERY
Iosif Vaisman
1998
From genes to proteins
From genes to proteins
DNA
RNA
PROMOTER
ELEMENTS
TRANSCRIPTION
SPLICE
SITES
mRNA
SPLICING
START
CODON
STOP
CODON
TRANSLATION
PROTEI
From genes to proteins
Comparative Sequence Sizes
•
•
•
•
•
•
•
Yeast chromosome 3
Escherichia coli (bacterium) genome
Largest yeast chromosome now mapped
Entire yeast genome
Smallest human chromosome (Y)
Largest human chromosome (1)
Entire human genome
350,000
4,600,000
5,800,000
15,000,000
50,000,000
250,000,000
3,000,000,000
Low-resolution physical map
of chromosome 19
Chromosome 19 gene map
Computational Gene Prediction
•Where the genes are unlikely to be located?
•How do transcription factors know where to bind a region of DNA?
•Where are the transcription, splicing, and translation start and stop
signals?
•What does coding region do (and non-coding regions do not) ?
•Can we learn from examples?
•Does this sequence look familiar?
Artificial Intelligence in Biosciences
Neural Networks (NN)
Genetic Algorithms (GA)
Hidden Markov Models (HMM)
Stochastic context-free grammars (CFG)
Information Theory
0
1
1 bit
Information Theory
00
01
1 bit
11
10
1 bit
Information Theory
1 bit
1 bit
Scientific Models
Physical models -- Mathematical models
Mechanistic models
Stochastic models
Mechanism
Black box
Predictive power
Elegance
Consistency
Predictive power
Hidden Markov models
Stochastic mechanism
Neural Networks
•interconnected assembly of simple processing elements (units or nodes)
•nodes functionality is similar to that of the animal neuron
•processing ability is stored in the inter-unit connection strengths (weights)
•weights are obtained by a process of adaptation to, or learning from, a set
of training patterns
Genetic Algorithms
Search or optimization methods using simulated evolution.
Population of potential solutions is subjected to
natural selection, crossover, and mutation
choose initial population
evaluate each individual's fitness
repeat
select individuals to reproduce
mate pairs at random
apply crossover operator
apply mutation operator
evaluate each individual's fitness
until terminating condition
Crossover
Parent A
Parent B
crossover point
Child AB
Child BA
Mutation
Markov Model (or Markov Chain)
A
T
C
T
A
G
Probability for each character based only on several
preceding characters in the sequence
# of preceding characters = order of the Markov Model
Probability of a sequence
P(s) = P[A] P[A,T] P[A,T,C] P[T,C,T] P[C,T,A] P[T,A,G]
Hidden Markov Models
States -- well defined conditions
Edges -- transitions between the states
T
G
A
A
C
C
T
ATGAC
ATTAC
ACGAC
ACTAC
Each transition asigned a probability.
Probability of the sequence:
single path with the highest probability --- Viterbi path
sum of the probabilities over all paths -- Baum-Welch method
Hidden Markov Model of Biased Coin Tosses
• States (Si):
Two Biased Coins {C1, C2}
• Outputs (Oj):
Two Possible Outputs {H, T}
• p(OutputsOij):
p(C1, H), p(C1, T), p(C2, H) p(C2, T)
• Transitions:
From State X to Y {A11, A22, A12, A21}
• p(Initial Si):
p(I, C1), p(I, C2)
• p(End Si):
p(C1, E), p(C2, E)
Hidden Markov Model for Exon and Stop Codon
(VEIL Algorithm)
GRAIL gene identification program
POSSIBLE EXONS
REFINED EXON
POSITIONS
FINAL EXON
CANDIDATES
Suboptimal Solutions for the Human Growth
Hormone Gene (GeneParser)
Measures of Prediction Accuracy
Nucleotide Level
TN
FN
TP
FP
TN
FN
TP
FN TN
REALITY
PREDICTION
c
c
TP
nc
FP
Sensitivity
Sn = TP / (TP + FN)
Specificity
nc
PREDICTION
REALITY
FN
TN
Sp = TP / (TP + FP)
Measures of Prediction Accuracy
Exon Level
MISSING
EXON
WRONG
EXON
CORRECT
EXON
Sn =
number of correct exons
number of actual exons
Sp =
number of correct exons
number of predicted exons
REALITY
PREDICTION
Sensitivity
Specificity
GeneMark Accuracy Evaluation
Bibliography
http://linkage.rockefeller.edu/wli/gene/list.html
and
http://www-hto.usc.edu/software/procrustes/fans_ref/
Gene Discovery Exercise
http://metalab.unc.edu/pharmacy/Bioinfo/Gene