Download Jaap Heringa Bioinformatica 1 Bioinformatics Gathering knowledge

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Protein adsorption wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Protein moonlighting wikipedia , lookup

List of types of proteins wikipedia , lookup

Non-coding DNA wikipedia , lookup

Gene regulatory network wikipedia , lookup

Gene wikipedia , lookup

Community fingerprinting wikipedia , lookup

Gene expression profiling wikipedia , lookup

Point mutation wikipedia , lookup

Gene expression wikipedia , lookup

Genome evolution wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Molecular evolution wikipedia , lookup

Transcript
Jaap Heringa
Bioinformatica
Gathering knowledge
Chemistry
Biology
Molecular
biology
Mathematics
Statistics
Bioinformatics
• Anatomy, architecture
Rembrandt,
1632
• Dynamics, mechanics
Newton,
1726
Computer
Science
Informatics
Medicine
Physics
• Informatics
(Cybernetics – Wiener, 1948)
(Cybernetics has been defined as the science of control in machines and
animals)
• Genomics, bioinformatics
“The best of many worlds”
Bioinformatics
We are good at recognising
anatomical/dynamical patterns,
but not at dealing with informational
patterns
Bioinformatics
“Studying informational processes in biological systems”
(Hogeweg Utrecht; early 1970s)
“Information technology
applied to the management and
analysis of biological data”
(Attwood and Parry-Smith)
Applying algorithms with mathematical formalisms in
biology (genomics)
USA started but now everywhere
2d-3d, crossing street, bumbles, eye dynamics/information
Taking care of the computational infrastructure
and data management
everywhere
Bioinformatics
The Big Bang for Bioinformatics:
The Human Genome -- 26 June 2000
• Offers an ever more essential input to
–
–
–
–
–
–
–
–
Molecular Biology
Pharmacology (drug design)
Agriculture
Biotechnology
Clinical medicine
Anthropology
Forensic science
Chemical industries (detergent industries, etc.)
Dr Craig Venter
Celera Genomics
-- Shotgun method
Sir John Sulston
Human Genome
Project
1
Jaap Heringa
Bioinformatica
The Human Genome
cctggacctc ctgtgcaaga
tcccagatgg gtcctgtccc
tccagagctc aaaaccccac
caaatcttgt gacacacctc
acctccccca tgcccacggt
nnngtgccca gcacctgaac
caaggatacc cttatgattt
ccacgaagac ccnnnngtcc
caagacaaag ctgcgggagg
cgtcctgcac caggactggc
cctcccagcc cccatcgaga
nnnnnnnnnn nnnnnnnnnn
cctggtcaaa ggcttctacc
ggagaacaac tacaacacca
cagcaagctc accgtggaca
gatgcatgag gctctgcaca
atgagtgcca tggccggcaa
tggcacgtac cccgtgtaca
ctgccctgg 1089
acatgaaaca
aggtgcacct
ttggtgacac
ccccgtgccc
gcccagagcc
tcttgggagg
cccggacccc
agttcaagtg
agcagtacaa
tgaacggcaa
aaaccatctc
nnnnngagga
ccagcgacat
cgcctcccat
agagcaggtg
accgctacac
gcccccgctc
tacttcccag
nctgtggttc
gcaggagtcg
aactcacaca
acggtgccca
caaatcttgt
accgtcagtc
tgaggtcacg
gtacgtggac
cagcacgttc
ggagtacaag
caaagccaaa
gatgaccaag
cgccgtggag
gctggactcc
gcagcagggg
gcagaagagc
cccgggctct
gcacccagca
ttccttctcc
ggcccaggac
tgcccacggt
gagcccaaat
gacacacctc
ttcctcttcc
tgcgtggtgg
ggcgtggagg
cgtgtggtca
tgcaaggtct
ggacagcccn
aaccaagtca
tgggagagca
gacggctcct
aacatcttct
ctctccctgt
cggggtcgcg
tggaaataaa
DNA compositional biases
tggtggcagc
tggggaagcc
gcccagagcc
cttgtgacac
ccccgtgccc
ccccaaaacc
tggacgtgag
tgcataatgc
gcgtcctcac
ccaacaaagc
nnnnnnnnnn
gcctgacctg
atgggcagcc
tcttcctcta
catgctccgt
ctccgggtaa
cgaggatgct
gcacccagcg
60
120
180
240
300
360
420
480
540
600
660
720
780
840
900
960
1020
1080
Genomics
• Base composition of genomes:
• E. coli: 25% A, 25% C, 25% G, 25% T
• P. falciparum (Malaria parasite): 82%A+T
• Translation initiation:
• ATG is the near universal motif indicating the
start of translation in a DNA coding sequence.
A gene codes for a protein
“DNA makes RNA makes Protein”
DNA
Genome contains genes (genetic blueprint)
CCTGAGCCAACTATTGATGAA
transcription
mRNA
Genes are expressed into mRNA
CCUGAGCCAACUAUUGAUGAA
translation
mRNA is translated into protein
Protein
PEPTIDE
Proteins perform cellular functions (doers in the cell)
Human genome -- a few facts
Humans have
spliced genes…
•
•
•
•
Human genome contains about 30K genes
DNA in each cell comprises ~3 × 109 base pairs
Human body contains ~3.5 × 1012 cells
DNA between different people only varies for 0.2% or
less. So, only 2 letters in 1000 are expected to be
different. Over the whole genome, this means that
about 5-6 million letters would differ between
individuals.
• Large part of DNA not expressed (“junk/nonsense
DNA”)
• Eukaryotes: expressed DNA stretches are called exons,
which are interrupted by introns
2
Jaap Heringa
Bioinformatica
DNA makes RNA makes Protein
Some further facts about human genes
•
•
•
•
•
•
Comprise about 3% of the genome
Average gene length: ~ 8,000 bp
Average of 5-6 exons/gene
Average exon length: ~200 bp
Average intron length: ~2,000 bp
~8% genes have a single exon
• some exons can be as small as 1 or 3 bp.
• HUMFMR1S is not atypical: 17 exons 40-60 bp
long, comprising <2% of a 67,000 bp gene
Genomic Data Sources
• DNA/protein sequence data
(more than 80 genomes)
• Expression (microarray) data
• Proteome (xray, NMR,
mass spectrometry)
• Metabolome
• Physiome (spatial,
temporal)
• Protein interaction data
Integrative
Bioinformatics
Genetic diseases
• Many diseases run in families and are a result of
genes which predispose such family members to
these illnesses
• Examples are Alzheimer’s disease, cystic fibrosis
(CF), breast or colon cancer, or heart diseases.
• Some of these diseases can be caused by a problem
within a single gene, such as with CF.
Structural/Functional Genomics
Genetic diseases (Cont.)
• For other illnesses, like heart disease, at least 20-30
genes are thought to play a part, and it is still
unknown which combination of problems within
which genes are responsible.
• With a “problem” within a gene is meant that a
single nucleotide or a combination of those within
the gene are causing the disease (or make that the
body is not sufficiently fighting the disease).
• Persons with different combinations of these
nucleotides could then be unaffected by these
diseases.
Genetic diseases (Cont.)
Cystic Fibrosis
• Known since very early on (“Celtic gene”)
• Inherited autosomal recessive condition (Chr. 7)
• Symptoms:
– Clogging and infection of lungs (early death)
– Intestinal obstruction
– Reduced fertility and (male) anatomical anomalies
• CF gene CFTR has 3-bp deletion leading to Del508
(Phe) in 1480 aa protein (epithelial Cl- channel) –
protein degraded in ER instead of inserted into cell
membrane
3
Jaap Heringa
Bioinformatica
DNA makes RNA makes Protein:
Expression data
• More copies of mRNA for a gene leads to
more protein
• mRNA can now be measured for all the
genes in a cell at ones through microarray
technology
• Can have 60,000 spots (genes) on a single
gene chip
• Colour change gives intensity of gene
expression (over- or under-expression)
cDNA microarrays
cDNA microarrays
Compare the genetic expression in two samples of cells
cDNA clones
PRINT
cDNA from one
gene on each spot
SAMPLES
cDNA labelled red/green
with fluorescent dyes
e.g. treatment / control
normal / tumor tissue
Robotic printing
HYBRIDIZE
Add equal amounts of
labelled cDNA samples
to microarray.
SCAN
Laser
Detector
Metabolic
networks
Glycolysis
and
Gluconeogenesis
Detector measures ratio of induced
fluorescence of two samples
Kegg database
(Japan)
4
Jaap Heringa
Bioinformatica
Data explode, for example:
Dickerson’s formula: equivalent
to Moore’s law
Protein Data Bank (PDB): 14500 Protein 3D structures
10900 x-ray crystallography, 1810 NMR, 278 theoretical models, others...
n = e.19(y-1960)
with y the year.
On 27 March 2001 there were 12,123 3D protein
structures in the PDB: Dickerson’s formula predicts
12,066 (within 0.5%)!
Not only data explode:
computations can explode as well
• Many problems can be NP (nonpolynomial) complete: computer time is
exponential relative to data size
• We often need to reformulate the problem
to make it tractable
• Or use heuristics (clever rules of thumb) to
reduce computations
Protein folding problem
Bioinformatics grand challenges
• Understanding (multi)cellular functioning in terms
of genomic data:
• Protein folding problem (IBM)
• Complex diseases (cancer, heart disease)
• Integrating genomic data
• Predicting functions and interactions of all
proteins
Protein folding problem
Protein structure hierarchical levels
PRIMARY STRUCTURE (amino acid sequence)
MTSPQAVLFKTGGVLRKAID
sequence
SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVD
EVGGEALGRLLVVYPWTQRF
FESFGDLSTPDAVMGNPKVKA
HGKKVLGAFSDGLAHLDNLKG
TFATLSELHCDKLHVDPENFR
LLGNVLVCVLAHHFGKEFTPP
VQAAYQKVVAGVANALAHKY
H
fold
N
C
With only 2 angles per amino
acid: protein of 100 amino
acids has 2100 possible folds!
Active/binding site
Best bet is homology modelling
QUATERNARY STRUCTURE (oligomers)
TERTIARY STRUCTURE (fold)
5
Jaap Heringa
Bioinformatica
Structural domain organisation can be nasty…
The DEATH Domain
• Present in a variety of Eukaryotic
proteins involved with cell death.
• Six helices enclose a tightly
packed hydrophobic core.
• Some DEATH domains form
homotypic and heterotypic dimers.
Pyruvate kinase
Phosphotransferase
β barrel regulatory domain
http://www.mshri.on.ca/pawson
α/β barrel catalytic substrate binding
domain
α/β nucleotide binding domain
1 continuous + 2 discontinuous domains
Bioinformatics tool
Bioinformatics tool
• Scoring function (‘biology’, most important)
– Metric, objective function, model
• Search method
Data
Algorithm
– Optimisation
•
•
•
•
•
•
•
tool
Tool components:
• Metric, objective function
(model containing biology)
• Search function
Biological
Interpretation
(model)
Bioinformatics
“Nothing in Biology makes sense except in
the light of evolution” (Theodosius
Dobzhansky)
“Nothing in bioinformatics makes sense
except in the light of Biology”
DP
GA
HMM
MC
Simulated Annealing
MCMC
SVM
Pattern recognition
Some are easy to describe, others not
•
•
•
•
•
Visual patterns (colour in RGB mode)
Audio patterns (musical scores)
Knitting patterns
Taste: cooking recipes
Smell:
Biological patterns are often not easy to recognise
6
Jaap Heringa
Bioinformatica
Multivariate statistics – Cluster analysis
C1 C2 C3 C4 C5 C6 ..
1
2
3
4
5
Example: Divergent evolution
Pair-wise alignment
Raw table
T D W V T A L K
(IL mutation and insertion)
T D W V I K
Similarity criterion
Scores
Similarity
matrix
5×5
T D W L I K
Ancestral sequence
(VL mutation)
T D W V T A L K
T D W L - - I K
Cluster criterion
Sequence alignment
How to do it?
Phylogenetic tree
Pair-wise alignment
T D W V T A L K
T D W L - - I K
Combinatorial explosion
- 1 gap in 1 sequence: n+1 possibilities
- 2 gaps in 1 sequence: (n+1)n
- 3 gaps in 1 sequence: (n+1)n(n-1), etc.
2n
(2n)!
~
=
n
22n
(n!)2
Solution: Pair-wise sequence alignment
(more than just string matching – guaranteed optimal alignment)
Global dynamic programming
MDAGSTVILCFVG
M
D
A
A
S
T
I
L
C
G
S
Evolution
20×20
Amino Acid Exchange
Matrix
Search matrix
√πn
~1088
alignments
2 sequences of 300 a.a.:
2 sequences of 1000 a.a.: ~10600 alignments!
Global dynamic programming
MDAGSTVILCFVGMDAAST-ILC--GS
Alignment
Gap penalties
(open,extension)
Parameters
Integrative Bioinformatics
Institute VU (IBIVU)
• Integrating data sources, integrating methods
• Integrating data through methods
• Making new tools to analyse the genomic data
(integrative data mining) and predict cellular and
molecular features, including for example:
• Structure, function and interaction of proteins
• signalling and metabolic networks
• complex diseases
7
Jaap Heringa
Bioinformatica
Bioinformatics @ VU
• New genomics data is being collected
(pharmacogenomics, VUMC microarray)
• Strong biology groups (neural biology,
metabolome, metabolic control)
• Great computational groups (HTC,
Visualisation, IC Video wall, Machine
learning, Computational intelligence)
• Very good mathematical groups (Statistics,
Stochastics)
• You!
Bioinformatics @ VU
• Combine many areas such as mathematics
(statistics), computer science (machine
learning, high-throughput computing),
molecular biology, medicine, etc.
• Analyse and predict molecular features
• Make advanced methods and websites
• Do you dare?
Bioinformatics teaching @ VU
Bioinformatics teaching @ VU
• “Medische Natuurwetenschappen (MNW)”
2nd year:
Introduction to Bioinformatics
• New 2-Year Masters Course: mixture of
courses and practical projects
• Developing diverse set of courses
• Diverse palette of 3/6/9/12-month projects
• Student gets mentor for flexible guidance
8