* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Jaap Heringa Bioinformatica 1 Bioinformatics Gathering knowledge
Protein adsorption wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Protein moonlighting wikipedia , lookup
List of types of proteins wikipedia , lookup
Non-coding DNA wikipedia , lookup
Gene regulatory network wikipedia , lookup
Community fingerprinting wikipedia , lookup
Gene expression profiling wikipedia , lookup
Point mutation wikipedia , lookup
Gene expression wikipedia , lookup
Genome evolution wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Jaap Heringa Bioinformatica Gathering knowledge Chemistry Biology Molecular biology Mathematics Statistics Bioinformatics • Anatomy, architecture Rembrandt, 1632 • Dynamics, mechanics Newton, 1726 Computer Science Informatics Medicine Physics • Informatics (Cybernetics – Wiener, 1948) (Cybernetics has been defined as the science of control in machines and animals) • Genomics, bioinformatics “The best of many worlds” Bioinformatics We are good at recognising anatomical/dynamical patterns, but not at dealing with informational patterns Bioinformatics “Studying informational processes in biological systems” (Hogeweg Utrecht; early 1970s) “Information technology applied to the management and analysis of biological data” (Attwood and Parry-Smith) Applying algorithms with mathematical formalisms in biology (genomics) USA started but now everywhere 2d-3d, crossing street, bumbles, eye dynamics/information Taking care of the computational infrastructure and data management everywhere Bioinformatics The Big Bang for Bioinformatics: The Human Genome -- 26 June 2000 • Offers an ever more essential input to – – – – – – – – Molecular Biology Pharmacology (drug design) Agriculture Biotechnology Clinical medicine Anthropology Forensic science Chemical industries (detergent industries, etc.) Dr Craig Venter Celera Genomics -- Shotgun method Sir John Sulston Human Genome Project 1 Jaap Heringa Bioinformatica The Human Genome cctggacctc ctgtgcaaga tcccagatgg gtcctgtccc tccagagctc aaaaccccac caaatcttgt gacacacctc acctccccca tgcccacggt nnngtgccca gcacctgaac caaggatacc cttatgattt ccacgaagac ccnnnngtcc caagacaaag ctgcgggagg cgtcctgcac caggactggc cctcccagcc cccatcgaga nnnnnnnnnn nnnnnnnnnn cctggtcaaa ggcttctacc ggagaacaac tacaacacca cagcaagctc accgtggaca gatgcatgag gctctgcaca atgagtgcca tggccggcaa tggcacgtac cccgtgtaca ctgccctgg 1089 acatgaaaca aggtgcacct ttggtgacac ccccgtgccc gcccagagcc tcttgggagg cccggacccc agttcaagtg agcagtacaa tgaacggcaa aaaccatctc nnnnngagga ccagcgacat cgcctcccat agagcaggtg accgctacac gcccccgctc tacttcccag nctgtggttc gcaggagtcg aactcacaca acggtgccca caaatcttgt accgtcagtc tgaggtcacg gtacgtggac cagcacgttc ggagtacaag caaagccaaa gatgaccaag cgccgtggag gctggactcc gcagcagggg gcagaagagc cccgggctct gcacccagca ttccttctcc ggcccaggac tgcccacggt gagcccaaat gacacacctc ttcctcttcc tgcgtggtgg ggcgtggagg cgtgtggtca tgcaaggtct ggacagcccn aaccaagtca tgggagagca gacggctcct aacatcttct ctctccctgt cggggtcgcg tggaaataaa DNA compositional biases tggtggcagc tggggaagcc gcccagagcc cttgtgacac ccccgtgccc ccccaaaacc tggacgtgag tgcataatgc gcgtcctcac ccaacaaagc nnnnnnnnnn gcctgacctg atgggcagcc tcttcctcta catgctccgt ctccgggtaa cgaggatgct gcacccagcg 60 120 180 240 300 360 420 480 540 600 660 720 780 840 900 960 1020 1080 Genomics • Base composition of genomes: • E. coli: 25% A, 25% C, 25% G, 25% T • P. falciparum (Malaria parasite): 82%A+T • Translation initiation: • ATG is the near universal motif indicating the start of translation in a DNA coding sequence. A gene codes for a protein “DNA makes RNA makes Protein” DNA Genome contains genes (genetic blueprint) CCTGAGCCAACTATTGATGAA transcription mRNA Genes are expressed into mRNA CCUGAGCCAACUAUUGAUGAA translation mRNA is translated into protein Protein PEPTIDE Proteins perform cellular functions (doers in the cell) Human genome -- a few facts Humans have spliced genes… • • • • Human genome contains about 30K genes DNA in each cell comprises ~3 × 109 base pairs Human body contains ~3.5 × 1012 cells DNA between different people only varies for 0.2% or less. So, only 2 letters in 1000 are expected to be different. Over the whole genome, this means that about 5-6 million letters would differ between individuals. • Large part of DNA not expressed (“junk/nonsense DNA”) • Eukaryotes: expressed DNA stretches are called exons, which are interrupted by introns 2 Jaap Heringa Bioinformatica DNA makes RNA makes Protein Some further facts about human genes • • • • • • Comprise about 3% of the genome Average gene length: ~ 8,000 bp Average of 5-6 exons/gene Average exon length: ~200 bp Average intron length: ~2,000 bp ~8% genes have a single exon • some exons can be as small as 1 or 3 bp. • HUMFMR1S is not atypical: 17 exons 40-60 bp long, comprising <2% of a 67,000 bp gene Genomic Data Sources • DNA/protein sequence data (more than 80 genomes) • Expression (microarray) data • Proteome (xray, NMR, mass spectrometry) • Metabolome • Physiome (spatial, temporal) • Protein interaction data Integrative Bioinformatics Genetic diseases • Many diseases run in families and are a result of genes which predispose such family members to these illnesses • Examples are Alzheimer’s disease, cystic fibrosis (CF), breast or colon cancer, or heart diseases. • Some of these diseases can be caused by a problem within a single gene, such as with CF. Structural/Functional Genomics Genetic diseases (Cont.) • For other illnesses, like heart disease, at least 20-30 genes are thought to play a part, and it is still unknown which combination of problems within which genes are responsible. • With a “problem” within a gene is meant that a single nucleotide or a combination of those within the gene are causing the disease (or make that the body is not sufficiently fighting the disease). • Persons with different combinations of these nucleotides could then be unaffected by these diseases. Genetic diseases (Cont.) Cystic Fibrosis • Known since very early on (“Celtic gene”) • Inherited autosomal recessive condition (Chr. 7) • Symptoms: – Clogging and infection of lungs (early death) – Intestinal obstruction – Reduced fertility and (male) anatomical anomalies • CF gene CFTR has 3-bp deletion leading to Del508 (Phe) in 1480 aa protein (epithelial Cl- channel) – protein degraded in ER instead of inserted into cell membrane 3 Jaap Heringa Bioinformatica DNA makes RNA makes Protein: Expression data • More copies of mRNA for a gene leads to more protein • mRNA can now be measured for all the genes in a cell at ones through microarray technology • Can have 60,000 spots (genes) on a single gene chip • Colour change gives intensity of gene expression (over- or under-expression) cDNA microarrays cDNA microarrays Compare the genetic expression in two samples of cells cDNA clones PRINT cDNA from one gene on each spot SAMPLES cDNA labelled red/green with fluorescent dyes e.g. treatment / control normal / tumor tissue Robotic printing HYBRIDIZE Add equal amounts of labelled cDNA samples to microarray. SCAN Laser Detector Metabolic networks Glycolysis and Gluconeogenesis Detector measures ratio of induced fluorescence of two samples Kegg database (Japan) 4 Jaap Heringa Bioinformatica Data explode, for example: Dickerson’s formula: equivalent to Moore’s law Protein Data Bank (PDB): 14500 Protein 3D structures 10900 x-ray crystallography, 1810 NMR, 278 theoretical models, others... n = e.19(y-1960) with y the year. On 27 March 2001 there were 12,123 3D protein structures in the PDB: Dickerson’s formula predicts 12,066 (within 0.5%)! Not only data explode: computations can explode as well • Many problems can be NP (nonpolynomial) complete: computer time is exponential relative to data size • We often need to reformulate the problem to make it tractable • Or use heuristics (clever rules of thumb) to reduce computations Protein folding problem Bioinformatics grand challenges • Understanding (multi)cellular functioning in terms of genomic data: • Protein folding problem (IBM) • Complex diseases (cancer, heart disease) • Integrating genomic data • Predicting functions and interactions of all proteins Protein folding problem Protein structure hierarchical levels PRIMARY STRUCTURE (amino acid sequence) MTSPQAVLFKTGGVLRKAID sequence SECONDARY STRUCTURE (helices, strands) VHLTPEEKSAVTALWGKVNVD EVGGEALGRLLVVYPWTQRF FESFGDLSTPDAVMGNPKVKA HGKKVLGAFSDGLAHLDNLKG TFATLSELHCDKLHVDPENFR LLGNVLVCVLAHHFGKEFTPP VQAAYQKVVAGVANALAHKY H fold N C With only 2 angles per amino acid: protein of 100 amino acids has 2100 possible folds! Active/binding site Best bet is homology modelling QUATERNARY STRUCTURE (oligomers) TERTIARY STRUCTURE (fold) 5 Jaap Heringa Bioinformatica Structural domain organisation can be nasty… The DEATH Domain • Present in a variety of Eukaryotic proteins involved with cell death. • Six helices enclose a tightly packed hydrophobic core. • Some DEATH domains form homotypic and heterotypic dimers. Pyruvate kinase Phosphotransferase β barrel regulatory domain http://www.mshri.on.ca/pawson α/β barrel catalytic substrate binding domain α/β nucleotide binding domain 1 continuous + 2 discontinuous domains Bioinformatics tool Bioinformatics tool • Scoring function (‘biology’, most important) – Metric, objective function, model • Search method Data Algorithm – Optimisation • • • • • • • tool Tool components: • Metric, objective function (model containing biology) • Search function Biological Interpretation (model) Bioinformatics “Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky) “Nothing in bioinformatics makes sense except in the light of Biology” DP GA HMM MC Simulated Annealing MCMC SVM Pattern recognition Some are easy to describe, others not • • • • • Visual patterns (colour in RGB mode) Audio patterns (musical scores) Knitting patterns Taste: cooking recipes Smell: Biological patterns are often not easy to recognise 6 Jaap Heringa Bioinformatica Multivariate statistics – Cluster analysis C1 C2 C3 C4 C5 C6 .. 1 2 3 4 5 Example: Divergent evolution Pair-wise alignment Raw table T D W V T A L K (IL mutation and insertion) T D W V I K Similarity criterion Scores Similarity matrix 5×5 T D W L I K Ancestral sequence (VL mutation) T D W V T A L K T D W L - - I K Cluster criterion Sequence alignment How to do it? Phylogenetic tree Pair-wise alignment T D W V T A L K T D W L - - I K Combinatorial explosion - 1 gap in 1 sequence: n+1 possibilities - 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc. 2n (2n)! ~ = n 22n (n!)2 Solution: Pair-wise sequence alignment (more than just string matching – guaranteed optimal alignment) Global dynamic programming MDAGSTVILCFVG M D A A S T I L C G S Evolution 20×20 Amino Acid Exchange Matrix Search matrix √πn ~1088 alignments 2 sequences of 300 a.a.: 2 sequences of 1000 a.a.: ~10600 alignments! Global dynamic programming MDAGSTVILCFVGMDAAST-ILC--GS Alignment Gap penalties (open,extension) Parameters Integrative Bioinformatics Institute VU (IBIVU) • Integrating data sources, integrating methods • Integrating data through methods • Making new tools to analyse the genomic data (integrative data mining) and predict cellular and molecular features, including for example: • Structure, function and interaction of proteins • signalling and metabolic networks • complex diseases 7 Jaap Heringa Bioinformatica Bioinformatics @ VU • New genomics data is being collected (pharmacogenomics, VUMC microarray) • Strong biology groups (neural biology, metabolome, metabolic control) • Great computational groups (HTC, Visualisation, IC Video wall, Machine learning, Computational intelligence) • Very good mathematical groups (Statistics, Stochastics) • You! Bioinformatics @ VU • Combine many areas such as mathematics (statistics), computer science (machine learning, high-throughput computing), molecular biology, medicine, etc. • Analyse and predict molecular features • Make advanced methods and websites • Do you dare? Bioinformatics teaching @ VU Bioinformatics teaching @ VU • “Medische Natuurwetenschappen (MNW)” 2nd year: Introduction to Bioinformatics • New 2-Year Masters Course: mixture of courses and practical projects • Developing diverse set of courses • Diverse palette of 3/6/9/12-month projects • Student gets mentor for flexible guidance 8