Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Introduktion til Bioinformatik Hold 01 Oktober 2010 Introduktion Rasmus Wernersson, Lektor Anders Gorm Pedersen, Docent Center for Biologisk Sekvensanalyse, DTU Oversigt •Taksonomi •DNA Data & Databaser •Protein •Protein struktur •Alignment •Pairwise + Multiple Metoder •BLAST (søgning) •Fylogenetiske træer •PyMOL (3D visualisering) Opsamlende øvelse Malaria vaccine Øvelserne er det primære Kursusplan på vores wiki Background information On evolution and sequences Classification: Linnaeus Carl Linnaeus 1707-1778 Classification: Linnaeus • Hierarchical system – – – – – – – Kingdom Phylum Class Order Family Genus Species Classification depicted as a tree No “mixed” animals Source: www.dr.dk/oline Classification depicted as a tree Species Genus Family Order Class Comparison of limbs Image source: http://evolution.berkeley.edu Theory of evolution Charles Darwin 1809-1882 Phylogenetic basis of systematics • Linnaeus: Ordering principle is God. • Darwin: Ordering principle is shared descent from common ancestors. • Today, systematics is explicitly based on phylogeny. Natural Selection: Darwin’s four postulates • More young are produced each generation than can survive to reproduce. • Individuals in a population vary in their characteristics. • Some differences among individuals are based on genetic differences. • Individuals with favorable characteristics have higher rates of survival and reproduction. • • • Evolution by means of natural selection Presence of ”design-like” features in organisms: Quite often features are there “for a reason” Evolution at the sequence level About DNA • DNA contains the recipes of how to make protein / enzymes. • Every time a cells divides it’s DNA is duplicated, and each daughter cell gets a copy. The DNA alphabet • The information in the DNA is written in a four letter code: A, T, G, C. • The DNA can be “sequenced” and the result stored in a computer file. • ATGGCCCTGTGGAT DNA is always written 5’ 3’ Ribose 3’ 5 4 1 3 2 5’ Deoxyribose 5 4 1 3 2 5’ AGCC 3’ 3’ TCGG 5’ 5’ 5’ ATGGCCAGGTAA 3’ DNA backbone: http://en.wikipedia.org/wiki/DNA (Deoxy)ribose: http://en.wikipedia.org/ 3’ Can DNA be changed? • ATGGCCCTGTGGATGCG Can DNA be changed? • ATGGCCCTGTGGATGCG • ATGGCCCTATGGATGCG A history of mutations ATGGCAATGTGGATGCA ATGGCCCCGTGGAACCG ATGTCCCCGTGGATGCG ATGGCCCCGTGGATGCG ATGGCCCTGTGGATGCG Time ATGGCCCTGTGTATGCG “DNA alignment” • Species1: • Species2: • Species3: ATGGCAATGTGGATGCA ATGGCCCCGTGGAACCG ATGTCCCCGTGGATGCG 6 3 5 Real life example: Alignment • Insulin from 7 different species • • • • • • • Homo: Pan: Sus: Ovis: Canis: Mus: Gallus: ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAA ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGTGCTGCTGGCCCTCTGGGGACCTGACCCAGCCTCGGCCTTTGTGAA ATGGCCCTGTGGACGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCCCTCTGGGCGCCCGCCCCGGCCCAGGCCTTCGTGAA ATGGCCCTGTGGACACGCCTGGTGCCCCTGCTGGCCCTGCTGGCACTCTGGGCCCCCGCCCCGGCCCACGCCTTCGTCAA ATGGCCCTCTGGATGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCCCTCTGGGCGCCCGCGCCCACCCGAGCCTTCGTTAA ATGGCCCTGTTGGTGCACTTCCTACCCCTGCTGGCCCTGCTTGCCCTCTGGGAGCCCAAACCCACCCAGGCTTTTGTCAA ATGGCTCTCTGGATCCGATCACTGCCTCTTCTGGCTCTCCTTGTCTTTTCTGGCCCTGGAACCAGCTATGCAGCTGCCAA Real life example: Tree Interpretation of Multiple Alignments Conserved features assumed to be important for functionality For instance: conserved pairs of cysteines indicate possible disulphide bridge Sequences are related • Darwin: all organisms are related through descent with modification • Prediction: similar molecules have similar functions in different organisms Protein synthesis carried out by very similar RNA-containing molecular complexes (ribosomes) that are present in all known organisms Sequences are related, II Related oxygenbinding proteins in humans DNA as Biological Information Rasmus Wenersson Overview • Learning objectives – About Biological Information – A note about DNA sequencing techniques and DNA data – File formats used for biological data – Introduction to the GenBank database Information flow in biological systems DNA sequences = summary of information Ribose 3’ 5 4 1 3 2 5’ Deoxyribose 5 4 1 3 2 5’ AGCC 3’ 3’ TCGG 5’ 5’ 5’ ATGGCCAGGTAA 3’ DNA backbone: http://en.wikipedia.org/wiki/DNA (Deoxy)ribose: http://en.wikipedia.org/ 3’ PCR Melting 96º , 30 sec 35 cycles Annealing ~55º, 30 sec Extension 72º , 30 sec Animation: http://depts.washington.edu/~genetics/courses/genet371b-aut99/PCR_contents.html PCR Der kræves QuickTime™ og et -komprimeringsværktøj, for at man kan se dette billede. Animation: http://www.people.virginia.edu/~rjh9u/pcranim.html PCR graph: http://pathmicro.med.sc.edu/pcr/realtime-home.htm Gel electrophoresis • DNA fragments are seperated using gel electrophoresis – Typically 1% argarose – Colored with EtBr or ZybrGreen (glows in UV light). – A DNA ”ladder” is used for identification of known DNA lengths. - + Gel picture: http://www.pharmaceutical-technology.com/projects/roche/images/roche3.jpg PCR setup: http://arbl.cvmbs.colostate.edu/hbooks/genetics/biotech/gels/agardna.html The Sanger method of DNA sequencing } OH Terminator X-ray sequenceing gel Images: http://www.idtdna.com/support/technical/TechnicalBulletinPDF/DNA_Sequencing.pdf Automated sequencing • The major break-through of sequencing has happended through automation. • Fluorescent dyes. • Laser based scanning. • Capillary electrophoresis • Computer based basecalling and assembly. Images: http://www.idtdna.com/support/technical/TechnicalBulletinPDF/DNA_Sequencing.pdf Handout exercise: ”base-calling” • Handout: Chromotogram • Groups of 2-3. • Tasks: – Identify “difficult” regions – Identify “difficult” sequence stretches. – Try to estimate the best interval to use. Biological data on computers • The GenBank database • File formats – FASTA – GenBank NCBI GenBank • GenBank is one of the main internaltional DNA databases. • GenBank is hosted by NCBI: National Center for Biotechnology Information. • GenBank has exists since 1982. • The database is public - no restrictions on the use of the data within. FASTA format >alpha-D ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCAC CCAGACTGTGGAGCCGAGGCCCTGGAGAGGTGCGGGCTGAGCTTGGGGAAACCATGGGCA AGGGGGGCGACTGGGTGGGAGCCCTACAGGGCTGCTGGGGGTTGTTCGGCTGGGGGTCAG CACTGACCATCCCGCTCCCGCAGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCC CCCACTTCGACTTGCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGG CCGCCTTGGGCAACGCTGTCAAGAGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCA GCGACCTGCATGCCTACAACCTGCGTGTCGACCCTGTCAACTTCAAGGCAGGCGGGGGAC GGGGGTCAGGGGCCGGGGAGTTGGGGGCCAGGGACCTGGTTGGGGATCCGGGGCCATGCC GGCGGTACTGAGCCCTGTTTTGCCTTGCAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTG GCCACACACCTGGGCAACGACTACACCCCGGAGGCACATGCTGCCTTCGACAAGTTCCTG TCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGATAA >alpha-A ATGGTGCTGTCTGCCAACGACAAGAGCAACGTGAAGGCCGTCTTCGGCAAAATCGGCGGC CAGGCCGGTGACTTGGGTGGTGAAGCCCTGGAGAGGTATGTGGTCATCCGTCATTACCCC ATCTCTTGTCTGTCTGTGACTCCATCCCATCTGCCCCCATACTCTCCCCATCCATAACTG TCCCTGTTCTATGTGGCCCTGGCTCTGTCTCATCTGTCCCCAACTGTCCCTGATTGCCTC TGTCCCCCAGGTTGTTCATCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACC TGTCACATGGCTCCGCTCAGATCAAGGGGCACGGCAAGAAGGTGGCGGAGGCACTGGTTG AGGCTGCCAACCACATCGATGACATCGCTGGTGCCCTCTCCAAGCTGAGCGACCTCCACG CCCAAAAGCTCCGTGTGGACCCCGTCAACTTCAAAGTGAGCATCTGGGAAGGGGTGACCA GTCTGGCTCCCCTCCTGCACACACCTCTGGCTACCCCCTCACCTCACCCCCTTGCTCACC ATCTCCTTTTGCCTTTCAGCTGCTGGGTCACTGCTTCCTGGTGGTCGTGGCCGTCCACTT CCCCTCTCTCCTGACCCCGGAGGTCCATGCTTCCCTGGACAAGTTCGTGTGTGCCGTGGG CACCGTCCTTACTGCCAAGTACCGTTAA (Handout) GenBank format • Originates from the GenBank database. • Contains both a DNA sequence and annotation of feature (e.g. Location of genes). (handout) GenBank format - HEADER LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM REFERENCE AUTHORS TITLE JOURNAL PUBMED COMMENT CMGLOAD 1185 bp DNA linear VRT 18-APR-2005 Cairina moschata (duck) gene for alpha-D globin. X01831 X01831.1 GI:62724 alpha-globin; globin. Cairina moschata (Muscovy duck) Cairina moschata Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Archosauria; Aves; Neognathae; Anseriformes; Anatidae; Cairina. 1 (bases 1 to 1185) Erbil,C. and Niessing,J. The primary structure of the duck alpha D-globin gene: an unusual 5' splice junction sequence EMBO J. 2 (8), 1339-1343 (1983) 10872328 Data kindly reviewed (13-NOV-1985) by J. Niessing. GenBank format - ORIGIN section ORIGIN 1 61 121 181 241 301 361 421 481 541 601 661 721 781 841 901 961 1021 1081 1141 // ctgcgtggcc cagggtgcta agcctgccac gtgggagaag gctgggccca tgggctggga aaaactgact ttcccccact gcggctgccc ctcagcaacc gactagggtc ggtctgagtt gggtaccagg gtgggccaga gggggactca tccggagcag tggtgctggc agttcttgtc cccttgcacc gggcatcggg tcagcccctc taagagctcg gccgctgccg gtggctggcc gggggcactc cccagagcgc ggcctcgctc tcgacctgca tgggcaatgc tgcatgccta cttgggtctg tcctggggtc gtcctggggg ggctgggatt gggcctcagg gggtactaag cgcacacctg cgccgtggct ttcaataaag ggtcccaggg cacccctcca gccccgcggg ccatgctgac accaggagga acagggtggg cacggggtgc cggcaggatg tcccggctct cgtgaagagc caacctgcgt ggggtctgag tggcagtcct ccagcagcca gtgtttggaa gggactcggg ccctggtttg ggcaaagact gccgtgctgg acaccattac agggctgggt cgctgataag tgtctccacc cgccgaggac attcggaagt cagcagggag gggctgagat ttcctcgcct gaacaggtcc ctggacaacc gttgaccctg ggtgtggggt gggggctgag gacagcaggg tgggagctgg gggggactga ccttgcagct acagccccga ctgaaaagta cacagctctg tgcttccaca ataaggccag acagaaaccc aagaagctca gaagctctgc caggagccct gggcaaagca acccccagac gtggccatgg tcagccaggc tcaacttcaa gcagggtctg ggccagggtc gctgggattg gcaggggcta gggagactca gctggcacag gatgcatgct cagatgagcc tgtctgtgtg catcc ggcgggagcg gtcagttgcc tcgtgcaggt agaggtgtgg gcagcgggtg gcagggcacc caagacctac caagaaagtg cctgtctgag ggcaagcggg ggggtccagg ctgtggtctt catctgggat gggccagggt gggccatctg tgcttccagg gcctttgaca actgcctgca tgctgggact GenBank format - FEATURE section FEATURES source CAAT_signal TATA_signal precursor_RNA exon CDS repeat_region intron repeat_region exon intron exon polyA_signal polyA_signal Location/Qualifiers 1..1185 /organism="Cairina moschata" /mol_type="genomic DNA" /db_xref="taxon:8855" 20..24 69..73 101..1114 /note="primary transcript" 101..234 /number=1 join(143..234,387..591,939..1067) /codon_start=1 /product="alpha D-globin" /protein_id="CAA25966.2" /db_xref="GI:4455876" /db_xref="GOA:P02003" /db_xref="InterPro:IPR000971" /db_xref="InterPro:IPR002338" /db_xref="InterPro:IPR002340" /db_xref="InterPro:IPR009050" /db_xref="UniProt/Swiss-Prot:P02003" /translation="MLTAEDKKLIVQVWEKVAGHQEEFGSEALQRMFLAYPQTKTYFP HFDLHPGSEQVRGHGKKVAAALGNAVKSLDNLSQALSELSNLHAYNLRVDPVNFKLLA QCFQVVLAAHLGKDYSPEMHAAFDKFLSAVAAVLAEKYR" 227..246 /note="direct repeat 1" 235..386 /number=1 289..309 /note="direct repeat 1" 387..591 /number=2 592..939 /number=2 940..1114 /number=3 1095..1100 1114 Exercise: GenBank • Work in groups of 2-3 people. • The exercise guide is linked from the course programme. • Read the guide carefully - it contains a lot of information about GenBank.