* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download For the last three and a half billion years, evolution has been
Gene nomenclature wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Multilocus sequence typing wikipedia , lookup
Transposable element wikipedia , lookup
Gene regulatory network wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Copy-number variation wikipedia , lookup
Real-time polymerase chain reaction wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Personalized medicine wikipedia , lookup
Genetic engineering wikipedia , lookup
Genetic code wikipedia , lookup
Non-coding DNA wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Biochemistry wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Gene expression wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Molecular ecology wikipedia , lookup
Biosynthesis wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Point mutation wikipedia , lookup
DNA sequencing wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Community fingerprinting wikipedia , lookup
Exome sequencing wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
MBV3070 Bioinformatikk Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I tillegg: 1. Tom Kristensen: Sekvenssammenstillinger. 7 sider. 2. Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionsspecific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673-4680. 3. D.G:Higgins, J.D.Thompson and T.J.Gibson: Using CLUSTAL for multiple sequence alignments. Methods Enzymol. 266 (1994) 383-402 4. ??? (Genfinning) 5. ???? (Mikromatriser Fremdriftsplan Innledning. Sekvensering. Databaser. Entrez og SRS. Dotplots Parvis sekvenssammenstilling FASTA og BLAST Flersekvenssammenstilling. ClustalW/ClustalX Motiver, profiler, PSI-BLAST Fylogeni Genomer. Analyse av genomisk DNA. Genfinning Mikromatriser (Ola Myklebost/Ole Chr. Lindgjærde) Proteinmodellering Vincent Eijsink Proteinmodellering Proteinmodellering Nyttige nettsteder for MBV3070 Emnets hjemmeside: http://www.uio.no/studier/emner/matnat/ molbio/MBV3070/v04/ Lærebokas hjemmeside: http://www.oup.com/uk/lesk/bioinf/ Hva er bioinformatikk? The NIH Biomedical Information Science and Technology Initiative Consortium agreed on the following definitions of bioinformatics and computational biology recognizing that no definition could completely eliminate overlap with other activities or preclude variations in interpretation by different individuals and organizations. Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems. Andre måter å definere bioinformatikk på "The mathematical, statistical and computing methods that aim to solve biological problems using DNA and amino acid sequences and related information." Fredj Tekaja, Institute Pasteur ”The use of computers to store, retrieve, analyze or predict the composition or the structure of biomolecules.” Damian Councell, bioinformatics.org “For the last three and a half billion years, evolution has been taking notes.” “It tries experiments. It wakes up every morning, does a little mutagenesis, changes a nucleotide here and there, and sees how it works. If it’s a success, it keeps the notes. In this notebook, we have all of the information of the greatest experimental tinkerer ever.” Dr. Eric Lander Director of the Whitehead InstituteMIT Center for Genome Research Hva betyr dette? Base symbols A C G T U R Y K M Adenine Cytosine Guanine Thymine Uracil Guanine / Adenine (puRine) Cytosine / Thymine (pYrimidine) Guanine / Thymine (Keto) Adenine / Cytosine (aMino) S W B D H V N Guanine / Cytosine (Strong) Adenine / Thymine (Weak) Guanine / Thymine / Cytosine (not A) Guanine / Adenine / Thymine (not C) Adenine / Cytosine / Thymine (not G) Guanine / Cytosine / Adenine (not T) Adenine / Guanine / Cytosine / Thymine Hvorfor tvetydige symboler? Sekvenseringsinstrumenter vil ikke alltid kunne lese sekvensen entydig I konsensussekvenser er det nyttig med tvetydige symboler Sekvens 1 Sekvens 2 Konsensus aagcggtaccag aaacagcaccaa aarcrgyaccar Den genetiske kode Den genetiske kode Aminosyresymboler A Ala alanine B Asx aspartic acid or asparagine C Cys cysteine D Asp aspartic acid E Glu glutamic acid F Phe phenylalanine G Gly glycine H His histidine I Ile isoleucine K Lys lysine L Leu leucine M Met methionine N Asn asparagine P Pro proline Q Gln glutamine R Arg arginine S Ser serine T Thr threonine U Sec selenocysteine V Val valine W Trp tryptophan X Xaa unknown or 'other' amino acid Y Tyr tyrosine Z Glx glutamic acid or glutamine (or substances such as 4-carboxyglutamic acid and 5-oxoproline that yield glutamic acid on acid hydrolysis of peptides) To måter å sekvensere på Shotgun-sekvensering: Dette er strategien som ble valgt av Celera for kommersiell sekvensering av det humane genom Ordnet sekvensering (top down): Denne strategien ble brukt i den ”offentlige” sekvensering av genomet, i et internasjonalt samarbeid Ovenfra og nedover-strategi for sekvensering To måter å sekvensere genomet på BAC to BAC Sequencing The BAC to BAC approach first creates a crude physical map of the whole genome before sequencing the DNA. Constructing a map requires cutting the chromosomes into large pieces and figuring out the order of these big chunks of DNA before taking a closer look and sequencing all the fragments. Whole Genome Shotgun Sequencing The shotgun sequencing method goes straight to the job of decoding, bypassing the need for a physical map. Therefore, it is much faster. Fragmentering av genomet BAC to BAC Sequencing Whole Genome Shotgun Sequencing Kloning av fragmentene BAC to BAC Sequencing Whole Genome Shotgun Sequencing Plassering på kartet av BAC-klonene BAC to BAC Sequencing Whole Genome Shotgun Sequencing This step not needed in shotgun sequencing Subkloner fra BAC-klonene BAC to BAC Sequencing Whole Genome Shotgun Sequencing This step not needed in shotgun sequencing Sekvensering av klonene BAC to BAC Sequencing Whole Genome Shotgun Sequencing Råsekvens fra et sekvenseringsinstrument Oppbygging av sammenhengende sekvenser BAC to BAC Sequencing Whole Genome Shotgun Sequencing Sammensetting av enkeltsekvenser til større sekvenser DNA sequencing 2001 Biological databases Primary databases (archival) – GenBank, EMBL, DDBJ, PDB Secondary databases (curated) – PIR, SwissProt and everything else Database Categories List http://www3.oup.co.uk/nar/database/c/ Genomics Databases (non-vertebrate) Human and other Vertebrate Genomes Human Genes and Diseases Metabolic and Signaling Pathways Microarray Data and other Gene Expression Databases Nucleotide Sequence Databases Other Molecular Biology Databases Protein sequence databases Proteomics Resources RNA sequence databases Structure Databases In all 548 databases, 162 more than one year ago GenBank entry LOCUS LISOD DEFINITION L.ivanovii sod gene for superoxide dismutase. ACCESSION X64011 S78972 NID g44010 VERSION X64011.1 KEYWORDS sod gene; superoxide dismutase. SOURCE Listeria ivanovii. ORGANISM 756 bp DNA BCT 30-JUN-1993 GI:44010 Listeria ivanovii Bacteria; Firmicutes; Bacillus/Clostridium group; Bacillaceae; Listeria. REFERENCE 1 (bases 1 to 756) AUTHORS Haas,A. and Goebel,W. TITLE Cloning of a superoxide dismutase gene from Listeria ivanovii by functional complementation in Escherichia coli and characterization of the gene product JOURNAL Mol. Gen. Genet. 231 (2), 313-322 (1992) MEDLINE 92140371 REFERENCE 2 (bases 1 to 756) AUTHORS Kreft,J. TITLE Direct Submission JOURNAL Submitted (21-APR-1992) J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am Hubland, 8700 Wuerzburg, FRG GenBank entry (cont.) FEATURES Location/Qualifiers source 1..756 /organism="Listeria ivanovii" /strain="ATCC 19119" /db_xref="taxon:1638" RBS 95..100 /gene="sod" gene 95..746 /gene="sod" CDS 109..717 /gene="sod" /EC_number="1.15.1.1" /codon_start=1 /transl_table=11 /product="superoxide dismutase" /protein_id="CAA45406.1" /db_xref="SWISS-PROT:P28763" /translation="MTYELPKLPYTYD… 723..746 terminator /gene="sod" BASE COUNT 247 a 136 c 151 g 222 t ORIGIN 1 cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat EMBL database entry EMBL:TRBG361 ID TRBG361 standard; RNA; PLN; 1859 BP. XX AC X56734; S46826; XX SV X56734.1 XX DT 12-SEP-1991 (Rel. 29, Created) DT 15-MAR-1999 (Rel. 59, Last updated, Version 9) XX DE Trifolium repens mRNA for non-cyanogenic beta-glucosidase XX KW beta-glucosidase. XX OS Trifolium repens (white clover) OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; OC Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; Rosidae; OC eurosids I; Fabales; Fabaceae; Papilionoideae; Trifolieae; Trifolium. XX EMBL database entry (cont.) RN [5] RP 1-1859 RX MEDLINE; 91322517. RA Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.; RT "Nucleotide and derived amino acid sequence of the cyanogenic RT beta-glucosidase (linamarase) from white clover (Trifolium repens L.)."; RL Plant Mol. Biol. 17:209-219(1991). XX RN [6] RP 1-1859 RA Hughes M.A.; RT ; RL Submitted (19-NOV-1990) to the EMBL/GenBank/DDBJ databases. RL M.A. Hughes, UNIVERSITY OF NEWCASTLE UPON TYNE, MEDICAL SCHOOL, NEW CASTLE RL UPON TYNE, NE2 4HH, UK XX DR AGDR; X56734; X56734. DR MENDEL; 11000; Trirp;1162;11000. DR SWISS-PROT; P26204; BGLS_TRIRP. XX EMBL database entry (cont.) FH Key Location/Qualifiers source 1..1859 FH FT FT /db_xref="taxon:3899" FT /organism="Trifolium repens" FT /tissue_type="leaves" FT /clone_lib="lambda gt10" FT /clone="TRE361" FT CDS 14..1495 FT /db_xref="SWISS-PROT:P26204" FT /note="non-cyanogenic" FT /EC_number="3.2.1.21" FT /product="beta-glucosidase" FT /protein_id="CAA40058.1" FT /translation="MDFIVAIFALFVISSFTITSTNAVEASTLLDIGNLSRSSFPRGFI FT FGAGSSAYQFEGAVNEGGRGPSIWDTFTHKYPEKIRDGSNADITVDQYHRYKEDVGIMK FT DQNMDSYRFSI…. FT FT mRNA 1..1859 /evidence=EXPERIMENTAL XX SQ Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other; aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt 60 cacaattact tccacaaatg cagttgaagc ttctactctt cttgacatag gtaacctgag 120 tcggagcagt tttcctcgtg EMBL database fields Note that each line begins with a two-character line code, which indicates the type of information contained in the line. The currently used line types, along with their respective line codes, are listed below: ID - identification entry) (begins each entry; 1 per AC - accession number (>=1 per entry) SV - new sequence identifier (>=1 per entry) DT - date (2 per entry) DE - description (>=1 per entry) KW - keyword (>=1 per entry) OS - organism species (>=1 per entry) OC - organism classification (>=1 per entry) OG - organelle (0 or 1 per entry) RN - reference number (>=1 per entry) RC - reference comment (>=0 per entry) EMBL database fields (cont.) RP - reference positions (>=1 per entry) RX - reference cross-reference (>=0 per entry) RA - reference author(s) (>=1 per entry) RT - reference title (>=1 per entry) RL - reference location (>=1 per entry) DR - database cross-reference (>=0 per entry) FH - feature table header (0 or 2 per entry) FT - feature table data (>=0 per entry) CC - comments or notes (>=0 per entry) XX - spacer line (many per entry) SQ - sequence header (1 per entry) bb - (blanks) sequence data (>=1 per entry) // - termination line per entry) (ends each entry; 1 The feature table The overall goal of the feature table design is to provide an extensive vocabulary for describing features in a flexible framework for manipulating them. The Feature Table documentation represents the shared rules that allow the three databases to exchange data on a daily basis. The range of features to be represented is diverse, including regions which: perform a biological function, affect or are the result of the expression of a biological function, interact with other molecules, affect replication of a sequence, affect or are the result of recombination of different sequences, are a recognizable repeated unit, have secondary or tertiary structure, exhibit variation, or have been revised or corrected. Feature table terminology The format and wording in the feature table use common biological research terminology whenever possible. For example, an item in the new feature table such as: Key Location/Qualifiers CDS 23..400 /product="alcohol dehydrogenase" /gene="adhI" might be read as: The feature CDS is a coding sequence beginning at base 23 and ending at base 400, has a product called 'alcohol dehydrogenase' and corresponds to the gene called 'adhI'. Feature table terminology (cont.) A more complex description: Key Location/Qualifiers CDS join(544..589,688..1032) /product="T-cell receptor beta-chain" /partial which might be read as: This feature, which is a partial coding sequence is formed by joining the indicated elements to form one contiguous sequence encoding a product called T-cell receptor beta-chain. Feature key examples Key Description conflict Separate determinations of the "same" sequence differ rep_origin Origin of replication protein_bind Protein binding site on DNA CDS Protein-coding sequence misc_RNA Generic label for an undefined RNA insertion_seq Insertion element D-loop Mitochondrial or other D-loop structure