Download For the last three and a half billion years, evolution has been

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene nomenclature wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Multilocus sequence typing wikipedia , lookup

Transposable element wikipedia , lookup

Gene regulatory network wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Copy-number variation wikipedia , lookup

Real-time polymerase chain reaction wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Personalized medicine wikipedia , lookup

Genetic engineering wikipedia , lookup

Genetic code wikipedia , lookup

Non-coding DNA wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Biochemistry wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Gene expression wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Molecular ecology wikipedia , lookup

Biosynthesis wikipedia , lookup

Gene wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Point mutation wikipedia , lookup

DNA sequencing wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Community fingerprinting wikipedia , lookup

Exome sequencing wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genomic library wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Transcript
MBV3070
Bioinformatikk
Pensumliste MBV3070 - Bioinformatikk
Arthur M. Lesk: Introduction to Bioinformatics. Oxford
University Press 2002. 270 sider
I tillegg:
1.
Tom Kristensen: Sekvenssammenstillinger. 7 sider.
2.
Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994)
CLUSTAL W: improving the sensitivity of progressive multiple
sequence alignment through sequence weighting, positionsspecific gap penalties and weight matrix choice. Nucleic Acids
Research, 22:4673-4680.
3.
D.G:Higgins, J.D.Thompson and T.J.Gibson: Using CLUSTAL
for multiple sequence alignments. Methods Enzymol. 266 (1994)
383-402
4.
??? (Genfinning)
5.
???? (Mikromatriser
Fremdriftsplan

Innledning. Sekvensering.
 Databaser. Entrez og SRS. Dotplots
 Parvis sekvenssammenstilling
 FASTA og BLAST
 Flersekvenssammenstilling. ClustalW/ClustalX
 Motiver, profiler, PSI-BLAST
 Fylogeni
 Genomer. Analyse av genomisk DNA. Genfinning
 Mikromatriser (Ola Myklebost/Ole Chr. Lindgjærde)
 Proteinmodellering
Vincent Eijsink
 Proteinmodellering
 Proteinmodellering
Nyttige nettsteder for MBV3070
Emnets hjemmeside:
http://www.uio.no/studier/emner/matnat/
molbio/MBV3070/v04/
 Lærebokas hjemmeside:
http://www.oup.com/uk/lesk/bioinf/

Hva er bioinformatikk?
The NIH Biomedical Information Science and Technology
Initiative Consortium agreed on the following definitions of
bioinformatics and computational biology recognizing that no
definition could completely eliminate overlap with other
activities or preclude variations in interpretation by different
individuals and organizations.
Bioinformatics: Research, development, or application of
computational tools and approaches for expanding the use of
biological, medical, behavioral or health data, including those
to acquire, store, organize, archive, analyze, or visualize such
data.
Computational Biology: The development and application of
data-analytical and theoretical methods, mathematical
modeling and computational simulation techniques to the
study of biological, behavioral, and social systems.
Andre måter å definere
bioinformatikk på


"The mathematical, statistical and computing
methods that aim to solve biological problems using
DNA and amino acid sequences and related
information." Fredj Tekaja, Institute Pasteur
”The use of computers to store, retrieve, analyze or
predict the composition or the structure of
biomolecules.” Damian Councell, bioinformatics.org
“For the last three and a half billion years,
evolution has been taking notes.”
“It tries experiments. It wakes up every
morning, does a little mutagenesis, changes a
nucleotide here and there, and sees how it
works. If it’s a success, it keeps the notes. In
this notebook, we have all of the information
of the greatest experimental tinkerer ever.”
Dr. Eric Lander
Director of the Whitehead InstituteMIT
Center for Genome Research
Hva betyr dette?
Base symbols

A
C
G
T
U
R

Y

K

M





Adenine
Cytosine
Guanine
Thymine
Uracil
Guanine / Adenine
(puRine)
Cytosine / Thymine
(pYrimidine)
Guanine / Thymine
(Keto)
Adenine / Cytosine
(aMino)

S

W

B

D

H

V

N
Guanine / Cytosine
(Strong)
Adenine / Thymine
(Weak)
Guanine / Thymine /
Cytosine (not A)
Guanine / Adenine /
Thymine (not C)
Adenine / Cytosine /
Thymine (not G)
Guanine / Cytosine /
Adenine (not T)
Adenine / Guanine /
Cytosine / Thymine
Hvorfor tvetydige symboler?
Sekvenseringsinstrumenter vil ikke alltid
kunne lese sekvensen entydig
 I konsensussekvenser er det nyttig med
tvetydige symboler

Sekvens 1
Sekvens 2
Konsensus
aagcggtaccag
aaacagcaccaa
aarcrgyaccar
Den genetiske kode
Den genetiske kode
Aminosyresymboler














A Ala alanine
B Asx aspartic acid or
asparagine
C Cys cysteine
D Asp aspartic acid
E Glu glutamic acid
F Phe phenylalanine
G Gly glycine
H His histidine
I Ile isoleucine
K Lys lysine
L Leu leucine
M Met methionine
N Asn asparagine
P Pro proline










Q Gln glutamine
R Arg arginine
S Ser serine
T Thr threonine
U Sec selenocysteine
V Val valine
W Trp tryptophan
X Xaa unknown or 'other'
amino acid
Y Tyr tyrosine
Z Glx glutamic acid or
glutamine (or substances
such as 4-carboxyglutamic
acid and 5-oxoproline that
yield glutamic acid on acid
hydrolysis of peptides)
To måter å sekvensere på
Shotgun-sekvensering: Dette er
strategien som ble valgt av Celera for
kommersiell sekvensering av det
humane genom
 Ordnet sekvensering (top down): Denne
strategien ble brukt i den ”offentlige”
sekvensering av genomet, i et
internasjonalt samarbeid

Ovenfra og nedover-strategi for
sekvensering
To måter å sekvensere genomet på
BAC to BAC Sequencing
The BAC to BAC approach first
creates a crude physical map of
the whole genome before
sequencing the DNA.
Constructing a map requires
cutting the chromosomes into
large pieces and figuring out the
order of these big chunks of DNA
before taking a closer look and
sequencing all the fragments.
Whole Genome Shotgun
Sequencing
The shotgun sequencing method
goes straight to the job of
decoding, bypassing the need for
a physical map. Therefore, it is
much faster.
Fragmentering av genomet
BAC to BAC
Sequencing
Whole Genome Shotgun
Sequencing
Kloning av fragmentene
BAC to BAC
Sequencing
Whole Genome Shotgun
Sequencing
Plassering på kartet av BAC-klonene
BAC to BAC
Sequencing
Whole Genome Shotgun
Sequencing
This step not needed in shotgun
sequencing
Subkloner fra BAC-klonene
BAC to BAC
Sequencing
Whole Genome Shotgun
Sequencing
This step not needed in shotgun
sequencing
Sekvensering av klonene
BAC to BAC
Sequencing
Whole Genome Shotgun
Sequencing
Råsekvens fra et
sekvenseringsinstrument
Oppbygging av sammenhengende
sekvenser
BAC to BAC
Sequencing
Whole Genome Shotgun
Sequencing
Sammensetting av enkeltsekvenser
til større sekvenser
DNA sequencing 2001
Biological databases

Primary databases (archival)
– GenBank, EMBL, DDBJ, PDB

Secondary databases (curated)
– PIR, SwissProt and everything else
Database Categories List
http://www3.oup.co.uk/nar/database/c/











Genomics Databases (non-vertebrate)
Human and other Vertebrate Genomes
Human Genes and Diseases
Metabolic and Signaling Pathways
Microarray Data and other Gene Expression Databases
Nucleotide Sequence Databases
Other Molecular Biology Databases
Protein sequence databases
Proteomics Resources
RNA sequence databases
Structure Databases
In all 548 databases, 162 more than one year ago
GenBank entry
LOCUS
LISOD
DEFINITION
L.ivanovii sod gene for superoxide dismutase.
ACCESSION
X64011 S78972
NID
g44010
VERSION
X64011.1
KEYWORDS
sod gene; superoxide dismutase.
SOURCE
Listeria ivanovii.
ORGANISM
756 bp
DNA
BCT
30-JUN-1993
GI:44010
Listeria ivanovii
Bacteria; Firmicutes; Bacillus/Clostridium group; Bacillaceae;
Listeria.
REFERENCE
1
(bases 1 to 756)
AUTHORS
Haas,A. and Goebel,W.
TITLE
Cloning of a superoxide dismutase gene from Listeria ivanovii by
functional complementation in Escherichia coli and characterization
of the gene product
JOURNAL
Mol. Gen. Genet. 231 (2), 313-322 (1992)
MEDLINE
92140371
REFERENCE
2
(bases 1 to 756)
AUTHORS
Kreft,J.
TITLE
Direct Submission
JOURNAL
Submitted (21-APR-1992) J. Kreft, Institut f. Mikrobiologie,
Universitaet Wuerzburg, Biozentrum Am Hubland, 8700 Wuerzburg, FRG
GenBank entry (cont.)
FEATURES
Location/Qualifiers
source
1..756
/organism="Listeria ivanovii"
/strain="ATCC 19119"
/db_xref="taxon:1638"
RBS
95..100
/gene="sod"
gene
95..746
/gene="sod"
CDS
109..717
/gene="sod"
/EC_number="1.15.1.1"
/codon_start=1
/transl_table=11
/product="superoxide dismutase"
/protein_id="CAA45406.1"
/db_xref="SWISS-PROT:P28763"
/translation="MTYELPKLPYTYD…
723..746
terminator
/gene="sod"
BASE COUNT
247 a
136 c
151 g
222 t
ORIGIN
1 cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat
EMBL database entry
EMBL:TRBG361
ID
TRBG361
standard; RNA; PLN; 1859 BP.
XX
AC
X56734; S46826;
XX
SV
X56734.1
XX
DT
12-SEP-1991 (Rel. 29, Created)
DT
15-MAR-1999 (Rel. 59, Last updated, Version 9)
XX
DE
Trifolium repens mRNA for non-cyanogenic beta-glucosidase
XX
KW
beta-glucosidase.
XX
OS
Trifolium repens (white clover)
OC
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
OC
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; Rosidae;
OC
eurosids I; Fabales; Fabaceae; Papilionoideae; Trifolieae; Trifolium.
XX
EMBL database entry (cont.)
RN
[5]
RP
1-1859
RX
MEDLINE; 91322517.
RA
Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.;
RT
"Nucleotide and derived amino acid sequence of the cyanogenic
RT
beta-glucosidase (linamarase) from white clover (Trifolium repens L.).";
RL
Plant Mol. Biol. 17:209-219(1991).
XX
RN
[6]
RP
1-1859
RA
Hughes M.A.;
RT
;
RL
Submitted (19-NOV-1990) to the EMBL/GenBank/DDBJ databases.
RL
M.A. Hughes, UNIVERSITY OF NEWCASTLE UPON TYNE, MEDICAL SCHOOL, NEW CASTLE
RL
UPON TYNE, NE2 4HH, UK
XX
DR
AGDR; X56734; X56734.
DR
MENDEL; 11000; Trirp;1162;11000.
DR
SWISS-PROT; P26204; BGLS_TRIRP.
XX
EMBL database entry (cont.)
FH
Key
Location/Qualifiers
source
1..1859
FH
FT
FT
/db_xref="taxon:3899"
FT
/organism="Trifolium repens"
FT
/tissue_type="leaves"
FT
/clone_lib="lambda gt10"
FT
/clone="TRE361"
FT
CDS
14..1495
FT
/db_xref="SWISS-PROT:P26204"
FT
/note="non-cyanogenic"
FT
/EC_number="3.2.1.21"
FT
/product="beta-glucosidase"
FT
/protein_id="CAA40058.1"
FT
/translation="MDFIVAIFALFVISSFTITSTNAVEASTLLDIGNLSRSSFPRGFI
FT
FGAGSSAYQFEGAVNEGGRGPSIWDTFTHKYPEKIRDGSNADITVDQYHRYKEDVGIMK
FT
DQNMDSYRFSI….
FT
FT
mRNA
1..1859
/evidence=EXPERIMENTAL
XX
SQ
Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other;
aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt
60
cacaattact tccacaaatg cagttgaagc ttctactctt cttgacatag gtaacctgag
120
tcggagcagt tttcctcgtg
EMBL database fields
Note that each line begins with a two-character line code,
which indicates the type of information contained in the line.
The currently used line types, along with their respective
line codes, are listed below:
ID - identification
entry)
(begins each entry; 1 per
AC - accession number
(>=1 per entry)
SV - new sequence identifier
(>=1 per entry)
DT - date
(2 per entry)
DE - description
(>=1 per entry)
KW - keyword
(>=1 per entry)
OS - organism species
(>=1 per entry)
OC - organism classification
(>=1 per entry)
OG - organelle
(0 or 1 per entry)
RN - reference number
(>=1 per entry)
RC - reference comment
(>=0 per entry)
EMBL database fields (cont.)
RP - reference positions
(>=1 per entry)
RX - reference cross-reference
(>=0 per entry)
RA - reference author(s)
(>=1 per entry)
RT - reference title
(>=1 per entry)
RL - reference location
(>=1 per entry)
DR - database cross-reference
(>=0 per entry)
FH - feature table header
(0 or 2 per entry)
FT - feature table data
(>=0 per entry)
CC - comments or notes
(>=0 per entry)
XX - spacer line
(many per entry)
SQ - sequence header
(1 per entry)
bb - (blanks) sequence data
(>=1 per entry)
// - termination line
per entry)
(ends each entry; 1
The feature table
The overall goal of the feature table design is to provide an extensive vocabulary for
describing features in a flexible framework for manipulating them. The Feature Table
documentation represents the shared rules that allow the three databases to exchange data
on a daily basis.
The range of features to be represented is diverse, including regions which:
perform a biological function,
affect or are the result of the expression of a biological function,
interact with other molecules,
affect replication of a sequence,
affect or are the result of recombination of different sequences,
are a recognizable repeated unit,
have secondary or tertiary structure,
exhibit variation, or
have been revised or corrected.
Feature table terminology
The format and wording in the feature table use common biological research
terminology whenever possible. For example, an item in the new feature table such
as:
Key
Location/Qualifiers
CDS
23..400
/product="alcohol dehydrogenase"
/gene="adhI"
might be read as:
The feature CDS is a coding sequence beginning at base
23 and ending at base 400, has a product called 'alcohol
dehydrogenase' and corresponds to the gene called
'adhI'.
Feature table terminology (cont.)
A more complex description:
Key
Location/Qualifiers
CDS
join(544..589,688..1032)
/product="T-cell receptor beta-chain"
/partial
which might be read as:
This feature, which is a partial coding sequence is
formed by joining the indicated elements to form one
contiguous sequence encoding a product called T-cell
receptor beta-chain.
Feature key examples
Key
Description
conflict
Separate determinations of the
"same" sequence differ
rep_origin
Origin of replication
protein_bind
Protein binding site on DNA
CDS
Protein-coding sequence
misc_RNA
Generic label for an undefined
RNA
insertion_seq
Insertion element
D-loop
Mitochondrial or other D-loop
structure