Download Lektion: Evolution og Sekvenser - CBS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Introduktion til Bioinformatik
Hold 01
Oktober 2010
Introduktion
Rasmus Wernersson, Lektor
Anders Gorm Pedersen, Docent
Center for Biologisk Sekvensanalyse, DTU
Oversigt
•Taksonomi
•DNA
Data & Databaser
•Protein
•Protein struktur
•Alignment
•Pairwise + Multiple
Metoder
•BLAST (søgning)
•Fylogenetiske træer
•PyMOL (3D visualisering)
Opsamlende øvelse
Malaria vaccine
Øvelserne er det primære
Kursusplan på vores wiki
Background information
On evolution and sequences
Classification: Linnaeus
Carl Linnaeus
1707-1778
Classification: Linnaeus
• Hierarchical system
–
–
–
–
–
–
–
Kingdom
Phylum
Class
Order
Family
Genus
Species
Classification depicted as a tree
No “mixed” animals
Source: www.dr.dk/oline
Classification depicted as a tree
Species Genus Family
Order
Class
Comparison of limbs
Image source: http://evolution.berkeley.edu
Theory of evolution
Charles Darwin
1809-1882
Phylogenetic basis of systematics
• Linnaeus:
Ordering principle is God.
• Darwin:
Ordering principle is shared
descent from common
ancestors.
• Today, systematics is explicitly
based on phylogeny.
Natural Selection: Darwin’s four postulates
•
More young are produced each generation than can survive
to reproduce.
•
Individuals in a population vary in their characteristics.
•
Some differences among individuals are based on genetic
differences.
•
Individuals with favorable characteristics have higher rates of
survival and reproduction.
•
•
•
Evolution by means of natural selection
Presence of ”design-like” features in organisms:
Quite often features are there “for a reason”
Evolution at the sequence level
About DNA
• DNA contains the
recipes of how to
make protein /
enzymes.
• Every time a cells
divides it’s DNA is
duplicated, and each
daughter cell gets a
copy.
The DNA alphabet
• The information in the
DNA is written in a
four letter code: A, T,
G, C.
• The DNA can be
“sequenced” and the
result stored in a
computer file.
• ATGGCCCTGTGGAT
DNA is always written 5’  3’
Ribose
3’
5
4
1
3
2
5’
Deoxyribose
5
4
1
3
2
5’ AGCC 3’
3’ TCGG 5’
5’
5’ ATGGCCAGGTAA 3’
DNA backbone: http://en.wikipedia.org/wiki/DNA
(Deoxy)ribose: http://en.wikipedia.org/
3’
Can DNA be changed?
• ATGGCCCTGTGGATGCG
Can DNA be changed?
• ATGGCCCTGTGGATGCG
• ATGGCCCTATGGATGCG
A history of mutations
ATGGCAATGTGGATGCA
ATGGCCCCGTGGAACCG
ATGTCCCCGTGGATGCG
ATGGCCCCGTGGATGCG
ATGGCCCTGTGGATGCG
Time
ATGGCCCTGTGTATGCG
“DNA alignment”
• Species1:
• Species2:
• Species3:
ATGGCAATGTGGATGCA
ATGGCCCCGTGGAACCG
ATGTCCCCGTGGATGCG
6
3
5
Real life example: Alignment
• Insulin from 7 different species
•
•
•
•
•
•
•
Homo:
Pan:
Sus:
Ovis:
Canis:
Mus:
Gallus:
ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAA
ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGTGCTGCTGGCCCTCTGGGGACCTGACCCAGCCTCGGCCTTTGTGAA
ATGGCCCTGTGGACGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCCCTCTGGGCGCCCGCCCCGGCCCAGGCCTTCGTGAA
ATGGCCCTGTGGACACGCCTGGTGCCCCTGCTGGCCCTGCTGGCACTCTGGGCCCCCGCCCCGGCCCACGCCTTCGTCAA
ATGGCCCTCTGGATGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCCCTCTGGGCGCCCGCGCCCACCCGAGCCTTCGTTAA
ATGGCCCTGTTGGTGCACTTCCTACCCCTGCTGGCCCTGCTTGCCCTCTGGGAGCCCAAACCCACCCAGGCTTTTGTCAA
ATGGCTCTCTGGATCCGATCACTGCCTCTTCTGGCTCTCCTTGTCTTTTCTGGCCCTGGAACCAGCTATGCAGCTGCCAA
Real life example: Tree
Interpretation of Multiple Alignments
Conserved features assumed to be important for functionality
For instance: conserved pairs of cysteines indicate possible
disulphide bridge
Sequences are related
• Darwin: all organisms are related through descent with modification
• Prediction: similar molecules have similar functions in different organisms
Protein synthesis carried out by
very similar RNA-containing
molecular complexes
(ribosomes) that are present in
all known organisms
Sequences are related, II
Related oxygenbinding proteins in
humans
DNA as
Biological Information
Rasmus Wenersson
Overview
• Learning objectives
– About Biological Information
– A note about DNA sequencing techniques
and DNA data
– File formats used for biological data
– Introduction to the GenBank database
Information flow in biological systems
DNA sequences = summary of information
Ribose
3’
5
4
1
3
2
5’
Deoxyribose
5
4
1
3
2
5’ AGCC 3’
3’ TCGG 5’
5’
5’ ATGGCCAGGTAA 3’
DNA backbone: http://en.wikipedia.org/wiki/DNA
(Deoxy)ribose: http://en.wikipedia.org/
3’
PCR
Melting
96º , 30 sec
35
cycles
Annealing
~55º, 30 sec
Extension
72º , 30 sec
Animation: http://depts.washington.edu/~genetics/courses/genet371b-aut99/PCR_contents.html
PCR
Der kræves QuickTime™ og
et -komprimeringsværktøj,
for at man kan se dette billede.
Animation: http://www.people.virginia.edu/~rjh9u/pcranim.html
PCR graph: http://pathmicro.med.sc.edu/pcr/realtime-home.htm
Gel electrophoresis
• DNA fragments are seperated
using gel electrophoresis
– Typically 1% argarose
– Colored with EtBr or ZybrGreen
(glows in UV light).
– A DNA ”ladder” is used for
identification of known DNA
lengths.
-
+
Gel picture: http://www.pharmaceutical-technology.com/projects/roche/images/roche3.jpg
PCR setup: http://arbl.cvmbs.colostate.edu/hbooks/genetics/biotech/gels/agardna.html
The Sanger method of DNA sequencing
}
OH
Terminator
X-ray sequenceing gel
Images: http://www.idtdna.com/support/technical/TechnicalBulletinPDF/DNA_Sequencing.pdf
Automated sequencing
• The major break-through
of sequencing has
happended through
automation.
• Fluorescent dyes.
• Laser based scanning.
• Capillary electrophoresis
• Computer based basecalling and assembly.
Images: http://www.idtdna.com/support/technical/TechnicalBulletinPDF/DNA_Sequencing.pdf
Handout exercise: ”base-calling”
• Handout:
Chromotogram
• Groups of 2-3.
• Tasks:
– Identify “difficult” regions
– Identify “difficult”
sequence stretches.
– Try to estimate the best
interval to use.
Biological data on computers
• The GenBank database
• File formats
– FASTA
– GenBank
NCBI GenBank
• GenBank is one of the main
internaltional DNA
databases.
• GenBank is hosted by NCBI:
National Center for
Biotechnology Information.
• GenBank has exists since
1982.
• The database is public - no
restrictions on the use of the
data within.
FASTA format
>alpha-D
ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCAC
CCAGACTGTGGAGCCGAGGCCCTGGAGAGGTGCGGGCTGAGCTTGGGGAAACCATGGGCA
AGGGGGGCGACTGGGTGGGAGCCCTACAGGGCTGCTGGGGGTTGTTCGGCTGGGGGTCAG
CACTGACCATCCCGCTCCCGCAGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCC
CCCACTTCGACTTGCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGG
CCGCCTTGGGCAACGCTGTCAAGAGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCA
GCGACCTGCATGCCTACAACCTGCGTGTCGACCCTGTCAACTTCAAGGCAGGCGGGGGAC
GGGGGTCAGGGGCCGGGGAGTTGGGGGCCAGGGACCTGGTTGGGGATCCGGGGCCATGCC
GGCGGTACTGAGCCCTGTTTTGCCTTGCAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTG
GCCACACACCTGGGCAACGACTACACCCCGGAGGCACATGCTGCCTTCGACAAGTTCCTG
TCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGATAA
>alpha-A
ATGGTGCTGTCTGCCAACGACAAGAGCAACGTGAAGGCCGTCTTCGGCAAAATCGGCGGC
CAGGCCGGTGACTTGGGTGGTGAAGCCCTGGAGAGGTATGTGGTCATCCGTCATTACCCC
ATCTCTTGTCTGTCTGTGACTCCATCCCATCTGCCCCCATACTCTCCCCATCCATAACTG
TCCCTGTTCTATGTGGCCCTGGCTCTGTCTCATCTGTCCCCAACTGTCCCTGATTGCCTC
TGTCCCCCAGGTTGTTCATCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACC
TGTCACATGGCTCCGCTCAGATCAAGGGGCACGGCAAGAAGGTGGCGGAGGCACTGGTTG
AGGCTGCCAACCACATCGATGACATCGCTGGTGCCCTCTCCAAGCTGAGCGACCTCCACG
CCCAAAAGCTCCGTGTGGACCCCGTCAACTTCAAAGTGAGCATCTGGGAAGGGGTGACCA
GTCTGGCTCCCCTCCTGCACACACCTCTGGCTACCCCCTCACCTCACCCCCTTGCTCACC
ATCTCCTTTTGCCTTTCAGCTGCTGGGTCACTGCTTCCTGGTGGTCGTGGCCGTCCACTT
CCCCTCTCTCCTGACCCCGGAGGTCCATGCTTCCCTGGACAAGTTCGTGTGTGCCGTGGG
CACCGTCCTTACTGCCAAGTACCGTTAA
(Handout)
GenBank format
• Originates from the
GenBank database.
• Contains both a DNA
sequence and
annotation of feature
(e.g. Location of
genes).
(handout)
GenBank format - HEADER
LOCUS
DEFINITION
ACCESSION
VERSION
KEYWORDS
SOURCE
ORGANISM
REFERENCE
AUTHORS
TITLE
JOURNAL
PUBMED
COMMENT
CMGLOAD
1185 bp
DNA
linear
VRT 18-APR-2005
Cairina moschata (duck) gene for alpha-D globin.
X01831
X01831.1 GI:62724
alpha-globin; globin.
Cairina moschata (Muscovy duck)
Cairina moschata
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Archosauria; Aves; Neognathae; Anseriformes; Anatidae; Cairina.
1 (bases 1 to 1185)
Erbil,C. and Niessing,J.
The primary structure of the duck alpha D-globin gene: an unusual
5' splice junction sequence
EMBO J. 2 (8), 1339-1343 (1983)
10872328
Data kindly reviewed (13-NOV-1985) by J. Niessing.
GenBank format - ORIGIN section
ORIGIN
1
61
121
181
241
301
361
421
481
541
601
661
721
781
841
901
961
1021
1081
1141
//
ctgcgtggcc
cagggtgcta
agcctgccac
gtgggagaag
gctgggccca
tgggctggga
aaaactgact
ttcccccact
gcggctgccc
ctcagcaacc
gactagggtc
ggtctgagtt
gggtaccagg
gtgggccaga
gggggactca
tccggagcag
tggtgctggc
agttcttgtc
cccttgcacc
gggcatcggg
tcagcccctc
taagagctcg
gccgctgccg
gtggctggcc
gggggcactc
cccagagcgc
ggcctcgctc
tcgacctgca
tgggcaatgc
tgcatgccta
cttgggtctg
tcctggggtc
gtcctggggg
ggctgggatt
gggcctcagg
gggtactaag
cgcacacctg
cgccgtggct
ttcaataaag
ggtcccaggg
cacccctcca
gccccgcggg
ccatgctgac
accaggagga
acagggtggg
cacggggtgc
cggcaggatg
tcccggctct
cgtgaagagc
caacctgcgt
ggggtctgag
tggcagtcct
ccagcagcca
gtgtttggaa
gggactcggg
ccctggtttg
ggcaaagact
gccgtgctgg
acaccattac
agggctgggt
cgctgataag
tgtctccacc
cgccgaggac
attcggaagt
cagcagggag
gggctgagat
ttcctcgcct
gaacaggtcc
ctggacaacc
gttgaccctg
ggtgtggggt
gggggctgag
gacagcaggg
tgggagctgg
gggggactga
ccttgcagct
acagccccga
ctgaaaagta
cacagctctg
tgcttccaca
ataaggccag
acagaaaccc
aagaagctca
gaagctctgc
caggagccct
gggcaaagca
acccccagac
gtggccatgg
tcagccaggc
tcaacttcaa
gcagggtctg
ggccagggtc
gctgggattg
gcaggggcta
gggagactca
gctggcacag
gatgcatgct
cagatgagcc
tgtctgtgtg
catcc
ggcgggagcg
gtcagttgcc
tcgtgcaggt
agaggtgtgg
gcagcgggtg
gcagggcacc
caagacctac
caagaaagtg
cctgtctgag
ggcaagcggg
ggggtccagg
ctgtggtctt
catctgggat
gggccagggt
gggccatctg
tgcttccagg
gcctttgaca
actgcctgca
tgctgggact
GenBank format - FEATURE section
FEATURES
source
CAAT_signal
TATA_signal
precursor_RNA
exon
CDS
repeat_region
intron
repeat_region
exon
intron
exon
polyA_signal
polyA_signal
Location/Qualifiers
1..1185
/organism="Cairina moschata"
/mol_type="genomic DNA"
/db_xref="taxon:8855"
20..24
69..73
101..1114
/note="primary transcript"
101..234
/number=1
join(143..234,387..591,939..1067)
/codon_start=1
/product="alpha D-globin"
/protein_id="CAA25966.2"
/db_xref="GI:4455876"
/db_xref="GOA:P02003"
/db_xref="InterPro:IPR000971"
/db_xref="InterPro:IPR002338"
/db_xref="InterPro:IPR002340"
/db_xref="InterPro:IPR009050"
/db_xref="UniProt/Swiss-Prot:P02003"
/translation="MLTAEDKKLIVQVWEKVAGHQEEFGSEALQRMFLAYPQTKTYFP
HFDLHPGSEQVRGHGKKVAAALGNAVKSLDNLSQALSELSNLHAYNLRVDPVNFKLLA
QCFQVVLAAHLGKDYSPEMHAAFDKFLSAVAAVLAEKYR"
227..246
/note="direct repeat 1"
235..386
/number=1
289..309
/note="direct repeat 1"
387..591
/number=2
592..939
/number=2
940..1114
/number=3
1095..1100
1114
Exercise: GenBank
• Work in groups of 2-3
people.
• The exercise guide is
linked from the course
programme.
• Read the guide
carefully - it contains a
lot of information about
GenBank.
Related documents