Download "Basics in Bioinformatics" Gabor Rakhely`s lecture, 18/Feb/2010

Document related concepts

Transcriptional regulation wikipedia , lookup

Protein adsorption wikipedia , lookup

DNA barcoding wikipedia , lookup

Molecular cloning wikipedia , lookup

Gene expression wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Genome evolution wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Genetic code wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Gene wikipedia , lookup

Genomic library wikipedia , lookup

Silencer (genetics) wikipedia , lookup

DNA vaccination wikipedia , lookup

Community fingerprinting wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Homology modeling wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Non-coding DNA wikipedia , lookup

Molecular evolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Basics in bioinformatics
Gábor Rákhely PhD.
Institute of Biophysics BRC HAS
Department of Biotecnology
University of Szeged
[email protected]
(599)-726
This presentation can be found:
http://biotech.szbk.u-szeged.hu/bioinf/bioinfo_itc.html
Books are available in English
BIOINFORMATICS
INFORMATICS
BIOINFORMATICS
BIOLOGY
“The >99% of the ever-lived scientists is contemporary
It is true for data  revolution in informatics
INFORMATICS
- experiments  information  production of new information
- treatment, classification (grouping) and displaying of data
- harmonizing of data
Entering data, arrangement of data  databanks
Processing, displaying and evaluation of data
 newer information  newer, other databanks
Databanks:
- fast exchange of data
- interactive link between databanks
and researchers
automation, usage of special softwares
and knowledge
PREBIOINFORMATICS:
RESOLVING THE INFORMATION CARRIER
1866 Mendel: crossing experiments with peas  heredity
in units
1869 Miescher: purification of salmon sperm DNA
DNA as inheriting material
1903 WS Sutton
the inheritable pattern is linked to the properties
of chromosomes during proliferation
cytochemsitry:
the chromosome consist of DNA and protein
1925-1928 F. Griffith
mouse infections with Streptococcus pneumoniae
 transforming principle
1944 Avery:
the transforming compound is DNA
PREBIOINFORMATICS:
RESOLVING THE INFORMATION CARRIER
1952. Hershey és Chase
From T2 phage DNA enters into the cells
THE ROAD TO THE DOUBLE HELIX
Chargaff E.: the ratio of the nucleotides is equal in humans and E. coli
Biophysical data: e.g. water content of DNA
Rosalind Franklin and Maurice Wilkins X-ray diffraction data
Crick és Watson 1952-1953
The model of the double helices
The double helical DNA
The central dogma and the main areas of
the bioinformatics in molecular biology
Genomics
Transcriptomics,
transcriptome
degradation
proteomics,
proteosome
degradation
biochemical activity
metabolic pathways
metabolomics
Genomics
Basically to determine the nucleotide
sequence of a genome or
extrachromosomal elements
 In silico prediction of functional regions,
including coding, regulatory regions, splice
sites e.t.c.

The main three branches of the evolutionary tree
(by Woese and colleagues)
Genome sizes in nucleotide base pairs
plasmids
viruses
bacteria
fungi
plants
algae
insects
mollusks
bony fish
The size of the human
genome is ~ 3 X 109 bp;
almost all of its complexity
is in single-copy DNA.
amphibians
reptiles
birds
The human genome is thought
to contain ~30,000-40,000 genes.
104
105
106
107
mammals
108
109
1010
1011
COMPARISON OF THE CELL ORGANIZATION IN
PROKARYOTES ANN EUKARYOTES
STRUCTURE OF GENES IN EUKARYOTES
exon
intron exon
Regulatory elements
upstream
Start of the biological
information (coding region)
downstream
End of biological information
(coding region)
altenative splicing
Genes within genes
neurofibromatosis type I gene
exons
introns
OGMP
EVI2B
EVI2A
THE ORGANIZATION OF THE PROKARYOTE
GENOME
The model of the E. coli
nucleoide
THE ORGANIZATION OF THE GENES IN
PROKARYOTES: polycistronic structure
DNS MANIPULTION
WITH COMPUTER
DNA sequencing according to SANGER
THE PRINCIPLE OF THE AUTOMATIC DNA
SEQUENCENG
GENOME SEQUENCING STRATEGIES
Shot gun
Primer walking
ALTERNATIVE SHOT GUN STRATEGIES
PRODUCTION OF BACTERIAL
SHOT GUN LIBRARY
Preparation of shotgun library
E. coli
chromosomal DNA
transformation
electroporation
2-3,5 kb
fragments
blunting the
ends
broken DNA
fragments
dephosphorylation
Preparative gel electrophoresis
SEQUENCE PROCESSING
Sequence analysis
checking,
validation
Phrap
SeqMan/DNASTAR
STADEN programme
Removal of vectorial and other
contaminating sequences
Removal of the low quality
sequences
Vector_clipping
Phrap
Contig assembly from the overlapping fragments
Phred
Manual checking the sequences
ARRANGMENT OF PRIMARY SEQUENCES INTO CONTIG
an example
S19T7
S12SK
S19SK
S148O20
S11T7
S148O22
S148O15
S148O7
S148O17
S17SK
S148019
S148O13
S148O8
S17T7
S13SK
S148O14
S148O18
S12T7
S13T7
S148O21
pcaB
macA
orf2
2000
pSC1/1
S148SK
S148O11
S148O9
orf1
S148O12
S148O10
S11SK
S148T7
SC110SK
SC110T7
orf-3
pSC1/2
PSC148
6000
pSC1/3
(7405 bps)
pcaG
pcaH
4000
S14SK
S18SK
S16SK
pSC1/8
pSC1/10
pSC1/4
pSC1/6
COSMID
LIBRARY
 Partial digestion of genomic DNA with
MboI (Sau3AI)
(compatible end with BamHI end)
 Size fractionation for 30 – 45 kb fragmnets
BamHI- XbaI
digestion
cos
cos
Ampr
A tool for connecting
non-overlapping
contigs
ori
ligation
30 – 45 kb fragments
cos
cos
in vitro packing with
GigaPack l extrackt
Selection for ampicillin
rezisztent clones
Cosmid library
PRIMER WALKING
In cosmid, BAC, YAC libraries
TEMPLATE GENERATING SYSTEMS
The location of the integration must
be known
High throughput automatic Southern
hybiridization
USEFUL TOOLS FOR ASSEMBLYING:
MAPPING
- genetic: positioning of genes and properties
- physical: arrangment of sequences and genes
- EST: expressed sequence tag
- STS: sequence tagged site single 100-500 bp fragment
ASSEMBLY OF THE CONTIGS: gap closure
DIFFICULTIES IN THE ASSEMBLY:
A.
Abnormal genetic elements:
formation of pseudogenes
B.
The coding region is sérült
No regulatory region, driving
elements of transcripion
convencional pseudogene: loss of function mutation
DIFFICULTIES IN THE ASSEMBLY
Retroelements and retrotransposition
DIFFICULTIES IN THE ASSEMBLY
DNA transposons
the retrotransposons are rather characteristic
for Eukaryotes
DIFFICULTIES IN THE ASSEMBLY
REPETITIVE SEQUENCES IN THE GENOMES
1 chromosome
microsatellites
(short tandem repeat, STR)
 13 bp repeat  150 bp long:
interspersed
repeats
pl. CACACACACACA
2 chromosome
tandem repeated DNA
Long Interspersed Nuclear Elements: LINE
On the average it occurs by 2 kb
Minisatellites
25 bp repeat  20 kbp length
Short Interspersed Nuclear Elements: SINE
For genetic profile analysis
“THE COMEDY OF ERRORS”
A SEGMENT OF THE HUMAN GENOME
IF EVERYTHING OK, WE HAVE
SEQUENCES
What does it contain, a gene or non-coding region?
How do we know we can find anything, e.g. a gene?
CTCGAGACGCTGTTTCTGGGGTCATTCATTCTTGGCGGGCTGCAACTGCTGGTGTGACCGACGCGACCTGGCAGGCCGCGGTGCGCAACTGGCCGGGCGGACTAATGGTGGAGCAAAAGA
TCGGCATGTCCAGCGCACCTGAAGCTTGGGTGGTTGCTGCAATAGCAGCCTTCCTTATTGGCATGGCGAAGGGCGGTTTGGCCAATGTGGGGGTTATCGCCGTTCCCTTGATGTCCCTGG
TCAAGCCGCCGCTTACCGCTGCCGGATTGCTGCTCCCGATCTATGTCGTTTCTGATGCATTCGGCGTCTGGCTTTATCGGCACCGGTATTCTGCCTCCAATCTGCGCATCCTGATTCCTT
CGGGATTTTTTGGGGTCCTGATTGGCTGGTTATTGGCCGGGCAGATCTCCGACGCGATTGCCAGTGTCATTGTTGGTTTCACCGGCTGCGGCTTCGTGGCTGTGCTGCTGGCACGACGAG
GGGTGCCATCGGTGCCGCGTCAAGCCAACGTGCCCAAAGGATGGTTTCTGGGGGTGGCCACCGGCTTTACCAGCTTTTTGACTCATTCCGGTGCGGCGACCTTCCAGATGTTCGTGCTGC
CGCAACGGCTGGACAAGACCATGTTCGCGGGCACATCAACGCTTACCTTTGCTGCCATAAACCTATTCAAGATTCCGTCCTACTGGGCATTGGGACAGCTTTCGACTTCCTCGGTCATGT
CCGCGCTAGTGTTGATTCCGGTGGCCGTGGCCGGGACGTTCGCAGGTGTTTTTGCGACGCGCAGGCTATCGACATCCTGGTTCTTCATTCTGGTCCAGGCGATGTTGCTGGTGGTCTCCA
TTCAGCTTCTGTGGAGGGGAATGTCGGATATCCTGAACTAGCTGGAGATCGCAATGTCAGAACGCTCAATCAATCAGAATGTAATCTTGACATAGAATACCGTTCCGATTTATTGCTTCG
AGTGAAGCTGCCCGTCCGCTGAGATGTCATGACATTTTCCCCGCTTGATTCCGCCCTGCTTGGACCGTTGTTCGCGACCGATGAAATGCGCACGGTCTTCTCCGAACGGCGTTTTTTGGC
GGGAATGCTTCGTGTTGAAGTGGCCCTGGCGCGCGCGCAGGCGGCAGAGGGCCTTGTCAGTTCGGAATTGGCCGACGCGATCGAGGTTGTTGGTACTGCCGGGTTGGACCCCGAGGCGAT
GGCGGCGACTACTCGCATGACAGGAGTGCCCGCAATATCGTTCGTCCGTGCGGTGCAATCGGCCCTGCCGCCCTCACTGGCGGGTGGATTTCATTTCGGCGCCACCAGTCAAGACATCGT
GGATACGGCCCACGCGCTCCAGCTGGCCGAGGCACTCGATATTATAGAAGTCGATTTACACGCCACTGTCAGCGCAATGATGAATCTGGCCGCTGCTCACTGCAATACACCCTGTATCGG
GCGCACGGCCTTGCAGCACGCAGCGCCAGTTACGTTCGGCTACAAGGCGTCCGGCTGGTGCGTTGCCCTGGCGGAGCATCTGGTGCAGCTTCCCGCGCTGCGAAAGCGGGTTCTGGTGGC
GTCGCTAGGGGGGCCGGTTGGTACCCTTGCCGCGATGGAGGAGCGGGCCGACGCTGTACTGGAGGGTTTCGCTGCGGACCTGGGGTTGGCCATTCCCGCCCTGGCCTGGCACACGCAGCG
GGCCCGGATCGTCGAGGTGGCCAGTTGGCTGGCCATATTGCTGGGAATTCTGGCAAAAATGGCCACCGATGTCGTTCACTTGTCCTCCACGGAAGTGCGCGAGCTTTCCGAACCTGTAGC
GCCGGGCAGGGGGGGCTCCTCGGCGATGCCTCACAAGCGGAACCCGATTTCCTCGATTACCATCCTGTCCCAGCATGCTGCGGCAGGGGCCCAGCTCTCCATTCTCGTGAACGGCATGGC
CAGTCTGCACGAACGTCCGGTGGGGGCGTGGCATTCGGAATGGTTGGCTCTGCCGACGCTGTTCGGCCTTGCCGGCGGTGCCGTGCGCGAGGGCAGGTTTCTGGCCGAGGGGCTGCTGGT
CGATGCCGACCAGATGGGTCGCAATCTACAATTGACCAATGGCCTGATTTTCAGCGACGCGGTAGCCGGCCAGTTGGCAAAGCACTTGGGTCGGGCCGAGGCTTATGCCGCTGTCGAGGA
TGCCGCCGCCGAGGTGTTGCGTTCAGGCGGCAGCTTTCAGGGTCAGCTGAACCAGCGCCTGCCCGATCACCGCGACGCTATCGCTATTGCTTTTGATACGACGCCGGCGATCCAGGCCGG
GGCCGCCCGCTGCCGTAGTGCGCTGGATCATGTGGCTCGTATTCTTGGACCCGCCTCTACCATCGGATTTCAAGGAGGCTAATGACGTGACGACACTGTTTGAGGCGACGACCATCCCGA
TTTGCGAGGGCCCGCGCGACCAGACCGCCGAGATCCTTTTCGAGATGCCGCCGGGTGCGTGGGATACCCATTTTCATGTTTTTGGCCCAGTTTCATCGTTTCCATACGCAGAACACAGGC
TCTATTCCCCACCGGAGTCGCCACTTGAGGATTATCTGGTGTTGATGGAGGCTTTGGGGATCGAGCGCGGCGTTTGTGTCCATCCGAATGTTCATGGTGCCGACAATTCGGTGACGCTCG
ACGCAGTTGCGCGGTCCGATGGTCGTCTGCTGGCGGTGATCAAGCCACATCACGAGATGACTTTTGTTCAGCTGCGGGACATGAAGGCGCAGGGGGTCTGCGGGGTACGTTTTGCCTTCA
ATCCGCAGCATGGCTCGGGCGAGTTGGATACTCGTTTGTTCGAGCGTATGTTGGACTGGTGCCGCGACCTAGGCTGGTGCGTAAAATTGCATTTCGCGCCCGCTGCGCTGGACGGTCTGG
CTGAACGTTTGGCGCGCGTCGATATTCCGATCATCATCGATCATTTCGGGCGGGTGGACACCGCGCAAGGTGTGGATCAGCCGCACTTCCTGCGTTTGCTCGATCTGGCCAAACTGGACC
Comparison to known sequences

The sequence obtained can be compared
to known sequences in the databanks

Question: what is similar?

What to compare DNA or protein?
SIMILARITY
CTCGAGACGCTGTTTCTGGGGTCATTCATTCTTGGCGGG
CTGCAACTGCTGGTGTGACCGACGCGACCTGGCAGGCCGC
GGTGCGCAACTGGCCGGGCGGACTAATGGTGGAGCAAAA
GATCGGCATGTCCAGCGCACCTGAAGCTTGGGTGGTTGC
TGCAATAGCAGCCTTCCTTATTGGCATGGCGAAGGGCGG
TTTGGCCAATGTGGGGGTTATCGCCGTTCCCTTGATGTC
CCTGGTCAAGCCGCCGCTTACCGCTGCCGGATTGCTGCTC
CCGATCTATGTCGTTTCTGATGCATTCGGCGTCTGGCTT
TATCGGCACCGGTATTCTGCCTCCAATCTGCGCATCCTG
ATTCCTTCGGGATTTTTTGGGGTCCTGATTGGCTGGTTA
TTGGCCGGGCAGATCTCCGACGCGATTGCCAGTGTCATT
GTTGGTTTCACCGGCTGCGGCTTCGTGGCTGTGCTGCTG
GCACGACGAGGGGTGCCATCGGTGCCGCGTCAAGCCAAC
GTGCCCAAAGGATGGTTTCTGGGGGTGGCCACCGGCTTT
ACCAGCTTTTTGACTCATTCCGGTGCGGCGACCTTCCAG
ATGTTCGTGCTGCCGCAACGGCTGGACAAGACCATGTTC
GCGGGCACATCAACGCTTACCTTTGCTGCCATAAACCTA
TTCAAGATTCCGTCCTACTGGGCATTGGGACAGCTTTCG
ACTTCCTCGGTCATGTCCGCGCTAGTGTTGATTCCGGTG
GCCGTGGCCGGGACGTTCGCAGGTGTTTTTGCGACGCGC
AGGCTATCGACATCCTGGTTCTTCATTCTGGTCCAGGCG
ATGTTGCTGGTGGTCTCCATTCAGCTTCTGTGGAGGGGA
ATGTCGGATATCCTGAACTAGCTGGAGATCGCAATGTC
AGAACGCTCAATCAATCAGAATGTAATCTTGACATAGA
ATACCGTTCCGATTTATTGCTTCGAGTGAAGCTGCCCGT
CCGCTGAGATGTCATGACATTTTCCCCGCTTGATTCCGCC
CTGCTTGGACCGTTGTTCGCGACCGATGAAATGCGCACG
GTCTTCTCCGAACGGCGTTTTTTGGC
CTCGAGACGCTGTTTCTGGGGTCATTCATTCTTGGCGGG
CTGCAACTGCTGGTGTGACCGACGCGACCTGGCAGGCCGC
GGTGCGCAACTGGCCGGGCGGACTAATGGTGGAGCAAAA
GATCGGCATGTCCAGCGCACCTGAAGCTTGGGTGGTTGC
TGCAATAGCAGCCTTCCTTATTGGCATGGCGAAGGGCGG
TTTGGCCAATGTGGGGGTTATCGCCGTTCCCTTGATGTC
CCTGGTCAAGCCGCCGCTTACCGCTGCCGGATTGCTGCTC
CCGATCTATGTCGTTTCTGATGCATTCGGCGTCTGGCTT
TATCGGCACCGGTATTCTGCCTCCAATCTGCGCATCCTG
ATTCCTTCGGGATTTTTTGGGGTCCTGATTGGCTGGTTA
TTGGCCGGGCAGATCTCCGACGCGATTGCCAGTGTCATT
GTTGGTTTCACCGGCTGCGGCTTCGTGGCTGTGCTGCTG
GCACGACGAGGGGTGCCATCGGTGCCGCGTCAAGCCAAC
GTGCCCAAAGGATGGTTTCTGGGGGTGGCCACCGGCTTT
ACCAGCTTTTTGACTCATTCCGGTGCGGCGACCTTCCAG
ATGTTCGTGCTGCCGCAACGGCTGGACAAGACCATGTTC
GCGGGCACATCAACGCTTACCTTTGCTGCCATAAACCTA
TTCAAGATTCCGTCCTACTGGGCATTGGGACAGCTTTCG
ACTTCCTCGGTCATGTCCGCGCTAGTGTTGATTCCGGTG
GCCGTGGCCGGGACGTTCGCAGGTGTTTTTGCGACGCGC
AGGCTATCGACATCCTGGTTCTTCATTCTGGTCCAGGCG
ATGTTGCTGGTGGTCTCCATTCAGCTTCTGTGGAGGGGA
ATGTCGGATATCCTGAACTAGCTGGAGATCGCAATGTC
AGAACGCTCAATCAATCAGAATGTAATCTTGACATAGA
ATACCGTTCCGATTTATTGCTTCGAGTGAAGCTGCCCGT
CCGCTGAGATGTCATGACATTTTCCCCGCTTGATTCCGCC
CTGCTTGGACCGTTGTTCGCGACCGATGAAATGCGCACG
GTCTTCTCCGAACGGCGTTTTTTGGC
the two sequences are (and look) the same
SIMILARITY
As now – but almost the same,
but they seem to be dissimilar
CTCGAGACGCTGTTTCTGGGGTCATTCATTCTTGGCGGG
CTGCAACTGCTGGTGTGACCGACGCGACCTGGCAGGCCGC
GGTGCGCAACTGGCCGGGCGGACTAATGGTGGAGCAAAA
GATCGGCATGTCCAGCGCACCTGAAGCTTGGGTGGTTGC
TGCAATAGCAGCCTTCCTTATTGGCATGGCGAAGGGCGG
TTTGGCCAATGTGGGGGTTATCGCCGTTCCCTTGATGTC
CCTGGTCAAGCCGCCGCTTACCGCTGCCGGATTGCTGCTC
CCGATCTATGTCGTTTCTGATGCATTCGGCGTCTGGCTT
TATCGGCACCGGTATTCTGCCTCCAATCTGCGCATCCTG
ATTCCTTCGGGATTTTTTGGGGTCCTGATTGGCTGGTTA
TTGGCCGGGCAGATCTCCGACGCGATTGCCAGTGTCATT
GTTGGTTTCACCGGCTGCGGCTTCGTGGCTGTGCTGCTG
GCACGACGAGGGGTGCCATCGGTGCCGCGTCAAGCCAAC
GTGCCCAAAGGATGGTTTCTGGGGGTGGCCACCGGCTTT
ACCAGCTTTTTGACTCATTCCGGTGCGGCGACCTTCCAG
ATGTTCGTGCTGCCGCAACGGCTGGACAAGACCATGTTC
GCGGGCACATCAACGCTTACCTTTGCTGCCATAAACCTA
TTCAAGATTCCGTCCTACTGGGCATTGGGACAGCTTTCG
ACTTCCTCGGTCATGTCCGCGCTAGTGTTGATTCCGGTG
GCCGTGGCCGGGACGTTCGCAGGTGTTTTTGCGACGCGC
AGGCTATCGACATCCTGGTTCTTCATTCTGGTCCAGGCG
ATGTTGCTGGTGGTCTCCATTCAGCTTCTGTGGAGGGGA
ATGTCGGATATCCTGAACTAGCTGGAGATCGCAATGTC
AGAACGCTCAATCAATCAGAATGTAATCTTGACATAGA
ATACCGTTCCGATTTATTGCTTCGAGTGAAGCTGCCCGT
CCGCTGAGATGTCATGACATTTTCCCCGCTTGATTCCGCC
CTGCTTGGACCGTTGTTCGCGACCGATGAAATGCGCACG
GTCTTCTCCGAACGGCGTTTTTTGGC
GLOBAL, LOCAL
AAACTCGAGACGCTGTTTCTGGGGTCATTCATTCTTGGC
GGGCTGCAACTGCTGGTGTGACCGACGCGACCTGGCAGG
CCGCGGTGCGCAACTGGCCGGGCGGACTAATGGTGGAGC
AAAAGATCGGCATGTCCAGCGCACCTGAAGCTTGGGTGG
TTGCTGCAATAGCAGCCTTCCTTATTGGCATGGCGAAGG
GCGGTTTGGCCAATGTGGGGGTTATCGCCGTTCCCTTGA
TGTCCCTGGTCAAGCCGCCGCTTACCGCTGCCGGATTGCT
GCTCCCGATCTATGTCGTTTCTGATGCATTCGGCGTCTG
GCTTTATCGGCACCGGTATTCTGCCTCCAATCTGCGCATC
CTGATTCCTTCGGGATTTTTTGGGGTCCTGATTGGCTGG
TTATTGGCCGGGCAGATCTCCGACGCGATTGCCAGTGTC
ATTGTTGGTTTCACCGGCTGCGGCTTCGTGGCTGTGCTG
CTGGCACGACGAGGGGTGCCATCGGTGCCGCGTCAAGCC
AACGTGCCCAAAGGATGGTTTCTGGGGGTGGCCACCGGC
TTTACCAGCTTTTTGACTCATTCCGGTGCGGCGACCTTC
CAGATGTTCGTGCTGCCGCAACGGCTGGACAAGACCATG
TTCGCGGGCACATCAACGCTTACCTTTGCTGCCATAAAC
CTATTCAAGATTCCGTCCTACTGGGCATTGGGACAGCTT
TCGACTTCCTCGGTCATGTCCGCGCTAGTGTTGATTCCG
GTGGCCGTGGCCGGGACGTTCGCAGGTGTTTTTGCGACG
CGCAGGCTATCGACATCCTGGTTCTTCATTCTGGTCCAG
GCGATGTTGCTGGTGGTCTCCATTCAGCTTCTGTGGAGG
GGAATGTCGGATATCCTGAACTAGCTGGAGATCGCAAT
GTCAGAACGCTCAATCAATCAGAATGTAATCTTGACAT
AGAATACCGTTCCGATTTATTGCTTCGAGTGAAGCTGCC
CGTCCGCTGAGATGTCATGACATTTTCCCCGCTTGATTC
CGCCCTGCTTGGACCGTTGTTCGCGACCGATGAAATGCG
CACGGTCTTCTCCGAACGGCGTTTTTTGGC
BLAST, FASTA
Problems with DNA comparison

Codon usage preference: various codons may
code for the same amino acid, 
the DNA sequences are different, the protein
sequences are the same
… AND DOES IT CODE FOR ANY PROTEIN?
Open reading frames:
Usually they start with ATG, but in softwares it’s option
Length: default 100 aminoacid, but option
The result is hypothetical, it should be checked compared
to the existing data
Putative protein list
Finding orfs
Finding orfs
… AND DOES IT CODE FOR ANY PROTEIN?
Open reading frames:
Usually they start with ATG, but in softwares it’s option
Length: default 100 aminoacid, but option
The result is hypothetical, it should be checked compared
to the existing data
Putative protein list
similarity  BLASTP
Generation of information from information
Problems: frameshift mutation, the global failure of global similarity
Where does it start? What is start?
FRAME SHIFT MUTATION – A
SOLUTION FOR IT
Translation in each open reading frame
Stop codons are not taken into account, just as missing aa
It compares everything to everything at the protein level
example
BLASTX
Six frame translation
FRAMESHIFT
WHERE DOES IT START FROM?
Who knows?
2290
2300
2310
2320
2330
2340
GCCGCCCGCTGCCGTAGTGCGCTGGATCATGTGGCTCGTATTCTTGGACCCGCCTCTACC
A A R C R S A L D H V A R I L G P A S T
M W L V F L D P P L P
2350
2360
2370
2380
2390
2400
ATCGGATTTCAAGGAGGCTAATGACGTGACGACACTGTTTGAGGCGACGACCATCCCGAT
I G F Q G G *
S D F K E A N D V T T L F E A T T I P I
- Identification of other elements
- Genomic context
- Experimental control
GENOMIC CONTEXT
OH
NH3+
gén
hossz
(aa)
funkcó
homológia
(%)
orf1
259
hypothetical conserved membrane protein,
permease?
45
pcaB
~ 450
3-carboxy-cis-cis muconate
cycloizomerase
Putative hydrolase
40-45
orf2
359
maleil acetate redukase
45-55
OH
SO34-szulfocatechol
P340 II dioxygenase
COOCOO
40
319
macA
O
2
SO3Sulfanilic acid
+
SO3sulfomuconate
sulfomuconate
cycloisomerase
orf3
395
pcaH
245
pcaG
195
istB
19
putative oxidase, dehydrogenase
NAD binding domain
protocatechol-3,4 dioxygenase beta
subunit
protocatechol-3,4 dioxygenase
alpha subunit
80, 67, <
60
IS21 transposase, C-terminal
100
40-45
COO -
O
O
SO 3
sulfolaktone
sulfolaktone hydrolase
64, 61,
HSO33
COO COO
O
maleilacetate
maleilacetate redukase
pSC1/48 (7404bp)
orf1
pcaB
orf2
In a genomic locus the neighboring genes
may be functionally/metabolically linked
macA
TCA cycle
orf3
pcaH pcaG istB
Identification with MS
CODON USAGE
The codon usage is characteristic for the organism, species
Codon usage tables, databanks
APPLICATION OF CODON USAGE FOR
IDENTIFICATION OF CODING REGIONS
ESTABLISHMENT OF THE FUNCTION OF THE
PREDICTED GENE PRODUCT– BY ANALOGY
Comparison to the known sequences available in the databanks
Similarity search can be made at the DNA or protein level
What is a database ?
• A collection of...
– structured
– searchable (index)
– updated periodically (release)
– cross-referenced (hyperlinks)
db
• Associated tools (software)
access, update, insertion, deletion….
-> table of contents
-> new edition
-> links with other
Types of Databases

Primary Databases
 Original submissions by experimentalists
 Content controlled by the submitter


Examples: GenBank, SNP, GEO
Derivative Databases
 Built from primary data
 Content controlled by third party (NCBI)

Examples: Refseq, TPA, RefSNP, UniGene, NCBI Protein,
Structure, Conserved Domain
Sequence Databases
Main nucleic acid sequence databases
• EMBL
• GenBank
• DDBJ
Main protein sequence databases
• Swiss Prot
• also TREMBL, GenPept
Often integrated with other databases
International Sequence
Database Collaboration
Entrez
NIH
NCBI
GenBank
•Submissions
•Updates
•Submissions
•Updates
EMBL
CIB
NIG
DDBJ
•Submissions
•Updates
getentry
EBI
SRS
EMBL
Integrating Sequence and
Bibliographic Databases
Entrez
• Links nucleic acid sequences, protein
sequences and MEDLINE
• Powerful and easy to use
• US-based: can be slow from Africa
SRS
• Universal system for searching sequence
and other databases
• Available worldwide
The (ever expanding) Entrez System
OMIM
PubMed
PubMed Central
3D Domains
Journals
Structure
Books
CDD/CDART
Entrez
Protein
Taxonomy
Genome
GEO/GDS
UniSTS
UniGene
Nucleotide
SNP
PopSet
The Entrez System
GenBank:
NCBI’s Primary Sequence
Database
Release 139
December 2003
30,968,418
36,553,368,485
>140,000
Records
Nucleotides
Species
138 Gigabytes
570 files
• full release every two months
• incremental and cumulative updates daily
• available only through internet
ftp://ftp.ncbi.nih.gov/genbank/
The Growth of GenBank
35
40
Sequence records
Total base pairs
35
25
Release 139:
31.0 million records
36.6 billion nucleotides
30
25
20
Average doubling time ≈ 12 months
20
15
15
10
10
5
0
'82 '84 '85 '86 '87 '88 '90 '91 '92 '93 '95 '96 '97 '98 '00 '01 '02 '03
5
0
Total Base Pairs
(billions)
Sequence Records
(millions)
30
European Bioinformatics Institute (EBI)
SEARCHING IN THE DATABANKS:
SIMILARITY - ALIGNMENT
Comparison of primary DNA or protein sequences to other primary or
secondary sequences
Expecting that the function of the similar sequence is known
from experiments !!!
Thinking by analogy
Assuming that if the sequence is similar, the function is also similar
question: what is responsible for the function?
the whole protein or its part
How many function (activity) does a protein have?
globality
-
locality
Alignment - background
“For many protein sequences, evolutionary
history can be traced back 1-2 billion
years”
-William Pearson
 When we align sequences, we assume that they share a
common ancestor



They are then homologous
Protein fold is much more conserved than protein sequence
DNA sequences tend to be less informative than protein
sequences
Aligning Sequences….
• There are lots of possible alignments.
• Two sequences can always be aligned.
• Sequence alignments have to be scored.
• Often there is more than one solution with the same score.
Alignment methods

Rigorous algorithms
 Needleman-Wunsch
 Smith-Waterman

Heuristic algorithms
 BLAST
 FASTA
Pairwise comparison

Local alignment
 Identify
the most similar region shared between
two sequences
 Smith-Waterman

Global alignment
 Align
over the length of both sequences
 Needleman-Wunsch
Global – local alignment
TEGNAP VELED VOLTAM
TEGNAP VELED MAGOLTAM VELE DALOLTAM
::::::::::::
:
:::::
TEGNAP VELED----------V-------OLTAM
Global
TEGNAP VELED MAGOLTAM VELE DALOLTAM
::::::::::::
.:::::
TEGNAP-VELED---VOLTAM--------------
TEGNAP VELED MAGOLTAM VELE DALOLTAM
::::::::::::
.:::::
TEGNAP VELED ----------------VOLTAM
TEGNAP VELED MAGOLTAM VELE DALOLTAM
::::::
:::: : .:::::
TEGNAP----------------VELE-D-VOLTAM
Local
TEGNAP VELED MAGOLTAM
::::::::::::
.:::::
TEGNAP VELED---VOLTAM
TEGNAP VELED
:::::: :::::
TEGNAP VELED
VELE DALOLTAM
:::: : .:::::
VELE-D-VOLTAM
Parameters of Sequence Alignment
Scoring Systems:
• Each symbol pairing is assigned a numerical value,
based on a symbol comparison table.
Gap Penalties:
• Opening:
• Extension:
The cost to introduce a gap
The cost to elongate a gap
DNA Scoring Systems
actaccagttcatttgatacttctcaaa
Sequence 1
taccattaccgtgttaactgaaaggacttaaagact
Sequence 2
Negative scoring values to penalize mismatches:
A
T
C
G
A
5
-4 -4 -4
T
-4
5 -4 -4
C
-4
-4
G
-4
-4 -4
5 -4
5
Matches: 5
Mismatches: 19
Score: 5 x 5 + 19 x (-4) = - 51
Dotplots
A
A
CCTCCTTTGT
CCTCCTTTGG
CCTCCCTTAG
Pro
Leu
5 -4
C -4 -4 -4
Leu
5 5 5 5 5 -4 5 5 -4 5
C
5 -4 -4
G –4 -4
CCTCCTTTGT
Pro
G
5 -4 -4 -4
T -4
Point = 50
5 5 5 5 5 5 5 5 5 5
T
Point = 32
5
Protein Scoring Systems
• Amino acids have different biochemical and physical properties
that influence their relative replaceability in evolution.
tiny
aliphatic
P
C S+S
I
V
A
L
hydrophobic
M
Y
F
small
G
G
CSH
T
S
D
K
W
H
N
E
R
Q
aromatic
positive
polar
charged
Protein Scoring Systems
• Amino acids have different biochemical and physical properties
that influence their relative replaceability in evolution.
• Scoring matrices reflect
• probabilities of mutual substitutions
• the probability of occurrence of each amino acid.
• Widely used scoring matrices:
• PAM
• BLOSUM
Blosum62 scoring matrix
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
4
-1 5
-2 0 6
-2 -2 1 6
0 -3 -3 -3 9
-1 1 0 0 -3 5
-1 0 0 2 -4 2 5
0 -2 0 -1 -3 -2 -2 6
-2 0 1 -1 -3 0 0 -2 8
-1 -3 -3 -3 -1 -3 -3 -4 -3 4
-1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
-1 2 0 -1 -1 1 1 -2 -1 -3 -2 5
-1 -2 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
-2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
-1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5
-3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11
-2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7
0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
A R N D C Q E G H I
L K M F P S T W Y V
Basic principles of dynamic programming
- Creation of an alignment path matrix
- Stepwise calculation of score values
- Backtracking (evaluation of the optimal path)
FastA programs:
FastA
TFastA
searches for similarity between a query sequence
and any group of sequences (DNA and Protein).
compares a peptide sequence against a set of
nucleotid sequences.
FastX
compares a nucleotide sequence against a protein
database taking frameshifts into account.
TFastX
compares a peptide sequence against a nucleotide
sequence database taking frameshifts into account.
BLAST programs
Program Input
Database
1
blastn
DNA
DNA
1
blastp
protein
protein
6
blastx
DNA
protein
6
tblastn
protein
DNA
36
tblastx
DNA
DNA
What program to use for
searching?
1) BLAST is fastest and easily accessed on the
Web


limited sets of databases
nice translation tools (BLASTX, TBLASTN)
2) FASTA works best in GCG




integrated with GCG
precise choice of databases
more sensitive for DNA-DNA comparisons
FASTX and TFASTX can find similarities in sequences with
frameshifts
3) Smith-Waterman is slower, but more sensitive


known as a “rigorous” or “exhaustive” search
SSEARCH in GCG and standalone FASTA