Download Introduction to Bioinformatics.

Document related concepts

DNA nanotechnology wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Replisome wikipedia , lookup

Microsatellite wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Introduction to
Bioinformatics
1
Introduction to Bioinformatics.
LECTURE 2: GENE FINDING
*
Chapter 2: All the sequence's men
2
Introduction to Bioinformatics
LECTURE 2: GENE FINDING
2.1 Human genome sweepstake
In 2003 Lee Rowen (Institute Systems Biology,
Seattle) wins GeneSweep, the betting pool for the
number of human genes
Her price: $1200 and a signed copy of Watson’s
The Double Helix
Her guess: 25.947 genes
3
Introduction to Bioinformatics
LECTURE 2: GENE FINDING
2.1 Human genome sweepstake
Total bets in the GeneSweep:
2000: $1
2001: $5
2002: $20
2003: $1200
4
Introduction to Bioinformatics
LECTURE 2: GENE FINDING
2.1 Human genome sweepstake
The human genome counts 3.3 billion bp …
… but how to estimate
the number of human genes ?
5
Introduction to Bioinformatics
LECTURE 2: GENE FINDING
2.1 Human genome sweepstake
• 1990: estimate human genes ~300,000
• 1995: estimate human genes ~100,000
• 2000: estimate human genes ~30,000
• 2004: estimate human genes ~25,000
• 2008: estimate human genes ~22,000
• 2009: known human genes :18,308
6
Introduction to Bioinformatics
LECTURE 2: GENE FINDING
2.1 Human genome sweepstake
The pie shows that we’re now down to just 18,308 genes.
That’s over 8,000 genes fewer than six years ago.
Many sequences that once looked like full-fledged genes,
capable of generating a protein, now don’t make the
grade. Some genes turned out to be pseudogenes –
vestiges of genes that once worked but have been since
wrecked by mutations.
In other cases, DNA segments that appeared to be parts
of separate genes have turned out to be part of the same
gene.
7
Introduction to Bioinformatics
LECTURE 2: GENE FINDING
2.1 Human genome sweepstake
In this lecture we will try to estimate the
number of genes in a given DNA string
First however some biology …
8
Introduction to Bioinformatics
LECTURE 2: GENE FINDING
The human genome is stored on 23 chromosome
pairs. 22 of these are autosomal
chromosome pairs, while the remaining
pair is sex-determining.
The haploid human genome occupies a
total of just over 3 billion DNA base pairs.
The Human Genome Project produced
a reference sequence of the euchromatic
human genome, which is used worldwide
in biomedical sciences.
The haploid human genome contains an
estimated 22,000 protein-coding genes,
far fewer than had been expected before
its sequencing. In fact, only about 1.5% of the
genome codes for proteins, while the rest consists
of RNA genes, regulatory sequences, introns and
(controversially) "junk" DNA.
9
Introduction to Bioinformatics
LECTURE 2: GENE FINDING
2.2 Genes and Proteins
CENTRAL IDEA:
•Genes code for proteins
•There are fixed codes for START and STOP
•We can use those to look for DNA words:
[ START | n × <triplet> | STOP ]
•Such DNA words are s
10
Introduction to Bioinformatics
LECTURE 2: GENE FINDING
2.2 Genes and Proteins
DRAW-BACKS:
• Only candidate-genes are found
• Most of the DNA is non-coding “junk DNA” (???... )
• Where to start reading … and in what direction?
• Looooooooooooong computation times
11
Introduction to Bioinformatics
LECTURE 2: GENE FINDING
2.2 Genes and Proteins
DNA
Deoxyribonucleic acid (DNA) is a nucleic acid that
contains the genetic instructions specifying the biological
development of all cellular forms of life (and most viruses).
DNA is a long polymer of nucleotides and encodes the
sequence of the amino acid residues in proteins using the
genetic code, a triplet code of nucleotides.
12
13
DNA under electron microscope
14
3D model of a section of the DNA molecule
15
16
17
Genetic code
The genetic code is a set of rules that maps DNA sequences
to proteins in the living cell, and is employed in the process of
protein synthesis.
Nearly all living things use the same genetic code, called the
standard genetic code, although a few organisms use minor
variations of the standard code.
Fundamental code in DNA: {x(i)|i=1..N,x(i) in {C,A,T,G}}
Human: N = 3.3 billion
18
Genetic code
19
Replication
of
DNA
20
Genetic code: TRANSCRIPTION
DNA → RNA
Transcription is the process through which a DNA sequence is enzymatically
copied by an RNA polymerase to produce a complementary RNA. Or, in other
words, the transfer of genetic information from DNA into RNA. In the case of
protein-encoding DNA, transcription is the beginning of the process that
ultimately leads to the translation of the genetic code (via the mRNA
intermediate) into a functional peptide or protein. Transcription has some
proofreading mechanisms, but they are fewer and less effective than the
controls for DNA; therefore, transcription has a lower copying fidelity than
DNA replication.
Like DNA replication, transcription proceeds in the 5' → 3' direction (ie the old
polymer is read in the 3' → 5' direction and the new, complementary
fragments are generated in the 5' → 3' direction).
in RNA Thymine (T) → Uracil (U)
21
Directionality: 5' to 3' direction
Directionality, in molecular biology, refers to
the end-to-end chemical orientation of a single
strand of nucleic acid. The chemical
convention of naming carbon atoms in the
nucleotide sugar-ring numerically gives rise to
a 5' end and a 3' end (usually pronounced "five
prime end" and "three prime end").
The relative positions of structures along a
strand of nucleic acid, including genes,
transcription factors, and polymerases are
usually noted as being either upstream
(towards the 5' end) or downstream (towards
the 3' end).
The importance of having this naming convention lies in the fact that nucleic
acids can only be synthesized in vivo in a 5' to 3' direction, as the polymerase
used to assemble new strands must attach a new nucleotide to the 3' hydroxyl
(-OH) group via a phosphodiester bond. By convention, single strands of DNA
and RNA sequences are written in 5' to 3' direction.
22
Genetic code: TRANSCRIPTION
DNA → RNA
23
Genetic code: TRANSLATION
RNA → protein
24
Genetic code: exons/introns
25
Genetic code: TRANSLATION
DNA-triplet → RNA-triplet = codon → amino acid
RNA codon table
There are 20 standard amino acids used in proteins,
here are some of the RNA-codons that code for each amino acid.
Ala A
Leu L
Arg R
Lys K
Asn N
Met M
Asp D
Phe F
Cys C
Pro P
...
Start
Stop
GCU, GCC, GCA, GCG
UUA, UUG, CUU, CUC, CUA, CUG
CGU, CGC, CGA, CGG, AGA, AGG
AAA, AAG
AAU, AAC
AUG
GAU, GAC
UUU, UUC
UGU, UGC
CCU, CCC, CCA, CCG
AUG, GUG
UAG, UGA, UAA
26
Protein Structure:
primary structure
27
Protein
Structure:
secondary
Structure
a: Alpha-helix,
b: Beta-sheet
28
Protein Structure:
super-secondary Structure
29
Protein Structure = protein function:
30
Introduction to Bioinformatics
LECTURE 2: GENE FINDING
Standard Genetic Code
note:
RNA ‘U’ ~ DNA ‘A’
31
Introduction to Bioinformatics
LECTURE 2: GENE FINDING
intron - exon
32
Introduction to Bioinformatics
LECTURE 2: GENE FINDING
2.3 Gene annotation: gene finding
• Statistical analysis (eg GC-content) can
identify different regions on a DNA strand
• ab initio methods (=statistical analysis)
• Markov sequence model
33
Change points in Labda-phage
34
Introduction to Bioinformatics
LECTURE 2: Section 2.3 Gene annotation: gene finding
• Ab initio methods suffice for finding
genes on Prokaryotic DNA
• For more complex Eukaryotic DNA we
need sequence alignment methods and
Markov sequence models.
35
Introduction to Bioinformatics
LECTURE 2: Section 2.3 Gene annotation: gene finding
READING FRAMES
The DNA is translated per codon = nucleotide-triplet.
The sequence: …ACGTACGTACGTACGTACGT…
Can thus be read as:
…-ACG-TAC-GTA-CGT-ACG-TAC-GT…
or:
…A-CGT-ACG-TAC-GTA-CGT-ACG-T…
or:
…AC-GTA-CGT-ACG-TAC-GTA-CGT-…
36
Introduction to Bioinformatics
LECTURE 2: Section 2.3 Gene annotation: gene finding
OPEN READING FRAMES: ORF
An open reading frame or ORF is a portion of
an organism's genome which contains a
sequence of bases that could potentially
encode a protein
In a gene, ORFs are located between the
start-code sequence (initiation codon) and
the stop-code sequence (termination codon).
37
Introduction to Bioinformatics
LECTURE 2: Section 2.3 Gene annotation: gene finding
OPEN READING FRAMES: ORF
As we saw, we can distinguish 3 possible
ORFs on one strand (5’ to 3’).
On the complementary strand (5’ to 3’) we can
also look for 3 possibiloties – but these can
be reconstructed from the first strand.
So, we can distinguish 6 possible ORFs
38
Introduction to Bioinformatics
LECTURE 2: Section 2.3 Gene annotation: gene finding
OPEN READING FRAMES: ORF
39
Introduction to Bioinformatics
LECTURE 2: Section 2.3 Gene annotation: gene finding
Introns and Exons
40
Introduction to Bioinformatics
LECTURE 2: GENE FINDING
Algorithm 2.1: ORF-finder
Given a DNA sequence s and a positive integer k, for each
possible reading frame decompose the sequence into triplets,
and find all stretches of triplets starting with a START-codon
and ending with a STOP-codon.
Repeat also for the reverse compliment of s.
The Output consists of all ORFS longer than or equal to the
prefixed threshold k.
41
Introduction to Bioinformatics
LECTURE 2: GENE FINDING
2.4 Detecting spurious signals
• a pattern in DNA can arise from pure chance
• hypothesis testing with null-hypothesis H0
• test statistics
• p-value = probability-value
• significance level 
• type I-error (FP = False Positive) of H0
• type II-error (FN = False Negative) of H0
42
Hypothesis
testing with H0
43
Introduction to Bioinformatics
LECTURE 2: Section 2.4 Spurious signals
Computing a p-value for ORFs
• Translation table : triplet → aminoacid (AA)
• 64 possible triplets (for 20 AAs)
• 1 start-codon ATG = M = Met = Methionine
• 3 stop-codons TAA, TAG, and TGA
• a priori probability non stop-codon = 61/64
44
Introduction to Bioinformatics
LECTURE 2: Section 2.4 Spurious signals
Computing a p-value for ORFs
•a priori probability non stop-codon = 61/64
• P(k non-stopcodons) = (61/64)k
• 95%-significance : p = 0.05
• (61/64)k ≈ 0.05  k ≈ 62  +/- 64 codons
• 99%-significance : p = 0.01
• (61/64)k ≈ 0.01  k ≈ 100  +/- 102 codons
45
Introduction to Bioinformatics
LECTURE 2: Section 2.4 Spurious signals
Non-uniform codon distribution
• Pstop = P(TAA) + P(TAG) + P(TGA)
• P(k non-stop codons) = (1 - Pstop)k
• For a significance-level α we need k* codons
with: (1 - Pstop)k* = α
46
Introduction to Bioinformatics
LECTURE 2: Section 2.4 Spurious signals
Randomization tests
• Generate a string with the same statistical
properties of the original data
• Per nucleotide? per triplet? per … ?
• p-value: find the rank of observed test
statistic in null distribution: if its percentile is
less than α then it is significant
47
Introduction to Bioinformatics
LECTURE 2: Section 2.4 Spurious signals
Randomization tests
• Another method is bootstrapping
• No permutation but sampling with replacement
• Again: per nucleotide? per triplet? per … ?
• p-value: find the rank of observed test statistic in
null distribution: if its percentile is less than α then it
is significant
48
Introduction to Bioinformatics
LECTURE 2: Section 2.4 Spurious signals
Example:
ORF length in Mycoplasma genitalium
• original DNA sequence: 11,922 ORFs
• single-nucleotide permutation test =
multinomial distribution
• permute, search ORFs, record their length
• randomized DNA sequence: 17,367 ORFs
• H0 = randomized DNA sequence
49
Introduction to Bioinformatics
LECTURE 2: Section 2.4 Spurious signals
ORF length in Mycoplasma genitalium
• This approach does not identify short genes
• Smaller threshold for ORF-length:
the upper 5% of randomized DNA
• In original DNA 1520 ORFs in this upper 5%
• Many FALSE POSITIVES but still better than
the original 11,922 in the DNA of M.gen.
50
Introduction to Bioinformatics
LECTURE 2: Section 2.4 Spurious signals
ORF length in Mycoplasma genitalium
• H0 = randomized DNA sequence
• Keep ORFs in original DNA that are longer
than (most) ORFs in the randomized DNA
• max(ORF-length) in random seq. = 402 bp
• in original DNA 326 ORFs longer than 402 bp
• Good estimate: M. genitalium has 470 genes
51
Introduction to Bioinformatics
LECTURE 2: Section 2.4 Spurious signals
Here follows the ORF-length distribution
in the original and the randomized DNA
* Note the long tail of the real DNA! *
52
Introduction to Bioinformatics
LECTURE 2: Section 2.4 Spurious signals
53
Introduction to Bioinformatics
LECTURE 2: Section 2.4 Spurious signals
Example 2:
ORF length in Haemophilus influenzae
• threshold = max(ORF-length) in random seq. = 537 bp
• in original DNA 1182 ORFs longer than 537 bp
• this is about the real number of genes: 1428
54
Introduction to Bioinformatics
LECTURE 2: Section 2.4 Spurious signals
Problems with multiple testing
• the α-significance (e.g. 5%) represents the
false positive rate of one single test
• If we conduct – say – 100 tests this means
that 5 false positives are expected
• therefore, if 5 significant genes were found
out of 100 tests with α = 0.05 this does not
mean anything biologically!
55
Introduction to Bioinformatics
LECTURE 2: Section 2.4 Spurious signals
Problems with multiple testing
• The ORFs found in this way are only
candidate genes!
• it is not clear at this point whether these
ORFS are actually translated to proteins!
• Using sequence alignment (chapter 3) the
case for a candidate gene can be tightened
56
Introduction to Bioinformatics
LECTURE 2: Section 2.4 Spurious signals
57
END of LECTURE 2
58
59