Download Why teach a course in bioinformatics?

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
How to find a gene?*
• One way is too search for an open reading frame
(ORF).
• An ORF is a sequence of codons in DNA that
starts with a Start codon, ends with a Stop
codon, and has no other Stop codons inside.
* = inexact science
Each strand has 3 possible ORFs.
5'
3’
atgcccaagctgaatagcgtagaggggttttcatcatttgagtaa
1 atg ccc aag ctg aat agc gta gag ggg ttt tca tca ttt gag taa
M
P
K
L
N
S
V
E
G
F
S
S
F
E
*
2
3
tgc cca agc tga ata gcg tag agg ggt ttt cat cat ttg agt
C
P
S
*
I
A
*
R
G
F
H
H
L
S
gcc caa gct gaa tag cgt aga ggg gtt ttc atc att tga gta
A
Q
A
E
*
R
R
G
V
F
I
I
*
V
Eukaryotic Genomes
• Finding a gene is much more
difficult in eukaryotic genomes
than in prokaryotic genomes.
WHY??
Prokaryotic (bacterial) genomes:
• Are much smaller than eukaryotic
genomesE. coli = 4,639,221 bp, 4.6 Mb
Human = ~~ 3,300 Mb
Prokaryotic (bacterial) genomes:
• Contain fewer genes:
E. coli- 4285 protein coding genes
- 122 Structural RNA genes
• Human- ~ ~ ~ 32,000 genes
Prokaryotic (bacterial) genomes:
Contain a small amount of noncoding DNA-
E. coli= ~ 11% (average intergenic distance = 130 bp)
Human = > 95% (there are islands, hundreds of
thousands of bp, apparently without a gene.)
Eukaryotic Genomes:
• Contain massive amounts of repetitive DNA
sequences (Define).
• Human- repeat seqeunces comprise over
50% of genome.
• E. coli- DNA is almost entirely unique
What are the human repetitive
DNA sequences?
1) Simple ‘stutters’
(CAGCAGCAGCAGCAGCAG . . . .)
2) Psuedogenes
3) Transposable elements (= > 40% of HG)
4) Segmental duplications (~ 10 - -300 kb)
5) Gene Families (maybe a reflection of
genomic duplications)
Shocking discovery in mid
1970s:
Eukaryotic genes are interrupted by
noncoding DNA!
Almost all transcripts (mRNA) are
spliced before leaving the nucleus.
Exon
=
Genetic code
Intron
=
Non-essential
DNA ? ?
• The mechanism
of splicing is
not well
understood.
Variable mutation rate?
• Most mutations in introns and
intergenic DNA are (apparently)
harmless
• Consequently, intron and intergenic
DNA sequences diverge much quicker
than exons.
Shocking discovery in
late1990s:
• Some eukaryotic genomes have
thousands of genes that are
alternatively spliced.
• In the human genome, it is now
estimated that 35% of the genes
undergo alternative splicing
Alternate Splice sites generate various
proteins isoforms
Bacteria cells are different:
• Prokaryotic cells- No splicing (i.e.
– no split genes)
• Eukaryotic cells- Intronless genes
are rare (avg. # of introns in HG is
3-7, highest # is 234); dystrophin
gene is > 2.4 Mb.
Identifying all of the human
genes
a) Is tough
b) Is easy
c) Is really tough
Making it tough:
• Pseudogenes
• Large intergenic regions
• Prevelant and long introns
• Alternative splicing
Comparison of 4 plant genomes:
8 genes in C. elegans- 5 intronic
genes:
12 of the 64 genes duplicated
between human chr. #18 and #20
Is there a gene in there?
5’
CAGACTGTAGTCGTAGTCGTGTAGTCGTATGGCCGTAGTCGTAGTCGATCGTGATTCGTAGTCGTAGTCGTAGTCGTAGTCGTAG
TCGTAGTCGAGTCGTAGTCGTAGTCGTAGTCGTAGCTGTAGTCGTAGTCGTAGTCGAGTCGTAGTCGTGTACGTGTAGTCGTAGT
CGTAGCTGTACTAGTCGTATGCGTAGTCGTAGTCGTAGCGAGTCTGAGTGTACGTCGTAGTGCTAGTTGCGTAGTCGTAGTCGTA
GTGTCGTAGTCGTGTAGCTGTAGTCGTAGTCGTAGTCGTAGTCGTGTACGTAGTGTCGTATGCGAGGCTAGTCAGGTCGTATGGC
TAGTATGCGTAGTCGAGTCGTAGTCGTAGTGTACGTCGTAGTGTCAGTCGTCAGTTGACGTACGTAGTGTCGTAGTCGTAGTCGT
AGTCGTAGTCGTAGTGAGTGTACGTTGCGTATGGCTATGTATGTGCAGTGCTGTAGTCGTAGTGCTGTAGTCAGTTGCGTAGTGA
TGTACGTGTATGCGTATGCGTAGTCTGAGTTGCTGAGTGCTAGTCTGAGTGTCGTAGTCGTAGTGCGTAGTCGTATGCGTATGCG
TATCGGATTGCGTAGTGTAGCTGTAGTCGTAGTCGTAGTGTCGTAGTCGTGTAGTCAGTCGTGTAGTAGTCGTATGACCGCGGCG
CGAGTTGGTGCGGCGGGGGCTATTTTTCGGAGCGTGTAAGGTTATTAGGTTTTTCCTATTATATGCGCTTAGCGTAGCGCGATTA
GCGTATAGCGCATTATATATGCGCCTTCTCTCTTCGAGAGATCTCAGCGTCGTAGTGTACGTCGT
CGAGGCACTGTAGTCGTAGTCGTGTAGTCGTATGGCCGTAGTCGTAGTCGATCGTGATTCGTAGTGGTAGTCGTAGTCGTAGTCG
TAGTCGTAGTCGAGTCGTAGTCGTAGTCGTAGTCGTAGCTGTAGTCGTAGTCGTAGTCGAGTCGTAGTCGTGTACGTGTAGTCGT
AGTCGTAGCTGTACTAGTCGTATGCGTAGTCGTAGTCGTAGCGAGTCTGAGTGTACGTCGTAGTGCTAGTTGCGTAGTCGTAGTC
GTAGTGTCGTAGTCGTGTAGCTGTAGTCGTAGTCGTAGTCGTAGTCGTGTACGTAGTGTCGTATGCGAGGCTAGTCAGGTCGTAT
GGCTAGTATGCGTAGTCGAGTCGTAGTCGTAGTGTACGTCGTAGTGTCAGTCGTCAGTTGACGTACGTAGTGTCGTAGTCGTAG
TCGTAGTCGTAGTCGTAGTGAGTGTACGTTGCGTATGGCTATGTATGTGCAGTGCTGTAGTCGTAGTGCTGTAGTCAGTTGCGTA
GTGATGTACGTGTATGCGTATGCGTAGTCTGAGTTGCTGAGTGCTAGTCTGAGTGTCGTAGTCGTAGTGCGTAGTCGTATGCGTA
TGCGTATCGGATTGCGTAGTGTAGCTGTAGTCGTAGTCGTAGTGTCGTAGTCGTGTAGTCAGTCGTGTAGTAGTCGTATGACCGC
GGCGCGAGTTGGTGCGGCGGGGGCTATTTTTCGGAGCGTGTAAGGTTATTAGGTTTTTCCTATTATATGCGCTTAGCGTAGCGCG
ATTAGCGTATAGCGCATTATATATGCGCCTTCTCTCTTCGAGAGATCTCAGCGTCGTAGTGTACGT
CAGACTGTAGTCGTAGTCGTGTAGTCGTATGGCCGTAGTCGTAGTCGATCGTGATTCGTAGTCGTAGTCGTAGTCGTAGTCGGGC
TTGTAGTCGAGTCGTAGTCGTAGTCGTAGTCGTAGCTGTAGTCGTAGTCGTAGTCGAGTCGTAGTCGTGTACGTGTAGTCGTAGT
CGTAGCTGTACTAGTCGTATGCGTAGTCGTAGTCGTAGCGAGTCTGAGTGTACGTCGTAGTGCTAGTTGCGTAGTCGTAGTCGTA
GTGTCGTAGTCGTGTAGCTGTAGTCGTAGTCGTAGTCGTAGTCGTGTACGTAGTGTCGTATGCGAGGCTAGTCAGGTCGTATGGC
TAGTATGCGTAGTCGAGTCGTAGTCGTAGTGTACGTCGTAGTGTCAGTCGTCAGTTGACGTACGTAGTGTCGTAGTCGTAGTCGT
AGTCGTAGTCGTAGTGAGTGTACGTTGCGTATGGCTATGTATGTGCAGTGCTGTAGTCGTAGTGCTGTAGTCAGTTGCGTAGTGA
TGTACGTGTATGCGTATGCGTAGTCTGAGTTGCTGAGTGCTAGTCTGAGTGTCGTAGTCGTAGTGCGTAGTCGTATGCGTATGCG
TATCGGATTGCGTAGTGTAGCTGTAGTCGTAGTCGTAGTGTCGTAGTCGTGTAGTCAGTCGTGTAGTAGTCGTATGACCGCGGCG
CGAGTTGGTGCGGCGGGGGCTATTTTTCGGAGCGTGTAAGGTTATTAGGTTTTTCCTATTATATGCGCTTAGCGTAGCGCGATTA
GCGTATAGCGCATTATATATGCGCCTTCTCTCTTCGAGAGATCTCAGCGTCGTAGTGTACGTCGC
3’
How to confirm the identification
of a gene?
• Possible answer- Identify the gene
by identifying its promoter.
Promoters are DNA regions that
control when genes are activated.
coding region 
Promoter
[
]
• Exons encode the information
that determines what product
will be produced.
Promoters encode the
information that determines
when the protein will be
produced.
Nucleotides of a particular gene are often
numbered:
• De
Demonstration of a consensus
sequence.
Three current bioinformatic
challenges:
• 1) verification of the data (it is correct?)
• 2) Thorough annotation of the data (includes
developing appropriate means of annotating)
• 3) How to handle data of ever-larger chunks
A dot = a promoter. Dark purple = left to
right, light purple = right to left. Overlapping
genes= green
•
Inner circle = ccw direction, outer circle = cw direction
How to find a gene?
• Look for a substantial ORFs and
associated ‘features’.
ORFs- open reading frames
• Two nucleic acids, that are
exact complements of each
other will hybridize.
• Two nucleic acids that are
mostly complementary
(some mismatchs) will . . .
. . . hybridize under the right
conditions.
Recombinant DNA techniques?
• Many popular tools of recDNA rely on the
principle of DNA hybridization.
• In large mixes of DNA molecules,
complementary sequences will pair.
Hybridization ‘in silico’
• Algorithms have been written that will
compare two nucleic acid sequences. Two
similar DNA sequences (they would
hybridize in solution) are said ‘to match’
when software determines that they are of
significant similarity.
8/10= 80%
Mouse
Human
ATGCCGTGCTA
: : : : : : : :
ATG--CGGGCAA
Protein- Protein similarity
searches?
• Many algorithms have been
designed to compare strings of
amino acids (single letter amino
acid code) and find those of a
defined degree of similarity.
60
70
80
90
#1 TSIDQLRATTSYDELRQDGSTTISYDDYSR
:::.: : :: :: ::: :: : :: :: : :::.:: : ::
#2 TSIEQLRATTSYDELRQDGSTTISTDDYSR
Significance of sequence
similarity
• DNA similarity suggests:
• Similar function
• Similar structure
• Evolutionary relationship
The End
Related documents