Download Seq_stat - Asia University, Taiwan

Document related concepts

DNA profiling wikipedia , lookup

DNA sequencing wikipedia , lookup

DNA polymerase wikipedia , lookup

DNA nanotechnology wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Replisome wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Microsatellite wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Genome Sequences
Ka-Lok Ng
Asia University
History of genome sequencing
• 1995, led by Craig Venter’s group, at the
Institute of Genomic Research (TIGR) in
Maryland
• Reported the complete DNA seq. of the
bacterium Haemophilus influenzae
• The first viral genome seq. (phage phiX174)
was produced by Fred Sanger’s group at
1978
• Insulin A, B chains(胰島素) – the first
determined amino acid sequence in 1951 by
F. Sanger (Cambridge U)
• Sanger was awarded two Nobel prizes, the
first one in 1958 on the structure of insulin,
and the second one in 1980 (both in
chemistry) for developing DNA sequencing
techniques (with Paul Berg and Walter
Gilbert)
Genome sequencing up to year 2001
http://www.biochem.arizona.edu/classes/bioc471/pages/Lecture7/Lecture7.html
Timeline of genome sequencing
http://www.biochem.arizona.edu/classes/bioc471/pages/Lecture7/Lecture7.html
First draft of human genome
F. Collins and C. Venter
Biological sequence space
•
•
•
DNA sequence
– a seq. of symbols from the alphabet A,
T, C, and G
– IUPAC notation
– R denotes A or G
– Y denotes C or T
– - denotes Gap
RNA sequence
– a seq. of symbols from the alphabet A,
U, C, and G
– IUPAC notation
– R denotes A or G
– Y denotes C or U
– - denotes Gap
Protein sequence
– a seq. of symbols from 20 alphabets
(except U,X, “J,O,B”, Z)
RNA secondary structure
Biological sequence space
• Convenient to model biological seq. as a onedimensional (1D) object
• It is also incorrect
• It neglects all the information that might be
contained in the 3D structure of the molecule
• We make this approximation in this course
Building blocks of DNA sequences
• Backbone
• Pyrimidines – single ring
–Thymine
–Cytosine
• Purines – double rings
–Adenosine
–Guanin
Complementary (A,T), (C,G)
Building blocks of protein sequences
N-terminius, C-terminus
(reading protein
sequences from N to C)
peptide bond
 O==C –N-H, alpha
carbon, the R group
Central dogma of molecular biology
More with coding DNA
DNA is a double strands, there are a
total of 6 open reading frame (ORF)
Codon translation
Alternative splicing
Genome sequences
•
•
•
•
Prokaryotic genomes
– Eubacteria and archaes are the two major groups of prokaryotes organisms
without nuclei
– Generally have a single, circular genome between 0.5 and 1.3 Mbp long
– Simple genes and genetic control seqs.
Viral genomes
– Not free-living organisms
– Can be either single or double-stranded, and either DNA or RNA, that is ssDNA,
ssRNA, dsDNA ro dsRNA
– HIV, SARS
Eukaryotic genomes
– Ranging in size from 8 Mb for some fungi to 670 Gbp
– Human genome is about 3 Gbp long
– Baker’s yeast, worm, zebra-fish, fruit-fly, mosquito; mammalian such as human,
mouse, and plants such as rice
Organellar genomes
– Mitochondrion (mtDNA) and chloroplast genome
– Only hundreds or tens of thousand of bases long, circular, and contain a few
essential genes
Working with whole Genomes
Below is a circular representation of the E. coli.
DNA and Protein Sequences Databases
NCBI
http://www.ncbi.nlm.nih.gov/
EMBL
http://www.ebi.ac.uk/services/
DDBJ
http://www.ddbj.nig.ac.jp/
Protein Sequence Databases
NCBI  Molecular databases  http://www.ncbi.nlm.nih.gov/Database/ 
RefSeq
UniProt http://www.pir.uniprot.org/
UniProt = Swiss-Prot + TrEMBL + PIR-PSD
 UniProt = UniProt Archive (UniParc) + UniProt Knowledgebase (UniProtKB)
+ UniProt nonredundant reference database (UniRef)
ExPasy http://us.expasy.org/
PIR http://www-nbrf.georgetown.edu/
The Entrez system
• Redundancy in GenBank
• Many different GenBank entries are relevant to a
specific gene, esp. for human, E.coli, yeast, fruit fly
• 4 entries encompass the same E.coli dUTPase gene
GenBank entries
Sizes
X01714
1609
V01578
2568
L10328
136254
AE000441
10562
Entrez Gene
• Example: MEN1 AND human[ORGN]
• where ORGN = organism
Entrez Gene
• Read the summary Summary
• Official Symbol
• Gene type
• Gene name
• Gene description
• RefSeq status
• Organism
• Lineage
• Gene aliases
• Summary
• Reference
• Protein-protein interaction
FASTA format
Batch Entrez Gene
• NCBI  site map 
Batch Entrez Gene
• Retrieve multiple sequences information at one
time
• Uniprot seq. ID, prepare a text file, and upload
(use database = protein)
Q9XX00
Q8MQ56
Q9XWS4
Q9XU77
Q9XWH5
Q9N2K7
Eukaryotic entry
example: AF018430
Use CoreNucleotide to search for the seq.
Retrieving GenBank entries without accession
number
•
•
Entrez - human[organism] AND dUTPase[protein name]
AND must be in capital letters !
Whole Genome DB
• NCBI home page  Genome Biology  Entrez Genome  Viral
genome DB, Microbial genome ..etc )
Microbial genome – TIGR
• http://www.tigr.org/tdb/
• Comprehensive Microbial Resource (CMR)
Genome databases
• allow you to browse genomes starting from
chromosome down to a single gene, an
individual exons or a nucleotide.
• Ensembl database
• http://www.ensembl.org
• UCSC database
• http://genome.ucsc.edu
Microbial Database : GOLD
• http://www.genomesonline.org
Statistical analysis of biological sequences
• Look for sequence structures in biological
sequences, either DNA, RNA or protein seqs.
• Assuming one starts from 1D structure
• Take DNA as an example, one expects the
frequency of appearance of nucleotide A, T, C
and G are equal  random sequence, %A = %T
= %C = %G = 25%
• In actual DNA seq., this is not true !
Statistical analysis of DNA sequences
•
•
•
•
Study the base composition
GC content
Frequent or rare words – words of length k
Biological relevance of unusual words (motifs)
Counting words in DNA seqs.
http://www.genomatix.de/cgi-bin/tools/tools.pl create seq. statistics
Counting words in DNA seqs.
• NCBI  Genome (complete genome sequences)
 microbial  Haemophilus influenzae Rd
KW20 , NC_000907.1 (TIGR, dated on 1995) 
Link: RefSeq FTP or GenBank FTP (L42023.fna)
Counting words in Haemophilus influenzae genome
Total number of bp
GC content agree with
NCBI record
Counting words in Haemophilus influenzae genome
•
•
•
•
(%A) strand + = (%T) strand -,
(%C) strand + = (%G) strand -,
….
Because of the complementary principle, i.e. A-T,
and C-G
Counting words in Haemophilus influenzae genome
Use L-k+1
Percentage of dinucleotide
Counting words in Haemophilus influenzae genome
• Nucleotide words of length 2 (called dimer) or higher
(trimers, k-mers)
• Words of length k are called k-grams or k-tuples in
computer science, or k-mer in biological science
Frequency of 3-mers
Finding unusual DNA words
•
A simple statistical analysis can be used to find under- and overrepresentation of motifs (主題,基本花紋) (i.e. k-mers)
• Help us to decide when an observed bias is significant
For the case of 2-mers
• Compare the observed probability N of the 2-mers with the one expected
under a background model, typically a multi-nomial model. The ratio
between the two quantities indicates how much a certain word deviates
from the background model and is called the odds ratio;
rxy 
N ( xy)
N ( x) N ( y )
where N(xy) is the frequency of the dinucleotide xy, N(x) and N(y) denote the
frequency of the nucleotide x and y respectively.
rxy > 1 or rxy < 1  the xy nucleotide is considered of high or lower relative
abundance compared with a random seq.
Finding unusual DNA words
AA and TA seems to be unusual
•
•
Clearly dimer deviate from value 1 are unusually represented, although the
amount of deviation needed to consider this as a significant patterns needs
to be analyzed with the tools discussed later in this course.
The dimer GG looks extremely infrequent in that table but this analysis
reveals that this is not likely to be a significant bias because the nucleotide
G is low in frequency to begin with.
Finding unusual DNA words
• the odds ratio can be generalized to a k-mers
• For k-mers there are 4 to the k-th power, 4k, possible different
patterns
rk mers
N (k  mers)

N (1) N (2)....N (k )
Frequent words in H. influenzae,
The words AAAGTGCGGT and ACCGCACTTT both appearing more than
500 times.
Biological relevance of unusual motifs
• Frequent words may be due to repetitive elements
• Rare motifs include binding sites for transcription factors
• Words such as CTAG that have undesirable structural
properties, because they lead to “kinking” of the DNA
Virus vs. Bacteria
• Words that are not compatible with the internal immune
system of a bacterium. Bacterial cells can be infected by
viruses, and I response they produce restriction
enzymes, proteins that are capable of cutting DNA at
specific nucleotide words, known as restriction sites. The
nucleotide motifs recognized by restriction enzymes are
under-represented in many viral genomes, so as to avoid
the bacterial hosts’ restriction enzymes.
Analyzing DNA seq.
http://bioweb.pasteur.fr/intro-uk.html#dna
Analyzing DNA seq. GC composition
•
•
•
Calculates the fractional GC content of nucleic acid sequences
C+G content, C ≡ G has a triple bond
GEECEE http://bioweb.pasteur.fr/seqanal/interfaces/geecee.html
Counting long words in DNA seqs.
• http://bioweb.pasteur.fr/intro-uk.html
• Use AK003076
>gi|12833508|dbj|AK003076.1| Mus musculus adult male spleen cDNA, RIKEN fulllength enriched library, clone:0910001I10 product:DUTPASE homolog [Mus
musculus], full insert sequence
GGCTTTTTCCACGCCCGCCGCCATGCCCTGCTCGGAAGATGCCGCGGCCGTCT
CTGCCTCCAAGAGGGCT
CGAGCGGAGGATGGCGCTTCTCTGCGCTTCGTGCGGCTCTCGGAGCACGCCAC
GGCGCCCACCCGCGGGT
CCGCGCGCGCTGCCGGCTACGACCTATTCAGTGCCTATGATTATACAATATCAC
CCATGGAGAAAGCCAT
CGTGAAGACAGACATTCAGATAGCTGTCCCTTCTGGGTGCTATGGAAGAGTAGC
TCCACGTTCTGGCTTG
GCTGTAAAGCACTTCATAGATGTAGGAGCTGGTGTCATAGACGAGGATTACAGA
GGAAACGTTGGGGTCG
TGCTGTTTAACTTTGGGAAAGAGAAGTTTGAAGTGAAAAAAGGTGATCGGATTGC
GCAGCTCATCTGTGA
GCGGATTTCTTATCCAGACTTAGAGGAAGTGCAGACCCTGGATGACACCGAGAG
AGGCTCAGGAGGCTTC
GGCTCCACCGGGAAGAATTAGAACTTTGCTGGAAGTATCTCGCTGTTTCAACACT
GGAAACCAGAAGCTC
TAACTTCGGAAGCATTTGGTGTTCTAGGATGCAGGAAAGGAGACCTCGATCACAT
CACGTTGGAACGATT
CTGTTCCCTGGTTGAGGTCGCCTGTAAGTCTGCACTGTGAGCATGGCATTGACA
TGCAGACTTGGTAAAA
CCCAGGGTACAGTTAGATTTTTTGTTGTTGTTGTATTATTTAAATTATAGCCTTCCA
AAAACTGTTTTTG
ATCATAATTGCTGTATCATTTGTAATTTTTTTTAATCCAATAAAGTTGCTTTTAGC
Analyzing DNA seq. composition
Unusual words in different organisms or chromosomes
•
•
•
The measure rxy is suitable for a single seq..
In comparing seqs. from different organisms or chromosome  account for
the complementary anti-parallel structure of DNA  modify rxy
Reference: Burge, Campbell and Karlin (1992), PNAS, 89, 1358
Double helix
Sa = 5’-ATCG....-3’
Sb = 5’-CAGT….-3’
SaI = 3’-TAGC….-5’
SbI = 3’-GTCA….-5’
• Let I = inverted complementary seq.,
• X = A, T, C, G
• a, b = species
• faX = freq. of X for species a
Observation
• Chargaff’s rule  double strands 
total number of A/C = total number of T/G
Unusual words in different organisms or chromosomes
•
•
•
Question: compare faX and fbX
need to consider the union of S and SI
why ? Let us consider the case in which one seq. with lots of A, and the other with
lots of T  in fact it has lots of A in the complementary seq. !
Sa = 5’-AAAACGT....-3’
Sb = 5’-TTTTCGA….-3’
SaI = 3’-TTTTGCA….-5’
SbI = 3’-AAAAGCT….-5’
• Need to symmetrize 對稱化 the nucleotide frequencies, take into account of
complementary seq.
•
•
I = inverted complement of X
Define S* = S + SI  fX* = (fX + fI(X))/2
* means the union, that is count the freq. of X in both strand and take the average
f A* 
f A  f I ( A)
2
f T  f I (T )


f A  fT
f  fA
 T
2
2
Work with single DNA only, no need to
find out the complementary seq.
 fT*
2
f  fT*  f I*( A)
*
A
similarly
f C*  f G*
Compare the double strand quantity f*,
 that is compare f*aX and f*bX
Unusual words in different organisms or chromosomes
How about counting frequency of 2-mers ?
*
f GT


f GT  f I (GT )
2
f AC  f I ( AC )
2

f GT  f AC f AC  f GT

2
2
*
 f AC
*
*
f GT
 f AC
in _ general
f
*
XY
 f
*
I ( XY )

f XY  f I ( XY )
2
I = inverted complement of XY
Unusual words in different organisms or chromosomes
How about the odd ratio for 2-mers ?
*
GT
r
*
f GT
2( f GT  f AC )
 * *
f G fT ( f G  f C )( fT  f A )
similar _ for _ other _ 2  mers
*
*
you _ can _ proof , rGT
 rAC
in _ general ,
*
rXY
 rI*( XY )
A conservative estimation of low and high odd ratios are less than 0.78 and
higher than 1.22 respectively.
Unusual words in different organisms or
chromosomes
How about the odd ratio for 3-mers ?
*
XYZ
r
*
f XYZ
f X* fY* f Z*
 * * *
f XY fYZ f XNZ
where, N _ is _ any _ nucleotide, and
f
*
XYZ

( f XYZ  f I ( XYZ ) )
2
Compare statistical properties (1-mer and 2-mers) of human and chimp
complete mitochondrial DNA
NC_001807 and NC_001643
Human
f A* 
f C* 
Human
Chimp
A (%)
30.86%
31.13%
C (%)
31.33%
30.80%
G (%)
13.16%
12.89%
T (%)
24.66%
25.18%
Chimp
f A  f I ( A)

30.86  24.66
2
31.33  13.16
2
2
 22.245%
 27.76%
f A  f I ( A)
31.13  25.18
 28.155%
2
2
30.80  12.89
f C* 
 21.845%
2
f A* 
Both species have similar fX

Compare statistical properties (1-mer and 2-mers) of human and chimp
complete mitochondrial DNA
second nucleotide
second nucleotide
A
C
G
T
A
C
G
T
first
A
0.0962
0.0902
0.0483
0.0738
first
A
1.0042
0.8812
1.1750
1.0293
nucl.
C
0.0927
0.1074
0.0265
0.0868
nucl.
C
0.9664
1.1537
0.5742
1.0861
G
0.0371
0.0432
0.0258
0.0254
G
0.9819
1.0328
1.5773
0.7321
T
0.0826
0.0725
0.0309
0.0606
T
1.0352
0.9857
0.9296
1.0019
*
rXY
 rI*( XY )
Human
Chimp
4x4 = 16, symmetric  only need to compute 8 numbers not 16 !
second nucleotide
symmetric
A
C
G
T
first
A
1
2
3
4
nucl.
C
5
6
7
3
G
8
7
6
2
T
4
8
5
1
Compare statistical properties (1-mer and 2-mers) of
human and chimp complete mitochondrial DNA
*
rXY


*
2( f XY  f I ( XY ) )
f XY
 * *
f X fY ( f X  f I ( X ) )( fY  f I (Y ) )
2( f XY  f I ( XY ) ) / N 2
[( f X  f I ( X ) )( fY  f I (Y ) )] / N 2
2 N ( p XY  pI ( XY ) )
( p X  pI ( X ) )( pY  pI (Y ) )
where _ p _ denotes _ percentage _ of _ the _ word
See my human and chimp k-mers Excel file
Linguistic study of DNA sequences
• Does genomic sequences have any resemblance to a
natural language ?  open question !
– Coding regions
• Bacteria: no introns
• Archaea: some introns, TATA boxes
• Eukarya: many introns and exons, TATA boxes
– Noncoding regions
• Pseudogenes
• Repetitive sequences
– Mini-satellites
– Micro-satellites
– Alphabets, words, sentences
– Coding regions  words
– Non-coding regions  ?
How to obtain inverted complementary seq. ?
•
•
Prepare a FASTA format file
Biological software web site  http://bioweb.pasteur.fr/intro-uk.html#dna  seq. tools
 EMBOSS  program name: revseq  Advanced revseq form  output file :
outseq.out
GC content
Factors contributing to the variation of GC content
1. Environmental temperature
2. Levels of methylation
3. Recent transposon activity (DNA jumps around)
• Over stretches of hundreds of kb, GC content
should vary by <1% as a result of random
sampling
• But most genomes show a bias ranging over as
much as 30% !
GC content
Figure. Distribution of GC content along human chromosome 1. GC content varies
between 20% and 65% at several different levels of resolution, including for the
entire 220Mb of chromosome 1 average over 1-Mb windows (top) and within
just 1 Mb for 200-bp windows (bottoms). A gap in the IHFSE seq. can be seen
at the 400-kb mark on the 1-Mb scale.
GC content
•
•
•
•
Karyotypic bands revealed by nuclear
dyes such as Giemsa tend to correlate
with GC content (dark bands being
more AT-rich), possibly reflecting their
propensity to coil into superstructure,
but clearly other features of the DNA
contribute to chromatin assembly.
Chromosome is 2 ~ 3 cm long
The 46 chromosomes (over 1m long)
are packed inside the nucleus with a
size of 0.001 cm ! Amazing !!
CpG dinucleotides are
underrepresented in mammalian
genomes overall, but cluster as CpG
islands between 0.5 and 2 kb in length
that are significantly enriched just
upstream of genes.
hsa-mir-639
UCSC database http://genome.ucsc.edu
Finding internal repeats in DNA seqs.
• tandem repeats, inverted repeat
• repeats often involved in genome
rearrangements or regulatory mechanisms of
gene expression
• tools result depend on scoring system and
ranking
• Dot-plot approach
http://arbl.cvmbs.colostate.edu/molkit/
Finding internal repeats in DNA seqs.
TF sequence
Transcription factor, TFIIIA for X.laevis, K02938
>gi|214818|gb|K02938.1|XELTFIIIA X.laevis 5S RNA gene transcription factor (TFIIIA) mRNA, complete
cdsGAATTCCGGAAGCCGAGGGCTGTTCAGTTGCTGAAGGAGAGATGGGAGAGAAGGCGCTGCCGGTGGTGTATAAGCGGTACATCTGCTCTT
TCGCCGACTGCGGCGCTGCTTATAACAAGAACTGGAAACTGCAGGCGCATCTGTGCAAACACACAGGAGAGAAACCATTTCCATGTAAGGAAG
AAGGATGTGAGAAAGGCTTTACCTCGCTTCATCACTTAACCCGCCACTCACTCACTCATACTGGCGAGAAAAACTTCACATGTGACTCGGATGG
ATGTGACTTGAGATTTACTACAAAGGCAAACATGAAGAAGCACTTTAACAGATTCCATAACATCAAGATCTGCGTCTATGTGTGCCATTTTGAGA
ACTGTGGCAAAGCATTCAAGAAACACAATCAATTAAAGGTTCATCAGTTCAGTCACACACAGCAGCTGCCATACGAATGTCCTCATGAAGGCTG
TGACAAGCGGTTTTCTTTGCCTTCCCGTTTAAAACGTCATGAAAAAGTCCATGCAGGCTATCCCTGCAAAAAGGATGATTCTTGCTCATTTGTGG
GAAAGACTTGGACATTATACTTGAAACACGTGGCAGAATGCCATCAGGACCTAGCAGTATGTGATGTGTGTAATCGAAAATTCAGGCACAAAGA
TTACTTGAGGGATCATCAGAAAACTCACGAAAAAGAGCGAACTGTGTATCTCTGCCCTCGAGATGGCTGTGACCGCTCCTATACCACTGCATTC
AATCTTAGAAGCCATATACAATCATTTCATGAGGAACAGAGACCTTTTGTTTGTGAGCATGCTGGCTGCGGGAAATGCTTTGCAATGAAAAAAAG
CCTAGAAAGACATTCAGTTGTACATGATCCAGAGAAGAGGAAGCTGAAGGAGAAATGCCCTCGCCCAAAGAGAAGCCTGGCCTCTCGCCTCAC
TGGATACATACCCCCCAAGAGCAAAGAAAAAAATGCATCCGTTTCGGGAACAGAAAAGACTGATTCACTTGTGAAAAATAAGCCCTCTGGCACT
GAAACAAATGGCTCATTGGTTCTAGATAAATTAACTATACAATAATATAAGAAAACATTTAAATTTATTTTTTTATTTGTTAAAATTGCCCTCAGGAT
GGTTAACCCATATTTAGTGTGGGTTTTTTCTTTTTTTACAGCTTTAATTCATTTTTTTTCGGCTATAACAAAAGGAATCTGTTCTAGACGCATGATT
TGTTTTATGAACTGCAGTATTGGCCATGCCTACAGGTAAAGGCACAGTGTTAATGGCTACATACCTCTTCTACCCCATGTTTGCTATTAAAAGTG
AGGTGCAGCAGCCACTGGTCTGTTTATTTACAATACATTCATTTAGTAAGACTCTGTATTCATTTTCAAAAGAATCACTAAGGGAATGTGCAAAAT
TGTTATCACTCTACTGTAAACACAAATGTACTGCTTGCACCCTGTTGGTGGGGCTTTTTTTGGGGAGGTTGACTGACCCTGTTTTTTTTTTAACG
GAATTC
Rosalind Franklin The Dark Lady of DNA (1920~1958)
By Brenda Maddox
• Maddox tells her readers, in their Nobel acceptance
speeches in 1962 Watson and Crick made no mention of
Rosalind Franklin at all. It was only Wilkins who “uttered”
Franklin’s name, mentioning her as one of two people (the
other being Alex Stokes), who “made very valuable
contributions to the X-ray analysis.”
• Watson, Francis Crick, and Maurice Wilkins. The latter
three received a Nobel Prize for their discovery in 1962.
Franklin was ignored.
• For more about the story read
http://www.humanistperspectives.org/issue151/books.html
Sodium deoxyribose nucleate from calf thymus, Structure B, Photo 51,
taken by Rosalind Franklin and R G Gosling, 2 May 1952, with Linus
Pauling’s holographic annotations to the right of the photo. This photo
shows the double helices structure of DNA with a separation of 20A.
Discovery of the double helix structure of DNA
The discovery is based on three pieces of works
1. Chargaff’s rule (discovered in 1949)
• Chargaff – an Austrian-American
biochemist
• total number of A/C = total number of T/G
2. Linus Pauling - discover the alpha-helix
structure of protein
3. X-ray diffraction pattern of crystal
– Did by Rosalind Franklin
– Crystal X-ray diffraction – by William
Bragg and his son William Junior Bragg
http://www.virtualsciencefair.org/2004/mcgo4s0/p
ublic_html/t2/dna.html
http://post.queensu.ca/~forsdyke/bioinfo1.htm
Erwin Chargaff (1905 -2002)
Discovery of the double helix structure of DNA
Linus Pauling
• Nobel Prize in Chemistry in 1954
• Nobel Peace Prize in 1963
• who championed the use of Vitamin C,
• live to be 93 (he died in 1994)
• nature was a marvelous 令人驚異的 contrivance
發明,想出的辦法 composed of molecules
assembled by the Great Mechanic
• http://www.utoronto.ca/jpolanyi/public_affairs/pu
blic_affairs4i.html
• Watson and Crick conjectured that DNA is made
up of two, three or four helices, but their model
did not fit the X-ray data.
• They stop their work for almost one year
• Crick continued his Ph.D thesis, Crick worked
on his tobacco virus research
• So as Pauling, he proposed the three helices
model of DNA.
• They were wrong, because they did not know
the complimentary principle yet.
Discovery of the double helix structure of DNA
• Watson and Crick also proposed DNA is
made up of two exactly same helix with the
same nucleotide on the opposite helix 
wrong
• Jerry Donohue, a visiting chemist at
Cambridge from Cal-Tech, asserted that the
shape of those DNA bases ought to be the
keto form and not the enol form, as the
textbooks of the day asserted.
• Armed now with the memory of Franklin’s
clear photograph 51, this next to-last-step in
the emergence of the final model was
absolutely crucial.
Donohue (1920 – 1985)
Discovery of the double helix structure of DNA
•
“‘The point was important,’ [Crick] said,
‘because if the unit cell is strictly C2, one
must have the DNA chains in pairs,
running in opposite directions.’”
•
This scientific point was crucial for
Watson and Crick. In separate papers
published that same year, Franklin had
said that “C2 is the only space group
possible.” Why, Maddox wonders, had
Watson or Crick failed to mention the
importance of this in either of their
Nature papers of 1953?
•
A physicist, he worked with John Randall
in the late 1930s on the development of
radar, moving to the USA during World
War II to work on the Manhattan project.
After the War he joined Randall at King's
College London and with Rosalind
Franklin began an investigation into the
structure of DNA.
Watson (1928-) and
Crick (1916-2004)
Maurice Wilkins
(1916-2004)
Diffraction of X ray by crystal
• Max von Laue who was
awarded the Nobel prize for
physics in 1914 "for his
discovery of the diffraction of Xrays by crystals". His
collaborators Walter Friedrich
and Paul Knipping took the
picture on the right in 1912.
• http://cxpi.spme.monash.edu.au
/xray_history.htm
Max von Laue (1897-1960)
A beam of X-rays is scattered into a
characteristic pattern by a crystal. In this case
it is copper sulphate.
Diffraction of X ray by crystal
• Sir William Lawrence Bragg,
Australian born British physicist,
won the Nobel prize (1915) with
his father William Henry Bragg
"for their services in the analysis
of crystal structure by means of
Xrays“, when he was only 25
years old.
William Henry Bragg
William Lawrence Bragg
(1862-1942)
(1890-1971)
Bragg’s law of diffraction