Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
associations
QTL
MARKERS
mapping
barcoding
SNPs
metagenomics
genotype STRs
arrays
2D gels
ESTs
RNA
DNA PCRtranscriptomics
GMOs genomics
RT
RNAi
immunotechniques
Protein
proteomics
(gene expression)
MRM
mass spectrometry
glycobiology
shotgun seq
Sanger
BAC pools
next (next) gen
lipids
Metabolite
metabolomics
gas chromatography
#molecularhipsters
DNA SEQ
+ ENVIRONMENT = PHENOTYPE
DAVID NG BIOTEACH.UBC.CA
BIOINFORMATICS
MICHAEL SMITH LABS 2014
GENOMICS 101
AGENDA:
9am-10am
General Molecular
Biology background
10am-12pm
Historical context
12pm-1pm
Lunch
1pm-2:30pm
Glossary Review
2:30pm-2:45pm
Break
2:45pm-3:45pm
Case Study Examples
3:45pm-4pm
Q&A
DOCUMENT CONTAINS:
- SLIDES SECTION (INCLUDING GLOSSARY)
- GENOMEBC CASE STUDIES
- NOTES FOR REFERENCE
1
2
3
“genome”
“transcriptome” “proteome”
“metabolome”
4
5
6
Proc. Nati. Acad. Sci. USA
Vol. 74, No. 12, pp. 5463-5467, December 1977
Biochemistry
DNA sequencing with chain-terminating inhibitors
(DNA polymerase/nucleotide sequences/bacteriophage 4X174)
F. SANGER, S. NICKLEN, AND A. R. COULSON
Medical Research Council Laboratory of Molecular Biology, Cambridge CB2 2QH, England
Contributed by F. Sanger, October 3, 1977
A new method for determining nucleotide seABSTRACT
quences in DNA is described. It is similar to the "plus and
minus" method [Sanger, F. & Coulson, A. R. (1975) J. Mol. Biol.
94,441-4481 but makes use of the 2',3'-dideoxy and arabinonucleoside analogues of the normal deoxynucleoside triphosphates,
which act as specific chain-terminating inhibitors of DNA
polymerase. The technique has been applied to the DNA of
bacteriophage 4bX174 and is more rapid and more accurate than
either the plus or the minus method.
a stereoisomer of ribose in which the 3'-hydroxyl group is oriented in trans position with respect to the 2'-hydroxyl group.
The arabinosyl (ara) nucleotides act as chain terminating inhibitors of Escherichia coli DNA polymerase I in a manner
comparable to ddT (4), although synthesized chains ending in
3' araC can be further extended by some mammalian DNA
polymerases (5). In order to obtain a suitable pattern of bands
from which an extensive sequence can be read it is necessary
to have a ratio of terminating triphosphate to normal triphosphate such that only partial incorporation of the terminator
occurs. For the dideoxy derivatives this ratio is about 100, and
for the arabinosyl derivatives about 5000.
The "plus and minus" method (1) is a relatively rapid and
simple technique that has made possible the determination of
the sequence of the genome of bacteriophage 4X174 (2). It
depends on the use of DNA polymerase to transcribe specific
regions of the DNA under controlled conditions. Although the
METHODS
method is considerably more rapid and simple than other
available techniques, neither the "plus" nor the "minus"
Preparation of the Triphosphate Analogues. The prepamethod is completely accurate, and in order to establish a seration of ddTTP has been described (6, 7), and the material is
quence both must be used together, and sometimes confirmanow commercially available. ddA has been prepared by
tory data are necessary. W. M. Barnes (J. Mol. Biol., in press)
McCarthy et al. (8). We essentially followed their procedure
has recently developed a third method, involving ribo-substiand used the methods of Tener (9) and of Hoard and Ott (10)
tution, which has certain advantages over the plus and minus
to convert it to the triphosphate, which was then purified on
method, but this has not yet been extensively exploited.
DEAE-Sephadex,
using a 0.1-1.0 M gradient of triethylamine
Another rapid and simple method that depends on specific
carbonate at pH 8.4. The preparation of ddGTP and ddCTP
chemical degradation of the DNA has recently been described
has
not been described previously; however we applied the
by Maxam and Gilbert (3), and this has also been used extensame method as that used for ddATP and obtained solutions
sively for DNA sequencing. It has the advantage over the plus
having the requisite terminating activities. The yields were very
and minus method that it can be applied to double-stranded
low and this can hardly be regarded as adequate chemical
DNA, but it requires a strand separation or equivalent fracHowever, there can be little doubt that the
characterization.
tionation of each restriction enzyme fragment studied, which
activity was due to the dideoxy derivatives.
makes it somewhat more laborious.
The starting material for the ddGTP was N-isobutyryl-5'This paper describes a further method using DNA polyO-monomethoxytrityldeoxyguanosine prepared by F. E.
merase, which makes use of inhibitors that terminate the newly
Baralle (11). After tosylation of the 3'-OH group (12) the
synthesized chains at specific residues.
was converted to the 2',3'-didehydro derivative with
compound
Principle of the Method. Atkinson et al. (4) showed that the
sodium methoxide (8). The isobutyryl group was partly reinhibitory activity of 2',3'-dideoxythymidine triphosphate
moved during this treatment and removal was completed by
(ddTTP) on DNA polymerase I depends on its being incorpoincubation in NH3 (specific gravity 0.88) overnight at 45° . The
rated into the growing oligonucleotide chain in the place of
didehydro derivative was reduced to the dideoxy derivative (8)
thymidylic acid (dT). Because the ddT contains no 3'-hydroxyl
and converted to the triphosphate as for the ddATP. The mogroup, the chain cannot be extended further, so that termination
nophosphate was purified by fractionation on a DEAE-Seoccurs specifically at positions where dT should be incorporated.
phadex column using a triethylamine carbonate gradient
If a primer and template are incubated with DNA polymerase
(0.025-0.3 M) but the triphosphate was not purified.
in the presence of a mixture of ddTTP and dTTP,
as well as the
was on
ddCTP
N-anisoyl-5'-O-monomethoxypreparedthefrom
This work was supported in part by
manuscript.
for comments
thank many colleagues at Scripps
spike is 18. I(one
boundary isotopic triphosphates
If the
which in this report,
ofexpressed
other
threeK-T
deoxyribonucleoside
and
Foundation
Science
grants from the National
in particular G.
the ideas
Research
Waltham,
Inc.,
trityldeoxycytidine
(Collaborative
acid rain, all having
of impact-related
result
the National Aeronautics and Space Administration.
Galer, J. Gieskes, M. Kastner, D. Lal,
32p),
Arrhenius,
isindeed
labeledthewith
a mixture of fragments
theS.same
but the final purification on
above method
MA)Comments
by the from
H.-G. Stosch.
and
G. Lugmair,
isotope
strontium
When
this
5'theandoceanic
with ddT
residues
at therecord
3' endsmay
is obtained.
DEAE-Sephadex
improved the origi- was omitted because the yield was very low
two anonymous reviewers also
seawater
The
impacts.
large
other
reveal
mixture is fractionated by electrophoresis onnal manuscript.
denaturing
1987
7 December
1987; accepted
28 September
I thank P. Hey
the required
The solution
solutionofcontained
andforthepreparation
activity.
al.bands
(9), which
of Burke
strontium curve
shows the distribution of
acrylamide
gels the
patternet of
was used directly in the experiments described in this paper.
shows
million years,
500synthesized
spans
in the
dTs
the past
DNA.
Byatusing analogous ternewly
An attempt was made to prepare the triphosphate of the
the
high spikes
other
least two for
ininseparate
incubations and
minators
theprominent
other nucleotides
intermediate didehydrodideoxycytidine because Atkinson et
at a pattern of bands
mid-Cretaceous,
one in the
ratio,
87Sr/86Sr
in parallel
on the gel,
the
running
samples
the be read off as in the
other incan
the
andthe
million
is-100
obtained
fromyears,
which
sequence
Amplification
Enzymatic
Primer-DirectedAbbreviations:
The symbols
usedDNA
for the deoxyriC, T, A, and G areof
The
million years.
at -290 mentioned
Pennsylvanian,
other
above.
rapid techniques
in DNAPolymerase
bonucleotides
sequences; the prefix dd is used for the 2',3'DNA
a Thermostable
with
a few millionhave
precede by triphosphates
first
been used-the
Twoappears
types ofto terminating
dideoxy derivatives
(e.g., ddATP is 2',3'-dideoxyadenosine 5'-triCenoat
the
event
extinction
mass
the
years
is
the
ara
is
used
for
the
arabinose
arabinonucleosides.
Arabinose
analogues.
phosphate); prefix
dideoxy derivatives and the
manian-Turonian boundary. There is also a RANDALL K. SAIKI, DAVID H. GELFAND, SUSANNE STOFFEL,
5463
large increase in 87Sr/86Sracross the Permian-Triassic boundary (9), the time of the
most extreme mass extinction in the Phanerozoic record (17). However, the increase
appearsto be rather gradual, extending over
20 million to 25 million years, and is thus
quite different in character from the K-T
spike. Nevertheless, data are sparse for this
interval, and more work will be required to
determine the exact nature of the increase.
The occurrence of a spike toward higher
values in the seawater 87Sr/86Srrecord at the
K-T boundary is tantalizing evidence for
enhanced continental weathering, possibly
due to impact-related acid rain. Detailed
strontium isotopic studies through this and
other intervals where such spikes appear are
required to determine precisely the nature of
the isotopic variations with respect to stratigraphy, and particularly with respect to
mass extinctions.
AND
REFERENCES
NOTES
1. J. Hess, M. L. Bender, J.-G. Schilling, Science231,
979 (1986).
2. L. W. Alvarez, W. Alvarez, F. Asaro, H. V. Michel,
ibid. 208,
1095 (1980).
3. R. G. Prinn and B. Fegley, Jr., Earth Planet. Sci.
Lett. 83, 1 (1987).
4. J. Lewis, H. Watkins, H. Hartman, R. Prinn, Geol.
Soc. Am. Spec. Pap. 190 (1982),
p. 215.
5. M. R. Palmer and H. Elderfield, Nature (London)
314, 526 (1985).
6. D. J. DePaolo and B. L. Ingram, Science227, 938
(1985).
7. F. Surlyk and M. B. Johansen, ibid. 223, 1174
(1984).
8. Since strontium is removed from the oceans mainly
by biogenic carbonate precipitation, removal rates
may have been much less for some period after the
K-T boundary event, leading to higher strontium
concentrations in seawater and a temporary increase
in "residence time."
9. W. H. Burke et al., Geology10, 516 (1982).
10. M. A. Wadleigh, J. Veizer, C. Brooks, Geochim.
Cosmochim. Acta 49,
1727 (1985).
11. D. W. Mittlefehldt and G. W. Wetherill, ibid. 43,
201 (1979).
12. D. W. Graham, M. L. Bender, D. F. Williams, L. D.
Keigwin, Jr., ibid. 46, 1281 (1982).
13. A. A. Afifi, 0. P. Bricker, J. C. Chemerys, Chem.
Geol. 49, 87 (1985).
14. R. M. Garrels and F. T. Mackenzie, Evolution of
SedimentaryRocks(Norton, New York, 1971).
15. G. W. Brass, Geochim. Cosmochim. Acta
40,
721
(1976). This paper provides data for both strontium
isotopic ratios and strontium-to-calcium ratios for
limestone.
16. C. Officer, C. Drake, J. Devine, Eos 66, 813 (1985).
17. J. J. Sepkoski, Jr., Geol. Soc. Am. Spec. Pap. 190
(1982), p. 283.
7
STEPHEN J. SCHARF, RUSSELL HIGUCHI, GLENN T. HORN,
KARY B. MULLIS,* HENRY A. ERLICH
A thermostable DNA polymerase was used in an in vitro DNA amplification
procedure, the polymerase chain reaction. The enzyme, isolated from Thermus aquaticus, greatly simplifies the procedure and, by enabling the amplification reaction to be
performed at higher temperatures, significantly improves the specificity, yield, sensitivity, and length of products that can be amplified. Single-copy genomic sequences were
amplified by a factor of more than 10 million with very high specificity, and DNA
segments up to 2000 base pairs were readily amplified. In addition, the method was
used to amplify and detect a target DNA molecule present only once in a sample of 105
cells.
T
HE ANALYSIS OF SPECIFIC NUCLEO-
tide sequences, like many analytic
procedures, is often hampered by the
presence of extraneous material or by the
extremely small amounts availablefor examination. We have recently described a method, the polymerase chain reaction (PCR),
that overcomes these limitations (1, 2). This
technique is capable of producing a selective
enrichment of a specific DNA sequence by a
factor of 106, greatly facilitating a variety of
subsequent analytical manipulations. PCR
has been used in the examination of nucleotide sequence variations (3-5) and chromosomal rearrangements (6), for high-efficiency cloning of genomic sequences (7), for
direct sequencing of mitochondrial (8) and
genomic DNAs (9, 10), and for the detection of viral pathogens (11).
PCR amplification involves two oligonucleotide primers that flank the DNA segment to be amplified and repeated cycles of
heat denaturation of the DNA, annealing of
the primers to their complementary sequences, and extension of the annealed
primerswith DNA polymerase. These primers hybridize to opposite strands of the
target sequence and are oriented so DNA
synthesis by the polymerase proceeds across
the region between the primers, effectively
doubling the amount of that DNA segment.
Moreover, since the extension products are
also complementary to and capable of binding primers, each successive cycle essentially
doubles the amount of DNA synthesized in
the previous cycle. This results in the exponential accumulation of the specific target
fragment, approximately 2", where n is the
number of cycles.
One of the drawbacks of the method,
however, is the thermolability of the
Klenow fragment of Escherichiacoli DNA
polymerase I used to catalyze the extension
of the annealed primers. Because of the heat
denaturation step required to separate the
newly synthesized strands of DNA, fresh
enzyme must be added during each cycle-a
tedious and error-prone process if several
samples are amplified simultaneously. We
now describe the replacement of the E. coli
DNA polymerase with a thermostable DNA
polymerase purified from the thermophilic
bacterium, Thermus aquaticus (Taq), that
can survive extended incubation at 95?C
(12). Since this heat-resistant polymerase is
relatively unaffected by the denaturation
step, it does not need to be replenished at
each cycle. This modification not only simplifies the procedure, making it amenable to
automation, it also substantially improves
the overall performance of the reaction by
increasing the specificity, yield, sensitivity,
and length of targets that can be amplified.
Samples of human genomic DNA were
subjected to 20 to 35 cycles of PCR amplifi-
8
R. K. Saiki, S. J. Scharf, R. Higuchi, G. T. Horn, K. B.
Mullis, H. A. Erlich, Cetus Corporation, Department of
Human Genetics, 1400 Fifty-Third Street, Emeryville,
CA 94608.
D. H. Gelfand and S. Stoffel, Cetus Corporation, Department of Microbial Genetics, 1400 Fifty-Third Street,
Emeryville, CA 94608.
*Present address: Xytronyx, 6555 Nancy Ridge Drive,
Suite 200, San Diego, CA 92121.
REPORTS
29 JANUARY I988
487
This content downloaded from 142.103.160.110 on Tue, 13 May 2014 00:17:02 AM
All use subject to JSTOR Terms and Conditions
to the whiteboard
9
TRUE OR FALSE?
1. Evidence shows that mice are attracted to their mates based on
genetic diversity. This they can somehow tell from the smell of their
urine. There is also currently some weak evidence that humans
indirectly do the same thing.
2. According to established data within the senecense field, people who
take longer to “sh*t” arguably should live longer.
3. A transgenic mouse, affectionately known as the "Doogie" mouse has
been produced with superior intellect and mental prowess.
4. Research based on psychogenomic analysis have determined
genetic sequences involved in your love/hate relationship with brussel
sprouts.
10
PHENOTYPE
PROTEINS
GENES
+
DNA SEQ
ATCG
TAGC
+
dQ dQ +
ATGC
TACG
11
http://miserycat.deviantart.com/art/Free-Unicorn-Icon-180216460?q=&qo=
enhancer
silencer
chromatin
*
CpG
*
*
TATA
promoter
*
CAP
signal
kozak
*
ATG
*
*
STOP polyA
*
exon intron
*
* = “maybe”
*
12
13
MOLECULAR BIOLOGY
http://devisional.com/
14
MOLECULAR BIOLOGY
http://devisional.com/
15
16
16
17
GENOME: In modern molecular biology and genetics, the
genome is the entirety of an organism's hereditary
information. It is encoded either in DNA or, for many types of
virus, in RNA. The genome includes both the genes and the
non-coding sequences of the DNA/RNA
GENOMICS: is a discipline in genetics concerning the study
of the genomes of organisms
18
!"#$%&'(
!"#$%&'(
!"#$#%& '()*("+#", %"- %"%&.'#' /0 $1(
1*2%" ,("/2(
!"$(3"%$#/"%& 4*2%" 5("/2( 6()*("+#", 7/"'/3$#*2!
! ! "#$%&#' '&(% )* #+%,)$( #""-#$( ). %,- )"")(&%- "#/-0 !*1'&#%&).( #$- '&(%-2 #% %,- -.2 )* %,- "#"-$0
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!"# "$%&' (#')%# ")*+, &' #-./&)/+0'&/1 ./)2# )3 0'3)/%&.0)' &4)$. "$%&' +#2#*)5%#'.6 5"1,0)*)(16 %#+070'# &'+ #2)*$.0)'8
9#/# :# /#5)/. ."# /#,$*., )3 &' 0'.#/'&.0)'&* 7)**&4)/&.0)' .) 5/)+$7# &'+ %&;# 3/##*1 &2&0*&4*# & +/&3. ,#<$#'7# )3 ."# "$%&'
(#')%#8 =# &*,) 5/#,#'. &' 0'0.0&* &'&*1,0, )3 ."# +&.&6 +#,7/040'( ,)%# )3 ."# 0',0("., ."&. 7&' 4# (*#&'#+ 3/)% ."# ,#<$#'7#8
"#$ %$&'()*+$%, *- .$/&$01( 023( *- #$%$&'4, '/ 4#$ *5$/'/6 3$$7( *4#$ 894# )$/4:%,;<= (52%7$& 2 ()'$/4'>) ?:$(4 4* :/&$%(42/& 4#$
/24:%$ 2/& )*/4$/4 *- 6$/$4') '/-*%@24'*/ 4#24 #2( 5%*5$00$&
A'*0*6, -*% 4#$ 02(4 #:/&%$& ,$2%(B "#$ ()'$/4'>) 5%*6%$(( @2&$
-200( /24:%200, '/4* -*:% @2'/ 5#2($(C )*%%$(5*/&'/6 %*:6#0, 4* 4#$
-*:% ?:2%4$%( *- 4#$ )$/4:%,B "#$ >%(4 $(42A0'(#$& 4#$ )$00:02% A2('( *#$%$&'4,D 4#$ )#%*@*(*@$(B "#$ ($)*/& &$>/$& 4#$ @*0$):02% A2('(
*- #$%$&'4,D 4#$ EFG &*:A0$ #$0'HB "#$ 4#'%& :/0*)7$& 4#$ '/-*%@2I
4'*/20 A2('( *- #$%$&'4,C 3'4# 4#$ &'()*+$%, *- 4#$ A'*0*6')20 @$)#2/I
'(@ A, 3#')# )$00( %$2& 4#$ '/-*%@24'*/ )*/42'/$& '/ 6$/$( 2/& 3'4#
4#$ '/+$/4'*/ *- 4#$ %$)*@A'/2/4 EFG 4$)#/*0*6'$( *- )0*/'/6 2/&
($?:$/)'/6 A, 3#')# ()'$/4'(4( )2/ &* 4#$ (2@$B
"#$ 02(4 ?:2%4$% *- 2 )$/4:%, #2( A$$/ @2%7$& A, 2 %$0$/40$(( &%'+$
4* &$)'5#$% >%(4 6$/$( 2/& 4#$/ $/4'%$ 6$/*@$(C (523/'/6 4#$ >$0&
*- 6$/*@')(B "#$ -%:'4( *- 4#'( 3*%7 20%$2&, '/)0:&$ 4#$ 6$/*@$
($?:$/)$( *- JKK +'%:($( 2/& +'%*'&(C 89J /24:%200, *)):%%'/6
502(@'&(C ;LJ *%62/$00$(C =; $:A2)4$%'2C ($+$/ 2%)#2$2C */$
-:/6:(C 43* 2/'@20( 2/& */$ 502/4B
M$%$ 3$ %$5*%4 4#$ %$(:04( *- 2 )*002A*%24'*/ '/+*0+'/6 89 6%*:5(
-%*@ 4#$ N/'4$& O424$(C 4#$ N/'4$& P'/6&*@C Q252/C R%2/)$C
S$%@2/, 2/& T#'/2 4* 5%*&:)$ 2 &%2-4 ($?:$/)$ *- 4#$ #:@2/
6$/*@$B "#$ &%2-4 6$/*@$ ($?:$/)$ 32( 6$/$%24$& -%*@ 2 5#,(')20
@25 )*+$%'/6 @*%$ 4#2/ KUV *- 4#$ $:)#%*@24') 52%4 *- 4#$ #:@2/
6$/*@$ 2/&C 4*6$4#$% 3'4# 2&&'4'*/20 ($?:$/)$ '/ 5:A0') &242A2($(C
'4 )*+$%( 2A*:4 KWV *- 4#$ #:@2/ 6$/*@$B "#$ ($?:$/)$ 32(
5%*&:)$& *+$% 2 %$024'+$0, (#*%4 5$%'*&C 3'4# )*+$%26$ %'('/6 -%*@
2A*:4 ;9V 4* @*%$ 4#2/ K9V *+$% %*:6#0, >-4$$/ @*/4#(B "#$
($?:$/)$ &242 #2+$ A$$/ @2&$ 2+2'02A0$ 3'4#*:4 %$(4%')4'*/ 2/&
:5&24$& &2'0, 4#%*:6#*:4 4#$ 5%*X$)4B "#$ 42(7 2#$2& '( 4* 5%*&:)$ 2
>/'(#$& ($?:$/)$C A, )0*('/6 200 625( 2/& %$(*0+'/6 200 2@A'6:'4'$(B
G0%$2&, 2A*:4 */$ A'00'*/ A2($( 2%$ '/ >/20 -*%@ 2/& 4#$ 42(7 *A%'/6'/6 4#$ +2(4 @2X*%'4, *- 4#$ ($?:$/)$ 4* 4#'( (42/&2%& '( /*3
(4%2'6#4-*%32%& 2/& (#*:0& 5%*)$$& %25'&0,B
"#$ ($?:$/)$ *- 4#$ #:@2/ 6$/*@$ '( *- '/4$%$(4 '/ ($+$%20
%$(5$)4(B Y4 '( 4#$ 02%6$(4 6$/*@$ 4* A$ $H4$/('+$0, ($?:$/)$& (* -2%C
A$'/6 8J 4'@$( 2( 02%6$ 2( 2/, 5%$+'*:(0, ($?:$/)$& 6$/*@$ 2/&
$'6#4 4'@$( 2( 02%6$ 2( 4#$ (:@ *- 200 (:)# 6$/*@$(B Y4 '( 4#$ >%(4
+$%4$A%24$ 6$/*@$ 4* A$ $H4$/('+$0, ($?:$/)$&B G/&C :/'?:$0,C '4 '(
4#$ 6$/*@$ *- *:% *3/ (5$)'$(B
.:)# 3*%7 %$@2'/( 4* A$ &*/$ 4* 5%*&:)$ 2 )*@50$4$ >/'(#$&
($?:$/)$C A:4 4#$ +2(4 4%*+$ *- '/-*%@24'*/ 4#24 #2( A$)*@$
2+2'02A0$ 4#%*:6# 4#'( )*002A*%24'+$ $--*%4 200*3( 2 60*A20 5$%(5$)4'+$
*/ 4#$ #:@2/ 6$/*@$B G04#*:6# 4#$ &$42'0( 3'00 )#2/6$ 2( 4#$
($?:$/)$ '( >/'(#$&C @2/, 5*'/4( 2%$ 20%$2&, )0$2%B
! "#$ 6$/*@') 02/&()25$ (#*3( @2%7$& +2%'24'*/ '/ 4#$ &'(4%'A:I
4'*/ *- 2 /:@A$% *- -$24:%$(C '/)0:&'/6 6$/$(C 4%2/(5*(2A0$
$0$@$/4(C ST )*/4$/4C T5S '(02/&( 2/& %$)*@A'/24'*/ %24$B "#'(
6'+$( :( '@5*%42/4 )0:$( 2A*:4 -:/)4'*/B R*% $H2@50$C 4#$ &$+$0I
*5@$/4200, '@5*%42/4 MZ[ 6$/$ )0:(4$%( 2%$ 4#$ @*(4 %$5$24I5**%
%$6'*/( *- 4#$ #:@2/ 6$/*@$C 5%*A2A0, %$\$)4'/6 4#$ +$%, )*@50$H
)*+
)**%&'/24$ %$6:024'*/ *- 4#$ 6$/$( '/ 4#$ )0:(4$%(B
! "#$%$ 255$2% 4* A$ 2A*:4 =9C999<W9C999 5%*4$'/I)*&'/6 6$/$( '/
4#$ #:@2/ 6$/*@$]*/0, 2A*:4 43')$ 2( @2/, 2( '/ 3*%@ *% \,B
M*3$+$%C 4#$ 6$/$( 2%$ @*%$ )*@50$HC 3'4# @*%$ 204$%/24'+$
(50')'/6 6$/$%24'/6 2 02%6$% /:@A$% *- 5%*4$'/ 5%*&:)4(B
! "#$ -:00 ($4 *- 5%*4$'/( ^4#$ _5%*4$*@$1` $/)*&$& A, 4#$ #:@2/
6$/*@$ '( @*%$ )*@50$H 4#2/ 4#*($ *- '/+$%4$A%24$(B "#'( '( &:$ '/
52%4 4* 4#$ 5%$($/)$ *- +$%4$A%24$I(5$)'>) 5%*4$'/ &*@2'/( 2/&
@*4'-( ^2/ $(4'@24$& aV *- 4#$ 4*420`C A:4 @*%$ 4* 4#$ -2)4 4#24
+$%4$A%24$( 255$2% 4* #2+$ 2%%2/6$& 5%$I$H'(4'/6 )*@5*/$/4( '/4* 2
%')#$% )*00$)4'*/ *- &*@2'/ 2%)#'4$)4:%$(B
! M:/&%$&( *- #:@2/ 6$/$( 255$2% 0'7$0, 4* #2+$ %$(:04$& -%*@
#*%'b*/420 4%2/(-$% -%*@ A2)4$%'2 24 (*@$ 5*'/4 '/ 4#$ +$%4$A%24$
0'/$26$B E*b$/( *- 6$/$( 255$2% 4* #2+$ A$$/ &$%'+$& -%*@ 4%2/(I
5*(2A0$ $0$@$/4(B
! G04#*:6# 2A*:4 #20- *- 4#$ #:@2/ 6$/*@$ &$%'+$( -%*@ 4%2/(I
5*(2A0$ $0$@$/4(C 4#$%$ #2( A$$/ 2 @2%7$& &$)0'/$ '/ 4#$ *+$%200
2)4'+'4, *- (:)# $0$@$/4( '/ 4#$ #*@'/'& 0'/$26$B EFG 4%2/(5*(*/(
255$2% 4* #2+$ A$)*@$ )*@50$4$0, '/2)4'+$ 2/& 0*/6I4$%@'/20
%$5$24 ^c"d` %$4%*5*(*/( @2, 20(* #2+$ &*/$ (*B
! "#$ 5$%')$/4%*@$%') 2/& (:A4$0*@$%') %$6'*/( *- )#%*@*(*@$(
2%$ >00$& 3'4# 02%6$ %$)$/4 ($6@$/420 &:50')24'*/( *- ($?:$/)$ -%*@
$0($3#$%$ '/ 4#$ 6$/*@$B O$6@$/420 &:50')24'*/ '( @:)# @*%$
-%$?:$/4 '/ #:@2/( 4#2/ '/ ,$2(4C \, *% 3*%@B
! G/20,('( *- 4#$ *%62/'b24'*/ *- G0: $0$@$/4( $H502'/( 4#$ 0*/6I
(42/&'/6 @,(4$%, *- 4#$'% (:%5%'('/6 6$/*@') &'(4%'A:4'*/C 2/&
(:66$(4( 4#24 4#$%$ @2, A$ (4%*/6 ($0$)4'*/ '/ -2+*:% *- 5%$-$%$/4'20
%$4$/4'*/ *- G0: $0$@$/4( '/ STI%')# %$6'*/( 2/& 4#24 4#$($ _($0>(#1
$0$@$/4( @2, A$/$>4 4#$'% #:@2/ #*(4(B
! "#$ @:424'*/ %24$ '( 2A*:4 43')$ 2( #'6# '/ @20$ 2( '/ -$@20$
@$'*('(C (#*3'/6 4#24 @*(4 @:424'*/ *)):%( '/ @20$(B
! T,4*6$/$4') 2/20,('( *- 4#$ ($?:$/)$& )0*/$( )*/>%@( (:66$(I
4'*/( 4#24 02%6$ STI5**% %$6'*/( 2%$ (4%*/60, )*%%$024$& 3'4# _&2%7
SIA2/&(1 '/ 72%,*4,5$(B
! d$)*@A'/24'*/ %24$( 4$/& 4* A$ @:)# #'6#$% '/ &'(420 %$6'*/(
^2%*:/& 89 @$62A2($( ^.A`` *- )#%*@*(*@$( 2/& */ (#*%4$%
)#%*@*(*@$ 2%@( '/ 6$/$%20C '/ 2 5244$%/ 4#24 5%*@*4$( 4#$
*)):%%$/)$ *- 24 0$2(4 */$ )%*((*+$% 5$% )#%*@*(*@$ 2%@ '/ $2)#
@$'*('(B
! .*%$ 4#2/ ;BW @'00'*/ ('/60$ /:)0$*4'&$ 5*0,@*%5#'(@( ^OFe(`
'/ 4#$ #:@2/ 6$/*@$ #2+$ A$$/ '&$/4'>$&B "#'( )*00$)4'*/ (#*:0&
200*3 4#$ '/'4'24'*/ *- 6$/*@$I3'&$ 0'/726$ &'($?:'0'A%':@
@255'/6 *- 4#$ 6$/$( '/ 4#$ #:@2/ 5*5:024'*/B
Y/ 4#'( 525$%C 3$ (42%4 A, 5%$($/4'/6 A2)76%*:/& '/-*%@24'*/ */
4#$ 5%*X$)4 2/& &$()%'A'/6 4#$ 6$/$%24'*/C 2(($@A0, 2/& $+20:24'*/
*- 4#$ &%2-4 6$/*@$ ($?:$/)$B f$ 4#$/ -*):( */ 2/ '/'4'20 2/20,('( *4#$ ($?:$/)$ '4($0-D 4#$ A%*2& )#%*@*(*@20 02/&()25$g 4#$ %$5$24
$0$@$/4( 2/& 4#$ %')# 5202$*/4*0*6')20 %$)*%& *- $+*0:4'*/2%, 2/&
A'*0*6')20 5%*)$(($( 4#24 4#$, 5%*+'&$g 4#$ #:@2/ 6$/$( 2/&
5%*4$'/( 2/& 4#$'% &'--$%$/)$( 2/& ('@'02%'4'$( 3'4# 4#*($ *- *4#$%
© 2001 Macmillan Magazines Ltd
FG"Ndh i jZc W9K i ;J RhkdNGdl 899; i 333B/24:%$B)*@
!"#$%" &"'("#)*#+ ,"#-."/ !"#$%&' #( )*'&* )+ %)%,- .&()/#0
$&12&(0& 0)(%*#32%&'4 5#%6 , 7,*%#,- -#$% )+ 7&*$)((&-8 9 +2-- -#$% )+
0)(%*#32%)*$ ,% &,06 0&(%*& #$ ,:,#-,3-& ,$ ;277-&/&(%,*<
=(+)*/,%#)(8>
01*-"1"23 4#/-*-(-" 5$. 6*$%"3*)27 8"/"2.)19 ,"#-". 5$. !"#$%"
8"/"2.)1: ?*#0 ;8 ",('&*@?4 ",2*&( A8 "#(%)(@4 B*20& B#**&(@?4
C6,' D2$3,2/@?4 A#06,&- C8 E)'<@?4 F&((#+&* B,-'5#(@4
G&*# H&:)(@4 G&( H&5,*@4 A#06,&- H)<-&@4 I#--#,/ J#%KL2.6@?4
M)&- J2(N&@4 H#,(& O,.&@4 G,%*#(, L,**#$@4 9('*&5 L&,+)*'@4
F)6( L)5-,('@4 "#$, G,((@4 F&$$#0, "&6)0KN<@4 M)$#& "&P#(&@4
Q,2- A0?5,(@4 G&:#( A0G&*(,(@4 F,/&$ A&-'*#/@4 F#-- Q8 A&$#*):@?4
C6&* A#*,(',@4 I#--#,/ A)**#$@4 F&*)/& D,<-)*@4
C6*#$%#(, M,</)('@4 A,*N M)$&%%#@4 M,-76 ;,(%)$@4
9('*&5 ;6&*#',(@4 C,**#& ;)2.(&K@4 D#0)-& ;%,(.&RS6)/,((@4
D#N)-, ;%)T,():#0@4 9*,:#(' ;23*,/,(#,(@
U H2'-&< I</,(@
;1" &2#+". ,"#-.": F,(& M).&*$V4 F)6( ;2-$%)(V?4
M,06,&- 9#($0)2.6V4 ;%&76,( B&0NV4 H,:#' B&(%-&<V4 F)6( B2*%)(V4
C6*#$%)76&* C-&&V4 D#.&- C,*%&*V4 9-,( C)2-$)(V4
M&3&00, H&,'/,(V4 Q,()$ H&-)2N,$V4 9('*&5 H2(6,/V4
=,( H2(6,/V4 M#06,*' H2*3#(V?4 "#$, J*&(06V4 H,**&( O*,+6,/V4
;#/)( O*&.)*<V4 S#/ L233,*'V?4 ;&,( L2/76*,<V4 9'*#&((& L2(%V4
A,%%6&5 F)(&$V4 C6*#$%#(& "-)<'V4 9/,(', A0A2**,<V4
"20< A,%%6&5$V4 ;#/)( A&*0&*V4 ;,*,6 A#-(&V4 F,/&$ C8 A2--#N#(V?4
9('*&5 A2(.,--V4 M)3&*% Q-2/3V4 A,*N M)$$V4 M,%(, ;6)5(N&&(V
U ;,*,6 ;#/$V
02/1*#+-$# <#*="./*-> !"#$%" &"'("#)*#+ ,"#-".:
M)3&*% L8 I,%&*$%)(W?4 M#06,*' G8 I#-$)(W4 ",H&,(, I8 L#--#&*W?4
F)6( H8 A0Q6&*$)(W4 A,*0) 98 A,**,W4 ?-,#(& M8 A,*'#$W4
"20#(', 98 J2-%)(W4 9$#+ S8 C6#(5,--,W?4 G</3&*-#& L8 Q&7#(W4
I,**&( M8 O#$6W4 ;%&76,(#& "8 C6#$$)&W4 A#06,&- C8 I&('-W4
G#/ H8 H&-&6,2(%<W4 S*,0#& "8 A#(&*W4 9('*&5 H&-&6,2(%<W4
F,$)( B8 G*,/&*W4 "#$, "8 C))NW4 M)3&*% ;8 J2-%)(W4
H)2.-,$ "8 F)6($)(W4 Q,%*#0N F8 A#(XW U ;,('*, I8 C-#+%)(W
<& ?@A B$*#- !"#$%" 4#/-*-(-": S*&:)* L,5N#($Y4
?-3&*% B*,($0)/3Y4 Q,2- Q*&'N#Y4 Q,2- M#06,*'$)(Y4
;,*,6 I&((#(.Y4 S)/ ;-&K,NY4 D)*/,( H)..&%%Y4 F,(RJ,(. C6&(.Y4
9((& Z-$&(Y4 ;2$,( "20,$Y4 C6*#$%)76&* ?-N#(Y4
?'5,*' [3&*3,06&*Y U A,*:#( J*,K#&*Y
62>7$. ,$77"+" $5 C"3*)*#" D(%2# !"#$%" &"'("#)*#+ ,"#-".:
M#06,*' 98 O#33$\?4 H)((, A8 A2K(<\4 ;%&:&( ?8 ;06&*&*\4
F)6( B8 B)20N\?4 ?*#0, F8 ;)'&*.*&(\4 G#/ C8 I)*-&<\?4 C,%6&*#(& A8
M#:&$\4 F,/&$ L8 O)**&--\4 A#06,&- "8 A&%KN&*\4
;2$,( "8 D,<-)*]4 M,T2 ;8 G206&*-,7,%#^4 H,:#' "8 D&-$)(4
U O&)*.& A8 I&#($%)0N_
84EAF !"#$%*) &)*"#)"/ ,"#-".: `)$6#<2N# ;,N,N#a4
9$,) J2T#<,/,a4 A,$,6#*, L,%%)*#a4 S&%$2$6# `,',a4
9%$2$6# S)<)',a4 S,N&6#N) =%)6a4 C6#6,*2 G,5,.)&a4
L#'&/# I,%,(,3&a4 `,$2$6# S)%)N#a U S)'' S,<-)*a
!"#$/)$G" 2#3 ,F8& <C8HIJKJ: F&,( I&#$$&(3,06@b4
M)-,(' L&#-#.@b4 I#--#,/ ;,2*#(@b4 J*,(0)#$ 9*%#.2&(,:&@b4
Q6#-#77& B*)%%#&*@b4 S6)/,$ B*2-$@b4 ?*#0 Q&--&%#&*@b4
C,%6&*#(& M)3&*%@b U Q,%*#0N I#(0N&*@b
!;, &"'("#)*#+ ,"#-".: H)2.-,$ M8 ;/#%6@@4
"<(( H)20&%%&R;%,//@@4 A,*0 M23&(c&-'@@4 G&#%6 I&#($%)0N@@4
L)(. A&# "&&@@ U F)9(( H23)#$@@
?"G2.-%"#- $5 !"#$%" L#27>/*/9 4#/-*-(-" $5 C$7")(72.
!"#$%& ' ()* +,- ' ./ 0&1%$"%2 3,,. ' 44456789:;5<=>
6*$-")1#$7$+>: 9('*&d M)$&(%6,-@V4 A,%%6#,$ Q-,%K&*@V4
O&*,-' D<,N,%2*,@V4 ;%&+,( S,2'#&(@V U 9('*&,$ M2/7@V
6"*M*#+ !"#$%*)/ 4#/-*-(-"ND(%2# !"#$%" ,"#-".:
L2,(/#(. `,(.@W4 F2( `2@W4 F#,( I,(.@W4 O2<,(. L2,(.@Y
U F2( O2@\
C(7-*%"+2O2/" &"'("#)*#+ ,"#-".9 ;1" 4#/-*-(-" 5$. &>/-"%/
6*$7$+>: "&*)< L))'@]4 "&& M)5&(@]4 9(27 A,',(@] U ;6#K&( e#(@]
&-2#5$.3 !"#$%" ;")1#$7$+> ,"#-".: M)(,-' I8 H,:#$@^4
D,(0< 98 J&'&*$7#&-@^4 98 Q#, 93)-,@^ U A#06,&- F8 Q*)0%)*@^
&-2#5$.3 D(%2# !"#$%" ,"#-".: M#06,*' A8 A<&*$@_4
F&*&/< ;06/2%K@_4 A,*N H#0N$)(@_4 F,(& O*#/5))'@_
U H,:#' M8 C)X@_
<#*="./*-> $5 02/1*#+-$# !"#$%" ,"#-".: A,<(,*' P8 Z-$)(@a4
M,T#('&* G,2-@a U C6*#$%)76&* M,</)('@a
?"G2.-%"#- $5 C$7")(72. 6*$7$+>9 E"*$ <#*="./*-> &)1$$7 $5
C"3*)*#": D)32<)$6# ;6#/#K2Vb4 G,K26#N) G,5,$,N#Vb
U ;6#($&# A#()$6#/,Vb
<#*="./*-> $5 ;"P2/ &$(-1Q"/-".# C"3*)27 ,"#-". 2- ?2772/:
O-&( 98 ?:,($V@@4 A,*#, 9%6,(,$#)2V@ U M).&* ;062-%KV@
<#*="./*-> $5 @R721$%2S/ L3=2#)"3 ,"#-". 5$. !"#$%"
;")1#$7$+>: B*20& 98 M)&VV4 J&(. C6&(VV U L2,1#( Q,(VV
C2P T72#)R 4#/-*-(-" 5$. C$7")(72. !"#"-*)/: F2-#,(& M,/$&*VW4
L,($ "&6*,06VW U M#06,*' M&#(6,*'%VW
,$73 &G.*#+ D2.O$. U2O$.2-$.>9 U*-2 L##"#O".+ D2V"# !"#$%"
,"#-".: I8 M#06,*' A0C)/3#&VY4 A&-#$$, '& -, B,$%#'&VY
U D&#-,< H&'6#,VY
!6WX!".%2# 8"/"2.)1 ,"#-." 5$. 6*$-")1#$7$+>:
L&-/2% B-)f0N&*V\4 G-,2$ L)*(#$06&*V\ U O,3*#&-& D)*'$#&NV\
? !"#$%" L#27>/*/ !.$(G Y7*/-"3 *# 27G12O"-*)27 $.3".9 27/$
*#)7(3"/ *#3*=*3(27/ 7*/-"3 (#3". $-1". 1"23*#+/Z:
M#06, 9.,*5,-,V]4 "8 9*,:#('V]4 F&++*&< 98 B,#-&<V^4 9-&X B,%&/,(V4
;&*,c/ B,%K).-)2@4 ?5,( B#*(&<V_4 Q&&* B)*NVa4Wb4 H,(#&- O8 B*)5(@4
C6*#$%)76&* B8 B2*.&W@4 ")*&(K) C&*2%%#V_4 L$#2RC62,( C6&(V]4
H&,((, C62*06V]4 A#06&-& C-,/7V4 M#06,*' M8 C)7-&<Wb4
S)3#,$ H)&*N$Va4Wb4 ;&,( M8 ?''<WV4 ?:,( ?8 ?#06-&*V^4
S&**&(0& ;8 J2*&<WW4 F,/&$ O,-,.,(@4 F,/&$ O8 M8 O#-3&*%V4
C<*2$ L,*/)(WY4 `)$6#6#'& L,<,$6#K,N#W\4 H,:#' L,2$$-&*W]4
L&((#(. L&*/T,N)3V_4 G,*$%&( L)N,/7W^4 I)(6&& F,(.V]4
"8 ;%&:&( F)6($)(WV4 S6)/,$ 98 F)(&$WV4 ;#/)( G,$#+W_4
9*&N G,$7*<KNV_4 ;0)% G&((&'<Wa4 I8 F,/&$ G&(%Yb4 Q,2- G#%%$V]4
?2.&(& P8 G))(#(V]4 =,( G)*+W4 H,:#' G2-7WY4 H)*)( ",(0&%Y@4
S)'' A8 ")5&YV4 9)#+& A0"<$,.6%W^4 S,*T&# A#NN&-$&(W_4
F)6( P8 A)*,(YW4 D#0)-, A2-'&*V_4 P#0%)* F8 Q)--,*,@4
C6*#$ Q8 Q)(%#(.YY4 O*&. ;062-&*V]4 F)f*. ;062-%KWb4 O2< ;-,%&*V_4
9*#,( J8 98 ;/#%Y\4 ?-#, ;%27N,V_4 F)$&76 ;K2$%,N)5N#W_4
H,(#&--& S6#&**<RA#&.V]4 F&,( S6#&**<RA#&.V]4 "2N,$ I,.(&*V]4
F)6( I,--#$W4 M,</)(' I6&&-&*WY4 9-,( I#--#,/$WY4 `2*# =8 I)-+V]4
G&((&%6 L8 I)-+&W^4 ;6#,5RQ<(. `,(.W U M2RJ,(. `&6W@
&)*"#-*[) %2#2+"%"#-: F2-*$#27 D(%2# !"#$%" 8"/"2.)1
4#/-*-(-"9 <& F2-*$#27 4#/-*-(-"/ $5 D"27-1: J*,(0#$ C)--#($Y]?4
A,*N ;8 O2<&*Y]4 F,(& Q&%&*$)(Y]4 9',/ J&-$&(+&-'Y]?
U G*#$ 98 I&%%&*$%*,('Y]g @5[)" $5 &)*"#)"9 <& ?"G2.-%"#- $5
A#".+>: 9*#$%#'&$ Q,%*#()$Y^g ;1" 0"77)$%" ;.(/-: A#06,&- F8
A)*.,(Y_
)*+
© 2001 Macmillan Magazines Ltd
19
keywords: public, BAC pools
THE HUMAN GENOME
J. Craig Venter,1* Mark D. Adams,1 Eugene W. Myers,1 Peter W. Li,1 Richard J. Mural,1
Granger G. Sutton,1 Hamilton O. Smith,1 Mark Yandell,1 Cheryl A. Evans,1 Robert A. Holt,1
Jeannine D. Gocayne,1 Peter Amanatides,1 Richard M. Ballew,1 Daniel H. Huson,1
Jennifer Russo Wortman,1 Qing Zhang,1 Chinnappa D. Kodira,1 Xiangqun H. Zheng,1 Lin Chen,1
Marian Skupski,1 Gangadharan Subramanian,1 Paul D. Thomas,1 Jinghui Zhang,1
George L. Gabor Miklos,2 Catherine Nelson,3 Samuel Broder,1 Andrew G. Clark,4 Joe Nadeau,5
Victor A. McKusick,6 Norton Zinder,7 Arnold J. Levine,7 Richard J. Roberts,8 Mel Simon,9
Carolyn Slayman,10 Michael Hunkapiller,11 Randall Bolanos,1 Arthur Delcher,1 Ian Dew,1 Daniel Fasulo,1
Michael Flanigan,1 Liliana Florea,1 Aaron Halpern,1 Sridhar Hannenhalli,1 Saul Kravitz,1 Samuel Levy,1
Clark Mobarry,1 Knut Reinert,1 Karin Remington,1 Jane Abu-Threideh,1 Ellen Beasley,1 Kendra Biddick,1
Vivien Bonazzi,1 Rhonda Brandon,1 Michele Cargill,1 Ishwar Chandramouliswaran,1 Rosane Charlab,1
Kabir Chaturvedi,1 Zuoming Deng,1 Valentina Di Francesco,1 Patrick Dunn,1 Karen Eilbeck,1
Carlos Evangelista,1 Andrei E. Gabrielian,1 Weiniu Gan,1 Wangmao Ge,1 Fangcheng Gong,1 Zhiping Gu,1
Ping Guan,1 Thomas J. Heiman,1 Maureen E. Higgins,1 Rui-Ru Ji,1 Zhaoxi Ke,1 Karen A. Ketchum,1
Zhongwu Lai,1 Yiding Lei,1 Zhenya Li,1 Jiayin Li,1 Yong Liang,1 Xiaoying Lin,1 Fu Lu,1
Gennady V. Merkulov,1 Natalia Milshina,1 Helen M. Moore,1 Ashwinikumar K Naik,1
Vaibhav A. Narayan,1 Beena Neelam,1 Deborah Nusskern,1 Douglas B. Rusch,1 Steven Salzberg,12
Wei Shao,1 Bixiong Shue,1 Jingtao Sun,1 Zhen Yuan Wang,1 Aihui Wang,1 Xin Wang,1 Jian Wang,1
Ming-Hui Wei,1 Ron Wides,13 Chunlin Xiao,1 Chunhua Yan,1 Alison Yao,1 Jane Ye,1 Ming Zhan,1
Weiqing Zhang,1 Hongyu Zhang,1 Qi Zhao,1 Liansheng Zheng,1 Fei Zhong,1 Wenyan Zhong,1
Shiaoping C. Zhu,1 Shaying Zhao,12 Dennis Gilbert,1 Suzanna Baumhueter,1 Gene Spier,1
Christine Carter,1 Anibal Cravchik,1 Trevor Woodage,1 Feroze Ali,1 Huijin An,1 Aderonke Awe,1
Danita Baldwin,1 Holly Baden,1 Mary Barnstead,1 Ian Barrow,1 Karen Beeson,1 Dana Busam,1
Amy Carver,1 Angela Center,1 Ming Lai Cheng,1 Liz Curry,1 Steve Danaher,1 Lionel Davenport,1
Raymond Desilets,1 Susanne Dietz,1 Kristina Dodson,1 Lisa Doup,1 Steven Ferriera,1 Neha Garg,1
Andres Gluecksmann,1 Brit Hart,1 Jason Haynes,1 Charles Haynes,1 Cheryl Heiner,1 Suzanne Hladun,1
Damon Hostin,1 Jarrett Houck,1 Timothy Howland,1 Chinyere Ibegwam,1 Jeffery Johnson,1
Francis Kalush,1 Lesley Kline,1 Shashi Koduru,1 Amy Love,1 Felecia Mann,1 David May,1
Steven McCawley,1 Tina McIntosh,1 Ivy McMullen,1 Mee Moy,1 Linda Moy,1 Brian Murphy,1
Keith Nelson,1 Cynthia Pfannkoch,1 Eric Pratts,1 Vinita Puri,1 Hina Qureshi,1 Matthew Reardon,1
Robert Rodriguez,1 Yu-Hui Rogers,1 Deanna Romblad,1 Bob Ruhfel,1 Richard Scott,1 Cynthia Sitter,1
Michelle Smallwood,1 Erin Stewart,1 Renee Strong,1 Ellen Suh,1 Reginald Thomas,1 Ni Ni Tint,1
Sukyee Tse,1 Claire Vech,1 Gary Wang,1 Jeremy Wetter,1 Sherita Williams,1 Monica Williams,1
Sandra Windsor,1 Emily Winn-Deen,1 Keriellen Wolfe,1 Jayshree Zaveri,1 Karena Zaveri,1
Josep F. Abril,14 Roderic Guigó,14 Michael J. Campbell,1 Kimmen V. Sjolander,1 Brian Karlak,1
Anish Kejariwal,1 Huaiyu Mi,1 Betty Lazareva,1 Thomas Hatton,1 Apurva Narechania,1 Karen Diemer,1
Anushya Muruganujan,1 Nan Guo,1 Shinji Sato,1 Vineet Bafna,1 Sorin Istrail,1 Ross Lippert,1
Russell Schwartz,1 Brian Walenz,1 Shibu Yooseph,1 David Allen,1 Anand Basu,1 James Baxendale,1
Louis Blick,1 Marcelo Caminha,1 John Carnes-Stine,1 Parris Caulk,1 Yen-Hui Chiang,1 My Coyne,1
Carl Dahlke,1 Anne Deslattes Mays,1 Maria Dombroski,1 Michael Donnelly,1 Dale Ely,1 Shiva Esparham,1
Carl Fosler,1 Harold Gire,1 Stephen Glanowski,1 Kenneth Glasser,1 Anna Glodek,1 Mark Gorokhov,1
Ken Graham,1 Barry Gropman,1 Michael Harris,1 Jeremy Heil,1 Scott Henderson,1 Jeffrey Hoover,1
Donald Jennings,1 Catherine Jordan,1 James Jordan,1 John Kasha,1 Leonid Kagan,1 Cheryl Kraft,1
Alexander Levitsky,1 Mark Lewis,1 Xiangjun Liu,1 John Lopez,1 Daniel Ma,1 William Majoros,1
Joe McDaniel,1 Sean Murphy,1 Matthew Newman,1 Trung Nguyen,1 Ngoc Nguyen,1 Marc Nodell,1
Sue Pan,1 Jim Peck,1 Marshall Peterson,1 William Rowe,1 Robert Sanders,1 John Scott,1
Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy Stockwell,1 Russell Turner,1 Eli Venter,1
Mei Wang,1 Meiyuan Wen,1 David Wu,1 Mitchell Wu,1 Ashley Xia,1 Ali Zandieh,1 Xiaohong Zhu1
Downloaded from www.sciencemag.org on February 26, 2011
The Sequence of the Human Genome
Decoding of the DNA that constitutes the
human genome has been widely anticipated
for the contribution it will make toward unCelera Genomics, 45 West Gude Drive, Rockville, MD
20850, USA. 2GenetixXpress, 78 Pacific Road, Palm
Beach, Sydney 2108, Australia. 3Berkeley Drosophila
Genome Project, University of California, Berkeley, CA
94720, USA. 4Department of Biology, Penn State University, 208 Mueller Lab, University Park, PA 16802,
USA. 5Department of Genetics, Case Western Reserve
University School of Medicine, BRB-630, 10900 Euclid
Avenue, Cleveland, OH 44106, USA. 6Johns Hopkins
University School of Medicine, Johns Hopkins Hospital, 600 North Wolfe Street, Blalock 1007, Baltimore,
MD 21287– 4922, USA. 7Rockefeller University, 1230
York Avenue, New York, NY 10021– 6399, USA. 8New
England BioLabs, 32 Tozer Road, Beverly, MA 01915,
USA. 9Division of Biology, 147-75, California Institute
of Technology, 1200 East California Boulevard, Pasadena, CA 91125, USA. 10Yale University School of
Medicine, 333 Cedar Street, P.O. Box 208000, New
Haven, CT 06520 – 8000, USA. 11Applied Biosystems,
850 Lincoln Centre Drive, Foster City, CA 94404, USA.
12
The Institute for Genomic Research, 9712 Medical
Center Drive, Rockville, MD 20850, USA. 13Faculty of
Life Sciences, Bar-Ilan University, Ramat-Gan, 52900
Israel. 14Grup de Recerca en Informàtica Mèdica, Institut Municipal d’Investigació Mèdica, Universitat
Pompeu Fabra, 08003-Barcelona, Catalonia, Spain.
1
*To whom correspondence should be addressed. Email: [email protected]
1304
16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org
derstanding human evolution, the causation
of disease, and the interplay between the
environment and heredity in defining the human condition. A project with the goal of
determining the complete nucleotide sequence of the human genome was first formally proposed in 1985 (1). In subsequent
years, the idea met with mixed reactions in
the scientific community (2). However, in
1990, the Human Genome Project (HGP) was
officially initiated in the United States under
the direction of the National Institutes of
Health and the U.S. Department of Energy
with a 15-year, $3 billion plan for completing
the genome sequence. In 1998 we announced
our intention to build a unique genomesequencing facility, to determine the sequence of the human genome over a 3-year
period. Here we report the penultimate milestone along the path toward that goal, a nearly
complete sequence of the euchromatic portion of the human genome. The sequencing
was performed by a whole-genome random
shotgun method with subsequent assembly of
the sequenced segments.
The modern history of DNA sequencing
began in 1977, when Sanger reported his method for determining the order of nucleotides of
DNA using chain-terminating nucleotide analogs (3). In the same year, the first human gene
was isolated and sequenced (4). In 1986, Hood
and co-workers (5) described an improvement
in the Sanger sequencing method that included
attaching fluorescent dyes to the nucleotides,
which permitted them to be sequentially read
by a computer. The first automated DNA sequencer, developed by Applied Biosystems in
California in 1987, was shown to be successful
when the sequences of two genes were obtained
with this new technology (6). From early sequencing of human genomic regions (7), it
became clear that cDNA sequences (which are
reverse-transcribed from RNA) would be essential to annotate and validate gene predictions
in the human genome. These studies were the
basis in part for the development of the expressed sequence tag (EST) method of gene
identification (8), which is a random selection,
very high throughput sequencing approach to
characterize cDNA libraries. The EST method
led to the rapid discovery and mapping of human genes (9). The increasing numbers of human EST sequences necessitated the development of new computer algorithms to analyze
large amounts of sequence data, and in 1993 at
The Institute for Genomic Research (TIGR), an
algorithm was developed that permitted assembly and analysis of hundreds of thousands of
ESTs. This algorithm permitted characterization and annotation of human genes on the basis
of 30,000 EST assemblies (10).
The complete 49-kbp bacteriophage lambda genome sequence was determined by a
shotgun restriction digest method in 1982
(11). When considering methods for sequencing the smallpox virus genome in 1991 (12),
a whole-genome shotgun sequencing method
was discussed and subsequently rejected owing to the lack of appropriate software tools
for genome assembly. However, in 1994,
when a microbial genome-sequencing project
was contemplated at TIGR, a whole-genome
shotgun sequencing approach was considered
possible with the TIGR EST assembly algorithm. In 1995, the 1.8-Mbp Haemophilus
influenzae genome was completed by a
whole-genome shotgun sequencing method
(13). The experience with several subsequent
genome-sequencing efforts established the
broad applicability of this approach (14, 15).
A key feature of the sequencing approach
used for these megabase-size and larger genomes was the use of paired-end sequences
(also called mate pairs), derived from subclone libraries with distinct insert sizes and
cloning characteristics. Paired-end sequences
are sequences 500 to 600 bp in length from
both ends of double-stranded DNA clones of
prescribed lengths. The success of using end
sequences from long segments (18 to 20 kbp)
of DNA cloned into bacteriophage lambda in
assembly of the microbial genomes led to the
suggestion (16 ) of an approach to simulta-
www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001
Downloaded from www.sciencemag.org on February 26, 2011
THE HUMAN GENOME
A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of
the human genome was generated by the whole-genome shotgun sequencing
method. The 14.8-billion bp DNA sequence was generated over 9 months from
27,271,853 high-quality sequence reads (5.11-fold coverage of the genome)
from both ends of plasmid clones made from the DNA of five individuals. Two
assembly strategies—a whole-genome assembly and a regional chromosome
assembly—were used, each combining sequence data from Celera and the
publicly funded genome effort. The public data were shredded into 550-bp
segments to create a 2.9-fold coverage of those genome regions that had been
sequenced, without including biases inherent in the cloning and assembly
procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in
the final assembly over what would be obtained with 5.11-fold coverage. The
two assembly strategies yielded very similar results that largely agree with
independent mapping data. The assemblies effectively cover the euchromatic
regions of the human chromosomes. More than 90% of the genome is in
scaffold assemblies of 100,000 bp or more, and 25% of the genome is in
scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed
26,588 protein-encoding transcripts for which there was strong corroborating
evidence and an additional !12,000 computationally derived genes with mouse
matches or other weak supporting evidence. Although gene-dense clusters are
obvious, almost half the genes are dispersed in low G"C sequence separated
by large tracts of apparently noncoding sequence. Only 1.1% of the genome
is spanned by exons, whereas 24% is in introns, with 75% of the genome being
intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex
evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA
sequence comparisons between the consensus sequence and publicly funded
genome data provided locations of 2.1 million single-nucleotide polymorphisms
(SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per
1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in
proteins, but the task of determining which SNPs have functional consequences
remains an open challenge.
1305
20
keywords: private (Celera), shotgun sequencing
SOME KEY FINDINGS
• The HGP has revealed that there are probably about 25,000 to 40,000 (since
updated to a count of ~20,500 human genes)
• Human genome is remarkably similar to other genomes in terms of total gene
humbers and gene functions, although most genes are more complex.
(Comparitive Genomics)
• Between 1.1% to 1.4% of the genome's sequence codes for proteins.
Nonfunctional regions appear to account for ~97%. 12% of human genomic
DNA is due to copy number variations - CNVs
• ~2 million single nucleotide polymorphisms - SNPs (~0.1 to 0.3% of total
genome)
21
21
• ~2 million single nucleotide polymorphisms - SNPs (~0.1 to 0.3% of total
genome)
is a DNA sequence variation occurring when a single nucleotide — A, T, C, or G
— in the genome (or other shared sequence) differs between (human) members.
AGCTTAGCGAGTGACCGGTCAGCTTACGCAGATCGAGGAGCTTACG
AGCTTAGCGAGTGCCCGGTCAGCTTACGCAGATCGAGGATCTTACG
AGCTTAGCGAGTGCCCGGTCAGCTTACGCAGATCGAGGATCTTACG
AGCTTAGCGAGTGACCGGTCAGCTTACGCAGATCGAGGAGCTTACG
AGCTTAGCGAGTGCCCGGTCAGCTTACGCAGATCGAGGATCTTACG
22
2 ALLELES A vs C in GENE X
2 ALLELES G vs T in GENE Y
22
23
MICROARRAY
DNA CHIP
24
25
25
26
27
28
29
https://www.youtube.com/watch?v=l99aKKHcxC4
30
17
source: http://www.illumina.com/
18
19
20
source: http://www.illumina.com/
source: http://www.illumina.com/
source: http://www.illumina.com/
31
32
33
34
23
HYPOTHETICAL EXAMPLE
35
http://miserycat.deviantart.com/art/Free-Unicorn-Icon-180216460?q=&qo=
HORSE
UNICORN
1
2
3
1
2
3
4
5
6
4
5
6
7
8
9
7
8
9
http://miserycat.deviantart.com/art/Free-Unicorn-Icon-180216460?q=&qo=
36
UNICORN
HORSE
37
http://miserycat.deviantart.com/art/Free-Unicorn-Icon-180216460?q=&qo=
UNICORN
HORSE
38
http://miserycat.deviantart.com/art/Free-Unicorn-Icon-180216460?q=&qo=
UNICORN
http://miserycat.deviantart.com/art/Free-Unicorn-Icon-180216460?q=&qo=
HORSE
39
UNICORN
4
HORSE
4
http://miserycat.deviantart.com/art/Free-Unicorn-Icon-180216460?q=&qo=
40
http://vimeo.com/77246565
41
AT LAUNCH OF HUMAN GENOME PROJECT (1990)
Several machines to sequence the human genome. Est.
time and cost: 15 years and $3 billion
2 years ago (2012):
One machine can sequence an entire genome in about 8
days at a cost of about $20,000
1 year ago (2013):
One machine can sequence an entire genome in about 3
days at a cost of about $5,000
CURRENTLY (as in just announced in January 2014):
One machine (the Illumina X-TEN) can sequence an entire
genome in less than a day at a cost of about $1,000
42
TRUE OR FALSE?
1. Testis Determining Factor found on the Y chromosome, is largely
believed to be the principle gene product responsible for directing
development of male gonads. Ironically, its protein code contains the
amino acid sequence proline - isoleucine - glycine or “PIG” for short.
2. The nucleotide code of certain genes when translated into musical
notation have been known to have similarities to Chopin compositions.
3. If you search for DNA sequences that match PRESIDENTBUSH,
GWBUSH, GWBLISH, or even just BUSH, you get a series of hits that
coicindentally lead to sequences or organisms critical of that
administration.
43
44
GENOMICS 101 JARGON MAP
BIOINFORMATICS
associations
QTL
MARKERS
mapping
barcoding
SNPs
metagenomics
genotype STRs
arrays
2D gels
ESTs
RNA
DNA PCRtranscriptomics
GMOs genomics
RT
RNAi
immunotechniques
Protein
proteomics
(gene expression)
MRM
mass spectrometry
glycobiology
shotgun seq
Sanger
BAC pools
next (next) gen
lipids
Metabolite
metabolomics
gas chromatography
#molecularhipsters
DNA SEQ
+ ENVIRONMENT = PHENOTYPE
45
(mostly DNA)
genotype: genetic (mostly DNA) characterization
(at any level, big or small) of an organism
marker: generally something where you say “let
me flag this” and keep it in mind. Could be a seq, a
molecule, even morphology.
SNPs, STRs: examples of commonly used genomic
markers. SNPs short for single nucleotide
polymorphism, and STRs being short for satellite
tandem repeats. Note that other raw sequences
can also be used as markers.
46
associations: does the marker
/seq/molecule correlate nicely
with some sort of phenotype
or observation.
mapping: “this could be
useful...” Now let’s link it
physically to someplace
on my (usually) genome.
quantitative trait loci: (QTL)
mapping/association
technique that scans
large genomic sequences.
http://www.nature.com/scitable/topicpage/quantitative-trait-locus-qtl-analysis-53904
47
barcoding: genetic method to identify species,
usually by seq ubiquitous sequences that vary
slightly from organism to organism.
metagenomics: let’s survey the DNA content of an
environmental sample. Sometimes, seq
EVERYTHING (find cool novel stuff): sometimes
seq specific things, like the seqs involved in
barcoding (find cool novel species).
48
(mostly RNA)
transcriptomics: looking at all the transcripts of a
given system - cell, tissue, organism,
environmental sample, etc (all the mRNAs which
will get translated into proteins).
gene expression: analysis of what genes (or DNA
sequences) are being turned on with the
presumption that they will be translated in
proteins. This, by the way, is an incorrect
assumption but...
reverse transcriptase: because RNA is a pain in
the a** to work with.
49
arrays: chips, chips, chips. For gene expression,
see http://www.bio.davidson.edu/courses/
genomics/chip/chip.html
expressed sequence tags: or ESTs. Basically, why
sequence an entire RNA sequence, when a small
stretch is enough.
interfering RNA: or RNAi. Tool for transient and
specific down regulation of gene expression.
Works best with plants and some bugs.
50
(mostly protein)
proteomics: looking at all the proteins of a given
system (cell, tissue, organism, environmental
sample, etc).
2D gels: proteins separated
first by charge (pI) and
then by size (kDa).
immunotechniques: using antibodies. Because
they’re kind of like specific probes but for proteins.
51
mass spectrometry: machine that calculates the
size of molecules by the way they get hurled
around a magnet.
multiple reaction monitoring: doing successive
mass spec experiments, but with filters to select at
each stage.
http://physiologyonline.physiology.org/content/22/6/390
52
(mostly metabolites)
metabolomics: looking at all the
metabolites...(etc)...
lipids: fancy word for fats.
glyco: fancy prefix for things with sugars/
carbohydrates.
gas chromatography: separation procedure that
has particularly high precision. i.e. good for the
crazy diverse nature of all the metabolites in a
given sample.
53
(other)
GMO: genetically modified organisms.
Technically, technically, implicates an organism
directly modified by genetic engineering
techniques.
How is this different from genomics?
54
CASE STUDIES FROM SELECTED
GENOME BC PROJECTS.
GENOMICS OF SUNFLOWER
Project Leader: Loren H. Rieseberg
This project will sequence the genome of the cultivated sunflower, Helianthus annuus, the first
genome sequence for the Compositae (Asteraceae), the largest family of flowering plants,
whose ~24,000 named species comprise roughly 10% of all flowering plants. Many of these
species are economically important crops, medicinal plants or weeds, and this diversity is only
beginning to be tapped by modern growers and breeders. The most economically important of
these is the sunflower, which is also the most widely studied and genetically well-developed
representative of the family. In addition to its importance as an oil seed crop, the sunflower has
tremendous potential for cellulosic biomass production, both as a primary source and as a
residue of oilseed production. Thus, the proposed large-scale sequencing effort will be
accompanied by the development of genomic resources and knowledge needed for
manipulating cellulosic biomass traits and other key agronomic traits in hybrid sunflower
breeding programs. Additionally, improved genetic resources would be useful for those working
on other valuable Compositae crops, medicinals and horticultural varieties, as well as in control
efforts for the many noxious Compositae weeds.
Our specific objectives are to:
1) Extend the genetic map for sunflower;
2) Construct a physical map of the sunflower genome;
3) Generate a reference sequence for sunflower and the Compositae;
4) Determine the genetic basis of agriculturally important traits; and
5) Develop a xylem EST database and determine the genetic basis of cellulosic biomass traits.
Because sunflower has a large genome (3,500 Mbp) with abundant repetitive sequences
(mainly LTR transposons), a shotgun sequencing approach is unlikely to reliably link, order, and
orient sequenced contigs. Therefore, we are proposing a hybrid approach, which incorporates
both whole genome shotgun (WGS) sequencing to ~40x with the IlIumina GA platform and
sequencing of BAC pools to ~20x with Roche’s GS FLX platform. The sequencing of BAC pools
mitigates the assembly problems associated with WGS, while concomitantly reducing the
number of libraries required for the traditional BAC sequencing approach. Likewise, the
blending of Illumina and GS FLX reads is effective because the two sequencing platforms bring
different strengths: Illumina brings great depth, but cannot bridge regions of low complexity,
whereas GS FLX reads can span longer repeats, but their higher cost makes deep coverage
expensive. This strategy will provide sequence equivalent in accuracy to high coverage Sanger
sequence, but at a small fraction of the cost. Lastly, we will fill gaps in the assembly by targeted
BAC sequencing in a final, finishing stage. This approach thus takes advantage of next
generation sequencing platforms, using them to avoid costly and time-consuming pitfalls with
traditional genome sequencing approaches.
The availability of a full genome sequence will have immediate benefits, allowing several high
value projects to move forward at a rapid pace. First, an association-mapping project will
characterize genetic and morphological variation among domesticated sunflower lines and the
relationship between genetic differences and important crop traits. We will grow and phenotype
several hundred inbred cultivated lines in two different field environments, measuring a suite of
important morphological traits throughout the growing season as well as seed traits after final
harvest. Each line will be genotyped at 96 candidate genes for important traits, enabling us to
detect statistical associations between alleles at each gene and traits of interest. The discovery
of sequenced-based markers for favourable alleles will facilitate molecular breeding programs
underway in the private sector.
A second major focus will be the development of sunflower as a new biofuel source with unique
advantages as an annual woody plant. Biofuel development will exploit wood-producing
ecotypes of two extremely drought tolerant, wood-forming, desert-dwelling wild species:
silverleaf sunflower (H. argophyllus) and Algodones dune sunflower (H. tephrodes). Preliminary
characterization of the wood-producing ecotypes of silver-leaf sunflower indicates that they are,
for all practical purposes, miniature annual trees. The woody ecotypes grow to 4 meters in
height and 10 cm in diameter in a single year. Silverleaf sunflower differs from switchgrass and
other herbaceous annual plants in a fundamentally important way––the sunflower wood is similar
in quality to quaking aspen and poplar, much denser than forage- and silage-type cellulosic
biomass feedstocks. Denser materials store more energy/volume and are less costly to transport
and store. Moreover, as a drought-tolerant annual dicot, sunflower occupies a unique ecological
and production niche, and is capable of growing in marginal land not suitable for most crop or
timber species. Finally, wood production and quality increase in northern latitudes, so sunflower
might be particularly valuable as a Canadian biofuel.
Thus, the proposed project responds to both 3.2.1 Bioproducts (particularly 3.2.1.1 Feedstock
Optimization) and 3.2.2 Crops (particularly 3.2.2.1 Basic Plant Genomics).
To develop sunflowers as a viable biofuel source, we will first need a better understanding of the
genetic basis of wood production in these species. Comparisons of EST sequences from woody
and non-woody lines will give us candidate genes to examine. A complementary approach will
involve QTL mapping for woody traits and chemical phenotypes in BC1’s and RILs derived from
crosses between woody and non-wood lines. QTL-NILs for wood formation will be advanced
from these populations and will provide a basis for the development of high-biomass woody
cultivars.
Complementing this genomic and bioinformatic work will be a parallel project lead by Emily
Marden and Ed Levy, experts on Intellectual Property (IP) in animal and medical genomics.
They will apply their expertise to examine how public domain, patent pools and open source
approaches to IP would work in plant genomics, evaluating how well each approach would be
able to promote innovation in research and the development of new products and services. A
related goal will be to address regulatory questions related to manipulated plant genomes,
particularly second-generation genetically modified crops.
In summary, a fully-sequenced sunflower genome will be a fundamentally important resource,
enabling major advances for sunflower and the entire Compositae and providing the data
necessary for functional and comparative genomic analyses related to agricultural, biological,
and environmental research. Immediate applications of information from the project include the
development of second-generation expression and genotyping arrays for molecular breeding of
sunflower, whereas in the longer term our project will lead to the development of woody, high
biomass sunflower cultivars. Throughout the project we will exploit Canada’s strong genomics
infrastructure and leadership in Compositae genomics and use this infrastructure and expertise
to full advantage in collaboration with experts world-wide.
NEXT GENERATION INTEGRATED PEST MANAGEMENT TOOLS FOR
BEEKEEPING
Project Leader: Leonard Foster, Co-Project Leader: Stephen Pernal
The importance of honey bees (Apis mellifera) to human agriculture cannot be understated:
many major fruit, nut and vegetable crops we eat rely on the pollination services of bees. The
contribution of bees to Canadian agriculture exceeds $2.3 billion and to U.S. agriculture it is
nearly $15 billion. Worryingly, for the past four years North American beekeepers have lost
approximately one third of their stock annually – roughly three times the historical average.
These losses are largely attributed to bee-specific infectious diseases and even though some
pathogens can be suppressed using pesticides, widespread resistance to many products has
already developed. At the same time, viruses’ impact on bee health continues to grow and no
conventional pesticides are effective against viruses.
Novel Integrated Pest Management (IPM) approaches, such as selection and maintenance of
honey bee stock resistant to diseases or novel targeted treatments, could reverse most of the
recent declines. This project strategically aligns a multidisciplinary team of researchers towards
the goal of developing an economically-justified, practical, integrated and targeted pest
management solution for apiculture, resulting in an estimated $200 million in annual benefits to
Canadian agriculture based on the expected decrease on colony losses, increased honey
production and availability of bees for pollination. Our biological research will focus on
developing two distinct tools: a) marker-assisted selection of disease resistant honey bees, and
b) double-stranded RNA reagents for simultaneous, RNA interference (RNAi)- based control of
the most damaging honey bee pathogens. On the GE3LS side, economists will work with
biologists and beekeepers to estimate the economic viability of the new tools mentioned above
and to develop a set of best-practices recommendations for beekeepers to integrate the use of
these new tools with optimal use of existing tools.
We will employ highly multiplexed quantitative proteomics approach known as multiple reaction
monitoring (MRM) and genotyping to implement marker-assisted selection of disease resistant
honey bees, building on previous Canadian investments in biomarker discovery work amongst
the applicants. At the same time, we will use a combination of directed (MRM) and discovery
(liquid chromatography-tandem mass spectrometry on a Orbitrap Velos instrument) quantitative
proteomic approaches, together with measures of pathogen load and pathology, to develop and
evaluate RNAi approaches for controlling pathogens. In both approaches, the ultimate tests of
efficacy will occur at the whole colony level, even in commercial apiaries, where on-going,
parallel GE3LS work on the economic fundamentals of the beekeeping operations will provide a
baseline against which the performance of the developed tools can be evaluated. The specific
outcomes of the project will include:
1) demonstrated efficacy of and tools for marker-assisted selection (MAS) in bees by Q9,
2) disease-resistant honey bees with higher economic output than existing stocks by Q12
3) new RNAi tools and best-practices recommendations for the use of RNAi for controlling the
most damaging pathogens by Q12
With our biologists and economists having direct input into each others’ research questions, the
ultimate success of the tools developed here will be measured in net economic benefit to
beekeepers and agriculture, and will represent the first substantive, industry-wide step towards
reversing the steady decline in honey bees. Consumers, crop growers and beekeepers will
benefit from improved food security and healthier, more abundant, more effective pollinators.
POPCAN: Genetic improvement of poplar trees as a Canadian bioenergy
feedstock
Project Leader: Carl Douglas, Co-Project Leader: Shawn Mansfield
Society and the industrial milieu has widely adopted biotechnology and genomics in the
production of a variety of commonly used materials, including: pharmaceuticals, clinical
products, environmentally benign consumer goods, as well as complex chemicals and
polymers. POPCAN aims to employ genomics to include transportation fuel (ethanol) derived
from lignocellulosics to this growing list. Specifically, POPCAN will contribute to the
establishment of a sustainable, renewable raw material that can be reliably and consistently
produced on the Canadian landscape. The materials will be optimized to capture solar energy
and accrue biomass and inherently possess cell wall characteristic that make them amenable to
be efficiently transformed into ethanol through a variety of bioconversion unit operations. More
explicitly, the overall goals of this proposal are to thoroughly understand the genetic
underpinnings of tree growth (biomass accumulation) and biofuel trait variation (cell wall
biosynthesis), in two important poplar species, and to use this information to accelerate
Populus breeding for an economically viable lignocellulosic biofuels industry in Canada as a
chief Benefit for Canada. The integrated GE3LS research will provide a framework for land use
change, and make recommendations on economic and public policy to guide the development
and establishment of Poplar plantations in Canada. Goals of the proposed research are:
1. Genetic diversity, genotyping, genomics of adaptation/speciation. Assay the full genetic
diversity of 1400 P. trichocarpa (Phase II and northern BC) and 384 P. balsamifera genotypes
grown in common gardens, and outgroup species by acquiring both transcript and genome
sequences for each individual. This sequence data will also be used to investigate the
population genomics of adaptation and speciation. Outcomes: These efforts will reveal SNPs in
~40,000 genes for each genotype; this information will be used for genome-wide association
studies; identification of genes an allelic variant important for P. trichocarpa and P. balsamifera
local adaptation.
2. Bioenergy/biomass phenotypes. Generate an extensive and comprehensive phenotypic and
physiological database on secondary cell wall xylem and growth traits which can be related to
biofuels production, wood quality, biomass accumulation, pathogen resistance, adaptation, and
conversion recalcitrance in P. trichocarpa and P. balsamifera genotypes. Outcome: Phenotype
database for three association populations to reveal trait variation and support association
genetics linking phenotype and genotype.
3. Molecular phenotypes and comparative genomics. Investigate xylem-specific gene
expression variation and allele-specific gene expression/alternative splicing variation (molecular
phenotypes) in 200 P. trichocarpa Phase II genotypes; investigate allele-specific expression
selected interspecific hybrids; compare xylem-specific transcriptomes of three key plantation
species currently targeted as bioenergy sources - Populus, Eucalyptus, and Salix (willow);
investigate the potential role of epigenetic modification in clonal material in affecting trait
variation. Outcomes: Comprehensive set of molecular phenotypes to complement
bioenergy/biomass phenotypes; contribution of allele specific expression to hybrid
vigour/heterosis; data on trait stability different environments; conserved genes as new markers
for optimizing woody biomass production.
4. Association genetics. Use genetic polymorphism (SNPs) and phenotypic variation for
association mapping to validate previously identified associations and identify new associations.
Outcome: Identification of allelic markers for use marker assisted breeding (MAB) for
accelerated feedstock improvement.
5. Breeding. Integrate genotyping of SNPs identified by association genetics into poplar
breeding programs in Alberta, Saskatchewan, and Oregon. Outcome: Assess the ability of
MAB) approaches to contribute to rapid selection of superior individuals for biofuel and biomass
traits within the 3-year time frame of the project.
6. GE3LS. Investigate land use change, and provide economic and public opinion variables on
the feasibility to establish viable, fast-rotation poplar plantations in Canada. Outcomes:
Identification of public and policy and social challenge to plantation forestry; public policy
recommendations.
ADAPTREE: Assessing the adaptive portfolio of reforestation stocks for
future climates
Project Leaders: Sally Aitken, Andreas Hamann
A changing climate is the largest threat to forest productivity in western Canada and to the
ability of forested landscapes to provide ecological, economic, and cultural services, both now
and in the future. We propose applying state-of-the-art technologies and analytical methods
from genomics, geospatial analysis, and climate modelling to the two most economically
important western Canadian tree species – lodgepole pine (Pinus contorta) and interior spruce
(Picea glauca, P. engelmannii, and their hybrids). Coupled with in-depth socially-based inquiry
into the potential socio-economic impacts on forest-dependent communities, this research will
provide crucial information to guide reforestation policies and practices for uncertain future
climates. As climate changes and genotypes become increasingly mismatched with local
conditions, the production of some 45 million m3 of timber for annual harvest in British Columbia
and Alberta from these two species – worth well in excess of $10 billion per year – is predicted
to diminish up to 35% this century. This problem can be reduced dramatically with interventions
that match reforestation stock to anticipated future environments. However, there is a pressing
need to inform such actions by carefully developing and contextualizing scientific information
and by applying it through appropriate provincial reforestation policies. Reforestation seedlots
from seed orchards that are selectively bred for increased growth rates and wood quality under
current climatic conditions are the genetic and economic basis of tomorrow’s harvests. These
select genotypes need to be spatially reallocated to climatically suitable habitat over vast areas
through operational tree planting.
To address this problem, we need to understand which genes are involved in adaptation to local
climatic conditions. We will first sequence the transcriptomes of seedlings of both species grown
under a range of temperature and moisture treatments to capture the sequences of genes
potentially involved in adaptation to climate. We will then determine the genetic architecture of
adaptation to climate for temperature- and moisture-related traits through targeted resequencing of 10,000 genes in an initial association test of ~750 individuals of each species
sampled from a wide geographic area. SNP genotyping and controlled climate association
experiments with an additional 4,000 seedlings of each species will then be used to assess the
genetic diversity and adaptive capacity of both natural populations and orchard seedlots. Within
the three-year timeframe, we will develop strategic climate-based seed zones and policies to
manage this adaptive portfolio for uncertain future climates.
This research raises a number of economic, social and value-based questions that we will
endeavour to answer by engaging with a range of stakeholders and end-users, including forestdependent communities, industry representatives and provincial staff responsible for
reforestation and seed transfer policy. Based on a series of predictive ecological models that
incorporate various intervention strategies and predictions of climate change, we will conduct an
assessment of the socio-economic outcomes and explore both the understanding and
perceptions of this range of plausible climate change adaptation strategies for BC and Alberta.
With a better understanding of the role of genetic resources in the ecological, economic and
social sustainability of forests and the communities that depend on them, we will – through a
thorough policy regime analysis – translate our technical findings into substantive changes in
policy and forest management that address climate change in a proactive and strategic manner.
Our project will provide the scientific basis and policy recommendations with which to insure the
productivity of our forests of tomorrow and avoid losses of up to hundreds of millions of harvest
and wood product value per year over the next century.
WATERSHED DISCOVERY: Applied metagenomics of the watershed
microbiome.
Project Leader: Dr. Patrick Tang and Dr. Judith Isaac-Renton
Our goal is to change the way we monitor water quality in our watersheds. The new science of
metagenomics will be used to discover novel indicators of water pollution. These novel
indicators will be essential tools in our efforts to protect the health of our watersheds.
To use metagenomics to measure the impact of pollution on the communities of microorganisms
(the microbiome consisting of viruses, bacteria and protists) in different watersheds.
To create novel tests that monitor these changes in the microbiome in order to detect pollution
and pinpoint the specific source of the pollution.
SOME DETAILED REFERENCE
NOTES.
1.1 CONTEXT p2
1.2 THE DNA (BASICS) p3
1.3 CODE = PHENOTYPE p5
1.4 MOLECULAR BIOLOGY p9
1.4.2 POLYMERASE CHAIN REACTION p9
1.4.3. SANGER SEQUENCING p11
1.5 THE (HUMAN) GENOME p12
1.6 LOOKING AT SNPs p13
1.7 BETTER WAYS TO JUST SEQUENCE THE
HECK OUT OF A SINGLE SAMPLE p16
2.1 LEVEL UP: LOOKING AT GENE EXPRESSION
p17
2.1.1. SOME RNA SPECIFICS p18
2.2 LEVELING UP AGAIN: SOME WORDS ABOUT
PROTEINS AND METABOLITES. p19
2.3 CENTRAL DOGMA + ALL OF THIS DATA +
PHENOTYPES = INTERESTING SCIENCE p21
2.3.1 MAKING SENSE OF THE CORRELATIONS
p22
1
1.1 CONTEXT
At the end of the day, and whether we realize it or not, we are concerned about
the living. Whether this is in the context of improving our health or the health of
those around us, understanding the players in our ecosystems, or obtaining
resources for food and/or materials, many of the advantages that our society
have, are a result of our ability to understand and harness the living. And
broadly speaking, this means that we want to know as much as we possibly
can: about our own human bodies, and also biodiversity at large. In fact, we
see value in not just understanding it, but in perhaps being able to predict it, or
even control it.
Genomics, and its sister terms (proteomics, metabolomics, etc) are one avenue
of such exploration. What makes them powerful, however, is that they operate
under a flushness of information. – the biggest pile of data you can imagine, all
of which exists in molecular parts, if not in digital form.
It’s also very powerful, quite amazing, if not a little scary sometimes. But that’s
why it’s good to have a clear handle on the science involved.
--Let’s start with a small glossary...
GENOME: In modern molecular biology and genetics, the genome is the entirety
of an organism's hereditary information. It is encoded either in DNA or, for many
types of virus, in RNA. The genome includes both the genes and the non-coding
sequences of the DNA/RNA
GENOMICS: is a discipline in genetics concerning the study of the genomes of
organisms
DEOXYRIBONUCLEIC ACID or DNA is a molecule that contains the genetic
instructions used in the development and functioning of (almost) all known living
organisms.
NUCLEOTIDES are molecules that, when joined together, make up the structural
units of DNA.
A GENE is a unit of heredity in a living organism. It normally resides on some
stretches of DNA and RNA that codes for a type PROTEIN that has a FUNCTION
in the organism.
The FUNCTION of a GENE PRODUCT can (by itself or in tandem with other
GENE PRODUCTS) result in an observable PHENOTYPE or TRAIT.
An ALLELE is one of two or more forms of a gene or a genetic locus (generally a
group of genes). Sometimes, different alleles can result in different
observable phenotypic traits, such as different pigmentation. However, many
variations at the genetic level result in little or no observable variation.
2
1.2 THE DNA (BASICS)
From most perspectives, a lot of people would feel that the basic mode of data
that scientists have been using is DNA. This is sort of true, and for now, close
enough for us to dig a little deeper.
--So what exactly are the key features of an organisms DNA? Well, central to this
is the idea that the DNA contained within an organism (the genome) is
analogous to a blueprint for the construction and operation of that organism. In
other words, the DNA is very much like an instruction manual – a very detailed
and voluminous instruction manual.
No offense to all the wonderfully talented individuals in the world, but Mother
Nature has really outdone herself here with a rather superb job of getting this
genome business to work. It is nothing short of amazing.
This instruction manual is basically a code that is written in the language of a
molecule called deoxyribonucleic acid, (this here is our DNA). DNA is this
rather pretty looking molecule that is composed of four different building blocks.
Together, these building blocks are known as nucleotides, and individually they
each have a chemical name which is often abbreviated with a single letter these letters being A for adenine, T for thymine, C for cytosine, and G for
guanine. In effect, your DNA code is much like a language, a script of sorts,
with the principle difference being that it is composed of only four letters instead
of the full twenty six.
But what exactly does it code for? Let’s use homo sapiens as a general
example.
As alluded to earlier, a classic example of what your DNA code is capable of
doing is the textbook case of natural eye colour. Your eyes are a certain colour
because of the instructions within your DNA. The same is also true for your
natural hair colour, and in other school examples such as whether you are able
to roll your tongue or not. However, it's important to realize that virtually every
physical attribute you have is determined by your genetic makeup. In other
words, this also includes subtle nuances like the fact that some of your
acquaintances are more prone to farting when they ingest dairy products or that
a few of your friends may get drunk more quickly than others. Taken together,
this means that your DNA is responsible for an awful lot of information, which at
first glimpse is difficult to fully appreciate.
To put this all in perspective, it’s important to try and visualize the enormity of
the task at hand. One good way of doing this is to concentrate and focus on
one of your thumbs. Ask yourself a few simple questions. How does your
thumb know that it’s a thumb? How does its cells distinguish themselves from
the cells of other fingers? How does it know to come out of a certain place on
your hand – next to your forefinger, not next to the pinkie finger? For that matter,
how does it even know that it should be protruding from your hand and not from
your foot? In truth, these somewhat bizarre thoughts centre round a field of
research known as developmental biology. These sorts of scientific questions
are constantly asked in this dynamic field, although not necessarily always for
3
the thumbs - rather for the architecture of the entire body, or even possibly some
other creature's body. In essence, these biologists continually think about the
following question. How do we go from a single cell entity, created from a
marriage of a sperm and an egg, to a being of very set features, full of many
different types of cells and many different types of tissues? If you look around
you, distinct though we are from each other, we are all basically the same. What
I mean here is that generally speaking (and I hope I don't offend anyone), we all
have heads, we all have torsos, and so on and so on. Furthermore, all of these
bits and pieces are usually in the right sorts of places.
You must remember that it is your genome that is providing and directing all of
this information. Imagine doing this yourself with pen and paper. Think of all the
countless notes and scribbles you would need, so that something as basic as
your body shape is done properly. For example, you may need to devote a few
pages to your eyebrows. You would need to ensure that your eyebrows are in
the right place. Not anywhere unsettling like your nipples, but somewhere on
your face. Over your eyes and not under your eyes. You would need to
describe their thickness, their length, and their colour. Hopefully, you get the
point - the details are endless. Simply put, the amount of physical information in
your DNA code is mind boggling.
And it doesn't even stop there. Although a bit more controversial, it is becoming
clear that an individual's general behaviour and personality is, in part,
predetermined by your DNA. The game we just played which essentially
covered the genetics of things like LOVE, LONGEVITY, INTELLIGENCE, and
LOVE OF BRUSSEL SPROUTS easily demonstrates this. Obviously in this case,
a person's environment and experience plays a vastly more dominant role, but
there is nevertheless plenty of evidence to suggest the importance of genetic
factors in these types of traits.
SOME NUMBERS
A fairly good estimate of the size of your genome is a total of 3.0 billion letters of
code.
Other genomes: ~390,000,000 bp (rice), ~2,400,000,000 bp (dog), 4,639,221
bp (e.coli), 157,000,000bp (Arabidopsis), 18,000,000,000 + (Pinus tree)
It's worth noting that these are actually really huge numbers, the scale of which I
find is often lost to the casual listener. You get habituated, I think, by references
to the country being in debt so many billion dollars, or by certain athletes
signing billion dollar contracts. This is, matter of factly, a very big number and
many other analogies abound that are much more eloquent than my shit
example. If we were, for instance, to take 3.0 billion nucleotides and translate
them into text, letter for letter, the genome in its entirety would be equivalent to
about 8000 copies of the first Harry Potter book. Another one is to take 3.0
billion grains of sugar and pile them up in one spot. Apart from wasting a lot of
sugar, you would discover that you would've formed a mound about the size of
three cars. And we could even say that piling 3.0 billion cars on top of each
other would likely resemble a mountain of Everest proportions.
Regardless of the analogy you use, the shear size of your genome does make
sense. It is, after all, responsible for so many things, and you would assume
that you would need that much information to get all the details and all the
nuances sorted out.
4
1.3 CODE = PHENOTYPE
How exactly does DNA code translate itself into an observable characteristic, a
phenotype?
Here, we need to go over the CENTRAL DOGMA. Essentially a mechanism of
going from “instruction booklet” to an actual “product.”
For now, ignore the bit about RNA
...Which is very much about proteins: in fact Martha Stewart would say, "proteins
are a good thing." In life, they are the true movers and shakers of any
organism, and are the molecules that actually go through the daily business of
living. In short, these are the building blocks that give your various tissues their
shape and their function. I bring this up now, because all of this talk about
genomes and DNA is only illuminating if you recognize the fact that your genetic
code is simply the instructional package for making all of the different types of
proteins needed for life.
And things you need for life include: proteins that regulate chemical reactions
(for instance, in the conversion of the food we eat into energy); proteins that
transport key molecules from one place to another (like the pump implicated in
cystic fibrosis); proteins that become the basis of cell structure (like how the
architecture of certain tissues is achieved); proteins that facilitate cellular
communication (how all the different bits and pieces of your body can work
together). In truth, the diversity in protein makeup is responsible for all the
diversity in life itself. In other words, bring on the bacon.
In itself, how proteins come about from your DNA code is quite clever. Proteins
are built by piecing together molecules that are collectively called amino acids.
It's a bit analogous to DNA in that if you recall, your DNA code consists of
specific combinations of nucleotides. However, whereas our DNA is composed
of an alphabet of only four different letters (A, T, C, and G), proteins are built
with a much larger alphabet of 20 different letters or 20 different amino acids.
Again, for any given protein, the determination of which of the 20 amino acids to
use and in what order they are to be pieced together is dependant on the
nucleotide code itself. This might sound a little confusing but in essence the
production of proteins is dependent on dealing with two types of code. More
specifically, each combination of three nucleotides (often referred to as a
codon) will signify a particular amino acid. For example, the nucleotide T,
followed by a G and another G (or TGG) codes for the amino acid Tryptophan
(abbreviated ‘W’), the sequence ATG codes for the amino acid Methionine
(abbreviated ‘M’), and so on. The sequence TGGATG would therefore code for
two adjacent amino acids, Tryptophan and Methionine. Altogether, there are
three letter codons for all 20 different amino acids. In this manner, a long
sequence of nucleotides can potentially and theoretically be translated into a
long chain of amino acids - i.e. a protein molecule.
5
To illustrate this two code system, the best example that comes to mind, is the
use of Morse code to send messages overseas. In this situation, you essentially
have a binary code (dot or dash, two options), which when rearranged into units
of three, can translate into one of the 26 letters. For example, dot dot dot is the
same as the letter 'S', and dash dash dash is the same as the letter 'O'. This is a
two code system. Your first part being the Morse code element, and the second
part being the formation of words from letters. In our biological example, the
first code involves the use of nucleotides to provide information for which amino
acids to use, whereas the second code dictates the length and combination of
amino acids to form a functionally relevant protein.
(Now let’s take back the bit about ignoring the RNA)
In truth, the relationship between proteins and DNA is a little bit more
complicated. First, it turns out that the overwhelming majority of the human
genome doesn't do anything, and is basically considered to be garbage, junk,
filler or if you want to be particularly nasty, crap. This accounts for an
astounding 97% of your genome having absolutely no function or significance.
This introduces an interesting logistical problem in that humans are using what
is essentially a polluted genetic code. In other words, there has to be a system
that allows the deciphering of the good stuff from the bad. You don't want to
waste your time decoding your junk regions in that it could translate into some
random, useless or potentially harmful protein.
Even your freakin’ earlobe is doing this right now!
Secondly, the location of your DNA and the location of protein synthesis are
different. This of course, makes no sense because how can you translate your
DNA into proteins if the two molecules reside in geographically distinct places?
Here, we find that your DNA is found within a small physically enclosed area of
the cell called a nucleus, and proteins are awkwardly made outside the nucleus.
Although this nucleus could be viewed as simply a mechanism to "house" and
protect your genomic DNA, it does create a rather unfortunate conundrum in
that the all important DNA code is not accessible to the machinery necessary for
its translation into proteins.
In a rather crafty way, biology has managed to solve these problems through
the use of a middle-man known as the messenger RNA molecule or mRNA for
6
short. For the sake of clarity, mRNA is fundamentally similar in structure to DNA
having nucleotides. There is a slight difference but it’s visually quite minor - it
could actually be the basis of a challenging ‘spot the difference’ comic.
However, thinking conceptually, mRNA is comparable to a no-nonsense piece
of genetic code that is constructed from only the useful parts of your DNA. This
is similar to having study notes for a particular subject where only the crucial
parts are highlighted and regurgitated. Consequently, problem one is solved.
Here we have a strategy that can weed out the good from the bad and hence no
crap.
Additionally, mRNA is special in that it is a string of nucleotides with the ability to
move and ultimately leave the nucleus. You have to remember that your
genomic DNA living inside the nucleus of a cell is akin to an elephant stuck in
the upstairs toilet. It is simply too big to pass through doors that might otherwise
be situated along the walls of the nucleus. mRNA molecules do not need to be
so big. They are much more manageable in size because for each molecule,
they contain only the sequence of one protein (not all of them), and more
importantly they contain only the necessary sequence of that one protein (no
junk). This means that problem two is also solved, as mRNA acts as a mobile
representative of the genetic code that can now get out and come into contact
with components required for protein production.
--Confused? Don't worry, it's alright if it seems a little perplexing right now. I
know many people who have had nightmares over this stuff. If you do find
yourself waking up in the middle of the night screaming nonsense about RNA
and elephants, try thinking of the following analogy.
Because you are such a wonderful person, you wish to prepare a nice chocolate
cake for your friend, and to do this, you visit the library to look for a good cake
recipe. For some unexplained reason, you are also a huge Martha Stewart fan,
which is why you decide to look for a cake recipe in one of her many 'Martha
Stewart Living' magazines. After searching for several hours, lo and behold, you
find a promising recipe in her 'Weddings Issue', but notice that the magazine
itself has a sticker on it that says 'for reference only.' This is a bit of a bother
because it means that you won't be able to take the magazine out of the library,
and hence, into your kitchen where you had plan to spend most of your time
being a wonderful person. Furthermore, despite your best efforts, you can't
seem to find any semblance of a photocopy machine anywhere, since this is the
sort of library this is, and since the analogy wouldn't work otherwise.
Begrudgingly, this small nuisance forces you to look for a pen and a piece of
paper so that you can manually scribble down the recipe to take home. As you
do this, you quietly think to yourself that your friend had better appreciate all of
this effort.
No offence to Ms. Stewart, but I find her magazines are always full of extraneous
and in my opinion useless information. Do you really need to know the history of
the chocolate cake? Do you really need to know about the appropriate cutlery
used for serving cake? Do you really need to see and evaluate 15 different
colour schemes for acceptable presentation? I don't think so. All you really
need to concern yourself with is the ability to make the cake taste good. This is
why, when you go to the bother of copying down the recipe, you don't include
all of the nonsense - you just copy down what you need. In short, this turns out
7
to be just a few lines of ingredients and directions scrawled neatly on your piece
of paper. The crucial point is that you can now freely walk out of the library with
the recipe in hand.
Next, of course, is a trip to the local grocery store where you would get all the
necessary cake ingredients and maybe indulge yourself with the smutty
magazine about child actors gone bad. After which, you would head home and
bake a wonderful chocolate cake which is met with such praise, that you are
glad you didn't waste your time using table setting number six for the occasion.
A strange story indeed but here is how the analogy works. First, you need to
envision the entire library with all of its resources as the genome, and also
envision the building itself as the nucleus. The complete recipe found in the
magazine actually represents the genomic sequence for one particular protein.
As mentioned before, Martha publications tend to have a lot of useless
information, some of which is not even directly related to the production of the
cake (advertisements and historic footnotes). This is identical in premise to the
crap in your genomic material, and the concise notes you scribbled down
symbolize the messenger RNA molecule. This, as mentioned before, is twofold
important because, (1) it represents the minimal amount of information needed
and, (2) it represents the ability to leave the nucleus (in this case, the library)
and the ability to go to places where protein production can take place (in this
case, the rest of the world, but more importantly the grocery store and your
kitchen). Finally, the cake itself represents the protein. Remember I said that in
living systems, it's really the proteins that are the real movers and shakers?
They are certainly the most interesting parts of the big biological picture, and
wouldn't you say that the cake itself is the most interesting part of this process?
ANYWAY, this is some of the basics of genetics (replication, central dogman,
DNA, RNA, protein, genes, genome). This stuff is really powerful, and clocking
along at a phenomenal speed.
Anyway, consider the following:
ONE: That traits and phenotypes are what we pragmatically care about.
TWO: That these are reflected by a variety of different types of molecules.
THREE: this means that we technically have different places to “look.” i.e. DNA,
or RNA, or Protein.
Taken together, this is why we have the field of molecular biology. This is
essentially a catch phrase that encompasses the science and methodologies
that allows us to look at these molecular components, especially ones that
divulge some information about how the code (at various stages) becomes the
phenotype.
8
1.4 MOLECULAR BIOLOGY
Looking at the molecules involved in biological processes! i.e. How to study
DNA, RNA, and proteins!
Time to get a little more technical...
Let’s look a little more closely at DNA, by going over the act of DNA replication.
This, here, is the process that allows DNA to make copies of itself. Important
because as cells grow and divide (aka multiply), the code needs to be
maintained, and therefore new copies need to be produced for newly produced
cells.
See http://popperfont.net/2011/09/05/breakfast-of-champions-does-replication/
and then come back…
With replication now under our perfunctory belts, now we can look a little more
closely at a number of specific molecular biology techniques.
Keep in mind that the general idea is that researchers can take advantage of
various enzymes and use them to manipulate DNA molecules (or RNA, protein
molecules). As a point of reflection, it turns out that every enzyme described in
our replication lecture is something routinely used to mess with DNA. (to stick it,
to cut it, to copy it, etc)
1.4.2 POLYMERASE CHAIN
REACTION
PCR is an excellent example of illustrating how it's often the most simple and
elegant ideas that really propel science. A bit of history :Kary Mullis is the
principle investigator behind the techniques. Interesting (slightly eccentric
fellow) who won a Nobel Prize in 1993 for the technique (actually shared the
Nobel that year with our very own Michael Smith.
Basic premise.
Working with DNA. If you want to manipulate it, or even see it --> needs LOTS of
it.
PCR is an attempt to amplify DNA in a test tube environment.
EXCEPT that this amplification occurs via replication in the natural sense, and
replication is a little on the complicated side (in terms of getting it to work in an
in vitro setting). So, Dr Mullis essentially tried to MacGyver his way through the
experiment.
9
POINT ONE: replication requires open or single stranded DNA. Usually
accomplished by many enzymes that unwind DNA (such as helicases). Screw
the enzyme - Let's use HEAT to open up our DNA.
POINT TWO: replication also needs a primase enzyme to make primer for the
polymerase. BUT, you can make your own! Buy oligo's.
POINT THREE: with the use of a DNA primer system, you don’t need DNA pol I.
POINT FOUR: with a forward and reverse primer system, don’t need LIGASE.
POINT FIVE: with heat globally opening your DNA, you don’t need
TOPOISOMERASES either.
POINT SIX: You can get this to work with only a workhorse DNA polymerase (i.e.
one enzyme).
HOWEVER, we do have one problem: this high temperature will basically
knacker out any protein structures, including our polymerase. TO get around
this, let's use a polymerase from a bug that grows in high temperatures
(thermophile). i.e. this will be a heat stable polymerase.
HUGE ADVANTAGE #1: PCR is elegantly simple, and extremely forgiving
procedure.
PCR cycle is usually composed of three steps:
denaturation,
annealing, and
elongation.
Essentially, each cycle is responsible for doubling the amount of target DNA. A
cycle can take anywhere from 1.5 minutes to 5 minutes long, meaning that after
30 cycles, you have the potential to produce ~1000000000 molecules of
amplified product from one molecule of template.
HUGE ADVANTAGE #2: PCR gives you data.
ALSO, individuals over the last 15 years have been quite creative with the
technique to do some very cool things. For example:
*site-directed mutagenesis
*production of restriction endonuclease site for convenient cloning
*production of single and double strand product for sequencing protocols
*quantitation of rare DNA
10
*quantitation of mRNA expression
*differential display of mRNA by PCR
*random amplified polymorphic DNA (RAPD)
*cloning dinosaurs from blood found in fossilized mosquitos trapped in amber
HUGE ADVANTAGE #3: PCR is very versatile procedure with many possible
uses.
1.4.3. SANGER SEQUENCING
Often known as the Chain Termination procedure.
AND sequencing has allowed us to figure out some pretty remarkable stuff,
such as the general structure of what a human gene might look, but more
importantly, it has allowed us (from about the 1960s to the present day), figure
out some very fine tuned details about specific genes/proteins. i.e. we know a
fair bit about how genes work, how the DNA code is organized, how genes are
turned on, how they are turned off, how a single gene can actually have multiple
roles.
BASICALLY, the research pipeline worked in this way for a while. Whereby, you
have:
an interesting organism,
with an intriguing phenotype which you try to functionally make sense of,
which is linked to gene or two (or more)
which in turn is sequenced so that gain info how that gene might work exactly.
And in this way, a single organism (such as a human) gets “figured out” one
gene at a time, until you have a sense of how it all works together. i.e. you get a
sense of how all these genes might work in the context of the whole organism, in
the context of the whole DNA code. NOTE that this is information culled from a
variety of sources, not a single specimen. i.e. the data (from many samples) is
representative of a species, but the data is not necessarily derived from a single
sample of species.
ALSO NOTE that the data has an added layer of complexity, because genes
don’t tend to be an ALL or NONE thing. You have genes that may exist in forms
that work better, work slower, only work when such and such is just right.
BOTTOM line, this mass of data gets pretty complicated to sort out.
WHICH IS WHY COMPUTERS ARE VERY VERY HANDY!
11
THEN, OF COURSE, THE NOTION OF SEQUENCING AN ENTIRE GENOME
PROVIDED THE NEXT STEP. To discuss the implications of this, the drama of
the human genome project is probably the best avenue of exploration.
1.5 THE (HUMAN) GENOME
First, a reminder of some definitions:
GENOME: In modern molecular biology and genetics, the genome is the entirety
of an organism's hereditary information. It is encoded either in DNA or, for many
types of virus, in RNA. The genome includes both the genes and the non-coding
sequences of the DNA/RNA
GENOMICS: is a discipline in genetics concerning the study of the genomes of
organisms.
HUMAN GENOME PROJECT (completed June 26th, 2000): I’ll avoid the
dramatic bits (which I have a feeling may be discussed by Allen). Briefly…
PUBLIC: In the U.S. most of the funds come from the National Institutes of
Health – refered to as the “public project” about $3 billion dollars set over
15years.
Dr. Francis Collins, director of the National Human Genome Research
(started in 1990) “The Human Genome Initiative is a worldwide research effort
with the goal of analyzing the structure of human DNA and determining the
location of the estimated 100,000 human genes...” initial draft proposal for NIH
funding of human genome project.
~30 different human cell libraries.
FOR-PROFIT: most famous enterprise is Celera (Latin for “speed”) Genomics
Group.
SPEED MATTERS: J. Craig Venter, and NIH scientist frustrated at the slow pace
of sequencing the genome leaves NIH in 1992, to form The Institute for Genomic
Research (TIGR) with wife Claire Frasier. Venter and others developed a
technique termed SHOTGUN SEQUENCING which relies heavily on automated
DNA sequencing machines, and in 1995 are the first group in the world to
sequence a complete genome (Haemophilus influenzae, a bacteria that causes
the flu)
1998 Venter teamed up with PE Biosystems / Applied Biosystems (ABI) to form
Celera. Goal sequence the human genome by 2001 (2 years before completion
by the HGP, and for a mere $300 million (about a tenth of the public project).
Sample derived from DNA samples taken from 5 individuals. Celera says it used
one man's DNA as the foundation for its work. (This turned out to be Dr. Venter
himself)
In the end, both project streams agreed to share data to help complete the
human genome sequence by mid 2000. This was approximately sequence that
12
worked out to about 10 fold coverage (i.e. actually tried to sequence the sucker
10 times!) for 90% of the total sequence.
SOME KEY FINDINGS
• The HGP has revealed that there are probably about 25,000 to 40,000 (since
updated to a count of ~20,500 human genes)
This is really quite amazing! That something as complicated as a living
organism such as a human can be derived from the actions and mechanisms of
this few working parts! Point in comparison is to realize that total number of
different lego pieces (as of May 2010) was about 21,000.
• Human genome is remarkably similar to other genomes in terms of total gene
humbers and gene functions, although most genes are more complex.
(Comparitive Genomics)
This, folks, was already sort of known but the HGP really segmented this view.
That there is value in working with other simpler organisms to pose biological
questions that may be applicable to human biology. Use e.coli as an example,
in that e.coli is just way less maintenance to look after and study, than say a
human.
• Between 1.1% to 1.4% of the genome's sequence codes for proteins.
Nonfunctional regions appear to account for ~97%. 12% of human genomic
DNA is due to copy number variations – CNVs
The main thing here is to recognize that an awful lot (~97% of the DNA
sequence) appears to be completely useless
• ~2 million single nucleotide polymorphisms - SNPs (~0.1 to 0.3% of total
genome)
SNPs are pretty important. They deserve their own section!
1.6 LOOKING AT SNPs
A single nucleotide polymorphism is a DNA sequence variation occurring when
a single nucleotide — A, T, C, or G — in the genome (or other shared
sequence) differs between (human) members.
SNPs are extremely useful because they are a significant element that
differentiates one human genome from another. In other words, if we want to
sort out why genetic differences result in differences between people, there is
(conceptually) no longer a need to sequence the entire genome. JUST LOOK
AT SNPs, since they represent a key part of what’s different between genomes.
Plus, there’s a really cool way to look at SNPs, actually millions of them (at
once).
First, a quick rehash of the structure of DNA.
13
“The double helix structure of DNA is not unlike square dancing in an overly
homophobic community”
The double stranded nature of the DNA molecule is actually very useful,
because it infers the notion that if you know the sequence of one strand, you can
predict the sequence of the other strand (due to the complementation of
different nucleotides – A pair with Ts, Gs with Cs)
Powerful because you can look for interactions between complementary
sequences.
FOR EXAMPLE: (in class we’ll reenact by using the square dancing situation as
our metaphor for SNPs.) Let’s say we are interested in finding out whether a
person’s genome has 3 specific SNPs (i.e. specific single nucleotide
polymorphisms).
If we have a single strand of code that denotes the SNP#1 and its surrounding
sequences, we can attach that to some sort of solid support. Then we can do
the same for the other SNPs (SNP#2 to SNP#3).
This might look a little like this:
|•ATC (SNP = I like science)
|•TTG (SNP = I like art)
|•CAG (SNP = I like Nickelback)
Now if we take a sample DNA -> (1) break it up into smaller pieces, (2) make it
single stranded (you can do this by heating DNA); and also (3) label it
somehow, such as attach a fluorescent or glow in dark tag. This might look a
little like this.
GGTCGAATGCGATTTC (initial sample)
GGT CGA ATG AAC TTTC (broken up)
GGT^ CGA^ ATG^ AAC^ TTTC^ (labeled with ^)
Noticed that one of the labeled sequences (AAC) is complementary to one of
the sequences attached to the solid support (TTG). This means that these two
will bind specifically, culminating in the ^ label positioning itself in the exact
same spatial place as where the TTG strand was placed.
i.e. because we see a signal at the middle sequence (due to the binding), we
can infer that “Hey, this sample we’ve got must have sequences that indicate
the presence of SNP#2.
Of course, in this example we’re only looking at 3 SNPs (and in reality these
probes are not just 3 nucleotides in length, they tend to be around the 20
nucleotide length). But what about 100 SNPs, or maybe a thousand? Or let’s say
even a million IN ONE EXPERIMENT. If you can do a look for binding with a
million different sequences, then you’ve got yourself a pretty powerful system of
14
looking at SNPs. And remember, SNPs represent a significant portion of what
makes one person’s genome different from another!
Anyway, all to say that such experiments – looking at millions of SNPs in one go
is indeed possible. We call these things DNA CHIPS or MICROARRAYS.
As well, this is a powerful form of GENOTYPING.
(definition) GENOTYPING: is the process of determining information about the
genes (genotype) of an individual by examining the individual's DNA sequence
by using biological experiments (such as looking for SNP pairing)
The ability to genotype in this way is actually very powerful (use horse vs
unicorn example again). For instance, if a SNP is well defined – i.e. if you have
this SNP, then that means you have this trait – then you can use this for
predictions.
However, it also allows you to more quickly and efficiently infer linkage between
a trait and DNA sequence, sometimes simply by looking at differences or
similarities in DNA sequences and attempting to correlate trait differences and
or similarities. (USE UNICORN vs HORSE example). i.e. if all unicorns have this
SNP, and all horses are missing the SNP, then maybe the SNP has something to
do with unicornism?
NOTE that this is looking for a correlation trend, and correlation DOES NOT
equal CAUSATION (throw in a Bieber metaphor here), but if the correlation is
very striking, and spans over a massive sample number, then it’s probably
going to be interesting enough for you to want to check it out further.
ALSO: Bring up the HAPMAP project (public project to characterize all possible
human SNPs – up to about 10 million so far), as well as services like “23andme”
– this is the $199 genotyping service (which essentially looks at a variety of
medically relevant SNPs).
15
1.7 BETTER WAYS TO JUST
SEQUENCE THE HECK OUT OF A
SINGLE SAMPLE
Although looking at SNPs is a powerful way of quickly characterizing a large
number of elements in a person’s genome, it stands to reason that if you could
just sequence the whole genome of many individuals, perhaps even all
individuals, then you would have an even stronger data set from which to
correlate (and therefore identify) DNA sequences that result in certain traits.
I.e. instead of each sample being represented by 2 million SNPs (or more), each
sample is instead represented by 3 billion nucleotides! Obviously, you need
some serious computer power to be able to look at this effectively – but guess
what? Computers are already there.
Anyway, it’s happening (and fast). The first full genomes were sequenced and
published in 2007. Craig Venter and James Watson. There’s also currently a
1000 genome project in progress, which is set to have 1000 human genomes
sequenced by 2012, all public data in the hopes of being able to correlate
phenotypes/traits of these 1000 individuals to their genetic code.
And sequencing technology is getting better all the time. In fact, the speed in
which sequencing is increasing is often compared to Moore’s Law (law of
transistors – whereby The quantity of transistors that can be placed
inexpensively on an integrated circuit has doubled approximately every two
years).
From the two graphs, you can see how sequencing technology is improving
from both an output point of view, as well as a cost per Mbp point of view.
(1000,000bp). It’s actually improving FASTER than Moore’s Law!
HOW IS THIS POSSIBLE? Various new technologies that allow for this. Go back
to the Sanger example. With that experiment, you can get about 1000 letters in a
single experiment, which will take about a full day. However, that is one tube,
one experiment – you can imagine, it’s pretty straightforward to (say) work with
10 tubes. This means, that in a single day, you can get 10x1000letters of data =
10,000bp. Conceptually, this means I can increase my sequencing output if I
just have the opportunity to do more reactions in a single go.
However, this also raises costs – i.e. instead of chemicals for one tube, I need
chemicals for 10 tubes (or however many reactions I do, since they are all in
separate tubes)
MEANING that the challenge one has (in increasing the amount of code
determined, as well as keeping costs down) is: Can I mimic the data of millions
of reactions, but all in one place (one tube).
THESE are what the new technologies are all about! (I’ll highlight Illumina’s
Solexa platform).
16
THE NET RESULT, is that research is now at a place where getting LOTS of
sequencing data for relatively cheap costs is doable.
In any event, currently biological research is being propelled by these
technologies, because ultimately, they allow you to get massive amounts of raw
data (DNA, or RNA code). Much like google algorithms, the trick is to correlate
this raw data with phenotypic observation, using computer tools, and use
statistics to access potential validity.
2.1 LEVEL UP: LOOKING AT GENE
EXPRESSION
Although it’s nice to be able to look at genomic sequences, there’s a general
feeling that this is a little too broad in that it doesn’t necessarily tell you which
one (or more) of the many thousands of genes are actually contributing towards
an interesting trait, phenotype or observation. in other words, sometimes it’s
handy to look at what genes are being directly turned on or off under your
observed circumstances.
Anyway, gene expression is basically fancy talk for examining what DNA codes
are being “transcribed” into RNA molecules. If this seems a little hazy to you,
it’s probably best to look at this picture again:
There’s two points to make here.
First, is that if you look at this simple diagram, you might come to the conclusion
that at the end of the day, we really should be looking at the proteins (as stated
before, they are the “doers”). So why aren’t we?
The simple answer is that (quite frankly), proteins are a proverbial pain in the a**
to work with. You can look at proteins, and you can even look at them in the
same overwhelming scale as some of the genomic talk we’ve been doing (this is
what clever science linguists have decided to call “proteomics”), but the reality
is that it’s more difficult than looking at nucleotide code.
Second: RNA makes a good compromise. i.e. there’s at least an assumption
that if RNA is being produced or not being produced, then perhaps there is a
reason for this. Furthermore, maybe this reason has something to do with which
proteins get turned on or off. This second point is mostly valid. It’s mostly valid,
because intuitively it makes sense: just look at the diagram, it’s what the arrows
are implying. It’s only mostly valid because it turns out that being intuitive
doesn’t actually mean it’s true all the time. In other words, there’s lots of
research to suggest that just because an RNA molecule goes up in amounts, it
doesn’t necessarily mean that a lot of corresponding protein is produced.
17
Still, RNA is much easier to work with than proteins, and this is especially true for
techniques that try to look at things in large scale (i.e. what RNAs are there
when my organism is doing this or that, or sitting in this or that). Furthermore,
when you work with RNA, you still ultimately identify specific RNA molecules
through the act of sequencing. Therefore, all of the previously mentioned
sequencing advances are also prevalent for RNA work.
At the end of the day, all of this simply means that there are advantages to
working with RNA, but also some disadvantages. In the flow of information that
tells us about the organism, it reminds me a little of the type of information
hierarchies we can gleam from how we interact with the internet.
For example: Google and the like, probably have a pretty good sense of our
background details – they probably know the sorts of things we have and own,
and what we like and what we don’t like. They’ve probably accrued data
regarding our age, where we live, likely some physical characteristics. In many
way, this “profile” is a little like our DNA code. It provides the overall blueprint of
what each person is.
However, it doesn’t necessarily tell us what we might be doing at exact points of
time, or how we might react under specific circumstances. For instance, what
are our consumer tendencies around Christmas? Here, it can keep track of the
things we like to click on the internet. This provides a good overview of our
behaviour on what we might do under specific circumstances. However,
“browsing” is not quite the same, as an actual online purchase (whereupon a
product is actually delivered).
In our little analogy, browsing is a little like our RNA expression. It gives you
insight into potential things of interest. The purchased item still represents our
protein: the point being that not every click visited is relevant.
2.1.1. SOME RNA SPECIFICS
Structurally, RNA molecules are very similar to DNA molecules - they are
composed of 4 nucleotides, which are A, U, C, and G. U or uracil stands in for
thymidine. Their backbone is also a little different from DNA, which physically
means that RNA molecules are very “bendy.” Important for us is to also realize
that RNA molecules are notoriously unstable due to presence of lots of things
(known as RNases) that want to chew it up.
Consequently, there’s three RNA centric terms worth spending a bit of time on:
1. Messenger RNA: Also known as mRNA. When we’re talking about DNA to
RNA to PROTEIN, we tend to focus on this type of RNA. We don’t have time to
go into other types of RNA, but rest assured, there are lots of other types that
also do interesting things.
2. Reverse Transcriptase: an enzyme capable of taking an RNA molecule,
reading it, and producing a complementary DNA molecule. This is special,
because it technically negates the central dogma by going backwards!
18
3. Poly A Tail: all mRNA molecules from eukaryotic organisms (like humans, and
bees, and trees) have these. This has turned out to be really really really useful,
since it’s like a special identifier that you can use to “look” at all RNA transcripts.
4. ESTs, Expressed Sequence Tags: Ultimately, when you identify an RNA
molecule, you still need to sequence it. ESTs essentially makes things more
efficient by saying that you don’t need to sequence the entire RNA molecule –
you can just get away with sequencing a small part of it (a tag), and that that
part just has to be statistically big enough to be unique for that RNA sequence
(this turns out to be about 13 to 14 nucleotides).
5. Microarrays/DNA chips: Although mentioned earlier with evaluating SNPs,
microarrays are perhaps more commonly used for looking at gene expression
patterns.
2.2 LEVELING UP AGAIN: SOME
WORDS ABOUT PROTEINS AND
METABOLITES.
Still, despite the overall trickiness of working with proteins, there are procedures
that do allow a large scale assessment of protein expression (generally referred
to as proteomics). Proteomic work tends to rely on two main techniques: a way
to separate* proteins from each other (such as 2D gel electrophoresis) and a
way to identify the proteins you’ve separated (mass spectrometry).
*Note that broadly defined, a procedure aimed at separating bits in a mixture is
called “chromatography.”
2D gels is simply a method that attempts to visually resolve (fancy word for
“see”) as many proteins as possible in a single data read out.
Essentially it relies on the fact that if you treat your sample well, you can see a
multitude of different proteins run on a gel, whereby their position is noted due
to the size of the protein, and also the net charge of the protein.
By doing this, you create a “profile” as denoted by spots on a 2 dimensional
plane. Each spot, in theory, represents a protein, and it is in the reproducibility
of such profiles where you can try to tease out interesting information.
For instance, you can cut out each spot and attempt to identify (see below for
the mass spec part). Better yet, you can take your sample and apply two
different scenarios to them, and then do your 2D gels to look for differences in
the spots (did some appear, did some disappear, did some get bigger/smaller,
did some seem to move to a slightly different place).
In any event, these “spots” hold important information, but to identify them
specifically as protein A or B, mass spec tends to be the usual method to use.
19
This one is a little more complex, but the basic ideal is that you take your sample
(your spot, which represents your specific protein). You pass it through so that
your sample becomes charged and fires through a magnetic field. Because
your molecule now has charge and also mass, how it bends in this field can
lead you to directly calculate the exact mass of your molecule (kind of like how
momentum cause you to have a wider turn when you going around a tight
corner).
Anyway, the power of mass spec, is that you can get very precise mass
measurements. Precise to the point (we’re talking impressive numbers of
decimal places in precision) where you can begin to identify the individual
proteins. In class, I’ll also quickly go over how one might do this, by focusing on
identification tricks that use something like a trypsin cleavage strategy.
Then, there is also the idea of looking at metabolites. The principle idea here is
that you’re seeing the result of all the activities of all the DNA, RNA and proteins
working in tandem with the environment to produce all of the other chemicals in
your living organism.
Metabolomics is when you want to see “the unique chemical fingerprints that
specific cellular processes leave behind.” There are lots of examples of this:
probably an obvious one is to look at terpenoids in plants.
Here the techniques involved simply require being to “separate” all of these
metabolites from each other, and then using other techniques to try and identify
the metabolite. For example, you may use Gas Chromatography and then the
aforementioned Mass Spectrometry.
In any event, the primary point (up to this part) is the realization that we have
pretty good science if you want to catalog things like all the DNA, or all the RNA
expressed, or all the protein expressed, or all the metabolites expressed in your
particular living system. In science jargon terms, we call studying these
catalogs: genomics, transcriptomics (or gene expression), proteomics, and
metabolomics.
20
2.3 CENTRAL DOGMA + ALL OF THIS
DATA + PHENOTYPES =
INTERESTING SCIENCE
O.K. so now what?
We have ways to generate a LOT of data, but how does this translate to
interesting biological questions.
Basically, it’s a hunt for interesting and statistically relevant correlations, and
then some focused biochemistry to ensure that the correlation also makes
physiological sense. This can happen at many different levels. For instance:
At the level of what organisms to examine.
You can compare the same clone of an organism, but in different situations.
Perhaps, different gene expression, protein, or metabolite profiles occur when
the organism is under a particular scenario. i.e. the genome is identical, but
different things are turned on and off. This, in turn, could lead to some
interesting insight.
You can compare two (or more) organisms of the same species, but differing
genomes (due to general variation). Perhaps one is better adapted to certain
scenarios (climate, disease), or exhibits certain desirable or undesirable
phenotypes/traits.
You can compare two (or more) similar but different species of organisms. Both
of which exhibit a trait that you want to study. Perhaps, one of the organisms is
very well characterized, whereas the other is novel – here, you can use the prior
knowledge of the well studied species to better characterize the novel.
At the level of what data to extract.
Depending on what you are interested in looking for, sometimes, it’s best to
focus on the DNA data (genome), whereas other times, you may be more
interested in looking at other readouts (i.e. RNA, protein, metabolites). Often, in
large “-omics” projects, the data is multilayered in that you collect at a variety
with a mind to make sure they are in agreement with each other.
At the level of statistical significance and also within current computational tools
(bioinformatics).
This is to say, that with this glut of information, it’s actually not that difficult to find
correlations that are irrevelant. As well, if you think about it, if you’re looking for
correlations in a data set that may include 3 billion letters of code, you’re
probably going to want to figure out a way to do this without relying solely on
your eyeballs and a pen and paper.
In this respect, the field of bioinformatics is key: meaning that corresponding
computational tools that help scientists organize, examine, characterize and
make sense of the data is also part of the process.
21
2.3.1 MAKING SENSE OF THE
CORRELATIONS
(Surprise!) Some more terms...
Genotyping is the process of determining differences in the genetic make-up
(genotype) of an individual by examining the individual's DNA sequence
using biological assays and comparing it to another individual's sequence or a
reference sequence. It reveals the alleles an individual has inherited from their
parents. Traditionally genotyping is the use of DNA sequences to define
biological populations by use of molecular tools. It does not usually involve
defining the genes of an individual. (From Wiki)
If you have something particularly interesting in your genotype, you might refer
to that as a “marker” that you may use to identify or mark things of interest. This
also tends to segue into the idea of “Marker Assisted Breeding.” In other words,
here is a placeholder for something of interest, so I’m going to use it to as
something I can look for and follow when I’m trying to characterize or breed
organisms.
Mapping has a variety of different meanings but generally refers to the idea of
determining a genetic sequence with a physical place in the genome (i.e. it’s on
this chromosome, near the end), or relating a genetic sequence with a particular
phenotype (use of this kind of lingo is especially prevalent in disease research –
i.e. cystic fibrosis maps to -).
In the case of directly assigning a phenotype to a genetic sequence, a better
term is probably to call it a genetic association/linkage.
To make things even more complicated, you also have to remember that
usually, phenotype is not governed by a single gene. Indeed, the norm is that
the genetics involved in multifactorial. This makes things messier to decipher,
but due to the amount of data, and the power of bioinformatic tools, it is doable.
Here are a few terms that relate to the idea that traits are usually polygenic
(requiring the action of many genes).
Quantitative traits refer to phenotypes (characteristics) that vary in degree and
can be attributed to polygenic effects, i.e., product of two or more genes, and
their environment. Quantitative trait loci (QTLs) are stretches of DNA containing
or linked to the genes that underlie a quantitative trait.
There are a variety of ways to genotype/map/associate/look for QLTs: at simpler
levels, these might include things like PCR based techniques (RAPD, AFLP,
microsatellite sequences such as STRs). If you pan out a bit more, you could
genotype by examining differences in SNPs in different organisms. Pan out
even wider, and you can simply cross compare between raw genome
sequences.
I’ll briefly talk about STRs as they give you a general sense of how you can use
a PCR based technique to create genetic bookmarks in the genome that can
22
help you map/associate/find markers/etc. Mentions of microsatellite DNA is also
analogy to how STRs work.
A short tandem repeat (STR) in DNA occurs in non-coding region when a
pattern of two or more nucleotides are repeated and the repeated sequences
are directly adjacent to each other. The pattern can range in length from 2 to
5 base pairs (bp) (for example (CATG)n in agenomic region) and is typically in
the non-coding intron region. (from wiki)
RAPDs and AFLPs are procedures where the readout is conceptually very
different from STRs and microsatellite DNA, but the general premise of just
trying to correlate your readout (a band on the same) to something of interest is
the same.
Let’s repeat that last bit: What you’re basically doing is trying to find a genetic
sequence that correlates strongly to what you’re trying to look out for. The
important and non-intuitive thing here, is that the sequence itself might have
more to do with proximity (to your gene of interest), as oppose to being explicitly
involved in the thing your looking for.
23