Download Sequencing a genome - Information Services and Technology

Sequencing a genome Definition • Determining the identity and order of nucleotides in the genetic material – usually DNA, sometimes RNA, of an organism Basic problem • Genomes are large (typically millions or billions of base pairs) • Current technology can only reliably ‘read’ a short stretch – typically hundreds of base pairs Elements of a solution • Automation – over the past decade, the amount of hand-labor in the ‘reads’ has been steadily and dramatically reduced • Assembly of the reads into sequences is an algorithmic and computational problem A human drama • There are competing methods of assembly • The competing – public and private – sequencing teams used competing assembly methods Assembly: • Putting sequenced fragments of DNA into their correct chromosomal positions BAC • Bacterial artificial chromosome: bacterial DNA spliced with a mediumsized fragment of a genome (100 to 300 kb) to be amplified in bacteria and sequenced. Contig • Contiguous sequence of DNA created by assembling overlapping sequenced fragments of a chromosome (whether natural or artificial, as in BACs) Cosmid • DNA from a bacterial virus spliced with a small fragment of a genome (45 kb or less) to be amplified and sequenced Directed sequencing • Successively sequencing DNA from adjacent stretches of chromosome Draft sequence • Sequence with lower accuracy than a finished sequence; some segments are missing or in the wrong order or orientation EST • Expressed sequence tag: a unique stretch of DNA within a coding region of a gene; useful for identifying fulllength genes and as a landmark for mapping Exon • Region of a gene’s DNA that encodes a portion of its protein; exons are interspersed with noncoding introns Genome • The entire chromosomal genetic material of an organism Intron • Region of a gene’s DNA that is not translated into a protein Kilobase (kb) • Unit of DNA equal to 1000 bases Locus • Chromosomal location of a gene or other piece of DNA Megabase (mb) • Unit of DNA equal to 1 million bases PCR • Polymerase chain reaction: a technique for amplifying a piece of DNA quickly and cheaply Physical map • A map of the locations of identifiable markers spaced along the chromosomes; a physical map may also be a set of overlapping clones Plasmid • Loop of bacterial DNA that replicates independently of the chromosomes; artificial plasmids can be inserted into bacteria to amplify DNA for sequencing Regulatory region • A segment of DNA that controls whether a gene will be expressed and to what degree Repetitive DNA • Sequences of varying lenths that occur in multiple copies in the genome; it represents much of the genome Restriction enzyme • An enzyme that cuts DNA at specific sequences of base pairs RFLP • Restriction fragment length polymorphism: genetic variation in the length of DNA fragments produced by restriction enzymes; useful as markers on maps Scaffold • A series of contigs that are in the right order but are not necessarily connected in one continuous stretch of sequence Shotgun sequencing • Breaking DNA into many small pieces, sequencing the pieces, and assembling the fragments STS • Sequence tagged site: a unique stretch of DNA whose location is known; serves as a landmark for mapping and assembly YAC • Yeast artificial chromosome: yeast DNA spliced with a large fragment of a genome (up to 1 mb) to be amplified in yeast cells and sequenced Readings • Myers, “Whole Genome DNA Sequencing,” http://www.cs.arizona.edu/people/gene/PAPERS/whole.IEEE .pdf • Venter, et al, “The Sequence of the Human Genome,” Science, 16 Feb 2001, Vol. 291 No 5507, 1304 (parts 1 & 2) • Waterston, Lander, Sulston, “On the sequencing of the human genome,” PNAS, March 19, 2002, Vol 99, no 6, 3712-3716 • Myers, et.al., “On the sequencing and assembly of the human genome,” www.pnas.org/cgi/doi/10.1073/pnas.092136699 Hierarchical sequencing • Create a high-level physical map, using ESTs and STSs • Shred genome into overlapping clones • Multiply clones in BACs • ‘shotgun’ each clone • Read each ‘shotgunned’ fragment • Assemble the fragments Physical map Whole genome sequencing (WGS) • Make multiple copies of the target • Randomly ‘shotgun’ each target, discarding very big and very small pieces • Read each fragment • Reassemble the ‘reads’ Hierarchical v. whole-genome The fragment assembly problem • Aim: infer the target from the reads • Difficulties – – Incomplete coverage. Leaves contigs separated by gaps of unknown size. – Sequencing errors. Rate increases with length of read. Less than some . – Unknown orientation. Don’t know whether to use read or its Watson-Crick complement. Scaling and computational complexity • Increasing size of target G. – 1990 – 40kb (one cosmid) – 1995 – 1.8 mb (H. Influenza) – 2001 – 3,200 mb (H. sapiens) The repeat problem • Repeats – Bigger G means more repeats – Complex organisms have more repetitive elements – Small repeats may appear multiple times in a read – Long repeats may be bigger than reads (no unique region) Gaps • Read length LR hasn’t changed much •  = LR /G gets steadily smaller • Gaps ~ Re- R (Waterman & Lander) How deep must coverage be? Double-barreled shotgun sequencing • • • • • Choose longer fragments (say, 2 x LR) Read both ends Such fragments probably span gaps This gives an approximate size of the gap This links contigs into scaffolds Genomic results HGSC v Celera results To do or not to do? • “The idea is gathering momentum. I shiver at the thought.” – David Baltimore, 1986 • “If there is anything worth doing twice, it’s the human genome.” – David Haussler, 2000 Public or private? • “This information is so important that it cannot be proprietary.” – C Thomas Caskey, 1987 • “If a company behaves in what scientists believe is a socially responsible manner, they can’t make a profit.” – Robert CookDeegan, 1987 HW for Feb 17 • Comment on these assertions (500-1000 words): – WLS – “Our analysis indicates that the Celera paper provides neither a meaningful test of the WGS approach nor an independent sequence of the human genome.” – Venter – “This conclusion is based on incorrect assumptions and flawed reasoning.”

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Sequencing a genome - Information Services and Technology