Download Assembly of genomic sequences

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Complement component 4 wikipedia , lookup

Transcript
Griffiths • Wessler • Carroll • Doebley
Introduction to
Genetic Analysis
TENTH EDITION
CHAPTER 14
Genomes and Genomics
© 2012 W. H. Freeman and Company
CHAPTER OUTLINE
14.1
14.2
14.3
14.4
14.5
14.6
The genomics revolution
Obtaining the sequence of a genome
Bioinformatics: meaning from genomic sequence
The structure of the human genome
Comparative genomics
Functional genomics and reverse genetics
The human nuclear genome viewed as a set of labeled DNA
Neanderthal bone sample for DNA sequencing
Landmarks in genome sequencing
1995 1.8-Mb bacterium Hemophilus influenzae
(1st free-living organism sequenced)
1996 12-Mb Saccharomyces cerevisiae
(1st eukaryotic organism)
1998 100-Mb C. elegans
(1st multicellular organism)
2000 180-Mb Drosophila melanogaster
2001 3000-Mb human genome
2005 chimpanzee
12 Drosophila species genome sequenced
List of sequenced eukaryotic genomes
http://en.wikipedia.org/wiki/List_of_sequenced_eukaryotic_genomes
The logic of obtaining a genome sequence
genomic DNA library
Sequencing
average read ~300 nt
error rate
10X coverage
human genome ~ 3x109 bp
~ 108 reads
assembly of sequences
Next-generation whole-genome shotgun sequencing (WGS)
Cell-free (no cloning)
End reads from multiple inserts may be overlapped to
produce a contig
Primers:
Sequences in vector
Present in all clones
Paired-ends reads
Pyrosequencing reactions take place on beads in tiny wells
picoliter
centi- (10-2)
milli- (10-3)
micro- (10-6)
nano- (10-9)
pico- (10-12)
Emulsion PCR
DNA fragments + adapters
Each bead coated with millions of oligos to capture adapter
Emulsify => Each droplet contains 1 bead and 1 bound DNA fragment
PCR reactor => PCR products (106 copies) captured by the bead
(parallel >106 PCR reactions)
454 picotiter plate (PTP)
Pyrosequencing is based on detecting synthesis reactions
short reads: <250 bp
massively parallel
high throughput
>106 reads simultaneously
Assembly of genomic sequences
Problem: repetitive sequences
Paired-end reads may be used to join two sequence contigs
Too long to sequence entirely
find paired-end reads that spanned the gaps between
two sequence contigs
Strategy for whole-genome shotgun sequencing assembly
Paired-end reads can be produced by circularization
Paired-end reads can be produced by circularization
blunt-ends
+ adapters
Paired-end reads can be produced by circularization
PLoS Biology 5:e254 (2007)
1000 USD / human genome
Presented here is a genome sequence of an individual human. It was produced
from ;32 million random DNA fragments, sequenced by Sanger dideoxy
technology and assembled into 4,528 scaffolds, comprising 2,810 million bases
(Mb) of contiguous sequence with approximately 7.5-fold coverage for any given
region. We developed a modified version of the Celera assembler to facilitate the
identification and comparison of alternate alleles within this individual diploid
genome. Comparison of this genome and the National Center for Biotechnology
Information human reference assembly revealed more than 4.1 million DNA
variants, encompassing 12.3 Mb. These variants (of which 1,288,319 were
novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823
block substitutions (2–206 bp), 292,102 heterozygous insertion/deletion events
(indels)(1–571 bp), 559,473 homozygous indels (1–82,711 bp), 90 inversions, as
well as numerous segmental duplications and copy number variation regions.
Non-SNP DNA variation accounts for 22% of all events identified in the donor,
however they involve 74% of all variant bases. This suggests an important role
for non-SNP genetic alterations in defining the diploid genome structure.
Moreover, 44% of genes were heterozygous for one or more variants. Using a
novel haplotype assembly strategy, we were able to span 1.5 Gb of genome
sequence in segments .200 kb, providing further precision to the diploid nature
of the genome. These data depict a definitive molecular portrait of a diploid
human genome that provides a starting point for future genome comparisons
and enables an era of individualized genomic information.
Do not need to culture the organisms
Massive amount of data
Bioinformatics
The information content of the genome includes binding sites
Genome annotation
Prediction of protein-coding genes
open reading frames (ORFs)
cDNAs and ESTs reveal exons or gene ends in genome
searches
expressed sequence tags
Genome searches hunt for various binding sites
Binding sites are not precise
Binding site for the Pax6 transcription factor
Many forms of evidence are integrated to make gene
predictions
The sequence map of human chromosome 20
<3% of human genome is protein-coding
Pseudogenes
Processed pseudogenes (no introns, with polyA)
Protein families
Evolution: gene duplication => diverge
Cytogenetic map of human chromosome 7
Association with
human diseases
=> Identify disease genes
Phylogeny of living mammals and other amniotes
Comparative genomics
Homologous genes
The mouse and human genome have large syntenic blocks of
genes in common
Evolutionary history
Primate genome comparison
1% difference
Testing the role of a conserved element in gene regulation
human ISL1 element
+ reporter
Mouse ISL protein
Two E. coli strains contain islands of genes specific to each
strain
Microarrays can detect differences in gene expression
Microarrays can detect differences in gene expression
DNA microarrays reveal profiles of gene expression
Studying protein interactions with the use of the yeast twohybrid system
Protein interaction network
Steps in a chromatin immunoprecipitation assay (ChIP)
Disrupting gene function with the use of targeted
mutagenesis
Disrupting gene function with the use of RNA interference
Inserting transgenes into a nonmodel organism
Examples of nonmodel insects expressing a transgene