Download Genome Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Long non-coding RNA wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Essential gene wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

NUMT wikipedia , lookup

Copy-number variation wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene desert wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Point mutation wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Human genetic variation wikipedia , lookup

Genetic engineering wikipedia , lookup

Oncogenomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Transposable element wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Ridge (biology) wikipedia , lookup

Genomic imprinting wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Gene wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Microevolution wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genomic library wikipedia , lookup

History of genetic engineering wikipedia , lookup

Designer baby wikipedia , lookup

Genome (book) wikipedia , lookup

Public health genomics wikipedia , lookup

Metagenomics wikipedia , lookup

Human genome wikipedia , lookup

Genome editing wikipedia , lookup

Human Genome Project wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Pathogenomics wikipedia , lookup

Minimal genome wikipedia , lookup

Genomics wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Genome Analysis
Lecture 14
Introduction
A major application of bioinformatics is the analysis of full
genomes of organisms that have been sequenced
Traditional genetics has focused on understanding the role
of a particular gene or protein in biological process
Availability of genome sequences provides the sequences of
all the genes of an organism
Important genes influencing metabolism, cellular
differentiation and development, and disease processes in
animals can be identified and relevant genes manipulated
Challenge is to identify those genes that are predicted to
have a particular biological function
Lecture 14
Genomic Sequences
Availability of genome sequences facilitates the discovery and
utilization of sequence polymorphisms used to trace genes among
individuals in a population
Some types of genetic variation are best understood at the
genome-wide level.
Availability of genome sequences provides opportunity to
explore genetic variability both between organisms and within the
individual organism
Web resources - ch_10_t_1.html
Lecture 14
Prokaryotic Genomes
Genomes of 31 prokaryotic organisms have been sequenced
Organisms were selected on the basis of three criteria
They had been subjected to a good deal of biological analysis
and thus were model prokaryotic organisms
They were an important human pathogen – Mycobacterium
tuberculosis and Mycoplasma pneumoniae
They were of phylogenetic interest
Sequences were annotated as they were sequenced
Lecture 14
Gene Structure Varies
Organism
Haploid genome
size (Mb)
Predicted number of
genes
Arabidopsis thaliana (plant)
130
~25,000
Caenorhabditis elagans (worm)
100
18,424
Drosphila melangaster
180
13,601
Escherichia coli
4.7
4,288
Homo sapiens (human)
3000
45,000 – 120,000
Saccharomyces cerevisiae
(yeast)
13.5
6,241
Lecture 14
Steps of Genome Analysis
1)
2)
3)
4)
5)
6)
7)
8)
9)
10)
11)
12)
13)
14)
Genome sequence assembled
Identify repetitive sequences – mask out
Gene prediction – train a model for each genome
Look for EST and cDNA sequences
Genome annotation
Microarray analysis
Metabolic pathways and regulation
Protein 2D gel electrophoresis
Functional genomics
Gene location/gene map
Self-comparison of proteome
Comparative genomics
Identify clusters of functionally related genes
Evolutionary modeling
Lecture 14
Comparative Genomics
Includes a comparison of gene number, gene content and gene
location in both prokaryotic and eukaryotic groups of organisms
Availability of genome makes possible a comparison of all the
proteins (proteome) encoded by one organism with those of
another
Genes in two organisms that are so similar that they must have
the same function and evolutionary history are orthologs
Two or more proteins in the same proteome that share a high
degree of similarity because they share the same set of domains
are likely to be paralogs
Lecture 14
Comparative Genomics of
Eukaryotes
Drosophila has core proteome only twice the size of that of yeast
Complexity apparent in metazoans is not achieved by sheer number of genes
Despite the large differences between fly and worm in terms of development
and morphology, they use a core proteome of similar size
 Comparative analysis of the predicted proteins encoded by these genomes
suggests that nearly 30% of fly genes have putative orthologs in the worm
There are some signs that Drosophila proteome is more similar to mammalian
proteomes than those of worm or yeast
Some of the human disease genes absent in Drosophila reflect clear
differences in physiology between the two organisms – hemoglobins
Population of multidomain proteins is larger and more diverse in the fly than
in the worm
Genome sequencing effort of the fly has revealed a number of previously
unknown counterparts to human genes involved in cancer and neurological
disorders
Lecture 14
Functional Classifications of
Genes
Classify annotated genes by function
Early classification scheme for E. coli genes included
categories for enzymes, transport elements, regulators,
membranes, structural elements, protein factors, leader peptides
and carriers – based on sequence similarity
Another classification scheme is based on biochemical activity
Can also classify proteins that physically interact in a structure
or biochemical pathway
Lecture 14
Physical Mapping Databases
Access to maps produced by multiple groups is available at
NCBI which attempts to integrate several genetic and physical
maps with DNA and protein-sequencing information –
http://www.ncbi.nlm.nih.gov/Entrez/
Genome Data Base (GDB) – is limited to human data,
contains no sequence data. http://gdbwww.gdb.org
Whitehead Institute is primary source for of two genomewide physical maps – STS content map of more than 10,000
markers assigned to YACS and a radiation hybrid map of
12,000 markers. http://www.genome.wi.mit.edu
Lecture 14
Structural Genomics
Full understanding of the biological role of the proteins identified in
genomes will require knowledge of their structure and function
Structural genomics of single proteins combined with protein structure
prediction may contribute substantially to efficient structural characterization
of large macromolecular assemblies
The structure of most proteins will be modeled, not determined by
experiment
Will need to determine protein structures so that most of the remaining
sequences are related to at least one known structure of higher than 30%
sequence identity
Focus on proteins will be moving from structural genomics to functional
genomics
Lecture 14
Human Genome Project Facts
Since it began in 1990, the HGP is estimated to have cost $3 billion
A rough draft of the human genome was completed in June 2000. The final draft
is expected sometime in 2003
For the HGP, researchers collected blood(female) or sperm(male) samples from
a large number of donors. Only a few samples were processed as DNA resources.
Neither the donors nor scientists know whose DNA is being sequenced
Genome from Celera was based on DNA samples from 5 donors who identified
themselves only by race and sex
97 % of DNA in human genome consists of non-genetic sequences
Human DNA is 98 percent identical to chimpanzee DNA
Average amount of difference between any two humans is 0.2 percent
Humans have approximately 30,000 genes, roundworm has 19,098 genes and
fruit fly has 13,602 genes, yeast has 6,034 genes
Lecture 14
More HGP Facts
Human genome is the largest genome to be extensively sequenced
The genomic landscape shows marked variation in the distribution of a
number of features, including genes, transposable elements, GC content, CpG
islands and recombination rate
Hundreds of human genes appear likely to have resulted from horizontal
transfer from bacteria at some point in the vertebrate lineage
Although about half of the human genome derives from transposable
elements, there has been a marked decline in the overall activity of such
elements in the hominoid lineage
Segmental duplication is much more frequent in humans than in yeast, fly
or worm
The mutation rate is about twice as high in male as in female meiosis,
showing that most mutation occurs in males
More than 1.4 million single nucleotide polymorphisms(SNPs) have been
identified
Lecture 14
Background to the HGP
HGP arose from two insights that emerged in the 1980s
The ability to take global view of genomes could greatly accelerate
biomedical research
Creation of a global view would require a communal effort
Sequencing of bacterial viruses and human mitochondrion between 1977
and 1982 proved the feasibility of assembling small sequence fragments into
complete genomes
The program to create a human genetic map to make it possible to locate
disease genes based solely on their inheritance patterns
 The programs to create physical maps of clones covering the yeast and
worm genomes to allow isolation of genes and regions based solely on their
chromosomal position
The development of random shotgun sequencing of complimentary DNA
fragments for high-throughput gene discovery (ESTs)
Lecture 14
Timeline of Large-Scale Genomic Analysis
Lecture 14
Technology for Large-Scale Sequencing
Laboratory innovations included four-color fluorescence-based sequence
detection, improved fluorescent dyes, dye-labeled terminators, polymerases
specifically designed for sequencing, cycle sequencing and capillary gel
electrophoresis
 Important advances in the development of software packages for the
analysis of sequence data
PHRED makes it possible to monitor raw data quality and also assist in
determining whether two similar sequences truly overlap
PHRAP systematically assembles the sequence data using the base-quality
scores from PHRED.
Another key innovation for scaling up sequencing was the development by
several centers of automated methods for sample preparation. This typically
involved creating new biochemical protocols suitable for automation,
followed by construction of appropriate robotic systems.
Lecture 14
Lecture 14
Human Sequence in the High Throughput Sequence Division of
GenBank
Lecture 14
Lecture 14
Genome Browser
http://genome.ucsc.edu/
Lecture 14
Lecture 14
Lecture 14
Lecture 14
Lecture 14
Lecture 14
Classes of Interspersed Repeats
Lecture 14
Gene Content of Human Genome
Genes (or at least their coding regions) comprise only a tiny fraction of
human DNA, but they represent the major biological function of the genome
and the main focus of interest by biologists
Human genes tend to have small exons (encoding an average of only 50
codons) separated by long introns (some exceeding 10 kb)
This creates a signal-to-noise problem, with the result that computer
programs for direct gene prediction have only limited accuracy
Computational prediction of human genes must rely largely on the
availability of cDNA sequences or on sequence conservation with genes and
proteins from other organisms
This approach is adequate for strongly conserved genes (such as histones or
ubiquitin), but may be less sensitive to rapidly evolving genes (including
many crucial to speciation, sex determination and fertilization)
Lecture 14
Characteristics of Human Genes
Internal exon
Exon number
Introns
3’ UTR
5’ UTR
Coding sequence
(CDS)
Genomic extent
Median
122 bp
Mean
145 bp
7
1,023 bp
400 bp
8.8
3,365 bp
770 bp
240 bp
1,100 bp
367 aa
14 kb
300 bp
1,340 bp
447 aa
27 kb
Lecture 14
Lecture 14
Functional Categories in Eukaryotic Proteomes
Lecture 14
Applications to Medicine
A key application of human genome research has been the ability to find
disease genes of unknown biochemical function by positional cloning
This method involves mapping the chromosomal region containing the gene
by linkage analysis in affected families and then scouring the region to find the
gene itself
The human genomic sequence in public databases allows rapid identification
in silico of candidate genes, followed by mutation screening of relevant
candidates, aided by information on gene structure
For a mendelian disorder, a gene search can now often be carried out in a
matter of months with only a modestly sized team
Lecture 14
Lecture 14
Drug Targets
A recent compendium lists 483 drug targets as accounting for
virtually all drugs on the market
Only a minority of human genes may be drug targets. It has
been predicted that the number will exceed several thousand, and
this prospect has led to a massive expansion of genomic research
in pharmaceutical research and development
Serotonin receptors – mood disorders and schizophrenia
Leukotriene pathway – asthma
Amyloid precursor protein - Alzheimer's disease.
Lecture 14
Next Steps
Finishing the human sequence
Developing the Integrated Gene Index (IGI) and Integrated
Protein Index (IPI)
Large-scale identification of regulatory regions
Sequencing of additional large genomes
Completing the catalogue of human variation
From sequence to function
Lecture 14
Future Technology
Development
Functional genomics - aims to understand how genes are
regulated and what they do, largely through massively parallel
studies of gene expression in a variety of tissues
Proteomics – promises to make the identity of each protein
known and elucidate protein-protein interactions
Bioinformatics – enhance the ability of researchers to
manipulate, collect and analyze data more quickly and in new
ways
Lecture 14