Download Understanding A Genome Sequence

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

RNA interference wikipedia , lookup

X-inactivation wikipedia , lookup

List of types of proteins wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Gene expression wikipedia , lookup

Gene desert wikipedia , lookup

Non-coding DNA wikipedia , lookup

Ridge (biology) wikipedia , lookup

Genomic imprinting wikipedia , lookup

Community fingerprinting wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Gene regulatory network wikipedia , lookup

Molecular evolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Gene wikipedia , lookup

Gene expression profiling wikipedia , lookup

Genome evolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Terry Brown
Genomes
Third Edition
Chapter 5:
Understanding a Genome Sequence
Copyright © Garland Science 2007
Understanding A Genome
Sequence
• To understand the function of genome sequencing is not the ultimate
goal.
• The major challenge lies in understanding that what portion of genome
get expressed?
• To search for the gene is not easy job, it need both:
– bioinformatics
– and molecular biology techniques together to find out a gene.
• To find a gene in genomic DNA means to establish:
– Gene is expressed in that organism
– If it is eukaryotic than what type of introns and exons are there
– What are the regulatory parts which regulate the gene.
– What is the start and end point of a gene in genomic sequences
Locating the Genes in A Genome
Sequence
• Once the sequence of genome or a part of genome is available
• Then genes can be find by:
– By analyzing the sequences using computers (bioinformatics)
– Or by locating the genes by experimental methods
• Genes can be located along the piece of DNA by inspection of the
sequence for:
– Special features associated with the genes
– Like start and stop codons
• The gene inspection is a powerful tool and usually the first method that
is applied to analyze sequence
The Coding Regions of Genes are Open
Reading Frames
•
•
•
•
•
•
The genes that code for proteins contain open reading fames (ORF)
The ORF is a series of codons that specify amino acid sequence of
the protein
The ORF:
– Begins with start codon (usually ATG)
– Ends with termination codon (TAA, TAG or TGA)
The search of an ORF needs to be done on both the DNA strands
That makes it to search for six reading frames which is an easy task
if done by the aid of a computer
The success of find a true gene by this way lies in:
– That if the DNA sequence is random and GC content is about
50% then each of the three termination codon should appear by
chance once in 64 bp (43=64)
– If the GC content is less than that still they should appear once
in 100-200 bp
– Meaning any by chance sequence should be terminated around
50 amino acids coded.
• The average length of codons of:
– E. coli genes is 317
– For Saccharomyces cerevisiae is 483
– And for human is about 450 codons
• So programs can be made select any ORF which are longer
than say 100 codons
• In practice its very effective for E.coli genome
• But its not much effective for higher eukaryotic genomes
due to presence of introns.
Simple ORF scans are less effective with
DNA of higher eukaryotes
• The ORF search is quite effective for bacterial genomes but less
effective in higher eukaryotic genomes due to:
– Introns
– The space between the gene/intergenic region is quite long which
can mistakenly be read as ORF
• Intergenic DNA in E. coli is only 11% of its genome
• While its 62% in human genomes
• The problem of introns is the main
challenge for bioinfomaticians
• It can be partially solved by three strategies
• Codon Bias
• Exon-Intron boundries
• Upstream regulatory sequences
Figure 5.3 Genomes 3 (© Garland Science 2007)
Codon Bias
• This is based on the observation that not all the codons are equally
frequently found in the genes of a particular organism
• For example leusine is specified by six codons (TTA, TTG, CTT,
CTC, CTA and CTG)
• In humans its frequently coded by CTG and is only rarely by TTA or
CTA
• Similarly human genome uses GTG foru times more frequently than
GTA for valine.
• This biase can be written in the program to add in the search for exons
in the genome
Exon-Intron Boundaries
•
•
•
To solve the problem of intron exon search the
distinctive features in their boundaries can be searched.
Although these features are not very distinctive
– The sequence of the upstream exon-intron
boundary is usually described as 5’AG↓GTAAGT-3’
– Its only the consensus sequence so many variant
can be found
– The downstream exon-intron boundary is even less
defined: 5’-PyPyPyPyPyPyNACG↓-3’
– Where Py mean any pyrimidine (T or C) and N
mean any nucleotide (A,T,C or G)
Presently its very difficult to search the most of intronexon in this way
Upstream regulatory sequences
• The regulatory sequences also have
distinctive features that enable them to bind
with certain transcriptional factors.
• These features can also be searched to find
the genes downstream them.
• The problem with this type of method is
again that most of the genes contains
variable regulatory regions making it
difficult to relay on this technique
Figure 5.5 Genomes 3 (© Garland Science 2007)
CpG islands
• It is found now that most of the human genes contains
CpG islands upstream of many genes
• They are upstream and present in 1kb portion
• So in that region the GC content is greater than the average
• Some 40-50% of human genes contains such type of CpG
islands
• So if a ORF is found down stream of such region there is
great chances that it might be expressed in humans.
Locating Genes for Functional
RNA
•
•
•
•
•
•
•
•
ORF search is appropriate to find protein-coding genes.
But functional RNAs are not composed of codons i.e.
genes of rRNA and tRNA
Functional RNA do have their own distinctive features
They can fold into complex secondary structures
i.e. tRNA into cloverleaf
This is due to intramolecular base pairing
These folding pattern provide wealth of information
which can be utilized to search for their genes.
This type of searching has proven quite effective to search
for functional RNA genes
•
•
•
•
•
•
There are some RNA which don't form complex
secondary structures like siRNA and miRNA
Most of them contains simpler stem-loop sequences
(or hairpins)
That is due to Watson-Crick base pairing rules
The programs can calculate thermodynamic rules for
checking the:
– stability of such secondary structures
– Size of the loop
– And size of stem
The search can also be made to find regulatory
regions of these functional RNAs
The success can increase if the regions which are left
from coding regions are searched rigorously for the
presence of functional
RNA genes
Homology searches and comparative genomics give
an extra dimension to sequence inspection
• Most of the various software designed so far can find 95% of ORFs in
eukaryotes
• But still they make frequent mistakes in their positioning of exon-intron
boundaries
• Also find spurious ORFs which is a major problem
• These problems can be over come to a certain degree by the use of
homology search
• A search is made to the already known genes database to find any match
with DNA under study
• If a match is found it indicates the evolutionary relatedness of the gene
under study to the already known gene
• So homology search can help to assign the function to an entirely new
sequence
Figure 5.6b Genomes 3 (© Garland Science 2007)
Comparative Genomics
• A more precise method of homology search is possible when
genome sequences are available for two or more related species
• The related species have genomes that share similarities
inherited from their common ancestor
• Which then get different independently to each other
• The selection pressure on coding sequence make them more
conserved than intergenic regions in their genomes
• Therefore homologous genes can be easily identified by
comparing these genomes
• So any ORF which does not have clear homology can be
discounted as almost certainly being a chance sequence not a
real one
•
•
•
•
•
•
Comparative genomics approach is very successful for Saccharomyces
cerevisiae genomes
As complete or partial sequences are available for about 16 related species
The comparative analysis has authenticated S. serevisiae ORFs
About 500 putative ORFs have been removed from S. serevisiae genomes by
this analysis
The analysis can be made much powerful by the phenomenon of Synteny
Synteny is the conserved gene order displayed by genomes of related species.
Automatic annotation of
genome sequences
•
•
•
One great advantage of bioinformatics techniques for gene identification is the
combining of analytical programs into one integrated system.
So different approaches like:
– ORF finder
– Codong bias
– Regulatory analysis
– Intron-exon boundaries search
– Homology to genes
– Functional RNAs analysis
– comparative genomics analysis
Can be integrated and can help in automatic annotation of genomic sequences
Experimental Techniques for gene location
•
•
•
•
•
•
The gene finding by bioinformatics tool is a good way
But in cases where it is not helpful genes can be find by
checking for their expression into mRNA
The hybridization techniques allow to find
– Transcription start and stop sites
– Intron-exon boundries
– Termination sites
Northern hybridization is based on transfer of an RNA
prep agarose gel electrophoresis onto nitrocellulose
membranes which are subsequently probed with the DNA
under study
The northern hybridization allow the identification of
number of genes present in the fragment of DNA
There are some limitation of this technique:
– Multiple bands can be detected due to alternative
splicing
– RNA may not be representative of whole animal so
genes may not be expressing at that time
Zoo-blotting: Locating Gene
• Similar to northern blotting
• A gene is searched against number of
closely related species RNA
• So if any fragment binds with other
species, it also tells that the gene in
DNA fragment is expressing in related
species.
• Same as homology search
cDNA sequencing enables genes to be
mapped within DNA fragments
•
The mRNA expressed at certain stage and condition of a
cell can be converted into DNA by the process of reverse
transcription, known as cDNA
–
These cDNA made are usually short as some time enzymes leave
the template
•
These cDNA can be made in complete length by the
process of Rapid Amplification of cDNA Ends (RACE)
•
The RACE helps in identification of start and end point
of a gene to be elucidated with precision.
•
The full-length sequence of cDNA also gives
information about intron-exon boundaries to be
delineated
Heteroduplex Analysis: Position of ExonIntrons
•
•
•
•
•
•
The fragment of DNA if cloned into M13 vector it
make it possible to produce single stranded DNA
This DNA can be hybridized with mRNA prep
The regions where DNA-RNA hybrid will form will
be of exonic regions
While introns present in DNA will not be
hybridized with RNA as mRNA do not contains
introns.
The single stranded regions can be digested by S1
nucleases which will degrade any single stranded
DNA or RNA portion
Leaving only heterodupliex which can be resolved
on agarose gels to find their position
Determining the Function of
Individual Genes
• After determining the location of coding regions/genes in a genome the
next target is to assign function to the genes
• Function determination can be done in similar way as finding gene
location i.e.
– Bioinformatics tools (homology search)
– Experimental techniques (Biochemical analysis)
• Homology Reflects Evolutionary Relationships
• Homologous genes might be of two types:
– Orthologousn: homologous genes in different organism
– Paralogous: homologous genes in same organism
Homology analysis can provide information on the
function of an entire gene or of segment s within it
• To find the function of a gene a homology search conducted with the
DNA is less informative then protein sequence
• DNA has four nucleotides while amino acids are twenty
• Therefore proteins which are not homologous appear more distant
when their amino acid sequences are compared.
• In homology search a score for better alignment is given, there are two
ways by which score can be made:
– By counting the number of positions at which the same amino acid
is present in both the sequences which is then converted into
percentage score: This is called as identity
– Other based on the relatedness between the non-identical amino
acid to assign score using certain matrixes like substitution matrix:
This determines the degree of similarity between two sequences
Nucleotide Identity 76% (Homologous)
Amino acid Identity 28% (Not Homologous)
• The search is usually made by standard BLAST (Basic
Local Alignment Search Tool)
• BLAST can find sequences which are 30%-40%
similar
• PSI-BLAST (Position-Specific iterated BLAST) can
find more distant sequences which are not found by
standard BLAST search
• The homology finding has immense importance to
understand the function of a gene
• There are some limitations in the analysis which
should be kept in mind:
– Some proteins are assigned incorrect functions
– Some unrelated sequences may have similarities at
least in some part like presence of domains
– Homologous genes performing very different
biological functions
Using Homology Searching to Assign the
Function to Human Disease Genes
Assigning Gene Function by
Experimental Analysis
• The homology search is not a panacea that can identify the function of
all new genes.
• Therefore, experimental methods are needed to complement and
extend the results of homology studies.
• Reverse genetics is an approach to discovering the function of a gene
by analyzing the phenotypic effects of specific gene sequences
obtained by DNA sequencing. This investigative process proceeds in
the opposite direction of so-called forward genetic screens of classical
genetics.
• Forward genetics seeks to find the genetic basis of a phenotype or
trait while, reverse genetics seeks to find what phenotypes arise as a
result of particular genes
Functional Analysis by Gene
Inactivation
• Functional Analysis of a Gene can be
performed by:
– Inactivation of a gene
– Over expression of a gene
– Mutation analysis of a gene
• Gene inactivation is a powerful tool to
elucidate the function of a gene, falls under
reverse genetics approach
Individual Genes can be Inactivated by
Homologous Recombination
• A gene can be inactivated by sequence specific
manner homologous recombination
• For S. cerevisiae this strategy revealed the
function of many genes.
• A vector is generated with some suitable and
expressible antibiotic genes like kanamycin (kan’).
• The gene under study is replaced by kanamycin
gene by homologous recombination process.
• The resultant cells can be selected on kanamycin
plants and their function can be studied.
Homologous Recombination in
mammalian systems
•
•
•
•
•
•
In S. cerevisiae its easy to study the effect of gene replacement by homologous
recombination as its unicellular organism
But for multicellular organisms like humans and mouse its very difficult as
gene understudy should be replaced in every cell of the organism so that its
function in any cell type can be elucidated.
A mouse which is a model organism for humans because of its genetic
similarity with human beings, can be generated so that its all cells may
contains inactive gene.
Embryonic stem cells can be engineered in a similar way as for s. cerevisiae
and then can be mixed with early embryo, so that a chimera can be
generated
The chimera then allowed to mate so that any two gametes which have
inactive genes can combine giving rise to a homozygous organism, who’s both
genes are inactive.
These type of mouse are called knockout mice.
Gene Inactivation without
Homologous Recombination
• Transposon tagging is another way to inactivate a
gene without homologous recominbation
• The genetically engineered transposon can be
generated which can change their position in
response to certain external stimuli.
• This strategy helped a lot in understanding the
function of many genes of Drosophila
melanogaster
• The weakness of this approach that transposition
is not predictable so need to analyze lot of
recombinants to find target gene inactivation
RNA interference (RNAi)
• RNAi is very powerful technique which can be
utilized to silence many genes without changing the
genetic makeup of the organism.
• This targets mRNA in a sequence specific manner.
• All 19000 predicted genes of the Caenorhabditis
elegans have been analyzed for their function.
• Similarly its have the potential to be used for
higher animals and been used even for human cells
line and mouse.
• To work at organism level RNAi should be very
potent and expressive in all cells at higher amount
to achieve gene silencing in all the cells.
Gene Over Expression can also be used
to Access Function
• Not only the gene knockout or silencing can be
used to elucidate the function of a gene but
• Gene over expression can also tells about the
function of a gene.
• The dose of genes increased in the cell, which help
in understanding the function.
• The vectors can be generated which can over
express a target gene under strong promoter
• These vectors are also multicopy vector so that
gene dose can be increased.
• This approach also helped in elucidation of the
function of many genes.
More Detailed Studies of the Activity of a
Protein Coded by an unknown gene
• Gene inactivation, Gene over expression can determined the general
function of a gene
• But the detailed information about the function of a gene can not be
elucidated.
• i.e. which part of a protein is involved in which activity, regulation?
• Where that protein is expressed
• When that proteins is needed by the cell?
• To gain insight about these aspect a detailed analysis is needed.
Directed Mutagenesis can be used to
Probe Gene Function in Detail
• The site directed mutagenesis is technique where proteins
are mutated at desired position
• Like some active domain to make a protein more fast,
more thermal tolerant etc.
• This technique is actively utilized in protein engineering
• Where aim is to develop novel proteins with properties
that are better suited for use in industrial or clinical
settings.
Site Directed Mutagenesis
• There are three common
ways to make site specific
mutated genes
– Oligonucleotide-Directed
mutagenesis
– Artificial gene synthesis
– PCR
Reporter Gene and Immunocytochemistry can be used to
locate where and when genes are expressed
• We can experimentally determine that where
protein is being expressed in an organism
• And at what stage it is expressed by visually
examination of its expression
• It can be achieve by reporter gene under the
same regulation of target gene
• And location can be find by
immunocytochemistry
Case Study: Annotation of the Saccharomyces
cerevisiae Genome Sequence
• We have studied different techniques which allow to
allocate the position and expression of a gene of an
organism.
• Now we will study how these various techniques was
applied on S. cerevisiae genomic sequence to elucidate the
function and position of genes.
• The use of different techniques is dependent on many
considerations
– Type of genome
– Availability of related sequences
– Ease of experimentation for elucidating the function of genes
Figure 5.26 Genomes 3 (© Garland Science 2007)
Annotation of the yeast genome sequence
• The S. cerevisiae sequences project was completed in 1996.
• The initial analysis with 100 codons cut-off value for potential
genes, identified 6274 ORFs.
• Out of these about 30% were known to be genuine genes because
they had previously been identified by conventional genetic
approaches before even sequencing the genome.
• The remaining 70% were studied for homology analysis after
genome been sequenced completely.
• The results shows:
• Almost 30% of the genes could be assign function after homology
searching of the sequence database.
– About half of them were those whom function has already been
known.
– About half were with less similarities and many of them were of
those where similarity was confined to some domains so of
limited usefulness
Figure 5.23 Genomes 3 (© Garland Science 2007)
Table 5.2 part 1 of 2 Genomes 3 (© Garland Science 2007)
• For some genes the homology search enabled to find exactly the
function of the gene. i.e. different sub units of DNA polymerase
• For some genes it was puzzling to assign function on the bases of
homology search. i.e. a bacterial homolog of nitrogen fixation, which
tuned out to be gene involve in the synthesis of metal containing
proteins in which falls the nitrogen fixing gene of bacteria.
• About 10% of all the gene of S. cerevisiae had homolog in database
but with unknown function so finding the function of those genes were
not easy. These types of genes were called as orphan families
• The remaining yest genes, about 30% of the total, had no homologous
in the database.
• The 7% of the ORFs were questionable ORFs which might be not real
genes.
• There reminder look like genes but were unique so are called as single
orphans
• After initial annotation of the S. cerevisiae genome there were
questions about”
– How many single orphans are genuine genes?
– Second are there genuine genes less than 100 codons in length?
• Although there were just 6274 ORFs more than 100 codons but there
were 100.000 ORFs of 15 codons or less and many of them with
codon biasness of S. cerevisiae codon usage.
• Therefore potential of finding new genes were quite high.
• So experimental work was conducted to elucidate the function of genes
•
Experiments to find the function of S.
cerevisiae genome
• Comparative Genomics
– By comparing genes in closely related genomes
• Sequencing cDNA libraries
– By sequencing cDNA libraries which show which genes get
transcribed
• Transposon Tagging
Gene inactivation so that function of genes could be find.
The strategy which was used was robust by using tronsposon tagging
with molecular bar codes
These experiments are continuing but their results so far has reduced
the yeast gene catalogue to about 6120 genes
– by removing about many long ORFs which previously thought as
genes
– By adding some more ORFs which were shorter than 100 codons.