Download Document

Chapter – 20: Genomics and Global Screening Chapter - 20 Outline  Definition  History  Open eading frame (ORF)  First Genome sequence  Vectors for genome sequencing o YAC o BAC  Sequence Tagged Sites  Mapping with STSs  Shotgun-sequencing method  GeneEngine platform  Differential Display-PCR  Serial analysis of gene expression(SAGE):  DNA microarrays  Impact on Bioinformatics Definition Genome is defined as the collection of all genes in an organism. The genome includes both the genes and the non-coding sequences of the DNA. Hans Winkler, Professor of Botany at the University of Hamburg, Germany first adopted the term in 1920. Genomics is the study of genome and role of genes, alone and together, in directing life. The field includes intensive efforts to determine the entire DNA sequence of organisms and finescale genetic mapping efforts. Genomics was established by Fred Sanger when he first sequenced the complete genomes of a virus and a mitochondrion. His group established techniques of sequencing, genome mapping, data storage, and bioinformatic analyses in the 1970-1980s. The most important tool is microarrays in genomics. The first DNA-based genome to be sequenced in its entirety was that of bacteriophage Φ-X174(5,368 bp), sequenced by Frederick Sanger in 1977. The first free-living organism to be sequenced was that of Haemophilus influenzae (1.8 Mb) in 1995, and since then genomes are being sequenced at a rapid pace. Functional genomics is a field of molecular biology that uses the vast wealth of data produced by genomic projects (such as genome sequencing projects) to describe gene (and protein) functions and interactions. Functional genomics attempts to answer questions about the function of DNA at the levels of genes, RNA transcripts, and protein products. It attempts to answer questions such as gene transcription, translation and protein-protein interaction.The goal of functional genomics is to understand the relationship between an organism's genome and its phenotype. Functional genomics uses mostly multiplex techniques to measure the abundance of many or all gene products such as mRNAs or proteins within a biological sample to quantitate the various biological processes and improve our understanding of gene and protein functions and interactions. Global gene screening uses the techniques like Differential display-PCR (dd-pcr), Serial analysis of gene expression(sage) and DNA array technology to analyze DNA samples to detect the presence of a gene or genes associated with an inherited , Analyzing DNA to look for a genetic alteration that may indicate an increased risk for developing a specific disease or disorder History After the sequence of the phase x174 by Sanger in 1977, in 1995 Craig Ventor and Hamilton Smith sequenced the first genome of the free-living organism: Haemophilus Influenza and Mycoplasma genitalium. Haemophilus Influenza contained 1,830,137 bp and was the first to be completely sequenced. In 1996, the baker’s yeast containing the 12-million-bp was sequenced. Also in 1996, the first genome of an organism of the third domain of life, the archaea was sequenced. In 1997, E.coli genome containing 4.6-million-bp was sequenced. In 1998, the first animal genome, Carnorhabditis elegans was sequenced. Figure-1: DNA Sequence Trace Milestones in Genomic Sequencing The Human Genome Project was a 13 year old mega project, that was launched in the year 1990 and completed in 2003. The human genome project international consortium announced the publication of a draft sequence and analysis of the human genome—the genetic blueprint for the human being. An American company—Celera, led by Craig Venter and the other huge international collaboration of distinguished scientists led by Francis Collins, director, National Human Genome Research Institute, U.S., both published their findings. In 2000, the first rough draft of the human genome was completed. The genome has been completely sequenced using the definition employed by the International Human Genome Project. A graphical history of the human genome project shows that most of the human genome was complete by the end of 2003. The mouse genome was completely sequenced in 2002. And by the end of 2006, 453 completer genomes have been sequenced. To read the complete 3.2 billion base pairs it would require 60 years of 8 hours per day at 5 bp per second. Open Reading Frame (ORF) An open reading frame is a sequence of bases that if translated in one frame, contains no stop codons for a relatively long distance-long enough to code for one of the phase proteins. The open reading frame usually starts with an ATG(or occasionally a GTG)triplet, corresponding to an AUG (GUG) translation initiation codon, and end with a stop codon(UAG, UAA, OR UGA). An open reading frame is the same as a gene’s coding region. In a gene, ORFs are located between the start-code sequence (initiation codon) and the stop-code sequence (termination codon). ORFs are usually encountered when shifting through pieces of DNA while trying to locate a gene. Since there exist variations in the start-code sequence of organisms with altered genetic code, the ORF will be identified differently. the DNA sequence can be read in six reading frames in organisms with double-stranded DNA; three on each strand. The longest sequence without a stop codon usually determines the open reading frame. the base sequence of the phage DNA also tells us the amino acid sequence of all phage proteins. One uses the genetic code to translate the DNA base sequence of each of the reading frame into the corrosponding amino acid sequences. First Genome Sequence The first genome that was sequenced was E.Coli phase called x174 by Sanger in 1977 which had 5375 nt sequenced. the analysis of the open reading frame of the x174 phase revealed that some of the phase genes overlap. in the picture below, we see that the coding region of gene B lies within the region of gene A and the coding region of gene E lies within the region of gene D. Even though the genes occupy the same region, they code for different proteins because they encounter different codons for their reading frames. Figure-2: Phage fX174: Fred Sanger, 1977 5357 nt (1st genome sequenced) (a) Each letter is a gene. (b) Overlapping reading frame: only the non-template sequence (coding or sense strand) shown. Genome sequencing: Genome sequencing is the process that determines the complete DNA sequence of an organism’s genome at a single time. Biological samples such as saliva, epithelial cells, bone marrow, hair or anything else that have DNA-containing cells can provide the genetic material necessary for full genome sequencing. Large-scale sequencing aims at sequencing very long DNA pieces, such as whole chromosomes. Common approaches consist of cutting (with restriction enzymes) or shearing (with mechanical forces) large DNA fragments into shorter DNA fragments which are then cloned and individually sequenced and then out together. The two approaches to sequence human genome are clone-by-clone approach and shot-gun approach. Figure-3: Genome Sequencing Genomic DNA is fragmented into random pieces and cloned as a bacterial library. DNA from individual bacterial clones is sequenced and the sequence is assembled by using overlapping DNA regions. Clone-by-Clone Approach: map then sequence Figure-4: Cole-by-clone Approach During this approach, DNA is mapped first and followed by sequencing. The chromosomes were mapped and then split up into sections. A rough map was drawn for each of these sections, and then the sections themselves were split into smaller bits, with plenty of overlap between each of the bits. Each of these smaller bits would be sequenced, and the overlapping bits would be used to put the genome jigsaw back together again. Since every DNA sequence is derived from a known region, it is relatively easy to keep track of the project and to determine where there are gaps in the sequence. Moreover, assembly of relatively short regions of DNA is an efficient step. However, mapping can be a time-consuming, and costly, process. This process uses yeast chromosomal vector or bacterial chromosomal vector for cloning. This method was used and invented by Francis Collin. Shotgun Approach: sequence then map Figure-5: Shotgun Approach The alternative to the clone-by-clone approach is the shotgun sequencing, developed by Fred Sanger in 1982. First, all the DNA is first broken into fragments. The fragments are then sequenced at random and assembled together by looking for overlaps. The advantage of the whole-genome shotgun is that it requires no prior mapping. Its disadvantage is that large genomes need vast amounts of computing power and sophisticated software to reassemble the genome from its fragments. Reassembling these sequenced fragments requires huge investments in IT, and, unlike the clone-by-clone approach, assemblies can't be produced until the end of the project. . This process uses bacterial chromosomal vector for cloning. This method was used and invented by Craig Venter. Vectors for Genome Sequencing Yeast artificial chromosome(YAC): Yeast artificial chromosome(YAC) was useful in mapping human genome because they could hold hundreds of thousands of kilobases each. It contains a left and a right yeast chromosomal telomere, which are both necessary for yeast chromosomal replication and a yeast centromere, which is necessary for segregation of sister chromatids to the opposite poles of the dividing yeast cell. The centromere is placed adjacent to the left telomere and a huge piece of DNA can be placed between he centromere and the right telomere. The large pieces of DNA are prepared with by digesting long pieces of DNA with restricting enzyme. Then the YAC’s with their DNA inserts are placed into yeast cells, where they replicate as normal yeast cells. Figure-6: Yeast artificial chromosome Yeast artificial chromosome (YAC), Cloning in yeast artificial chromosomes (yellow, telomere; red, C, centromere; L, left arm; R, right arm with telomere; blue, large piece of foreign gene, several hundred kb). YAC can replicate in yeast and take a million bp insert. Even though it was useful for human genome mapping, it had the disadvantages such as inefficient, unstable, cloning efficiency low, hard to isolate from yeast. thus to solve the problems, scientist started to use bacterial artificial chromosome. Bacterial artificial chromosome(BAC): Bacterial artificial chromosome(BAC) solved the problems that arise with the YAC’s and this was the choice for the sequencing phase of the human genome project. they are based on the F plastid inhibited in the E.Coli cells. this plastid allows conjugation between bacterial cells and can be transferred from a F+ cell, the donor cell to F cell, a recipient cell, thus converting the recipient to a F+ cell. An small piece of host DNA can be transferred in to the F plasmid or the F plasmid can insert into the host chromosome and mobilize the host chromosome to pass from the donor to the recipient cell plasmid can accommodate large inserts of DNA. BAC, bacterial artificial chromosome, takes about 300,000 bp, became the vector of choice for human genome project. Developed by Melvin Simon, 1992. Par genes govern the distribution of plasmids into the daughter cells that keep the plasmid copy at 2 per cell. So, it is stable. Figure-7:Bacterial artificial chromosome The figure shows the first BAC’s developed by the Melvin Simon and colleagues in 1992. It has the cloning sites HindIII and BamHI, at top; the chloramphenicol resistance gene(CmR), used as a selection tool; the origin of replication(oriS); and the genes governing partition of plasmids to daughter cells (ParA and ParB). Sequence Tagged Sites The Sequence-Tagged Site (STS) is a relatively short, easily polymerase chain reaction (PCR)amplified sequence (200 to 500 bp) which can be specifically amplified by polymerase chain reaction (PCR) and detected in the presence of all other genomic sequences and whose location in the genome is mapped. One needs to know enough of the DNAs equence in the region being mapped to design short primers that will hybridize a few hundred basee pairs apart and cause amplification of predictabl;e length of DNA in between.STSs can be easily detected by the polymerase chain reaction (PCR) using these two primers. If the proper size amplified DNA fragment appears, then the unknown DNA has the STS of interest. It also must hybridize a specific number of base pairs to give the right size of PCR fragment ahich provides a check on the specificity of hybridization. For this reason they are useful for constructing genetic and physical maps from sequence data reported from many different laboratories. They serve as landmarks on the developing physical map of a genome. They are used in shotgun sequencing, specifically to aid sequence assembly. The advantage of STSs over other mapping landmarks is that the means of testing for the presence of a particular STS can be completely described as information in a database: anyone who wishes to make copies of the marker would simply look up the STS in the database, synthesize the specified primers, and run the PCR under specified conditions to amplify the STS from genomic DNA. Figure-8: Sequence-Tagged Site (STS) Here, we start with long pieces of DNA extending indefinitely in either direction. once the sequence of small areas of the DNA are known, we design primers that will hybridize this regions and allow PCR to produce double stranded fragments of predictable length. Here the PCR primer of 250 bp apart have been used. Several cycles of the PCR generate many double stranded PCR products that are precisely 250 bp long. Electrophoresis of this product allows one to measure its size exactly and confirm that it is the correct one. Mapping With STSs STS’s are useful in physical mapping or locating specific sequences in a genome. Microsatellites are a developed class of STSs that are highly polymorphic. Microsatellites are repeating sequences of 2-6 base pairs of DNA. Microsatellites are typically neutral and codominant. They are used as molecular markers in genetics, for kinship, population and other studies. They can also be used to study gene duplication or deletion. The most common way to detect microsatellites is to design PCR primers that are unique to one locus in the genome and that base pair on either side of the repeated portion. Therefore, a single pair of PCR primers will work for every individual in the species and produce different sized products for each of the different length microsatellites. In the picture below, at the top left several representative of the BACs are shown, with different symbols representing different STSs placed at specific intervals. In step (a) screen for two or more widely spaced STSs-STS1 and STS4. all the BACs that contain them are shown on the top right. The identified STSs are shown in color. In step (b), each the positive BACs are screened for more STSs-STS2, STS3 and STS5. the colored symbols on the BACs in the bottom right denotes the STS detected in each BAC.In step (c), align the STSs in each BAC to form a contig, a overlapping DNAs spanning long distances. Measuring the lengths of the BACs by pulsed-field gel electrophoresis helps to pin down the spacing between pairs of BACs. Figure-9: Mapping with STSs Shotgun-sequencing Method First proposed by Craig Venter, Hamilton Smith and Leroy Hood in 1996, focuses on the sequencing stage and then mapping., it starts with a BAC clone with very large inserts, averaging about 150 kb. The inserts in each BACs are sequenced on both ends using an automated sequencer that can easily read about 500bases at a time, so 500 bases at each end of the clone will be determined. these 500-base sequences serve as an identity tag, called a sequence-tagged connecter(STC) for each BAC clone. Following, each clone is fingerprinted by digesting with a restricting enzyme to determine the insert size and to eliminate the aberrant clones whose fragmentation patterns for not fit the consensus of the overlapping clones. Then we subdivide the BACs into smaller clones in a pUC vector with inserts averaging only about 2 kb. this whole BAC sequence allows the identification of the 30 or more BACs that overlap with the seed. Next, one selects the BAC with minimal overlap and proceed to sequence them. This process is repeated with other BACs with minimal overlap with the second set. This process also known as BAC walking allow one laboratory to sequence the whole human genome. In summary, one assembles libraries of clones with different size inserts, then sequences the inserts at random. this method relies on the computer program to find areas of overlap among the sequences and piece them together. In the picture below, (a) chromosomes are cloned into a BAC vector, yielding a collection of 300,000 BAC clones. A 96-well microtiter plate is shown with 96 of the clones. (b) a seed BAC is selected for sequencing. (c) the seed is subcloned into a plasmid vector, yielding a plasmid library.(d) three thousand of the plasmid clones are sequenced, and the sequences are ordered by their overlaps, producing the sequence of the whole 150-kb BAC.(e)find the BACs with the overlapping STCs, compare them with fingerprinting with minimal overlaps and sequence them. this process known as BAC walking creates contig of the whole chromosome. Figure-10: Shotgun sequencing GeneEngine™ Platform The process of direct analysis by the GeneEngine™ platform begins with the isolation of target material (DNA, RNA or protein) from a biological source, followed by fluorescently tagging the sample material at specific sites of interest (e.g. a nucleotide sequence motif or protein epitope). The sample is then injected into the nano-fluidic system of the GeneEngine™ Instrument, and the sample passes through an interrogation region consisting of several laser spots. Each molecule is detected by the laser excitation of the fluorescent tags on the molecule. Thousands of molecules pass through the system per minute; for DNA analysis, this represents a throughput of 10-30 million base pairs per second. Thousands of molecules pass through the system per minute; for DNA analysis, this represents a throughput of 10-30 million base pairs per second. The Trilogy™ technology combines advances in nanofluidics, optical engineering, and novel labeling strategies with life science applications in research, drug discovery and development, and diagnostics. The first applications of U.S. Genomics’ platform include direct detection and analysis of RNA, small RNAs (siRNA, miRNA, etc), and protein molecules as well as analyses of the molecules’ interactions. Differential Display-PCR Differential display, also known as DD-PCR, is the technique where one can identify and analyze altered gene expression at the mRNA level. It can also be used to indentify genes those are suppressed or induced. In this technique, one can analyze two or more samples to study the gene expression patterns. these samples can be obtained from any eukaryotic organism, including plants, fish, amphibians, reptiles, insects, yeast, fungi and mammals. In this technique one uses the limited number of primers to systematically amplify and visualize most of the mRNA in a cell. It is one of the most commonly used techniques for identifying differentially expressed genes at the mRNA level. It was first designed by Liang and Pardee (Science 257, 967,1992) and Welsh et al. (NAR, 20, 4965,1992) and the goal is to display all of mRNAs of a cell. The method depends on PCR and PAGE. The advantages of this method is that its simplicity, the ability to monitor the process at several stages, the requirement for total RNA, the fact that results can be evaluated side-by-0side comparison to polyacrylamide gels and that it yields CDNA fragme4nts that can be easily sequenced and identified. The total overview of the method as follows: Treat cells or tissue→Collect RNA→Treat RNA with DNase → Split RNA into aliquots and perform reverse transcription reaction on each using a different primer→Perform PCR using cDNA subsets as template with specific primer together with an arbitrary primer →Load PCR reactions on sequencing gene →I identify induced or inhibited genesRepeat experiments to confirm resultsExcise band from gelReamplify CDNA using same PCR conditionsUse PCR products as a probe in Northern blot clone CDNAs that are positive in northern blotscreen clones to identify unique speciesidentify clones that works in the NorthernSequence clone to obtain full length cDNA. Figure-11: The Differential Display-PCR method of analyzing samples Serial analysis of gene expression(SAGE) SAGE (Serial analysis of gene expression) is an alternate method of gene expression analysis based on RNA sequencing rather than hybridization. SAGE relies on the sequencing of 10-17 base pair tags which are unique to each gene. These tags are produced from poly-A mRNA and ligated end-to-end before sequencing. It was originally developed by Dr. Victor Velculescu at the Oncology Center of Johns Hopkins University and published in 1995. SAGE is a powerful tool for the analysis of gene expression. It does not require a preexisting clone, can identify and quantify known and new genes and can pick up low-abundance transcripts. Tags produced by SAGE can be identified using high-throughput sequencing and carries enough information to uniquely identify each mRNA transcript. Data can be searched on the SAGEmap database (www.ncbi.nlm.nih.gov/SAGE) and the information can then be analyzed and stored for future analysis. Briefly, SAGE experiments proceed as follows: • • • • Isolate the mRNA of an input sample Extract a small chunk of sequence from a defined position of each mRNA molecule. Link these small pieces of sequence together to form a long chain Clone these chains into a vector which can be taken up by bacteria. • • Sequence these chains using modern high-throughput DNA sequencers. Process this data with a computer to count the small sequence tags. Figure-12: SAGE Technique Although SAGE was originally conceived for use in cancer studies, it has been successfully used to describe the transcriptome of other diseases and in a wide variety of organisms. Genzyme Molecular Oncology (GMO) provided research support to KWK and has licensed the SAGE technology1 from The Johns Hopkins University for commercial purposes; the technology is freely available to academia for research purposes. Invitrogen has subsequently sublicensed the SAGE technology from GMO for the purpose of providing a SAGE kit. The University and researchers (VEV, LZ, BV, KWK) have a financial interest in GMO, the arrangements for which are managed by the University in accordance with its conflict of interest policies. DNA Microarray DNA microarray evolved from Southern blotting is a technique where fragmented DNA is attached to a substrate and then probed with a known gene or fragment. It is a device that allows for DNA to be bound to it for analysis with homologous cDNA or RNA. It measure the amount of mRNA in a sample that corresponds to a given gene or probe DNA sequence. The pbrobe sequences are immobilized on a solid surface and then hybridize with fluorescentlylabeled “target” mRNA. To measure the abundance of that mRNA sequence in the sample, one has to find out the intensity of the fluorescence of a spot is proportional to the amount of target sequence that has hybridized to that spot. It was first used in 1995 for gene expression profiling and a complete genome was published in 1997. Two main types of array: ‘microarray’ and ‘DNA chip’, depending on how nucleotide sequences are put onto the chip. Microarrays use presynthesized DNA (about 100 bases) for probing, whereas DNA chips use in situ synthesized oligonucleotide probes (25 bases). More recently, types of array are distinguished by the amount of genes that can be measured, since DNA chips allow for increased number of probes. An oligonucleotide is a short nucleic acid polymer, typically with twenty or fewer bases. Microarray works by putting a large number (upto 100,000 or more) of cDNA sequences or synthetic DNA oligomers onto a glass slide (or other subtrate) in known locations on a grid. Label an RNA sample and hybridize and measure amounts of RNA bound to each square in the grid. Finally make comparisons between cancerous and normal tissue, treated and untreated and the time course. DNA microarrays can be used to measure changes in gene expression, mutation Detection (single base, such as one type of diabetes), polymorphism analysis, mapping (locating genes within chromosomes), evolutionary Studies (identifying common ancestors), pharmacogenomics (the search for therapeutic responses to drugs given the genetic profiles of patients). Figure-13: Chip making It is a schematic diagram of a DNA microarray. This drawing represents a standard 1inch by 3 inch glass microscope slide with an array of 5808 tiny spots of DNA. Each dot is 200 micrometer in diameter and the distance between the dot centers is 40 micrometer. It is possible to place more than 10,000 spots on a slide of this size. Figure-14: Growing Oligonucleotides on a Glass Substrate the glass is coated with a refractive group that is blocked with a photosensitive agent(red). The blocking agent can be removed with light and the thus the parts of the plate is unmasked(blue) and light can go through. In the first cycle four spots are masked and thus light can only reach two of the unmasked spots. The unblocked spots are chemically coupled with guanosine(G) nucleotide. During the second cycle, three spots are masked and protected from light. While the other three are unmasked, including a spot form the first cycle and light reaches them. these spots are chemically coupled with adenosine nucleotide(A). Thus the spot going through the two cycles will have G-A nucleotide. In this patter, the cycles are repeated over and over again with different nucleotides. Figure-15: Creating DNA microarray Figure-16: An example of the results of DNA microarray DNA microarrays are created by robotic machines that arrange minuscule amounts of hundreds or thousands of gene sequences on a single microscope slide. Researchers have a database of over 40,000 gene sequences that they can use for this purpose. When a gene is activated, cellular machinery begins to copy certain segments of that gene. The resulting product is known as messenger RNA (mRNA), which is the body's template for creating proteins. The mRNA produced by the cell is complementary, and therefore will bind to the original portion of the DNA strand from which it was copied. To determine which genes are turned on and which are turned off in a given cell, a researcher must first collect the messenger RNA molecules present in that cell. The researcher then labels each mRNA molecule by attaching a fluorescent dye. Next, the researcher places the labeled mRNA onto a DNA microarray slide. The messenger RNA that was present in the cell will then hybridize - or bind - to its complementary DNA on the microarray, leaving its fluorescent tag. A researcher must then use a special scanner to measure the fluorescent areas on the microarray. Figure-17: DNA microarray and gene expression Hierarchical cluster analysis of normal tissue specimens. (a) Thumbnail overview of the two-way hierarchical cluster of 115 normal tissue specimens (columns) and 5,592 variably-expressed genes (rows). Mean-centered gene expression ratios are depicted by a log2 pseudocolor scale (ratio fold-change indicated); gray denotes poorly-measured data. Selected gene-expression clusters are annotated. The dataset represented here is available as Additional data file 2. (b) Enlarged view of the sample dendrogram. Terminal branches for samples are color-coded by tissue type. Shyamsundar et al. Genome Biology 2005 6:R22 doi:10.1186/gb-2005-6-3-r22 The two-way unsupervised analysis also identified clusters of coexpressed genes which represented both tissue-specific structures and systems and coordinately regulated cellular processes. For example, on the basis of the shared characteristics of well annotated genes in the clusters, we identified clusters representing cell proliferation mitochondrial ATP production, mRNA processing, protein translation and endoplasmic reticulum-associated protein modification and secretion. Interestingly, proliferation, mitochondrial ATP production and protein translation were each represented by two distinct clusters of genes, suggesting that subsets of these functions might be differentially regulated among different tissues. One gene cluster corresponded to sequences on the mitochondrial chromosome; we interpret this feature to reflect the relative abundance of mitochondria in each tissue sample. Impact on Bioinformatics Bioinformatics combines the application of computer science to molecular biology. The term was first introduced by Paulien Hogeweg in 1979 for the study of informatic processes in biotic systems.itys application has been in genomics and genetics, particularly in those areas of genomics involving large-scale DNA sequencing.Genomics produces high-throughput, highquality data, and bioinformatics provides the analysis and interpretation of these massive data sets. It is impossible to separate genomics laboratory technologies from the computational tools required for data analysis. Bioinformatics now entails the creation and advancement of databases, algorithms, computational and statistical techniques, and theory to solve formal and practical problems arising from the management and analysis of biological data. Common activities in bioinformatics include mapping and analyzing DNA and protein sequences, aligning different DNA and protein sequences to compare them and creating and viewing 3-D models of protein structures. References: www.yourgenome.org/hgp/hgp2/hgp_5.shtml www.ncbi.nlm.nih.gov/projects/genome/probe/doc/TechSTS.shtml\ www.fass.org/fass01/pdfs/kemppainen.pdf http://genomebiology.com/2005/6/3/R22 www.wikipedia.org

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Document