* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Genome Analysis
Long non-coding RNA wikipedia , lookup
Mitochondrial DNA wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Essential gene wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Copy-number variation wikipedia , lookup
Gene expression programming wikipedia , lookup
Gene desert wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Point mutation wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Human genetic variation wikipedia , lookup
Genetic engineering wikipedia , lookup
Oncogenomics wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Transposable element wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Ridge (biology) wikipedia , lookup
Genomic imprinting wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Microevolution wikipedia , lookup
Gene expression profiling wikipedia , lookup
Non-coding DNA wikipedia , lookup
Genomic library wikipedia , lookup
History of genetic engineering wikipedia , lookup
Designer baby wikipedia , lookup
Genome (book) wikipedia , lookup
Public health genomics wikipedia , lookup
Metagenomics wikipedia , lookup
Human genome wikipedia , lookup
Genome editing wikipedia , lookup
Human Genome Project wikipedia , lookup
Helitron (biology) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Pathogenomics wikipedia , lookup
Genome Analysis Lecture 14 Introduction A major application of bioinformatics is the analysis of full genomes of organisms that have been sequenced Traditional genetics has focused on understanding the role of a particular gene or protein in biological process Availability of genome sequences provides the sequences of all the genes of an organism Important genes influencing metabolism, cellular differentiation and development, and disease processes in animals can be identified and relevant genes manipulated Challenge is to identify those genes that are predicted to have a particular biological function Lecture 14 Genomic Sequences Availability of genome sequences facilitates the discovery and utilization of sequence polymorphisms used to trace genes among individuals in a population Some types of genetic variation are best understood at the genome-wide level. Availability of genome sequences provides opportunity to explore genetic variability both between organisms and within the individual organism Web resources - ch_10_t_1.html Lecture 14 Prokaryotic Genomes Genomes of 31 prokaryotic organisms have been sequenced Organisms were selected on the basis of three criteria They had been subjected to a good deal of biological analysis and thus were model prokaryotic organisms They were an important human pathogen – Mycobacterium tuberculosis and Mycoplasma pneumoniae They were of phylogenetic interest Sequences were annotated as they were sequenced Lecture 14 Gene Structure Varies Organism Haploid genome size (Mb) Predicted number of genes Arabidopsis thaliana (plant) 130 ~25,000 Caenorhabditis elagans (worm) 100 18,424 Drosphila melangaster 180 13,601 Escherichia coli 4.7 4,288 Homo sapiens (human) 3000 45,000 – 120,000 Saccharomyces cerevisiae (yeast) 13.5 6,241 Lecture 14 Steps of Genome Analysis 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) Genome sequence assembled Identify repetitive sequences – mask out Gene prediction – train a model for each genome Look for EST and cDNA sequences Genome annotation Microarray analysis Metabolic pathways and regulation Protein 2D gel electrophoresis Functional genomics Gene location/gene map Self-comparison of proteome Comparative genomics Identify clusters of functionally related genes Evolutionary modeling Lecture 14 Comparative Genomics Includes a comparison of gene number, gene content and gene location in both prokaryotic and eukaryotic groups of organisms Availability of genome makes possible a comparison of all the proteins (proteome) encoded by one organism with those of another Genes in two organisms that are so similar that they must have the same function and evolutionary history are orthologs Two or more proteins in the same proteome that share a high degree of similarity because they share the same set of domains are likely to be paralogs Lecture 14 Comparative Genomics of Eukaryotes Drosophila has core proteome only twice the size of that of yeast Complexity apparent in metazoans is not achieved by sheer number of genes Despite the large differences between fly and worm in terms of development and morphology, they use a core proteome of similar size Comparative analysis of the predicted proteins encoded by these genomes suggests that nearly 30% of fly genes have putative orthologs in the worm There are some signs that Drosophila proteome is more similar to mammalian proteomes than those of worm or yeast Some of the human disease genes absent in Drosophila reflect clear differences in physiology between the two organisms – hemoglobins Population of multidomain proteins is larger and more diverse in the fly than in the worm Genome sequencing effort of the fly has revealed a number of previously unknown counterparts to human genes involved in cancer and neurological disorders Lecture 14 Functional Classifications of Genes Classify annotated genes by function Early classification scheme for E. coli genes included categories for enzymes, transport elements, regulators, membranes, structural elements, protein factors, leader peptides and carriers – based on sequence similarity Another classification scheme is based on biochemical activity Can also classify proteins that physically interact in a structure or biochemical pathway Lecture 14 Physical Mapping Databases Access to maps produced by multiple groups is available at NCBI which attempts to integrate several genetic and physical maps with DNA and protein-sequencing information – http://www.ncbi.nlm.nih.gov/Entrez/ Genome Data Base (GDB) – is limited to human data, contains no sequence data. http://gdbwww.gdb.org Whitehead Institute is primary source for of two genomewide physical maps – STS content map of more than 10,000 markers assigned to YACS and a radiation hybrid map of 12,000 markers. http://www.genome.wi.mit.edu Lecture 14 Structural Genomics Full understanding of the biological role of the proteins identified in genomes will require knowledge of their structure and function Structural genomics of single proteins combined with protein structure prediction may contribute substantially to efficient structural characterization of large macromolecular assemblies The structure of most proteins will be modeled, not determined by experiment Will need to determine protein structures so that most of the remaining sequences are related to at least one known structure of higher than 30% sequence identity Focus on proteins will be moving from structural genomics to functional genomics Lecture 14 Human Genome Project Facts Since it began in 1990, the HGP is estimated to have cost $3 billion A rough draft of the human genome was completed in June 2000. The final draft is expected sometime in 2003 For the HGP, researchers collected blood(female) or sperm(male) samples from a large number of donors. Only a few samples were processed as DNA resources. Neither the donors nor scientists know whose DNA is being sequenced Genome from Celera was based on DNA samples from 5 donors who identified themselves only by race and sex 97 % of DNA in human genome consists of non-genetic sequences Human DNA is 98 percent identical to chimpanzee DNA Average amount of difference between any two humans is 0.2 percent Humans have approximately 30,000 genes, roundworm has 19,098 genes and fruit fly has 13,602 genes, yeast has 6,034 genes Lecture 14 More HGP Facts Human genome is the largest genome to be extensively sequenced The genomic landscape shows marked variation in the distribution of a number of features, including genes, transposable elements, GC content, CpG islands and recombination rate Hundreds of human genes appear likely to have resulted from horizontal transfer from bacteria at some point in the vertebrate lineage Although about half of the human genome derives from transposable elements, there has been a marked decline in the overall activity of such elements in the hominoid lineage Segmental duplication is much more frequent in humans than in yeast, fly or worm The mutation rate is about twice as high in male as in female meiosis, showing that most mutation occurs in males More than 1.4 million single nucleotide polymorphisms(SNPs) have been identified Lecture 14 Background to the HGP HGP arose from two insights that emerged in the 1980s The ability to take global view of genomes could greatly accelerate biomedical research Creation of a global view would require a communal effort Sequencing of bacterial viruses and human mitochondrion between 1977 and 1982 proved the feasibility of assembling small sequence fragments into complete genomes The program to create a human genetic map to make it possible to locate disease genes based solely on their inheritance patterns The programs to create physical maps of clones covering the yeast and worm genomes to allow isolation of genes and regions based solely on their chromosomal position The development of random shotgun sequencing of complimentary DNA fragments for high-throughput gene discovery (ESTs) Lecture 14 Timeline of Large-Scale Genomic Analysis Lecture 14 Technology for Large-Scale Sequencing Laboratory innovations included four-color fluorescence-based sequence detection, improved fluorescent dyes, dye-labeled terminators, polymerases specifically designed for sequencing, cycle sequencing and capillary gel electrophoresis Important advances in the development of software packages for the analysis of sequence data PHRED makes it possible to monitor raw data quality and also assist in determining whether two similar sequences truly overlap PHRAP systematically assembles the sequence data using the base-quality scores from PHRED. Another key innovation for scaling up sequencing was the development by several centers of automated methods for sample preparation. This typically involved creating new biochemical protocols suitable for automation, followed by construction of appropriate robotic systems. Lecture 14 Lecture 14 Human Sequence in the High Throughput Sequence Division of GenBank Lecture 14 Lecture 14 Genome Browser http://genome.ucsc.edu/ Lecture 14 Lecture 14 Lecture 14 Lecture 14 Lecture 14 Lecture 14 Classes of Interspersed Repeats Lecture 14 Gene Content of Human Genome Genes (or at least their coding regions) comprise only a tiny fraction of human DNA, but they represent the major biological function of the genome and the main focus of interest by biologists Human genes tend to have small exons (encoding an average of only 50 codons) separated by long introns (some exceeding 10 kb) This creates a signal-to-noise problem, with the result that computer programs for direct gene prediction have only limited accuracy Computational prediction of human genes must rely largely on the availability of cDNA sequences or on sequence conservation with genes and proteins from other organisms This approach is adequate for strongly conserved genes (such as histones or ubiquitin), but may be less sensitive to rapidly evolving genes (including many crucial to speciation, sex determination and fertilization) Lecture 14 Characteristics of Human Genes Internal exon Exon number Introns 3’ UTR 5’ UTR Coding sequence (CDS) Genomic extent Median 122 bp Mean 145 bp 7 1,023 bp 400 bp 8.8 3,365 bp 770 bp 240 bp 1,100 bp 367 aa 14 kb 300 bp 1,340 bp 447 aa 27 kb Lecture 14 Lecture 14 Functional Categories in Eukaryotic Proteomes Lecture 14 Applications to Medicine A key application of human genome research has been the ability to find disease genes of unknown biochemical function by positional cloning This method involves mapping the chromosomal region containing the gene by linkage analysis in affected families and then scouring the region to find the gene itself The human genomic sequence in public databases allows rapid identification in silico of candidate genes, followed by mutation screening of relevant candidates, aided by information on gene structure For a mendelian disorder, a gene search can now often be carried out in a matter of months with only a modestly sized team Lecture 14 Lecture 14 Drug Targets A recent compendium lists 483 drug targets as accounting for virtually all drugs on the market Only a minority of human genes may be drug targets. It has been predicted that the number will exceed several thousand, and this prospect has led to a massive expansion of genomic research in pharmaceutical research and development Serotonin receptors – mood disorders and schizophrenia Leukotriene pathway – asthma Amyloid precursor protein - Alzheimer's disease. Lecture 14 Next Steps Finishing the human sequence Developing the Integrated Gene Index (IGI) and Integrated Protein Index (IPI) Large-scale identification of regulatory regions Sequencing of additional large genomes Completing the catalogue of human variation From sequence to function Lecture 14 Future Technology Development Functional genomics - aims to understand how genes are regulated and what they do, largely through massively parallel studies of gene expression in a variety of tissues Proteomics – promises to make the identity of each protein known and elucidate protein-protein interactions Bioinformatics – enhance the ability of researchers to manipulate, collect and analyze data more quickly and in new ways Lecture 14