* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download PPT - Blumberg Lab
Epigenetics of neurodegenerative diseases wikipedia , lookup
Gene nomenclature wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Oncogenomics wikipedia , lookup
Copy-number variation wikipedia , lookup
Genetic engineering wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Public health genomics wikipedia , lookup
Transposable element wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Gene desert wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Gene expression programming wikipedia , lookup
Point mutation wikipedia , lookup
Ridge (biology) wikipedia , lookup
Genomic imprinting wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Human genome wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Genome (book) wikipedia , lookup
History of genetic engineering wikipedia , lookup
Non-coding DNA wikipedia , lookup
Microevolution wikipedia , lookup
Pathogenomics wikipedia , lookup
Minimal genome wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Gene expression profiling wikipedia , lookup
Designer baby wikipedia , lookup
Genome editing wikipedia , lookup
Metagenomics wikipedia , lookup
Genomic library wikipedia , lookup
Helitron (biology) wikipedia , lookup
BioSci 145B Lecture #6 5/11/2004 • Bruce Blumberg – 2113E McGaugh Hall - office hours Wed 12-1 PM (or by appointment) – phone 824-8573 – [email protected] • TA – Curtis Daly [email protected] – 2113 McGaugh Hall, 924-6873, 3116 – Office hours Tuesday 11-12 • lectures will be posted on web pages after lecture – http://eee.uci.edu/04s/05705/ - link only here – http://blumberg-serv.bio.uci.edu/bio145b-sp2004 – http://blumberg.bio.uci.edu/bio145b-sp2004 BioSci 145B lecture 6 page 1 ©copyright Bruce Blumberg 2004. All rights reserved Useful software for molecular biology (contd) • NCBI – www.ncbi.nlm.nih.gov – main information and analysis resource – indispensable resource BioSci 145B lecture 6 page 2 ©copyright Bruce Blumberg 2004. All rights reserved Useful software for molecular biology (contd) • • NCBI – Blast – how to find similar genes www.ncbi.nlm.nih.gov/BLAST/ BioSci 145B lecture 6 page 3 ©copyright Bruce Blumberg 2004. All rights reserved Useful software for molecular biology (contd) • Why pay Celera? BioSci 145B lecture 6 page 4 ©copyright Bruce Blumberg 2004. All rights reserved Routes to gene identification • Genome sequences are minimally useful without annotation – Annotation = description, biological information • Functional annotation – information on the function • Structural annotation – identification of genes, sequence elements – Much annotation is done automatically today • Via sequence comparisons with various databases – Gene sequences – ESTs • Algorithms predict promoters, splicing, polyadenylation sites and, most importantly ORFs • ORFs – open reading frames are putative proteins – Algorithms miss in both directions – Source of much disagreement • Field of bioinformatics has grown to encompass many types of analysis related to gene function – www.igb.uci.edu BioSci 145B lecture 6 page 5 ©copyright Bruce Blumberg 2004. All rights reserved How are genes identified? • Functional cloning – Finding a gene by using a functional assay • Positional cloning – Find a gene by where it is located, what it is near • Bioinformatic analysis that relates back to functional or positional cloning BioSci 145B lecture 6 page 6 ©copyright Bruce Blumberg 2004. All rights reserved How are genes identified? • Functional cloning (aka expression cloning) – identifying by a functional assay – What are functional assays? • Enzyme activity – kinases (add PO4 to proteins) • Ligand binding – peptide hormone (e.g. glucagon) receptors • Transport (ions, sugars, etc) – e.g., intestinal glucose transporters • Mutant rescue – restore function to a cell or embryo – Introduce cDNA library pools (~10,000 cDNAs) • via transfection, microinjection, infection • Perform functional assays – Robust, sensitive, accurate is key – positive pools are subdivided and retested to obtain pure cDNAs • cycle is repeated until single clones obtained – Applications – many enzymes transporters and growth factor receptors cloned this way BioSci 145B lecture 6 page 7 ©copyright Bruce Blumberg 2004. All rights reserved How are genes identified? • Positional cloning – identifying by where the gene is located – from genetic linkage map – by walking from nearby sequence tags (ESTs, STS, STC, etc) – Gene trap techniques (week 8) – Interspecific backcrosses (Mus musculus vs M. spretus) – When possible – try to rescue phenotype with candidate region • Positional cloning – ok you have identified a region where your gene of interest may be located - now what? – How do we figure out what genes are in this region without knowing function? – General problem for annotation of sequences • Genome sequencing vis a vis positional cloning BioSci 145B lecture 6 page 8 ©copyright Bruce Blumberg 2004. All rights reserved How are genes identified? (contd) • Ways to identify genes in regions – Cross-species hybridization • Probe another species with this genomic region • coding sequences are conserved -> should see hybridization where genes are • What do you think are limitations to this approach? – Species must be sufficiently different to reduce “noise” from overall sequence conservation » mouse vs human probably not great » Human vs frog or fish probably good – Must be sufficiently similar for genes to be conserved » Human vs frog or fish probably good » Humans vs yeast only good for common genes – Target species region needs to be well characterized • Computer parallels – compare sequence to be annotated with annotated sequence from a different organism – e.g., human with Drosophila – Unknown bacterium with E. coli, etc. BioSci 145B lecture 6 page 9 ©copyright Bruce Blumberg 2004. All rights reserved How are genes identified? (contd) • Ways to identify genes in regions (contd) – Hybridization to known genes or coding materials • What are some examples? – Hybridize to mRNA (Northern blots) – Hybridize to cDNA libraries (must be right tissue, cell or stage) – Capture cDNAs or mRNAs from solution • Computer based parallels – Compare with expressed sequences from other species – Compare with ESTs BioSci 145B lecture 6 page 10 ©copyright Bruce Blumberg 2004. All rights reserved How are genes identified? (contd) • Ways to identify genes in regions (contd) – Identify features found in typical promoters • What are promoters? Regions 5’ to a gene that are required for expression • CpG islands – regions in eukaryotic genes that are hypomethylated – Undermethylated – Remember that methylation of DNA typically inhibits gene expression – Digest with enzymes that have CG in recognition site that would be inhibited if methylated, e.g., SacII CCGCGG, run gel to check » If nonmethylated (expressed) enzyme will cut, region will be hypersensitive, get chopped up. » If methylated (not expressed) enzyme will not cut and region will not get digested • DNAse I hypersensitive sites – Similar principle – transcriptionally active DNA is “open” – If open, it is more sensitive to DNAse I than non-active DNA – Test by digestion and gel electrophoresis BioSci 145B lecture 6 page 11 ©copyright Bruce Blumberg 2004. All rights reserved How are genes identified? (contd) • Ways to identify genes in regions (contd) – Exon-trapping • Insert genomic clone into “intron” between two exons • Transfect into cells • Assay for size of transcript – Known size from two exons – If genomic clone has exon – size will increase • Extraordinarily rarely used – much too painful BioSci 145B lecture 6 page 12 ©copyright Bruce Blumberg 2004. All rights reserved How are genes identified? (contd) • Duchenne muscular dystrophy (DMD) first gene positionally cloned – Good example of how different things would be today – RFLP mapping identified a chromosomal region Xp21 where gene was • Many patients had translocations in the region – Chromosomes spliced to other chromosomes BioSci 145B lecture 6 page 13 ©copyright Bruce Blumberg 2004. All rights reserved How are genes identified? (contd) • Duchenne muscular dystrophy (DMD) first gene positionally cloned – One group did genomic subtraction cloning • Strategy enriched for regions lost in DMD patient – Hybridized enzyme digested normal DNA with excess sheared DMD DNA – Only hybrids with restriction site ends could be cloned – Only hybrids with such ends would be from region absent in DMD DNA (since DMD DNA was in excess) • Made a library and tested clones by Southern blot to normal and DMD DNA – Today – for a sequenced organism – just go to the database identify sequences in region of interest and verify by Southern or PCR as above • Or look in large insert libraries with breakpoints • Or do cDNA subtraction between tissues from normal individual and DMD individual – Presumes knowledge of source of mutation, i.e., the defect resides in the affected tissue – Would not detect a defect in inducing factor from other tissue BioSci 145B lecture 6 page 14 ©copyright Bruce Blumberg 2004. All rights reserved How are genes identified? (contd) • Duchenne muscular dystrophy (contd) – 2nd group cloned breakpoints • Girl with translocation between X and 21 • 21 was rich in rRNA genes so made a radiation hybrid panel from patient • Identified hybrid cell carrying the breakpoint – made a genomic library from it • Screened library for clones with both rRNA genes and X chromosome specific sequences – Long, tedious process with many more failures than successes – Finally found 1 such genomic clone • Mapped this genomic DNA to male patients with DMD and found deletions in many of them – DMD gene is largest known – 2.4 megabases – cDNA cloning followed – protein is dystrophin BioSci 145B lecture 6 page 15 ©copyright Bruce Blumberg 2004. All rights reserved How are genes identified? (contd) • The problem with all of these methods is that experiments are required – What do we do when sequences are coming in at the rate of tens of gigabases/month? – Need large-scale, robust, computerized methods to identify genes and annotate sequences! • All bioinformatics depends on databases – UCI bioinformatics groups are among the best at designing and constructing databases • http://www.igb.uci.edu/servers/databases.html • Three major databases of sequences (automatically duplicated) – GENBANK http://www.ncbi.nlm.nih.gov/Genbank/index.html – DNA Databank of Japan http://www.ddbj.nig.ac.jp/ – European Molecular Biology Laboratory (EMBL) http://www.ebi.ac.uk/embl/index.html BioSci 145B lecture 6 page 16 ©copyright Bruce Blumberg 2004. All rights reserved Dystrophin as an example • http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=retrieve&dop t=default&list_uids=1756 BioSci 145B lecture 6 page 17 ©copyright Bruce Blumberg 2004. All rights reserved Genome annotation – how to identify genes? • Gene identification/prediction is important but difficult – Large variety of methods and algorithms to predict exons – To identify genes must first identify open reading frames (ORFs) • When dealing with cDNAs – look for regions that code for proteins – Do all genes code for proteins? – Correct reading frame for a sequence is assumed to be largest with no stop codons (TGA, TAA, TAG) • Lots of tricks can be employed – Codon frequency for an organism » Coding sequences follow codon usage » Noncoding sequences do not, often have lots of stop codons – Consensus sites » Kozak translational initiation CCRCCATGG • What is a very important consideration when searching sequences to predict ORFs? – Sequence must be accurate » Incorrect base calls are troublesome » But indels (insertions or deletions are disastrous BioSci 145B lecture 6 page 18 ©copyright Bruce Blumberg 2004. All rights reserved Genome annotation – how to identify genes (contd)? • Computer based gene prediction methods – Two major methods are in use • Homology searches – Compare with other known sequences • Ab initio (from the beginning) prediction – Use algorithms to recognize common features and predict genes » Promoters » Splice sites » Polyadenylation sites » ORFs – Generally, microbial genomes are much easier to annotate – WHY? • Simply identify ORFS > 300 bp (100 amino acids) – Works very well – But can miss small coding sequences – Must run on both strands because there are shadow genes (overlap on two strands) • Using a variety of programs, can predict genes in bacterial genomes – this week -> environmental sequencing papers BioSci 145B lecture 6 page 19 ©copyright Bruce Blumberg 2004. All rights reserved Genome annotation – how to identify genes (contd)? • Computer based gene prediction methods (contd) – Huge variety of programs available • Neural networks – attempt to model learning process – Build decision trees, use probabilistic reasoning • Rule-based systems – Rules often not clear – Have trouble with exceptions • Hidden Markov models – Break sequences down into small units based on statistical analysis of composition » Hexamers appear to be optimal size to search – Classify sequences into types or “states” – Identify transitions between states – Very useful for large number of purposes BioSci 145B lecture 6 page 20 ©copyright Bruce Blumberg 2004. All rights reserved Genome annotation – how to identify genes (contd)? • Computer based gene prediction methods (contd) – Training sets are used to “teach” programs how to solve problems • Training set is actual data – genes with known features – Programs use training sets to classify new data • Neural networks use training data to build decision trees • Rule-based systems use training data to generate rules • HMM build table of probabilities for states and transitions – Pierre Baldi in IGB is UCI expert in machine learning • How well to gene predicting programs work? – Extremely well on bacterial genomes – Fairly well on simply eukaryotic genomes – Variable on complex genomes – Rule of thumb – use a group of programs and look for areas of agreement among them – The best current programs combine ab initio predictions with similarity data to define a probability model BioSci 145B lecture 6 page 21 ©copyright Bruce Blumberg 2004. All rights reserved Identification of gene function • You have identified a gene – what is its function? – Always look for similarity to known sequences • Book suggests swiss-prot • GENBANK translated database is best • BLAST is tool to use • Amino acid searches more sensitive than nucleotide searches – Because identical amino acid sequences might only be 67% identical at nucleotide level – What might you find? • Match may predict biochemical and physiological function – e.g., a known enzyme from another organism • Match may predict biochemical function only – e.g a kinase • Match a gene from another organism with no known function – May match ESTs or ORFs from other organisms • Match a known gene with partly characterized function – Search leads to clarification of function – NifS in book • Might not match anything at all – Expect this will happen less and less BioSci 145B lecture 6 page 22 ©copyright Bruce Blumberg 2004. All rights reserved Up-regulated by TTNPB (> 1.5 Fold, p < 0.01, n=334) Cytoskeleton Miscellaneous 4% (14) 1% (4) Energy metabolism 3% (9) Extrcellular matrix 3% (10) Unidentified 21% (73) Housekeeping 15% (52) Tumor suppressor 1% (2) Transcriptional 15% (53) Signal tranduction 8% (26) Retinoid metabolism Neural 1% (3) 2% (7) BioSci 145B lecture 6 page 23 ©copyright Hypothetical 26% (90) Bruce Blumberg 2004. All rights reserved Cytoskeleton Energy metabolism Extrcellular matrix Housekeeping Hypothetical Neural Retinoid metabolism Signal tranduction Transcriptional Tumor suppressor Unidentified Miscellaneous Identification of gene function (contd) • You have identified a gene – what is its function? (contd) – Does the sequence contain an obvious functional motif? • Homeobox or other consensus DNA binding domain? • Kinase domain? • Serine protease, etc. – InterPro database allows one to compare a protein sequence with whole family of structural databases • http://www.ebi.ac.uk/interpro/ • HICAICGDRSSGKHYGVYSCEGCKGFFKRTVRKDLTYTCRDSKDCMIDKRQRNRC QYCRYQKCLAMGM – Other sorts of similarity searches • Identify protein secondary structure motifs – Alpha helix, beta sheets, hydrophobicity – Amphipathic helices – Overall polarity of sequences • Not used much BioSci 145B lecture 6 page 24 ©copyright Bruce Blumberg 2004. All rights reserved Identification of gene function (contd) • You have identified a gene – what is its function? (contd) – Gene ontology – highly structured vocabulary for gene classification • Genes are classified using this vocabulary • Relates protein function with cellular or organismal functions – Nucleic acid binding – Cell division BioSci 145B lecture 6 page 25 ©copyright Bruce Blumberg 2004. All rights reserved Genome annotation • Extremely important as number of sequences increases – Goals are to identify • all of the sequences • all of the features of each sequence • All of the functions of the identified genes – Often annotation does not agree with known function • Human error • New and updated information not propagated to database • Inaccurate sequencing • Sometimes annotation is correct but protein lacks function under certain conditions (e.g., need cofactors) – Gold standard for functional analysis is loss-of-function analysis • Most accurate annotation – Common to have “annotation jamborees” where biologists and bioinformaticians come together to annotate new sequences • Xenopus tropicalis jamboree will be in Spring 2005 BioSci 145B lecture 6 page 26 ©copyright Bruce Blumberg 2004. All rights reserved Comparative genomics • Study of similarities and differences between genome structure and organization – How many genes? Chromosomes? – Genome duplications – Gene loss • Driving forces – Understanding evolution in molecular terms – Sequence annotation and function identification • Sequences with important functions tend to be conserved across evolution • Orthology vs paralogy – Homology – descended from a common ancestor – Orthologs are homologous genes in different organisms that encode proteins with the same function and which have evolved by direct vertical descent – Paralogs are homologous genes that encode proteins with related but non-identical functions • Derived by gene duplication BioSci 145B lecture 6 page 27 ©copyright Bruce Blumberg 2004. All rights reserved Midterm review Mean – 29 +/- 5.5 Range 23-39 1. (10 points) It is 2006 and a NASA Mars scout mission has returned soil samples from an area of Mars formerly covered by a sea. Surprisingly, the sample contains viable microorganisms and even more remarkably, these organisms are apparently eukaryotes (have a nucleus). One of your colleagues has figured out how to culture these organisms, which were named Mars burroughsii in the laboratory. Unfortunately, in the process he accidentally discovered that they are quite pathogenic to mammals. Worse, a sample was mistakenly poured down the drain and is now contaminating Newport Beach. Your PI is a specialist at working with weird microorganisms and she has decided to take the lead in determining the genome sequence of M. burroughsii. a) (3 points) A genomic library will be necessary to facilitate the mapping and sequencing of the genome. What type of library will you make, i.e., what type of vector will you use? Justify your choice. What sort of equipment will you require? b) (3 points) Describe how you will make a physical map of the M. burroughsii genome prior to sequencing. c) (4 points) Outline a method to quickly generate a high quality, finished, genome sequence, which will be essential in understanding the pathogenicity of M. burroughsii. BioSci 145B lecture 6 page 28 ©copyright Bruce Blumberg 2004. All rights reserved Midterm review a) It would be best to make a BAC or PAC library since these hold large inserts and are relatively stable compared with lambda, cosmids and YAC libraries. You will need standard laboratory equipment including an electroporator and most critically pulsed field gel electrophoresis (PFGE) b) you would want to map the large insert clones from the library made in a) by hybridization, fingerprinting or map as you go c) a high quality finished genome requires a long-range map (like you made in b) and shotgun sequencing. Most sequencing centers today would chose to combine whole genome shotgun sequencing with BAC end sequencing to facilitate finishing BioSci 145B lecture 6 page 29 ©copyright Bruce Blumberg 2004. All rights reserved Midterm review 2. (5 points) You look around the lab to find an E. coli strain that will be suitable for propagating the library you made above. You can find two strains that might be suitable. Their genotypes are the following (recall that the bacteria are mutated or deficient in the genes listed): strain A - mcrA, Δ(mrr-hsdRMS-mcrBC), ΔlacX74, Φ80lacZΔM15, recA1, araD139, Δ(ara-leu)7697, galU, galK, endA1, nupG strain B - mcrA, endA1 supE44, gyrA96, relA1, recA, recD, recJ, sbcC Is either of these strains a good choice for your library? If so, which one? Or are both ok? Explain your reasoning (i.e., which features are good or missing). Strain A is a good choice for PAC or BAC libraries because only restriction deficiency is required for the maximum efficiency in library construction. There would be no real harm for the strain to be recombination deficient but this is only required for vectors that exist in multiple copies/cell (such as lambda or cosmids) The desirable features are mcrA, Δ(mrr-hsdRMS-mcrBC) and the Φ80lacZΔM15 (although this is not essential). Strain B is not a good choice because it is not restriction deficient BioSci 145B lecture 6 page 30 ©copyright Bruce Blumberg 2004. All rights reserved Midterm review 3. (4 points) Your PI asks you to make a normalized M. burroughsii cDNA library for EST sequencing and has suggested that you ensure the library is well normalized by hybridizing the tester and driver to a large Cot½ value (e.g., 50)? Is this the correct approach? Why or why not? If you plan to use this cDNA library for many purposes, would it be a better idea to subtract it? Why or why not? To make a normalized library, one needs to hybridize to a LOW Cot½ value (e.g. 5) in order to reduce the frequency of abundant cDNAs. Hybridization to a high Cot½ value (e.g. 50) will deplete the library in rare sequences as well as abundant sequences. If the library will be used for multiple purposes, it would be better to normalize it since subtraction removes a significant number of sequences that are in common between the driver and tester. BioSci 145B lecture 6 page 31 ©copyright Bruce Blumberg 2004. All rights reserved Midterm review 4. (3 points) What are three important goals that one should always have when constructing a genomic library? Faithful representation of the genome – no chimerism or deletions, all sequences represented, at least five fold coverage of the genome Easy to screen Easy to produce enough DNA for further analysis BioSci 145B lecture 6 page 32 ©copyright Bruce Blumberg 2004. All rights reserved Midterm review 5. (3 points) What are three factors that determine whether a sequence can be stably propagated in bacteria? Toxicity Restriction recombination BioSci 145B lecture 6 page 33 ©copyright Bruce Blumberg 2004. All rights reserved Midterm review 6. (4 points) An international panel of experts has suggested that an EST project be implemented for M. burroughsii in order to speed up the identification of the pathogenic protein discussed in question 1. Describe generally how you would go about making a library of full-length cDNAs, including which type of vector you would choose and why. Assume that this library will be used for EST sequencing and functional analysis of proteins expressed from the library. You would begin by isolating RNA and selecting poly A+ RNA to enrich for mRNA. For full-length library construction, use the oligo-capping or cap trapping method to select those mRNAs that have 5’ cap structures. Synthesize first strand cDNA with reverse transcriptase and second strand cDNA with DNA polymerase I Add linkers or adapters, restriction digest and ligate to the vector of interest If the library is to be used for EST sequencing and functional analysis of proteins expressed from the library, it will probably be best to use a plasmid vector BioSci 145B lecture 6 page 34 ©copyright Bruce Blumberg 2004. All rights reserved Midterm 7. (5 points) This graph depicts an experiment in which genomic DNA was hybridized with an RNA tracer. What conclusions can be drawn about the nature of the RNA that is transcribed from this DNA? What implications do these conclusions have for large-scale genome sequencing projects such as Drosophila or human? The graph shows that RNA tracer only hybridizes with moderately and nonrepetitive sequences, suggesting that RNA is only transcribed from these regions This is the justification for why sequencing projects do not worry about regions of highly repetitive DNA. These regions are not likely to be transcribed into RNA and therefore are not cost-effective to sequence BioSci 145B lecture 6 page 35 ©copyright Bruce Blumberg 2004. All rights reserved Midterm 8. (3 point) You have identified a cDNA that encoded a protein which is essential for the pathogenesis of M. burroughsii. The cDNA is 7 kb long. How would you fully and completely determine the sequence of this cDNA so that it might be used to develop a vaccine against M. burroughsii infection. Remember, time is of the essence since people are getting very sick as a result of M. burroughsii. Since time is of the essence, it is not acceptable to use methods that will take a long time to generate the finished sequence. This excludes shotgun sequencing and primer walking. You would want to use either restriction fragment cloning, or progressive deletions, such as with Exonuclease III combined with dideoxy sequencing to quickly get the sequence. BioSci 145B lecture 6 page 36 ©copyright Bruce Blumberg 2004. All rights reserved Midterm 9. (3 points) In studying the dispersion of M. burroughsii in the ocean, an epidemiologist notices that many more people in Newport Beach get sick from swimming in the ocean than do those in San Diego; although, the number of organisms in the water at both locations is indistinguishable. Deductive logic suggests that there is something different in the human populations in Newport Beach and San Diego that mediates this apparent differential susceptibility to M. burroughsii. You and your colleagues have discovered that M. burroughsii infects humans by binding to a protein expressed on the surface of intestinal cells called CadF. Since the human genome has been sequenced, you know the sequence of the gene encoding this protein. Describe generally how you might identify single nucleotide polymorphisms in the CadF gene and perform a relatively simple test to identify which people in the population might carry this polymorphism and be resistant to infection with M. burroughsii. The method I was looking for was to use a SNP chip to quickly identify differences in the CadF gene among affected and unaffected people. Some very creative answers were given and partial or full credit was given for ones that might work. BioSci 145B lecture 6 page 37 ©copyright Bruce Blumberg 2004. All rights reserved