* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download VERTEBRATE GENOME EVOLUTION AND FUNCTION …
Survey
Document related concepts
Gene expression profiling wikipedia , lookup
Genomic imprinting wikipedia , lookup
Exome sequencing wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Gene regulatory network wikipedia , lookup
Community fingerprinting wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Gene desert wikipedia , lookup
Transposable element wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Genomic library wikipedia , lookup
Molecular evolution wikipedia , lookup
Transcript
Using Vertebrate Genome Comparisons to Find Gene Regulatory Elements Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller, Francesca Chiaromonte, Anton Nekrutenko, Ross Hardison University of California at Santa Cruz: David Haussler, Jim Kent National Human Genome Research Institute: Laura Elnitski Children’s Hospital of Philadelphia: Mitch Weiss Lawrence Livermore National Laboratory: Ivan Ovcharenko Comparative genomics to find functional sequences Genome size 2,900 Find common sequences blastZ, multiZ 2,400 Human Identify functional sequences: ~ 145 Mbp All mammals 1000 Mbp 2,500 Mouse Rat 1,200 million base pairs (Mbp) Also birds: 72Mb Papers in Nature from mouse and rat and chicken genome consortia, 2002, 2004 Genome sequence assemblies and sources Species Assembly Genome size Assembly depth Source Human hg17 2.851Gb “finished” International human genome sequencing consortium Chimp panTro1 ca. 2.8Gb 4x Chimpanzee sequencing consortium Mouse mm5 2.6Gb 1.9Gb “finished” Mouse genome sequencing consortium Rat rn3 2.57Gb Dog canFam1 2.5Gb 7.6x Broad Institute and Agencourt Bioscience Cow bosTau1 ca. 3Gb 3x Baylor and collaborators Opossum monDom1 3.5Gb 6.5x Broad Institute Platypus ornAna0 Chicken galGal2 1.2Gb 6.63x Washington University Genome Seq Center Frog xenTro1 ca. 1.3Gb 7.4x DoE Joint Genome Institute Zebrafish Zv4 1.56Gb 5.7x Zebrafish Sequencing Group at the Sanger Institute Tetraodon tetNig1 0.385Gb 7.9x Genoscope and Broad Institute Fugu fr1 0.319Gb 5.7x DoE Joint Genome Institute and Singapore IMCB Baylor and collaborators Washington University Genome Seq Center Coverage of human by alignments with other vertebrates ranges from 1% to 91% 5.4 Millions of years Human 91 92 173 220 310 360 450 5% Distinctive divergence rates for different types of functional DNA sequences Percentofofregions human not genome not in Percent in alignments alignments 100 100 90 90 80 80 70 70 Genome Coding exons Ultraconserved (HM) Log. (Genome) 60 60 50 50 40 40 30 30 20 20 10 10 00 00 100 200 300 400 500 100 200 300 400 500 Time of divergence from common ancestor to Time of divergence from Myr common human, ago ancestor to human, Myr ago Large divergence in cis-regulatory modules from opossum to platypus cis-Regulatory modules conserved from human to fish Millions of years 91 173 310 450 • About 20% of CRMs • Tend to regulate genes whose products control transcription and development • Recent reports: – Sandelin, A. et al. (2004). BMC Genomics 5: 99. – Woolfe, A. et al. (2005). PLoS Biol 3: e7 – Plessy, C., Dickmeis, T., Chalme,l F., Strahle, U. (2005) Trends Genet. 21: 207-10. cis-Regulatory modules conserved from human to chicken • • Millions of years 91 – Conservation jungles – Hillier et al. (2004) Nature 173 310 About 40% of CRMs Noncoding sequences conserved from human to chicken tend to clusters in gene-poor regions • Stable gene deserts are conserved from human to chicken – Ovcharenko et al., (2005) Genome Res. 15: 137-145. 450 • Conserved noncoding sequences in stable gene deserts tend to be long-range enhancers – Nobrega, M.A., Ovcharenko, I., Afzal, V., Rubin, E.M. (2003) Science 302: 413. Posters 120 (Bob Harris), 121(Laura Elnitski), 192 (Ivan Ovcharenko) cis-Regulatory modules conserved in eutherian mammals (and marsupials?) Millions of years 91 173 310 450 • About 80-90% of CRMs • Within aligned noncoding DNA of eutherians, need to distinguish constrained DNA (purifying selection) from neutral DNA. Score multi-species alignments for features associated with function • Multiple alignment scores – Binomial, parsimony (Margulies et al., 2003) • PhastCons – Siepel and Haussler, 2003; Siepel et al. 2005 – Phylogenetic Hidden Markov Model – Posterior probability that a site is among the 10% most highly conserved sites – Allows for variation in rates and autocorrelation in rates • Factor binding sites conserved in human, mouse and rat – Tffind (from M. Weirauch, Schwartz et al., 2003) • Score alignments by frequency of matches to patterns distinctive for CRMs – Regulatory potential (Elnitski et al., 2003; Kolbe et al., 2004) Computing Regulatory Potential (RP) Alignment seq1 seq2 seq3 Collapsed alphabet G G A 1 T T T 2 A G G 1 C T T 3 C C C 4 T G A 5 A 7 C 7 T A A 6 A G A 8 C C T 3 G C G 6 C C T 3 A A A 9 •A 3-way alignment has 124 types of columns. Collapse these to a smaller alphabet with characters s (for example, 1-9). •Train two order t Markov models for the probability that t alignment columns are followed by a particular column in training sets: –positive (alignments in known regulatory regions) –negative (alignments in ancestral repeats, a model for neutral DNA) –E.g. Frequency that 3 4 is followed by 5: 0.001 in regulatory regions 0.0001 in ancestral repeats •RP of any 3-way alignment is the sum of the log likelihood ratios of finding the strings of alignment characters in known regulatory regions vs. ancestral repeats. RP a in segment pREG ( sa | sa 1 sa t ) log p AR ( sa | sa 1 sa t ) More species and better models improve discriminatory power of RP scores Poster 257: James Taylor ROC curves for different RP scores, tested on a set of known regulatory regions from the HBB gene complex Galaxy metaserver for integrative analysis of genomic data • Use servers at primary data repositories (e.g. UCSC Table Browser) to gather initial data • Results stored and analyzed at Galaxy • Operations – Union, intersection, subtraction – Clustering, proximity • Bioinformatic tools: – Retrieve alignments – Ka/Ks • Giardine, Riemer … Nekrutenko, Poster 90 How well do these alignment-based scores work in finding cis-regulatory modules? RP and phastCons can discriminate most known functional elements from neutral DNA Genes co-expressed in late erythroid maturation • G1E-ER cells: proerythroblast line from mice lacking the transcription factor GATA-1. – Can restore the activity of GATA-1 by expressing an estrogen-responsive form of GATA-1 – Allows cells to mature further to erythroblasts Use microarray analysis of each to find genes that increase or decrease expression upon induction. – Walsh et al., (2004) BLOOD; Image from k-means cluster, GEO: repressed induced genes • time after restoration of GATA-1 Predicting cis-regulatory modules (preCRMs) Identify a genomic region with a regulated gene. Find all intervals whose RP score exceeds an empirical threshold. Subtract exons Find all matches to GATA-1 binding sites that are conserved (cGATA-1_BS) Intervals with RP scores above the threshold and with a cGATA-1_BS within 50bp are preCRMs. Test predicted cis-regulatory modules (preCRMs) • Enhancement in transient transfections of erythroid cells test HBG prom FF luciferase Dual luciferase tk Ren luciferase prom K562 cells • Activation and induction of reporter genes after site-directed, stable integration in erythroid cells • Chromatin immunoprecipitation (ChIP) for GATA-1 assay 7 of 24 Zfpm1 preCRMs enhance transient expression 9 of 24 Zfpm1 preCRMs enhance after stable integration at RL5 About half of the preCRMs are validated as functional Assay GATA-1 ChIPs Transient transfections Site-directed integrants All assays Number tested 5 64 Number positive 5 18 % validated 100 28 54 24 44 64 34 53 Conclusions • Particular types of functional DNA sequences are conserved over distinctive evolutionary distances. • Multispecies alignments can be used to predict whether a sequence is functional (signature of purifying selection). • Alignments can be used to predict certain functional regions, including some cis-regulatory elements. • The predictions of cis-regulatory elements for erythroid genes are validated at a good rate. • Databases such as the UCSC Table Browser, GALA and Galaxy provide access to these data. • Expect improvements at all steps. Many thanks … Wet Lab: Yuepin Zhou, Hao Wang, Ying Zhang, Yong Cheng, David King Alignments, chains, nets, browsers, ideas, … Webb Miller, Jim Kent, David Haussler PSU Database crew: Belinda Giardine, Cathy Riemer, Yi Zhang, Anton Nekrutenko RP scores and other bioinformatic input: Francesca Chiaromonte, James Taylor, Shan Yang, Diana Kolbe, Laura Elnitski Funding from NIDDK, NHGRI, Huck Institutes of Life Sciences at PSU Marsupial genome adds substantially to the conserved fraction of regulatory regions Additive contribution of each 2nd species to conservation 100 80 Primate Eutherian Marsupial Monotreme Avian Amphibian Fish Percent 60 40 20 n Kn ow e ge no m s W ho le cT FB S er s lp ro m ot on ct io na Fu n re gu C la to ry re gi is la pG iR m s s nd N As s ex on od in g C U ltr a co ns er ve d 0 All preCRMs in Gata2 are functional in at least one assay ChIP data are from publications from E. Bresnick’s lab. The distal Major regulatory element of the human HBA gene complex is conserved in opossum but not beyond Neutral DNA “cleared out” over 200Myr 100 Percent of human not aligned 90 Platypus 80 Chick Frog Fish Opossum 70 Mouse, Rat 60 Cow 50 Dog 40 30 20 10 Chimp 0 0 100 200 300 400 500 Divergence from common ancestor to human, Myr ago Most human DNA is not alignable to species separated by more than 200 yr. Divergence dates from Kumar and Hedges (Nature 1998) and Hedges (Nature Rev Genet 2002)