* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download MOPAC: Motif-finding by Preprocessing and Agglomerative
X-inactivation wikipedia , lookup
Transcription factor wikipedia , lookup
Non-coding DNA wikipedia , lookup
Eukaryotic transcription wikipedia , lookup
Secreted frizzled-related protein 1 wikipedia , lookup
RNA polymerase II holoenzyme wikipedia , lookup
Gene desert wikipedia , lookup
Gene expression wikipedia , lookup
Molecular evolution wikipedia , lookup
Community fingerprinting wikipedia , lookup
Genome evolution wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genomic imprinting wikipedia , lookup
Gene regulatory network wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Ridge (biology) wikipedia , lookup
Promoter (genetics) wikipedia , lookup
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger1 Ganesh Rajagopalan1 Debby Siegele2 1Department of Computer Science 2Department of Biology Texas A&M University Analyzing Gene Expression Patterns • DNA microarrays • ~4000 genes E. coli, ~6000 genes for yeast • Compare expression levels between conditions • Example: starvation response in E. coli – starve cells for nutrient sources – reintroduce => recovery => exponential growth – which genes show changes in response? • types of response: – – – – up-regulation down-regulation transient response (spike) (arbitrary temporal patterns) • Problem: can cluster genes based on response pattern, but then what? – not all genes in cluster are regulated the same way • Couple with genomic analysis – search for common motifs in up-stream regions – subsets of co-regulated genes within clusters • Assumptions: 1. regulation occurs by interaction of transcription factors with small motifs (~10-20bp) within several hundred bp of transcription start site 2. among many motifs, the ones of interest will be common to some genes in a cluster, but not found in any genes outside (with different responses) 3. the motif does not have to be shared by all genes in the cluster, only a subset Related Work • Many algorithms exist for motif finding – assume cluster (gene set) is already defined – word/string analysis models – probabilistic models • Gibbs sampling (AlignACE, MotifSampler) • Expectation Maximization (MEME) • HMMs – graph algorithms (e.g. clique) • Pevzner and Sze – what if motif only appears in a subset of genes? • count as parameter in MotifSampler, MEME Overview Our Approach 1. Definition of regulation patterns 2. Extraction of upstream sequences (for up-reg) 3. Define control set (genes with no change) 4. Make a list of all 12-mers in upstream regions 5. Find motifs that occur (more than once) in upregulated set, but not at all in control set 6. Group the motifs using clustering, form consensus of patterns Define Regulation Patterns • measured at 0, 5, and 15min after recovery • discrete representation of changes in expression levels • relative to exp. growth phase conditions +1: >2-fold increase -1: >2-fold decrease 0: otherwise (no significant change) • up-regulation patterns: (0,1,1) (0,1,0) (0,0,1) (-1,1,1) (-1,1,0) (-1,0,1) • define control set: (0,0,0) (1,1,1) (-1,-1,-1) Extraction of Upstream Sequences • nominally, 600bp upstream of translation start site (i.e. ORF; not transcription start) • If gene is a member of an operon: – take 300bp upstream of gene – plus 300bp upstream of translation start of first gene in operon • databases: K12 sequence: GOLD – operon relationships: E. coli Linkage Map (Berlyn et al.) • use reverse complement if transcribed in rev. Pre-processing • extract all 12-mers (overlapping) from upstream regions of up-regulated genes • note: better than DFS • remove those that appear in the control set • remove those that are dissimilar to everything else (“de-noising”) – score=mean distance to all motifs not in same upstream region or operon – remove if score>~9/12 mis-matches Clustering • compute similarity matrix among motifs • repeatedly merge closest neighbors – minimum spanning tree – single-linkage clustering • Stop merging when dist>3/12 mismatches • Form consensus: relax constraints on nucleotides at position by disjunction – – – – ACCATGGTATC ACGATGGTATT ACTATAGTATC AC(CTG)AT(AG)GTAT(TC) Experiments • • • • • Starvation of E. coli for glucose in medium 3 time-points: starved (0min), 5min, 15min Data collected in Siegele lab up-regulated: 22 genes control set: 1361 genes Motifs Found ID 1 2 3 4 5 6 7 8 9 10 11 12 13 Motif AAsAAwT T mAwA CmwT T kT T yT T C T T CT wHT gAwAT wT VAACwT hCAA rAkT T T wT T CAT CAArT wT T T wT r AT wAAT AAT ksw ACsdT T T T T mT w rAAwT T mAT AAT vwT T AAT AAT kC AT wT T GAAT T ww yT T T khGAT AT T AkT T T wT T CAT y Gene name CmtB, ygjR, cysD CysH, B3914, MetR B1587, MetF, FliY B1587, asnB, cysA,P,W B3914, MetR, MetF CmtB, yhaV, cysD B1587, yhaV, CmtB CmtB, asnB, b3914, ygjR MetF, CmtB, ygjR CmtB, b1587, yhaV, MetF AsnB, metR, metF YfiA, cysD, fliY B3914, metR, metF Sequence Logos Distance to Transcription Start Other Forms of Validation • Palindromicity: 11/13 motifs have index>0.5 • TRANSFAC database: – e.g. motif 2 matches pattern for MetJ-MetF site – a number of other hits for known transcription factors • biological verification awaits... – role in regulation pathway for starvation response? Conclusions • Augment cluster-analysis of expression patterns with motif analysis • Efficient method for generating candidates – from 12-mers in upstream regions • Efficient method for screening them – empirically, against a control set, rather than probabilistic background model • Advantage: Pattern does not have to be in all the genes in a set • Challenges: defining appropriate upstream regions and the right control set (as filter)