Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
University of Nebraska at Omaha Innovative Database Models and Advanced Tools in Bioinformatics Hesham H. Ali UNO Bioinformatics Research Group Department of Computer Science College of Information Science and Technology Key Challenges Facing Bioinformatics Research • Significant gaps between tool developers and tool users – Different objectives – Different funding agencies – Different academic cultures • Significant problems with available Biological Data – Archival based – Lack of structure Source: ncbi.nih.gov Problems with Current Biological Data • The availability of large biological data and the increasing rate in producing new data, available in public data banks or via microarray data • The increasing pressure to maximize the use of the available data, particularly to impact key related industries (biotech companies, biotech drugs) • The large degree of heterogeneity of the available data in terms of quality UNO Bioinformatics Research Group • Group Triangle – Research motivated by real biological problems – Innovative Database Models – Advanced tools Biological Questions Addresses by our Group • Molecular diagnosis - Identification – Sequence based id – Enzyme (cutting order) based id – Instrumentation (Mass Spec, WAVE) based id • Basic Molecular Biology - Gene regulation – Microarray Analysis – Motif discovery/searching • Epidemiology and Clinical Research – Patient tracking system – Clinical expert system Bioinformatics Solutions to These Problems • Develop new inventive database models – Custom database for specific domains – Centralized Structured integrated data • Develop innovative Bioinformatics tools – Clustering algorithms – Advanced motif finding approaches Database Models • Customized (Private) Solution: Custom based Data Base Model High degree of quality and consistency • Centralized (Public) Solution: New Curated Integrated and Structured DataBase Model Model One: Custom Databases • • • • • Allowing researchers to create custom sets of genetic data suited to their specific needs. Allowing researchers to control the quality of genetic data in their custom data sets through fine-tuning parameters. Searching data using optimal alignment algorithms, rather than using heuristic methods. Giving researchers/clinicians the ability to formulate sequence identification concepts and test their ideas against a validated database Incorporating information from GenBank if needed The Sequence Identification Problem • Identification of organisms using obtained sequences is a very important problem • Relying on wet lab methods only is not enough • Employing identification algorithms using signature motifs to complement the experimental approaches • Currently, no robust software tool is available for aiding researchers and clinicians in the identification process • Such a tool would have to utilize biological knowledge and databases to identify sequences • Issues related to size of data and quality of data are suspect and would need to be dealt with Nebraska gets its very own organism While trying to pinpoint the cause of a lung infection in local cancer patients, they discovered a previously unknown micro-organism. And they've named it "mycobacterium nebraskense," after the Cornhusker state. It was discovered few weeks ago using Mycoalign: A Bioinformatics program developed at PKI Source: Omaha World Herald, March 21, 2005 Model Two: Centralized Database - the Integrated Model A new integrated model based on: – Organized and curated database – True non-redundancy by having one record for each polymorphic set with pointer to the rest of the set if needed – Allowing advanced queries – Being user-friendly and employing true automation – Employing various algorithms with different levels of accuracy and speed for conducting homology searches. The Clean Gene Package • A set of integrated database and alignment tools: – – – – – – – Edited and curated Web based Of manageable size Based on hierarchical database model Utilize various alignment algorithms Allows advanced automated queries Allows fast and accurate searches The Key Challenges • The New Structured relational database model • Identification of equivalence classes of records (polymorphic sets) • Identification of a good representative for each set • Curation and classification • Accurate annotation • Advanced data mining tools • A user-friendly interface that employs true automation for interfacing with the database Tool I: Clustering Biological Data • Clustering is a fundamental technique in finding a structure in a collection of unlabeled data. • Basically, clustering is the process of organizing objects into groups whose members are similar in some way. • A good Clustering tool is a key component in analyzing microarray data Message Passing Clustering (MPC) • Inspired by real-world situations: elements with similar attributes cluster together simultaneously • Advantages: – Easy to understand and use. – Taking the advantage of communication among data objects, MPC is able to balance the global and local structure and be performed in parallel. – “Message” has flexible structure which allows further development to fit to different research interests. – We have extended the basic MPC to • Weighted MPC • Stochastic MPC • Semi-supervised MPC Basic MPC • The phylogenetic trees of Mycobacterium (9 species, 34 strains), constructed by the Neighbor Joining and MPC method. 3M.intracellulareMac-D M.intracellulare ATCC 4M.intracellulareMac-J M.intracellulare S 348 M.intracellulare S 348 1M.intracellulareMin-A 3M.chelonaeMch-C M.chelonae DSM 35770 * * * 43276 35752 19536 1M.chelonaeMch-A M.chelonae ATCC M.chelonae M.chelonae ATCC 2M.chelonaeMch-B M.xenopi S 88 1M.xenopiI/Mxe-A M.xenopi ATCC 19250 3M.xenopiIII/Mxe M.xenopi S 91 2M.xenopiII/Mxe-B M.xenopi 1M.kansasiiMka-A M.kansasii ATCC 12478 # 14470 2M.gordonaeMgo-B M.gordonae ATCC 35756 4M.gordonaeMgo-D M.gordonae Bo 10681/99 M.gordonae Bo 11340/99 3M.gordonaeMgo-C M.gordonae Bo 9411/99 5M.gordonaeMgo-E 1M.gordonaeMgo-A M.gordonae ATCC M.gordonae 2M.kansasiiMka-B M.kansasii S 221 # # M.kansasii S 536 # 3M.kansasiiMka-C M.kansasii DSM 44431 # 5M.kansasiiMka-F M.kansasii S 233 4M.kansasiiMka-D 3M.peregrinumMpe M.peregrinum S 254 ^ ATCC 14467 ^ M.fortiutum ATCC 49403 $ 1M.fortiutumMfo M.peregrinum 1M.peregrinum 2M.peregrinumMpe M.peregrinum ATCC 700686 ^ 2M.flavescensMfla-B M.flavescens DSM 43531 1M.flavescensMfla-A ATCC14474 M.flavescens ATCC 14474 M.flavescens 3M.flavescensMfla-A M.flavescens ATCC 23033 3M.terraeIII M.terrae S 281 M.terrae ATCC 15755 M.terrae 1M.terraeI M.terrae DSM 43541 2M.terraeII 4M.fortuitumMfo M.fortiutum ATCC 6841 $ 49404 $ M.fortiutum ATCC 43266 $ 3M.fortuitumMfo 2M.fortuitumMfo M.fortiutum ATCC M.intracellulare ATCC 35847 * 2M.intracellulareMAC-E 5M.intracellulareMac-L M.intracellulare S 350 a. NJ * M.xenopi S 88 M.xenopi S 91 M.xenopi ATCC 19250 M.kansasii S 233 M.kansasii DSM 44431 M.kansasii S 536 M.kansasii S 221 M.kansasii ATCC 12478 M.intracellulare S 350 M.intracellulare ATCC 35847 M.intracellulare S 348 M.intracellulare ATCC 35770 M.intracellulare ATCC 13950 M.gordonae Bo 10681/99 M.gordonae Bo 9411/99 M.gordonae Bo 11340/99 M.gordonae ATCC 35756 M.gordonae ATCC 14470 M.terrae S 281 M.terrae DSM 43541 M.terrae ATCC 15755 M.peregrinum ATCC 700686 M.peregrinum S 254 M.peregrinum ATCC 14467 M.fortuitum ATCC 6841 M.fortuitum ATCC 43266 M.fortuitum ATCC 49404 M.fortiutum ATCC 49403 M.flavescens DSM 43531 M.flavescens ATCC 23033 M.flavescens ATCC 14474 M.chelonae DSM 43276 M.chelonae ATCC 19536 M.chelonae ATCC 35752 b. MPC M.xenopi M.kansasii M.intracellulare M.gordonae M.terrae M.peregrinum M.fortuitum M.flavescens M.chelonae Weighted MPC (WMPC)— with Adaptive Feature Scaling • Add weight associated with each clusterfeature pair. A single feature have multiple weights in different clusters and, in one cluster, all features may have different weights. • Update the weights during the clustering process. If on some dimension, the similarity between two going-to-merge clusters is high (/low), then we increase (/decrease) the weight on that dimension in the newly merged cluster. x1 x11 x x 21 x 2 x n xn1 w1 w11 w w 21 w 2 w k wk 1 x12 x22 xn 2 w12 w22 wk 2 x1 p x2 p xnp w1 p w2 p wkp • Test WMPC on Colon Cancer data (2000 genes in 40 tumor and 22 normal samples), giving higher classification rate. • Two benefits: – Strengthen the signal features while reducing the noise features, so making clustering results more accurate. – More importantly, reveal the contribution of the features (genes) to the clusters (samples), so that identify the set of genes responsible for certain diseases. Stochastic MPC (SMPC) Based on Kernel Functions Chance to merge? Tie ? Target Object a 0 b d e c Probability Density Estimates Based on Little Gaussian Kernel Functions Kick out ? f distance Kernel Density Estimates Using Gaussian Kernels Semi-supervised MPC • Clustering methods are considered unsupervised, meaning that the reduction is derived solely from the data rather than reflecting any previous knowledge. • Classification methods are considered supervised, because in the training phase, samples classes are already known, and we classify the objects into known groups. • Between clustering and classification: Unlabeled data with prior knowledge, such as constraints and hypotheses. • The goal of semi-supervised clustering is to guide the clustering, using the prior knowledge, to get better partitions. Semi-supervised MPC Instance-level Constraints • Colon Cancer data (2000 genes in 40 tumor and 22 normal samples). • We cluster samples with genes as features. Since the samples (instances) labels (constrains) are known, it is call instance-level constraints. 1 0.9 Accuracy (Rand index) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 OP IP 0 0 5 10 15 20 25 30 35 40 45 50 Number of samples with an initial label 55 60 We want to see how well our method could separate the normal and tumor tissues based on different numbers of known labels for the samples as prior knowledge. OP: Output partition after clustering. IP: Input constraints presented before clustering. Combining the power of clustering with background information achieves better performance than either in isolation. Semi-supervised MPC Attribute-level Constraints Cluster t-statistic Cluster Size 1 -5.73831 22 2 -5.11799 19 T57079, T60860, R38758, R55828, H55759, X58401, H63361, R56207, U25435, R09138, H82741, H16096, H64576, L34059, U12140, H7834 6, H62177, M55422, H89481, U12134, H11460, R28608 R00254, H41147, R42765, T51139, D21205, H70635, H69819, H71122, H81802, R46731, M29550, H14607, R88749, J00146, H18451, M9685 9, X69115, X51435, H06189, 3 13.9581 17 L12723, M81637, D59253, X65873, M77698, T84049, M88108, R56401, H09719, X74330, D26018, D26067, R56630, X13482, H16991, T8674 9, H92195 4 9.81885 14 R39531, R21901, J02645, R34876, H72234, T92259, H47107, T65740, Z15115, D43951, L19437, X82103, U29175, T65438 5 -8.42722 42 M16029*, M16029*, D38549, U03100, L24038, J02906, X93499, M73481, T71207, R73129, X56253, M81758, R27017, U18934, X17651, H47 650, H89688, H23135, R39130, H82631, H45526*, H45526*, D90188, L06895, L34774, T58756, L11370, U34252, H78063, D21239, M96824, M24470, T77446, R53 612, R38513*, R38513*, R44677, M98045, T88712, R54317, M77477, H11054 6 14.9275 33 R42127, T90350, L25941, T65790, R22779, M90516, R11485, X68194, X63629, J05032, M34175, X74795, U34074, M58050, H20512, U1829 9, H80114, R67987, R56399, U10324, R56443, X17025, R40717*, R40717*, D13630, L24203, T96873, X53586, X73478, X14618, M55543, H42884, X54101 7 10.3472 12 D13315, T70062, U30825, T89115, D26600, T60437, D50063, R20804, R09479, R42837, D14658, H87473 8 -9.94085 14 T61333, M28882, M69066, M14539, T67406, M37721, T79831, M69135, M36634, U25138, R48303, U31525, T51539, H21042 9 8.59304 11 U22055, L41559, M88279, R37114, R96357, U29607, H82719, H04802, M22632, R27813, R60195 10 5.97321 29 H48027, M14200, T52642, R85464, T52343, T72503, T49732, T95048, T47584, R21547, L28010, M21339, T87527, T61338, T79813, L3684 4, M34192, M69238, L26405, H06970, M84326, T70063, X66363, H85878, X62153, T67905, T57872, Z48950, R62425 11 -7.47053 16 T93284, H80975, R73052, X05610, X79683, X55187, U30827, T64974, D31887, M92843, R67358, R46753, X68277, H50623, H15813, X5134 5 12 12.6072 12 T63591, T63370, T53412, T53396, T63133, R44884, R84411, X12671, R08183, M22382, D63874, D21261 13 5.07119 37 H46728, T57686, T68848, M11354, T69026, M14630, T72879, T47144, T61627, X61971, M94345, R42570, H13194, T65580, H15542, J0307 7, D30655 14 -5.35587 16 T61661, T62220, T96832, H80240, H86060, T63508, H24754, T63499, M11799, L28809, T63484, M57710, M33680, U12255, H88360, R7893 4 Gene Name (ORF) Gene clusters illustrating differentially expressed genes in tumor and normal samples a. Cluster 6 b. Cluster 8 Generalizations • WMPC extends the unweighted MPC to the weighted MPC. – If we initialize all entries in w to be 1 and never change the weights, MPC-AFS becomes a regular MPC. • SMPC extends the deterministic MPC to the stochastic MPC. – If we choose the particular kernel function (rectangular) and the particular bandwidth parameter (the minimum distance between the target cluster and all the others) to estimate the probability, SMPC can be reduced to a regular MPC . • Semi-supervised MPC extends unsupervised MPC to somehow supervised MPC. – Unsupervised MPC can be considered as a special case of semisupervised MPC with null background info and constraints. Tool II: Motif Finding/Data Mining Tool • Given a set of known binding sites, develop a representation of these binding sites that can be used to search for additional instances of those binding sites in the genome. • Given a set of sequences known to be co-regulated (i.e. by an expression array) determine the binding locations in the sequence and determine a representation for binding specificity. Motif Representations • • • • Static Sequences: tataat Regular Expressions (RegEx): tat[at].t Sequences with N errors: tataat:2 RegEx with N errors: tat[at].t:2 • Mononucleotide Scoring Matrices: • Dinucleotide Scoring Matrices (HMMs) • Multinucleotide a t [at] a 75% 25% 50% Scoring Matrices t 25% g 0% c 0% 1 . t t 25% 25% 100% 75% 50% 25% 75% 0% 0% 0% 25% 0% 0% 0% 0% 25% 0% 0% 2 3 4 5 6 Searching for Known Motifs 1. Obtain a multiple sequence alignment of known motifs (e.g. from gel shift assay) 2. Construct …atagtt… representation a …aattat… t …attatt… g …ttactt… c 3. Score all possible windows in the data Set based on: I seq (i) f b,i log 2 b f b ,i pb a t [at] . t t 75% 25% 50% 25% 25% 100% 25% 75% 50% 25% 75% 0% 0% 0% 0% 25% 0% 0% 0% 0% 0% 25% 0% 0% 4. Output results that exist over a specified threshold from data set Finding Unknown Motifs 1. Input a set of co-expressed sequences that are related by micro-array experiment 2. Input: motif length n 3. Score all possible windows by first Constructing a multiple sequence Alignment of the window to all other possible matches in the other sequences 4. Rank the set of all possible scoring matrices of length n based on information content relative to background. I seq (i) f b,i log 2 b f b ,i pb 5. Output an ordered list of motifs and corresponding scoring matrices. AGAST: Advanced Grammar Alignment Search Tool • Capitalize on the advantages of alignment. • Provide a formal and robust method for computing biorelationships • Provide optimum results based on the input. • Calculate relationships in the same time as alignment. • Allow for user knowledge and subsequence relationships. • Record attributes and sequence attributes can be considered simultaneously. • Dynamically construct requisite algorithms in a user friendly way, thus limiting development time and technical knowledge requirements. AGAST: Advanced Grammar Alignment Search Tool Advantages: • It can evaluate regular expressions important to biology as well as traditional RegEx tools. • It can evaluate traditional alignments. • It can do any combination of RegEx and traditional alignments. BioRegEx BioRegEx II Example of an Advanced Query Find a sequence that contains: tatatagcagcccatgagccggcccgcadtgctagttcag 5-10 bases Transcription Start Site Any Number of Bases Functional Unit Start Codon Example Query: tatata.*{5,10}atg.*[gc]ca[at]gct[atgc]g:2.* tatatagcaggggcccatgagccggcccccadagctcgttcag Score: 0 tatatagcagcccatgagccggcccgcadtgagttcag Score: 2 Conventional Problem • Motif Searching programs do not calculate based on combinatorial regulation modules (instead they calculate based on probability of a single motif). • We have developed and tested a program that considers an ordered set of motifs (or sequence attributes) and searches based on a context of adjacent elements (or grammar). Next Steps • Extract multiple motifs in the context of regulatory control networks. • Use phylogenetic footprinting and gene regulatory network information to compare and contrast gene regulation networks and extrapolate combinatorial control mechanisms and corresponding motifs upstream of genes. Other Current Research Projects Advanced tools for identifying splice sites Using ab initio Bayesian networks based approaches Using homology graph theory based approaches Fast Recognition of Microorganisms using enzyme cutting sequence, mass spectrometry or sequence based approaches Gene Prediction using Comparative Genomics Reconstructing Gene Regulatory Networks Clustering Techniques for Simplifying Protein Sequences Acknowledgment Kiran Bastola Alexander Churbanov Xutao Deng Huimin Geng Steven Hinrichs Xiaolu Huang Daniel Kuyper Mark Pauley Daniel Quest