Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BeeSpace Software Plans, Design, and Development Outline Goals Context Approach Software Process Functionality Design Implementation Details Future Prospects Project Goals & Parameters “This project will analyze social behavior… using Apis Mellifera as the model organism”. Goal: support research and analysis of the Western honey bee. Using “biology research (that) will generate a unique database of gene expressions…” and “microarray experiments (that will) utilize the recently sequenced genome, supported by state-of-the-art statistics.” Goal: support application of biological methods and techniques for exploratory analysis. And using “informatics research (that) will develop an interactive environment to analyze all information sources relevant to bee social behavior.” Goal: support application of language processing methods for exploratory analysis. “The BeeSpace environment will enable users to navigate a uniform space of diverse databases and literature sources for hypothesis development and testing. (Ref: http://www.beespace.uiuc.edu/) Goal: support dual analysis methodologies via an integrated analysis environment. Parameter: 5 years to complete project, includes research, development, deployment, outreach and documentation. Parameter: annual milestones and workshops expected. Context There are voluminous amounts of biomedical and genomic literature containing valuable knowledge and research results. There exist novel language processing techniques that have been primarily applied in niche applications. Implication: Emerging technologies (NLP, TM, etc.) can provide backbone for strategic solution, but their risks must be mediated thru controlled developmental cycles. There exist numerous, but currently isolated, tools for data processing of bioinformatics. Implication: Too much for human processing; and not in a machine-ready format for reasoning based systems. Implication: Opportunities exist for interoperability with disparate systems, but success hinges on standardization. The web is seeing an increase in smaller, highly focused communities-of-interest. Implication: Opportunities exist for supporting the creation and management of localized “knowledge-spaces”. Context – Related Tools & Projects 3rd Millennium Inc. – “…development of an integration framework for genomic, gene expression, and interaction data (protein-protein well as protein-DNA) from multiple sources and model organisms that can enable the display of the relationships between biochemical objects into the context of biological pathways and networks.” iHOP – Information Hyperlinked Over Proteins: supports lookup and summarization of genes/proteins. “In general more than 90% of all active relations between proteins in the literature are expressed syntactically as ‘protein verb protein’”. Ref. IntAct Database – “IntAct provides a freely available, open source database system and analysis tools for protein interaction data. All interactions are derived from literature curation or direct user submissions and are freely available.” Entrez eUtils – A web services (SOAP) interface for programmatically querying and interacting with NCBI databases. Software Process System Development Life Cycle (SDLC) Identify project goals and critical success factors. Investigate current methodologies and tools that have functional or domain overlap with project objectives. Research the applicability of novel analysis techniques for extracting deeply embedded and stratified knowledge structures. Build an integrated software suite that will allow for interactive analysis and augmentation of rich data sets. Test and deploy software to focused user groups. Document and publish research results. Re-iterate above process for continuous quality improvement. Functionality Should be web-based system supporting lightweight GUI components and having minimal end-user requirements. Should accommodate user-directed query-by-navigation (QBN) of “concept space”. Should extract and normalize concepts as “equivalence classes” of things with highly similar meaning. Should recognize and denote entities. Should allow user to drill-down, drill-up and drill-across concept space. E.g. textto-concept, concept-to-concept, concept-to-theme, and the reverse directions as well. Should allow user to perform encyclopedia-style lookup of entities. Should provide hooks for tie-in to 3rd party bioinformatics tools. Design Principles Maintainability Portability Extensible Efficiency Organized Interoperability Configurability Ease-of-use Trusted “Quality without a Name” References: “Code Complete”, 2nd ed., “Pattern-Oriented Software Architecture”, volume 1. Design – Use Case Diagram Design - Component Diagram BeeSpace Design Application Layer BeeSpace Navigator Query & Data Access Layer Fuzzy Query Engine Data Access Component Annotated Data, Meta-Data and Indices XML Schemas XML Data Indices Data Processing Layer Entity Recognizer POS Tagger NP Chunker Inverter Data Sources Text Bases Concept Normalizer Concept Generator Design - Deployment Scenarios BeeSpace Software Packaging Web App Standalone GUI App Core Library Data Processing Components Query/Access Components Agents/P2P Clients Extension Library Communication Components Design – Class Diagram Implementation Details The current system is being constructed as follows: The (v1.0) application is being developed as a web-based application. The output of the data processing pipeline is a set of indices and annotated data files that the client application depends on. Design Decision: There is a clear separation-of-concerns between the server-side processing and the client-side interface. XML is being fully utilized to as a data interchange format between software components. The pipeline is composed of independent software components, but these components need to be inter-connected. Design Decision: The interface is built on top of lightweight technologies (e.g. HTML, DHTML & JavaScript). Typical web-app challenges, such as sessioning and security, need to be addressed. Design Decision: Components are called as executables with defined interfaces. Some components need to be able to store their data aggregations persistently (and other components may need access to this data). Design Decision: Currently each component handles this problem independently. Better, long term solution is to extract out this concern and address it globally; for example, using ORDBMS. Future Implementation Details Support both a web interface (HTML, CSS, DHTML, JavaScript) and a full-blown GUI interface (Java Web Start app). Consistent Java implementation for portability, maintainability, RAD, etc. Incorporate a DBMS for consistent handling of “persistent storage”. Library extensions for communication between distributed, heterogeneous applications (perhaps KIF). Optimized data processing and communication. Climbing the Pyramid Pyramid of Knowledge Text Mining Data Mining ? ? Computer Automated Research (Success) Kno . w. Know Intelligent-driven Research (Profit) Hidden Relationships (Network) Semantics (Nodes) Computer Automated Business (Success) Re la Co nc Raw Text (Lit.) Tex t tion s ept s Intelligent-driven Business (Profit) s tern t a P In tion a m for ta Da Predictions (Trends) Aggregations (Reports) Raw Data (Txns) Future Prospects Generalize the system so that it is NOT domain-specific and can be readily applied to other domains. Allow for persistent sessioning and sharing of sharing of knowledge-spaces amongst communities-ofinterest. Support a visual query system (VQS) interface and/or a query-by-example (QBE) interface. Support all kinds of hypothesis generation: deduction, abduction & induction. Support personalized annotations. (What constitutes a “good” KR structure: clarity, logic, expressive?). Smooth the integration between the BeeSpace Navigator and the myriad number of web-based tools. Support n-ary, semantically rich relations as opposed to just dyadic. Visual Query in Text Mining Application Org: bee Org: fly Found-In Found-In Gene: ?x Gene: Glued HasProduct Protein: ?y Threshold: 0.9 SimilarTo HasProduct Polypetptide: p150Glued Future BeeSpace Components Future BeeSpace Design Application Layer BeeSpace Analyzer BeeSpace Workflow Manager BeeSpace Navigator Query & Data Access Layer Q/A Component Expert Shell Component Fuzzy Query Engine Data Access Component Text Miner Entity Mapper Concept Generator Central Knowledge Base ORDBMS Data Processing Layer Entity Recognizer POS Tagger NP Chunker Inverter Concept Normalizer Topic Detection Relation Extractor Rule Miner Ontology Detector Data Sources Data Bases Text Bases Web Bases Snake Space?