* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Shah - Buffalo Ontology Site
Survey
Document related concepts
Transcript
Computations using pathways and networks Nigam Shah [email protected] THE GOAL = MAKING SENSE OF HIGH THROUGHPUT DATA High throughput data • “high throughput” is one of those fuzzy terms that is never really defined anywhere • Genomics data is considered high throughput if: • You can not “look” at your data to interpret it • Generally speaking it means ~ 1000 or more genes and 20 or more samples. • There are about 40 different high throughput genomics data generation technologies. • DNA, mRNA, proteins, metabolites … all can be measured How does ontology help? • An ontology provides a organizing framework for creating “abstractions” of the high throughput data • The simplest ontologies (i.e. terminologies, controlled vocabularies) provide the most bangfor-the-buck • Gene Ontology (GO) is the prime example • More structured ontologies – such as those that represent pathways and more higher order biological concepts still have to demonstrate real utility. – Gene Ontology to analyze microarray data Using GO annotations Descriptions built by connecting/linking ontology terms Biologists interpret a list of genes and form a result statement such as: The photosynthesis genes located in the chloroplast are repressed in response to ozone stress and have the ABRE binding site enriched in their promoters. …more structure OBOL OBOL Relations Ontology Relations Ontology ?<link>? <Some MF> in <Some BP> Between-ontology structure … more structure [beyond GO]: PATO The building blocks of phenotype descriptions: EQ Entity (bearer) such as spermatocyte, wing Quality (property, attribute) - a kind of dependent continuant Formally, an EQ description defines: - a Quality which inheres_in a bearer entity The building blocks are combined according to the Phenosyntax www.fruitfly.org/~cjm/formats Semantically structured annotations 1. Relationship ontology 2. Mouse Pathology ontology 3. Tissue/Organ 4. Gene ontology Basal layer of organ shows membranous staining mRNA of genes encoding proteins with mf in bp at cc is increased in sample-id which shows some pathology in some tissue in some organ Queries enabled: 1. Identify all images with a specific pathology 2. Identify cases with pathology and some gene expression changes 3. Correlate changes biological processes with change in morphology Discovery enabled: 1. Classify samples in expression space and “look” for histological changes that correlate with it. HOW WHY Open Questions/Challenges • Creation/acceptance of a systematic formalism for creating expressive annotations. (e.g. associated_with, involves) • A generic tool that uses ontologies and allow the user to compose terms and cross ontology annotations • Easy term/annotation composition • Control the amount of alternative [compositional] statements allowed Pathways to analyze array data “Pathways” to analyze array data • The notion of a cancer signaling pathway can serve as an organizing framework for interpreting microarray expression data. • On examining a relatively small set of genes based on prior biological knowledge about a given pathway, the analysis becomes more specific. Reactome’s sky painter Operations on pathway resources Custom code RDF + SPARQL OWL + SWRL Verify a pathway resource Proofreading Reactome[1] In progress In progress Perform integrated querying of multiple pathway resources Hard (“wrapper” approaches) PKB[2] Verify multiple pathway resources Too hard (there are ~200) Merge and compare multiple pathway resources “Reason” over pathway resources [1] A case study in pathway knowledgebase verification, BMC Bioinformatics 2006, 7:196 [2] Pathway Knowledge Base: An Integrated pathway resource using BioPAX, Submitted to Applied Ontology Merge and compare pathway resources • Given a set of ‘nodes’ and some ‘links’ among them, query multiple pathway sources and fill in the most plausible interactions between the nodes. • Plausible = not contradicted by existing data and knowledge • Current pathway resources [in biopax] can not support this because, the manner in which ‘nodes’ are identified, the manner in which ‘links’ are identified is arbitrary. • Reactome has started to connect the pathway steps will GO biological processes. • BioPAX lets pathway sources “export” their nodes and links. • …but p53 in resource A is still different from P53 in resource B • … and Activate in resource A is still different from activates in resource B Problem • I have no clue what a pathway is! • A set or series of interactions, often forming a network, which biologists have found useful to group together for organizational, historic, biophysical or other reasons. • The complexity and abstraction represented in a pathway is decided by its author attempting to represent the interactions between a set of genes, proteins, and small molecules. “Networks” to analyze high throughput genomic data Building networks • Take a high throughput dataset • Define a notion of ‘relatedness’ depending on the dataset • Co-expression for microarray data • Co-occurance for literature networks • … • Enlist [node]--<link>--[node] pairs • Find a good graph drawing program! Nice hairball but … From Long et al, in Trends in Biochemical Sciences, vol 32, no 7. From Srinivasan et al, in Briefings in Bioinformatics August 2007. Srinivasan B, Snow R, Shah N and Batzoglou S in Interactome Networks conference @ CSHL Hypotheses/Models to analyze high throughput genomic data Events and Implicit claims An hypothesis is a statement about relationships (among objects) within a biological system. Protein P induces transcription of gene X An ‘event’ is a relationship between two biological entities. P promoter | gene X Implicit claims that can be tested: 1. P is a transcription factor. 2. P is a transcriptional activator. 3. P is localized to the nucleus. 4. P can bind to the promoter of gene X Representing Events Explicitly A hypothesis consists of at least one event stream An event stream is a sequence of one or more events or event streams with logical joints (or operators) between them. An event has exactly one agent_a, exactly one agent_b and exactly one operator (i.e. a relationship between the two agents). It also has a physical location that denotes ‘where’ the event happened, the genetic context of the organism and associated experimental perturbations when the event happened. A logical joint is the conjunction between two event streams. User interfaces Hypothesis described in Natural Language Biological process described in a formal language Evaluating an hypothesis A. Representation of an hypothesis in terms of events (ev = event) C. Plot of the support versus conflicts for submitted and neighboring hypotheses (n1, b1). Clicking on the n1 submits that hypothesis as ‘seed’ n1 b1 B. Holding the mouse on a neighboring hypothesis (b1) shows what event was replaced to create it HyBrow: lessons learnt • The minimum requirement for a formal representation: • Ability to represent data information Knowledge • A language to unambiguously express your “thought experiment” (your model, hypothesis, theory, theorem etc) • A reasoning framework to evaluate the outcome/ validity/accuracy of your thought experiment • Project Home page: www.hybrow.org Pathways as “models”? • Pathways are assumed to be models representing biological processes, without actually knowing the modeling formalism in which the model is valid. • The ‘language’ of writing out a pathway doesn’t really have a grammar and/or a logic • Most pathways end up being lists of heterogeneous sets of “steps” (in terms of the time of execution, the place of execution, the abstraction level, the kind of ‘thing’ passed along etc…) • Lots of discussion on requirements of data providers, where are the users/consumers and their use cases? Claims • Pathways are useful only if they can serve as “models” [accurate representations] of a process • Hence whatever needs to be done to ensure that a pathway is a valid model of at least one formalism should be required of the pathway author. • A pathway representation that doesn’t solve the problem of uniquely identifying entities doesn’t solve the problem of integrating pathways. • We just end up with marked up, structured information from multiple providers, without actually integrating anything. Success of projects in the Biomedical domain High KR complexity Virtual soldier TMJ HyBrow TAMBIS Riboweb Biocyc BioSigNet BioLingua Pathway logic Mycin PharmGKB Reactome Minimal KR complexity Use of GO Minimal computational complexity High computational complexity Success of projects in the Biomedical domain High KR complexity Virtual soldier TMJ HyBrow TAMBIS Riboweb BioCyc BioLingua Pathway logic BioSigNet Mycin PharmGKB Reactome Minimal KR complexity Use of GO Minimal computational complexity High computational complexity