Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Maryann E. Martone, Ph, D, Neuroscience Information Framework University of California, San Diego Themes Computers are now partners with humans in reading the literature Search Summarization Linking Discovery The scientific paper starts with the materials and methods All observations, claims etc flow from experimental design and materials If authors do not provide this information in the first place, then we can’t use it to improve all of the above Scientists produce articles for each other, not for computers Not everything you need to interpret the paper is in the paper More information may be there than is in the text NIF is an initiative of the NIH Blueprint consortium of institutes What types of resources (data, tools, materials, services) are available to the neuroscience community? How many are there? What domains do they cover? What domains do they not cover? Where are they? ○ Web sites ○ Databases • • PDF files Desk drawers ○ Literature ○ Supplementary material Who uses them? Who creates them? NIF provides a wealth of practical information on data and resource issues in neuroscience How can we find them? How can we make them better in the future? http://neuinfo.org The Neuroscience Information Framework: Discovery and utilization of web-based resources for neuroscience UCSD, Yale, Cal Tech, George Mason, Washington Univ A portal for finding and using neuroscience resources A consistent framework for describing resources Provides simultaneous search of multiple types of information, organized by category Literature 22 mil Data Federation 350 mil Resource Registry 5000 http://neuinfo.org Supported by an expansive ontology for neuroscience Utilizes advanced technologies to search the “hidden web” Supported by NIH Blueprint In an ideal information system, we would be able to find… What is known “What studies used my monoclonal mouse antibody against actin in humans?” “What phenotypes are associated with each mouse model of Spinal Muscular Atrophy” “What upregulates SMN1?” What is not known Connect information to infer plausible hypotheses ○ Genotype-phenotype ○ Possible drug targets Information gaps Whither biological information? What is potentially knowable What is known: Literature, images, human knowledge What is easily machine processable and accessible ∞ CA2: Ion, Brain Part or Gene? BioGrid Allen Brain Atlas Brain Info NIF queries across over 170+ independent databases Papers are the currency of science Despite the wealth of data out there (> 2500 databases on-line), the majority of data is still published in papers But...we write for other humans to consume and information continues to be hard to find Even for humans, however, it is difficult to find and verify basic information about a paper critical for interpretation What is the subject of the study What reagents were used What genes were studied A lot of information is missing from papers Not all data is available Data is published in papers in forms that are difficult to use Mining the literature for resources Resources: Materials, services, tools, data Project 1: Find materials: antibodies and transgenic animals Project 2: Mine supplemental data in papers showing gene expression changes in drug abuse Purpose Find new resources Track usage of existing resources Link resources to other useful information Linking resources: Link out broker Use case: antibodies Pilot project to use text mining to identify antibodies used in studies: Wanted to pick a project that would be immediately understandable by research scientists Antibodies are used routinely to identify proteins and other molecules in basic and translational studies Antibodies are a large source of experimental variability in results Same antibody can give you very different results Different antibodies to the same protein can give you very different results Neuroscientists spend a lot of time tracking down antibodies and trouble shooting experiments that use antibodies Our reagents and methods are not perfect “We note that many of the findings in the literature about neuronal NF-κB are based on data garnered with antibodies that are not selective for the NF-κB subunit proteins p65 and p50. The data urge caution in interpreting studies of neuronal NF-κB activity in the brain.” --Herkenham et al., J Neuroinflammation. 2011; 8: 141. Antibodies are complex entities Anti-Chat antibody Raised against a portion of choline acetyltransferase Raised in a particular species Is polyclonal or monoclonal Is affinity purified or not Recognizes the target in some species, e.g., human Reported in materials and methods Tissue sections were blocked with 5% serum and incubated overnight at 4 °C with the following primary antibodies: anti-ChAT (1:100; Millipore, Billerica, MA), anti-Bax (1:50; Santa Cruz), anti-Bcl-xl (1:50; Cell Signaling), anti- neurofilament 200 kDa (1:200; Millipore) ... “Find studies that used a rabbit polyclonal antibody against GFAP that recognizes human in immunocytochemisty” NIF Antibody Registry: -database of > 900,000 antibodies (AB_310775) Paz et al, J Neurosci, 2010 Searching for resources in literature NIF recently implemented a section-specific search Semi-automated resource identification pipeline Paul Sternberg, Yuling Li, Cal Tech Annotation of antibodies •Allows annotation of DOMEO annotation tool: Paolo entities and key Ciccarese; Tim Clark, MGH relationships: •Protocol •Subject of protocol •Links antibodies to a database of antibodies that contains their properties •NIF Antibody Registry •900,000 antibodies •Unique ID http://antibodyregistry.org http://annotationframework.org/ What studies used my monoclonal mouse antibody against actin in humans? Subject is neurologically Human Midfrontal cortex tissue samples from unimpaired subjects (n9) and from subjects with AD (n11) were obtained from the Rapid Autopsy Program Immunoblot analysis and antibodies The following antibodies were used for immunoblotting: mAb=monoclonal actin mAb (1:10,000 dilution, Sigma-Aldrich); -tubulin mAb (1:10,000, antibody Abcam); T46 mAb (specific to tau 404–441, 1:1000, Invitrogen); Tau-5 mAb (human tau 218–225, 1:1000, BD Biosciences) (Porzig et al., 2007); AT8 mAb (phospho-tau Ser199, Ser202, and Thr205, 1:500, Innogenetics); PHF-1 mAb (phospho-tau Ser396 and Ser404, 1:250, gift from P. Davies); 12E8 mAb (phospho-tau Ser262 and Ser356, 1:1000, gift from P. Seubert); NMDA receptors 2A, 2B and 2D goat pAbs (C terminus, 1:1000, Santa Cruz Biotechnology)… Tracking down reagents Feng et al., MATH5 controls the acquisition of multiple retinal cell fates, Mol Brain. 2010; 3: 36 Space limitationsContent gets separated in space and time Practices are designed to save space, improve readability and save authors typing But...electrons are cheap Cut and paste is cheap Re-examining plagiarism in the age of cut and paste Autocomplete is cheap Acronyms and abbreviations Are there any unique 3 letter strings Formats are flexible What the computer sees and what humans see don’t have to be the same thing Try this Watson! • 95 antibodies were identified in 8 articles • 52 did not contain enough information to determine the antibody used • Some provided details in another paper • And another paper, and another... • Failed to give species, clonality, vendor, or catalog number • But, many provided the location of the vendor because the instructions to authors said to do so Subject of study Often not explicit: “patients with AD” = human Type III SMA mice (Smn−/−, SMN2+/−) were produced as previously described (Tsai et al., 2006a). Official strain nomenclature of animals not designed for search SMN2Ahmb89tg/tg;SMNΔ7tg/tg:Smn1−/−; no unique identifier assigned Many lines of transgenics are generated and described within a single paper; difficult to relate individual findings with the correct animal line but all are not equivalent Three lines of transgenic mice, Ml, M2, and M3, were produced (Fig. 1B). Transgene expression was found in all tissues studied, with widespread high expression in line Ml, high expression in brain of line M3, and relatively low expression in brain of line M2 (Fig. 1C). (Ripps et al., PNAS, USA Vol. 92, pp. 689-693, January 1995) Which mouse did you use? “Transgenic mice expressing SOD1G93A (12) were purchased from Jackson Laboratory” 12 = Gurney ME; et al. 1994. Motor neuron degeneration in mice that express a human Cu,Zn superoxide dismutase mutation [see comments] [published erratum appears in Science 1995 Jul 14;269(5221):149] Science 264(5166):1772-5. Search NIF/Jackson lab for “Gurney SOD” ○ 7 entries for same producer ○ 3 track to the same reference Gogliotti et al, Biochem Biophys Res Commun. 2010 January 1; 391(1): 517. “Here we report our findings for the SMA mouse model that has been deposited by the Li group from Taiwan. These mice, JAX stock number TJL-005058, are homozygous for the SMN2 transgene, Tg(SMN2)2Hung, and a targeted Smn allele that lacks exon 7, Smn1tm1Hung.” Minimal metadata standards (really) for publishing in the 21st century 1) Provide gene accession numbers for all genes referenced in the methods section of a paper, per http://www.ncbi.nlm.nih.gov/gene Journal Comparative Neurology: 2) Identifyof(i.e., give ID) the species for the Requires complete subject of a study, and which each gene in instructions to characterization offrom antibody as stated product is derived, using the NCBI taxonomy and authors the strains from the model organism databases for •90% of antibodies had a catalog #; 20% had a lot number after mice, rats, worms, zebrafish and drosophila, these policies wereunique instituted employing any existing identifiers and •NIF could automatically identify 80% of these antibodies correct species-specific nomenclature: through matching with NIF Antibody Registry 3) Provide catalog numbers and vendor information for all reagents and animals described in the methods section of a paper Developed by the Link Animal Model to Human Disease Initiative (LAMHDI) consortium: Project 2: Extracting data from tables and supplementary material Challenge: Extract data on gene expression in brain from studies relevant to drug abuse Workflow: Find articles Extract results from tables Standardize results Drug related gene database: 140 tables from 54 articles Andrea Arnaud-Stagg, Anita Bandrowski Load into NIF Extracting additional knowledge from supplementary material Gene for tyrosine hydroxylase has increased expression in locus coeruleus of mouse compared to control when given chronic morphine Translations: Upregulated p < 0.05 = increased expression LC = locus coeruleus Probe ID = gene name J Neurosci. 2005 Jun 22;25(25):6005-15. Challenges working with tables and supplemental data Difficult data arrangements ○ PDF, JPG, TXT, CSV, XLS ○ Difficult styles: colors, symbols, data arrangements (results combined into one column, multiple comparisons in one table, legends defining values, unclearly described data (e.g., unclear significance) Not clear what tables/values represent nothing in paper about the supplementary data file and table has no heading Probe ID’s are given but not gene identifiers No link from supplemental material back to article; lose provenance Not all results are accounted for Is SMN1 affected by drugs of abuse? SMN1 is the gene that is mutated in Spinal Muscular Atrophy, a neurological disease of children Open world vs closed world assumptions Closed world assumption: holds that any statement that is not known to be true is false allows an agent to infer, from its lack of knowledge of a statement being true, anything that follows from that statement being false typically applies when a system has complete control over information Open world assumption: the assumption that the truth-value of a statement is independent of whether or not it is known by any single observer or agent to be true. limits the kinds of inference and deductions an agent can make to those that follow from statements that are known to the agent to be true the open world assumption applies when we represent knowledge within a system as we discover it, and where we cannot guarantee that we have discovered or will discover complete information. Reporting data: Closing the open world We measured the expression of 9000 genes as a function of chronic cocaine (S1). The 50 genes that showed significantly increased expression (p > 0.01) are shown in Table 2 What about the other 8950 genes? Cannot assume that they were increased, decreased or remained the same (Open world) We measured the expression of 9000 genes as a function of chronic cocaine (S1). The fold change and p value are given for each gene. The 50 genes that showed significantly increased expression (p > 0.01) are shown in Table 2 (Closed world) Narrative vs Data publishing Narrative (Author): Encourage use of minimal standards for key entities in the research paper Subject, protocol, genes, reagents ○ Make it easy to find accession numbers Standard templates for reporting supplemental data? ○ Unlikely although desired Tools for linking in line references to fragments of papers rather than the entire paper Data (Curators): Structuring data requires expertise Positive and negative results equally important If data are to be published in supplemental material or in paper, should make them machine interpretable Ideally, entire data set should be deposited in a public repository, e.g., GEO OMNIBUS Conclusions Humans are storytellers; it’s fundamental to the way we communicate But these stories are directed to an audience with expertise Scientists know each other’s work; personal networks very important The computer isn’t part of this So...we need to adapt publishing practices to aid automated search and mining of content Partnership between authors, publishers, curators and computer scientists, informaticians... Future of research communications and e-scholarship http://force11.org JOIN US!