Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
and Tools for exploring the biomedical information landscape Les Grivell EMBO Electronic Information Programme EAHIL 2004, Santander, Electronic information programme Online research information environment for the life sciences A next generation information service for the life sciences Communities@embo Life Sciences Mobility Portal But first, let me take you back – not to Altomira, but to the …… early days of scientific publishing (pre- impact factor) When libraries were comfortable places that had everything you needed … and it was possible to keep track of the literature …. (more or less) … Where are we now? – Publishing is big business • STM publishing is a multi-billion EUR activity (In the UK alone, GBP 22 billion in 2000) • Estimated 164000 scientific periodicals worldwide; around 16% of these are online – Core science; core journals • PubMed lists some 4600 journals in biomedical disciplines • As of 19 Sept 2004, 4429 of these are online • The PubMed database provides access to circa 15 million abstracts (but if you can’t be found, you won’t be read …) • The Science Citation Index lists 5876 journals with impact factors ranging from 54.45 – 0.00. (you’ve been found, but are you worth reading? …) Another information explosion: genomics 35 Base pairs (billions) 30 Sequence entries in the EMBL DNA database 25 20 15 10 Morowitz 5 0 1980 1985 1990 Year 1995 2000 2005 Raw sequences are not the only form of digital information The nice thing about biological information resources is that there are so many ….. • Hundreds of different databases, many in flatfile format • A variety of user interfaces • General lack of interoperability Wouldn’t it be nice to …… find all published literature references for a large set of gene symbols and explore their relationships? Micro-array chip Co-regulated genes Find literature Database lookup Discover relationships This is not really such a novel idea …. Fritz Saxl (1890– 1948) ‘Ich will nicht, dass in der Bibliothek I don’t want there to be endless ewig gesucht wird! Dieses Suchen searching in the library! It is at kostet Nerven und die dürfen nicht the expense of nerves and these verschwendet werden an solche should not be wasted on such Dummheiten... stupidities…. Aby Warburg (1866– 1929) Saxl & Warburg:Mnemosyne Atlas Some text search engines Bibliographic databases Biosis Full text / web-pages Pubmed Text-based! No direct linkage to other datasets Search only title, authors, abstract Boolean keyword search (AND / OR) Search language is English All documents stored and indexed in one location No ranking on relevance to query! main features • Ability to interconnect literature articles with different types of molecular data, including images • Ability to search through and retrieve journal articles and other full text documents, even when in different physical locations • Ability to support multi-lingual documents and queries • Services free to the academic community Features implemented via conceptual fingerprinting A discovery tool conceptual fingerprints Full text document Index and link index terms to (multi-lingual) thesauri •1 conceptual fingerprint (CFP) = 400 bytes •Abstraction: 250.000 pages/PC/day •Matching: 500.000 CFP’s: 40 millisec. Fingerprint database prototypes • Initial prototypes in September 2002 and July 2003 • Current prototype online since 1st March 2004 • Next launch due midOctober 2004 E-BioSci Content selection: abstracts + full text Choose search focus Full text query in English, French or German. Is fingerprinted for search … and now a word about 8 partners ( DE, ES, FR,UK) (Platform) 13 partners (ES, FR, IT, NL, UK) (Research project) Oriel’s aims Wouldn’t it be nice to be able to navigate from an image to literature and molecular databases? www.bioimage.org (Dr David Shotton, Univ. Oxford) Gene symbol identification in text Text containing symbols Improved literature – molecular dataset linkage PEO1 Twinkle, twinkle, little star, How I wonder what you are. Up above the world so high, Like a diamond in the sky. Twinkle, twinkle, little star, How I wonder what you are GUCY2C TYRO3 CD44 Problems in gene symbol recognition • Many gene symbols are indistinguishable from everyday words or abbreviations • Synonyms • Homonyms • Homonym synonyms (ELK1 = SAP1; CAR1 = SAP1; BD-2 = SAP1; RIP1_SAPOF = SAP1) Word-“processing” Natural language processing Protein interaction networks ataxia requires Yfh1 regulates Ssc1 Isu1 interacts activates Oct1 Hoffman & Valencia (Madrid) Some web-addresses http://www.e-biosci.org http://www.oriel.org http://www.bioimage.org http://www.pdg.cnb.uam.es/UniPub/iHOP/