Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Cis-Regulatory/ Text Mining Interface Discussion Questions (1) What does ORegAnno want from text mining? – Curation queue – Document mark-up – Mapping to database IDs (2) What does text mining need from ORegAnno? (3) What can text mining provide? – What level of performance is needed? (4) What is the right way to proceed? – Data sets for BioCreAtIvE? – Custom tools for individual “early adopters”? Answers: (1) What does ORegAnno Want from Text Mining • Management of curation queue – Ideally, user customized, so that user annotates those documents of immediate interest to her/him • Document mark-up to highlight relevant passages – A workflow pipeline making either the html or pdf version of the document available, with the (potentially) relevant terms highlighted – Support for “cut and paste” transfer of relevant regions to the database comments fields • Mapping to IDs, ontology codes – Gene, transcription factor (protein), organism, cell and tissue type, evidence types Answers: (2) What does Text Mining Need From ORegAnno? • Significant quantity of reliably annotated data to train text mining systems – Annotated at a level useful for natural language processing (e.g., marked for evidence at the phrase, sentence or passage level, depending on task) • This requires that ORegAnno have: – A clear statement of the scope of the ORegAnno database and a stable set of annotation guidelines – Annotations with high inter-annotator agreement – Tracking of entries by annotator, including depth of annotation (different annotators will annotate to different levels of detail, depending on interests) Answers: (3) What Can Text Mining Provide? • Curation queue management: – Document classification approaches (from e.g., TREC Genomics or BioCreAtIvE) can be applied and evaluated, making use of new training data from pre-jamboree and jamboree annotation – We can experiment with “user defined” criteria, based on restrictions for gene, transcription factor, organism, tissue, etc. • Document mark-up – Users could be provided with a list of genes/transcription factors in a paper, with hot links into the paper to find relevant passages – This would allow the annotator to drive the annotation process, selecting only those annotations that are correct and relevant. This in turn provides feedback using ORegAnno annotations to validate & train the text mining – Such a tool should make it easy for the annotator to provide the underlying text passages as evidence for the annotation, to provide more training data • Mapping to unique identifiers/controlled vocabulary/ontology – For each entity type (gene, transcription factor, organism, tissue type...), a tool can provide a mapping to the correct identifier; where there is possible ambiguity, the tool could provide a ranked list for the annotator to choose from – A tool can also flag different evidence types, with suggested code(s) Answers: (4) How to Proceed? • Stabilize guidelines and redo the inter-annotator agreement expt (and write up) • Prepare a Gold Standard data set of expert annotated data for training new annotators • Collect sufficient amount of training data for the various tasks (queue management, document mark up, automated mapping) • Develop end-to-end pipeline (in the style of the FlySlip project) to capture whole documents in machine-readable form for mark-up Recommendations: Training Materials & Tools • Case studies and gold-standard annotated articles • On-line training – Perhaps with a way for new annotators to test themselves against a set of gold standard annotations – This will require automated comparison of annotations for certain fields • Best tools links • Tools: – Copy mechanism for largely duplicated record