Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Question #1 Evaluation (ready for final review by others) Readers (Brad, Stephanie) In your literature review, you discuss the difficulty of evaluating and validating the results of literaturebased discovery (LBD). The problem is similar to that of evaluating information retrieval systems based on relevance of retrieved items to the initial question. In IR, relevance can be viewed from the user perspective, and thus evaluation must involve real users with real information needs. On the other hand, TREC-style evaluation provides the large collections and defined tasks (and results) that allow for more uniform evaluation and comparison of system performance. In general, this is viewed as a reasonable compromise that has advanced IR technology. The ultimate validation of an instance of LBD is its confirmation by actual experiment, and even more stringently, that it is interesting and useful. Short of that, methods such as partitioning the literature by date, and seeking confirmation of discoveries in the older literature in the newer literature have been used. 1. For your planned method, and what you consider to be the two best other competing methods, identify and discuss two limitations and two advantages of validation of LBD. In your answer, you might consider aspects such as availability of data, generalizability of results, and "power of persuasion", that is, ability to convince skeptics that LBD is a legitimate means of discovery. 2. Many research communities have adopted the TREC model of evaluation: creating a large collection of data and setting specific tasks for research systems. a) Discuss the viability of the TREC model for LBD. Include in your discussion consideration of the limitations and advantages from part 1. b) What does establishing such an effort require on the part of LBD researchers? Question #2 Mining and Extraction (Javed, Brad polish, then others review) Readers (Javed, Brad) a) (Javed to polish) What is the significance of token extraction as it relates to detection of entities critical to biomedicine? What are some of the state-of-the art approaches that have been developed for entity detection? Discuss their computational advantages and disadvantages. Why is it necessary to supplement mining of textual content with other sources of evidence such as entity associations generated based on techniques such as BLAST or micro-array analysis? Provide some concrete examples. What are the potential ways an integrated mining approach could be developed to improve upon techniques that rely on a single source of evidence? b) (Brad polish) Your proposed work utilizes MeSH defined terms (chemical names) assigned to articles by expert human Medline indexers. Compare and contrast discovery of chemical names via Mesh terms with algorithmic discovery from full text (or abstracts) of chemical names via NLM’s Metamap program, and Zimmerman’s ProMiner system. When you compare and contrast them, discuss assumptions (requirements of the methods), cost, performance, scalability (to hundreds of millions of articles), strengths and weakness of each of the three methods. Motivate why you believe using Mesh provides competitive advantages, or different types of results than using Metamap or Prominer for named entity extraction of chemicals. my concern with this is knowing when she's answered it well. How would she (or the readers) know what a complete answer would look like? If combined with #8, perhaps it could be refined a bit to better define its scope. Question #3 Representation/Methods (Brad Polish, others review) Readers: Alex, Javed Coordinating representation Domain differences: Different terms, representations across different domains (chemistry, biology, computer science, legal (patents) Text vs non text data: Discuss the potential of Swanson's literature-based discovery model (ABC) in drug research as applied to data sources beyond textual ones. Discuss the integration (in the context of both hypothesis generation and validation) of textual and non-textual data sources in drug discovery including both primary and undesired (e.g., toxic) side effects Terminology: In natural language, ambiguity adds to our richness and creativity of expression, but can also lead to confusion (intentional or not). We can consult a dictionary or thesaurus to choose a word that is appropriate to the context, from felis domesticus (for scientific discourse) to kitty (for conversation with a child). These tools thus support some level of translation among terms, but as with much translation between natural languages, information may be lost in translation. What are the issues involved in translation (or conversion) among different forms of representations for chemicals? In your discussion, you could consider affordances for use, information loss or gain, and policy or legal issues, but you don't need to limit yourself to these ideas. Question #4 (Diane to improve/polish, then review by others) Data Quality, Curation, READER (Diane, Alex) (a) What are the data quality issues with combining information from many sources, some of which many not be high quality….. You point out in your review that there is some questionable quality in the databases that you are using and the quality of any result is heavily dependent on the quality of its inputs. It is not practical to assume that all data is properly curated and even trying to select the best databases will not necessarily work over time as the quality may change. The question therefore is: As you use data from these databases, what techniques are available that will be more tolerant of incorrect data? In statistics, this is referred to as the problem of mis-labeled data; in experimental science, it is the question of dealing without outliers. You should be able to assume that the information in the text sources is correct as those are well reviewed. It is the massively collected databases that are the question. (b) what data mining techniques that could be helpful in improving data quality, performing curation. What are their strengths and weakness. Extracting information from databases: (a) As you put data into databases, there are a large number of data mining techniques that can be put to use. Are there any of these techniques that will be helpful in your work? (b) Also, are there fields within the already stored information, such as chemical compound structure, that in themselves would provide useful information that could be used. Question #5 (Diane) Representation Brad to polish, then review by others Readers: Stephanie, Diane Characterize and describe the current structure of scientific literature now, and how you will envision it may change by 2050. Consider things already being discussed and tried like XXXX, as well as your thoughts on what new forms scientific discourse may take. Be sure to address how textual and nontextual data may be intermingled (or not). Also, discuss possible barriers to these new forms of scientific literature (social, technical, legal, global/cultural). Based on your projections for new structures for representing scientific literature in 2050, describe what opportunities these new structures might afford LBD techniques in the future.