Download Baker-Comps-Question..

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Theoretical computer science wikipedia , lookup

Data analysis wikipedia , lookup

Corecursion wikipedia , lookup

Neuroinformatics wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Question #1 Evaluation (ready for final review by others)
Readers (Brad, Stephanie)
In your literature review, you discuss the difficulty of evaluating and validating the results of literaturebased discovery (LBD). The problem is similar to that of evaluating information retrieval systems based
on relevance of retrieved items to the initial question. In IR, relevance can be viewed from the user
perspective, and thus evaluation must involve real users with real information needs. On the other
hand, TREC-style evaluation provides the large collections and defined tasks (and results) that allow for
more uniform evaluation and comparison of system performance. In general, this is viewed as a
reasonable compromise that has advanced IR technology.
The ultimate validation of an instance of LBD is its confirmation by actual experiment, and even more
stringently, that it is interesting and useful. Short of that, methods such as partitioning the literature by
date, and seeking confirmation of discoveries in the older literature in the newer literature have been
used.
1. For your planned method, and what you consider to be the two best other competing methods,
identify and discuss two limitations and two advantages of validation of LBD. In your answer, you might
consider aspects such as availability of data, generalizability of results, and "power of persuasion", that
is, ability to convince skeptics that LBD is a legitimate means of discovery.
2. Many research communities have adopted the TREC model of evaluation: creating a large collection
of data and setting specific tasks for research systems.
a) Discuss the viability of the TREC model for LBD. Include in your discussion consideration of the
limitations and advantages from part 1.
b) What does establishing such an effort require on the part of LBD researchers?
Question #2 Mining and Extraction (Javed, Brad polish, then others review)
Readers (Javed, Brad)
a) (Javed to polish)
What is the significance of token extraction as it relates to detection of entities critical to biomedicine?
What are some of the state-of-the art approaches that have been developed for entity detection?
Discuss their computational advantages and disadvantages. Why is it necessary to supplement mining
of textual content with other sources of evidence such as entity associations generated based on
techniques such as BLAST or micro-array analysis? Provide some concrete examples. What are the
potential ways an integrated mining approach could be developed to improve upon techniques that rely
on a single source of evidence?
b) (Brad polish) Your proposed work utilizes MeSH defined terms (chemical names) assigned to articles
by expert human Medline indexers. Compare and contrast discovery of chemical names via Mesh terms
with algorithmic discovery from full text (or abstracts) of chemical names via NLM’s Metamap program,
and Zimmerman’s ProMiner system. When you compare and contrast them, discuss assumptions
(requirements of the methods), cost, performance, scalability (to hundreds of millions of articles),
strengths and weakness of each of the three methods. Motivate why you believe using Mesh provides
competitive advantages, or different types of results than using Metamap or Prominer for named entity
extraction of chemicals.
my concern with this is knowing when she's answered it well. How would she (or the readers) know
what a complete answer would look like? If combined with #8, perhaps it could be refined a bit to better
define its scope.
Question #3 Representation/Methods (Brad Polish, others review)
Readers: Alex, Javed
Coordinating representation
 Domain differences: Different terms, representations across different domains (chemistry,
biology, computer science, legal (patents)
 Text vs non text data: Discuss the potential of Swanson's literature-based discovery model
(ABC) in drug research as applied to data sources beyond textual ones. Discuss the integration
(in the context of both hypothesis generation and validation) of textual and non-textual data
sources in drug discovery including both primary and undesired (e.g., toxic) side effects
 Terminology: In natural language, ambiguity adds to our richness and creativity of expression,
but can also lead to confusion (intentional or not). We can consult a dictionary or thesaurus to
choose a word that is appropriate to the context, from felis domesticus (for scientific discourse)
to kitty (for conversation with a child). These tools thus support some level of translation among
terms, but as with much translation between natural languages, information may be lost in
translation. What are the issues involved in translation (or conversion) among different forms of
representations for chemicals? In your discussion, you could consider affordances for use,
information loss or gain, and policy or legal issues, but you don't need to limit yourself to these
ideas.
Question #4 (Diane to improve/polish, then review by others) Data Quality, Curation,
READER (Diane, Alex)
(a) What are the data quality issues with combining information from many sources, some of which
many not be high quality…..
You point out in your review that there is some questionable quality in the databases that you are using and the quality of any
result is heavily dependent on the quality of its inputs. It is not practical to assume that all data is properly curated and even
trying to select the best databases will not necessarily work over time as the quality may change. The question therefore is: As
you use data from these databases, what techniques are available that will be more tolerant of incorrect data? In statistics, this
is referred to as the problem of mis-labeled data; in experimental science, it is the question of dealing without outliers. You
should be able to assume that the information in the text sources is correct as those are well reviewed. It is the massively
collected databases that are the question.
(b) what data mining techniques that could be helpful in improving data quality, performing curation.
What are their strengths and weakness.
Extracting information from databases: (a) As you put data into databases, there are a large number of data mining techniques
that can be put to use. Are there any of these techniques that will be helpful in your work? (b) Also, are there fields within the
already stored information, such as chemical compound structure, that in themselves would provide useful information that
could be used.
Question #5 (Diane) Representation Brad to polish, then review by others
Readers: Stephanie, Diane
Characterize and describe the current structure of scientific literature now, and how you will envision it
may change by 2050. Consider things already being discussed and tried like XXXX, as well as your
thoughts on what new forms scientific discourse may take. Be sure to address how textual and nontextual data may be intermingled (or not). Also, discuss possible barriers to these new forms of scientific
literature (social, technical, legal, global/cultural). Based on your projections for new structures for
representing scientific literature in 2050, describe what opportunities these new structures might afford
LBD techniques in the future.