Download Baker-Comps-Question..

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data analysis wikipedia , lookup

Theoretical computer science wikipedia , lookup

Neuroinformatics wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
I have put together a possible set of questions and readers, based on input from Alex, Stephanie and
myself. I put questions in DEFINTE if at least two of us said “Should” include. We combined #4 and #8.
I discarded my question on open databases, which nobody (including me voted for ). You are
welcome to advocate for including/excluding questions. In particular I think the two main option plays
would be to consider
 Swapping in Q1 for Q2
 Swapping in Q7 for Q9
I’d like to hear from Javed and Diane. I’ll update this when I do. Again, we’re meet Friday morning at
8:30am here at Manning Hall in room 214. Before then, it would be good for Alex and Diane to mesh
their questions #4 and #8 together into one question, and for Alex to flesh out Q3.
Thanks, Brad
DEFINITE
Question #2 Entity representation (notation): chemicals, genes, proteins, etc. (Stephanie)
Readers: Stephanie, Brad
Chemicals, genes, proteins and similar entities can be represented in a number of forms, ranging from
in-line text through graphical, hybrid, and even 3-D animations. In part, they have arisen because of the
variety of uses for chemical representations. Some are well-suited for human interpretation, or printing
in journal articles, others are useful for computer manipulation such as searching and matching. Some
are used for communication with non-specialists, (e.g., aspirin), others are designed to hide information,
for example, in a patent application.
In natural language, ambiguity adds to our richness and creativity of expression, but can also lead to
confusion (intentional or not). We can consult a dictionary or thesaurus to choose a word that is
appropriate to the context, from felis domesticus (for scientific discourse) to kitty (for conversation with
a child). These tools thus support some level of translation among terms, but as with much translation
between natural languages, information may be lost in translation.
A. What are the issues involved in translation (or conversion) among different forms of representations
for chemicals? In your discussion, you could consider affordances for use, information loss or gain, and
policy or legal issues, but you don't need to limit yourself to these ideas.
B. Returning to natural language translation, one model of machine translation is the interlingua model.
In this theoretical model, there is a language-independent representation of meaning. All languages can
be translated into interlingua, and all can be translated from it, with no loss of meaning. <diagram, if it
would be useful> Would the interlingua model be useful in chemical representation? Why or why not?
What existing representation, in your opinion, comes closest to an interlingua? Describe the interlingual
features it has, and those it lacks.
I'd be willing to be first or second reader on this, if it's included.
Question #5 Named Entity Extraction (Brad)
Readers: Brad, Alex
Your proposed work utilizes MeSH defined terms (chemical names) assigned to articles by expert human
Medline indexers. Compare and contrast discovery of chemical names via Mesh terms with algorithmic
discovery from full text (or abstracts) of chemical names via NLM’s Metamap program, and
Zimmerman’s ProMiner system. When you compare and contrast them, discuss assumptions
(requirements of the methods), cost, performance, scalability (to hundreds of millions of articles),
strengths and weakness of each of the three methods. Motivate why you believe using Mesh provides
competitive advantages, or different types of results than using Metamap or Prominer for named entity
extraction of chemicals. (I like this because it goes to the core of her argument; whether using human
annotated information or automatic discovery from text is best solution in long term).
Question #9 (Diane)
Readers: Diane, Stephanie
In many cases, scientific literature is more structured than other forms of writing. Are there ways to
take advantage of this reduced diversity to improve natural language processing techniques? Are there
techniques that have shown themselves to work better in such environment? (I like this as a discussion
topic too; although I think it would have to be fleshed out a fair bit more though)
I think we could work with this. I'd be willing to be second reader on it.
COMBINATION OF #4 and #8
Readers: Javed, Diane
Question #4 (Javed)
What is the significance of token extraction as it relates to detection of entities critical to biomedicine?
What are some of the state-of-the art approaches that have been developed for entity detection?
Discuss their computational advantages and disadvantages. Why is it necessary to supplement mining
of textual content with other sources of evidence such as entity associations generated based on
techniques such as BLAST or micro-array analysis? Provide some concrete examples. What are the
potential ways an integrated mining approach could be developed to improve upon techniques that rely
on a single source of evidence?
my concern with this is knowing when she's answered it well. How would she (or the readers) know
what a complete answer would look like? If combined with #8, perhaps it could be refined a bit to better
define its scope.
Question #8 (Diane)
Extracting information from databases: (a) As you put data into databases, there are a large number of
data mining techniques that can be put to use. Are there any of these techniques that will be helpful in
your work? (b) Also, are there fields within the already stored information, such as chemical compound
structure, that in themselves would provide useful information that could be used. (I think this overlaps
and could be potentially combined with Question #4)
Yes, combining with 4 might work well.
Question #3 (Alex)
Readers: Alex, Javed
Discuss the potential of Swanson's literature-based discovery model (ABC) in drug research as applied to
data sources beyond textual ones. Discuss the integration (in the context of both hypothesis generation
and validation) of textual and non-textual data sources in drug discovery including both primary and
undesired (e.g., toxic) side effects. Do you want to prompt with some specific types of non-text data
sources (microarrays, etc)?
I think this is a good question, perhaps with some refinement.
POSSIBLE
Question #1 Evaluating LBD (Stephanie)
In your literature review, you discuss the difficulty of evaluating and validating the results of literaturebased discovery (LBD). The problem is similar to that of evaluating information retrieval systems based
on relevance of retrieved items to the initial question. In IR, relevance can be viewed from the user
perspective, and thus evaluation must involve real users with real information needs. On the other
hand, TREC-style evaluation provides the large collections and defined tasks (and results) that allow for
more uniform evaluation and comparison of system performance. In general, this is viewed as a
reasonable compromise that has advanced IR technology.
The ultimate validation of an instance of LBD is its confirmation by actual experiment, and even more
stringently, that it is interesting and useful. Short of that, methods such as partitioning the literature by
date, and seeking confirmation of discoveries in the older literature in the newer literature have been
used.
1. Identify and discuss two limitations and two advantages of current methods of validation of LBD. In
your answer, you might consider aspects such as availability of data, generalizability of results, and
"power of persuasion", that is, ability to convince skeptics that LBD is a legitimate means of discovery.
(my question was for her to detail limitations and advantages of her proposed evaluation method, and
what she thought were the best three other methods described in papers in her lit review, and then
contrast the four methods).
Your version of part 1 would be fine Brad, but rather than working with 4 methods (including hers), what
about 3? I'd rather she work on depth rather than breadth, especially if the question includes part 2.
2. Many research communities have adopted the TREC model of evaluation: creating a large collection
of data and setting specific tasks for research systems.
a) Discuss the viability of the TREC model for LBD. Include in your discussion consideration of the
limitations and advantages from part 1.
b) What does establishing such an effort require on the part of LBD researchers?
I'b be willing to be first or second reader on this.
Question #7 (Diane)
You point out in your review that there is some questionable quality in the databases that you are using
and the quality of any result is heavily dependent on the quality of its inputs. It is not practical to
assume that all data is properly curated and even trying to select the best databases will not necessarily
work over time as the quality may change. The question therefore is: As you use data from these
databases, what techniques are available that will be more tolerant of incorrect data? In statistics, this
is referred to as the problem of mis-labeled data; in experimental science, it is the question of dealing
without outliers. You should be able to assume that the information in the text sources is correct as
those are well reviewed. It is the massively collected databases that are the question. (Good question
but maybe not as closely tied to Nancy’s work since she is mainly capitalizing on human indexed Mesh
terms assigned to articles as opposed to the massively collected databases).
But I like this question because the problem of data quality is central to any large collection, especially
when working with a variety of mining techniques.
DISCARD
Question #6 Chemical Databases and Open Science (Brad)
Discuss the status of PubChem. How widely used is it currently used; by what communities/groups? For
what purposes? Compare the functionality of PubChem with CAS. Be sure to include data, functions, or
services that CAS provides that PubChem does not currently (and vice versus). If scientists were to stop
using CAS tomorrow, and replace all usage of chemical names in their lab work and paper writing with
PubChem names what effects would this have? Be thorough, and include not just technical and
workflow issues, but social, community, sharing, and other effects. (I like this as a discussion topic, but
maybe not central to her work).
I agree –I don't think it's as salient as others.