Download ambatipudibaabtandon_mallick_8816-2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Prostate-specific antigen wikipedia , lookup

Transcript
Programmatic Reading and Compilation of
Biomarker Research Data using Natural
Language Processing and Machine Learning
1Varun
2016 Internship Program supported by:
1Homestead
2Cameron
Tandon,
Baab,
Ambatipudi,
5
Kurzion and Dr. Parag Mallick
4Ronni
High School, Cupertino, CA. 2Menlo School, Atherton, CA. 3Saint Francis High School, Mountain View, CA.
4Mountain View High School, Mountain View, CA. 5Stanford University, Stanford, CA.
Introduction
Currently the vast majority of
research pertaining to
biomarkers is reported by the
scientific community in the form
of research papers. While this
format presents information in a
manner that can be read by
humans, it makes collecting
important information found
from many research papers
difficult. Thus, it would be
extremely beneficial to
researchers if this information
was presented in a structured
format through which data
could be easily culled.
The goal of this project is to
create a computational
approach to programmatically
read millions of published
research papers, extract relevant
biomarker information
(associated disease, medium,
type, levels in patients, etc.) and
present it concisely in the easily
searchable database stored in
the Markerville website.
3Mythri
Methods
Text
Candidate Extraction
Scoring
Evaluate Accuracy
Machine Learning
Database Output
Candidate Extraction
The tumor marker CA-125 is most highly associated with high-grade serous carcinomas, the most prevalent and
aggressive ovarian cancer subtype, with other subtypes exhibiting only limited expression [18-20].
Scoring
AGR2 is linked with prostate cancer.
Distance: 4
Output: 0
Access to prostate cancer sample tissue for testing was limited,
and testing for the biomarker AGR2 was difficult.
Distance: 15
Output: -1
Results
Conclusions and Further
Directions
Sentence Statistics
Irrelevant
Sentences
Sentences
without a
relation
Sentences
with a relation
Literature cited
1. Biomarkers Definitions Working Group. Biomarkers and surrogate endpoints: preferred definitions and conceptual framework. Clin
Pharmacol Ther 69, 89–95 (2001).
2. Pavlou, M. P., Diamandis, E. P. & Blasutig, I. M. The long journey of cancer biomarkers from the bench to the clinic. Clinical Chemistry 59,
147–157 (2013).
3. Kani, K., Malihi, P. D., Jiang, Y., Wang, H., Wang, Y., Ruderman, D. L., Agus, D. B., Mallick, P., Gross, M. E. Anterior Gradient 2 (AGR2): BloodBased Biomarker Elevated in Metastatic Prostate Cancer Associated With the Neuroendocrine Phenotype. The Prostate, 306-315 (2013).
4."DeepDive." DeepDive. Stanford University, n.d. Web. 13 June 2016. <http://deepdive.stanford.edu/>.
5. HazyResearch. DeepDive Lite. Github, 2016. Web. 13 June 2016. <https://github.com/HazyResearch/ddlite>
6. Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP
Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System
Demonstrations, pp. 55-60.
Using a combination of sentence structure
labeling functions and positive and negative
keyword labeling functions, we extracted the
following: biomarker names, disease associations,
associated drugs, biomarker levels, medium and
type.
After manually selecting approximately one
hundred papers relating to cancer biomarkers we
were able to process these papers and extract the
key information with am average of 75%
confidence.
The true benefit of our program is extracting
information from papers that are not specifically
about biomarkers, and we hope to optimize the
speed of our code so it is able to process the entirety
of Pubmed Central (3.9 million research papers)
In addition, we plan on optimizing our SQL
code, which is responsible for inputting data into a
database, which is accessible and searchable
through the Markerville website.
Although the international scientific
community is beginning to establish a format for
research papers that will allow computers to easily
understand key findings, our program allows for the
extraction of data from millions of papers written
before the establishment of this format.
Acknowledgements
We would like to thank Dr. Parag
Mallick for guiding us throughout this
project. Would would also like to
thank Theodoros Rekatsinas, Alex
Ratner and Steven Bach for their help
with DeepDive and Snorkel.