Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Programmatic Reading and Compilation of Biomarker Research Data using Natural Language Processing and Machine Learning 1Varun 2016 Internship Program supported by: 1Homestead 2Cameron Tandon, Baab, Ambatipudi, 5 Kurzion and Dr. Parag Mallick 4Ronni High School, Cupertino, CA. 2Menlo School, Atherton, CA. 3Saint Francis High School, Mountain View, CA. 4Mountain View High School, Mountain View, CA. 5Stanford University, Stanford, CA. Introduction Currently the vast majority of research pertaining to biomarkers is reported by the scientific community in the form of research papers. While this format presents information in a manner that can be read by humans, it makes collecting important information found from many research papers difficult. Thus, it would be extremely beneficial to researchers if this information was presented in a structured format through which data could be easily culled. The goal of this project is to create a computational approach to programmatically read millions of published research papers, extract relevant biomarker information (associated disease, medium, type, levels in patients, etc.) and present it concisely in the easily searchable database stored in the Markerville website. 3Mythri Methods Text Candidate Extraction Scoring Evaluate Accuracy Machine Learning Database Output Candidate Extraction The tumor marker CA-125 is most highly associated with high-grade serous carcinomas, the most prevalent and aggressive ovarian cancer subtype, with other subtypes exhibiting only limited expression [18-20]. Scoring AGR2 is linked with prostate cancer. Distance: 4 Output: 0 Access to prostate cancer sample tissue for testing was limited, and testing for the biomarker AGR2 was difficult. Distance: 15 Output: -1 Results Conclusions and Further Directions Sentence Statistics Irrelevant Sentences Sentences without a relation Sentences with a relation Literature cited 1. Biomarkers Definitions Working Group. Biomarkers and surrogate endpoints: preferred definitions and conceptual framework. Clin Pharmacol Ther 69, 89–95 (2001). 2. Pavlou, M. P., Diamandis, E. P. & Blasutig, I. M. The long journey of cancer biomarkers from the bench to the clinic. Clinical Chemistry 59, 147–157 (2013). 3. Kani, K., Malihi, P. D., Jiang, Y., Wang, H., Wang, Y., Ruderman, D. L., Agus, D. B., Mallick, P., Gross, M. E. Anterior Gradient 2 (AGR2): BloodBased Biomarker Elevated in Metastatic Prostate Cancer Associated With the Neuroendocrine Phenotype. The Prostate, 306-315 (2013). 4."DeepDive." DeepDive. Stanford University, n.d. Web. 13 June 2016. <http://deepdive.stanford.edu/>. 5. HazyResearch. DeepDive Lite. Github, 2016. Web. 13 June 2016. <https://github.com/HazyResearch/ddlite> 6. Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60. Using a combination of sentence structure labeling functions and positive and negative keyword labeling functions, we extracted the following: biomarker names, disease associations, associated drugs, biomarker levels, medium and type. After manually selecting approximately one hundred papers relating to cancer biomarkers we were able to process these papers and extract the key information with am average of 75% confidence. The true benefit of our program is extracting information from papers that are not specifically about biomarkers, and we hope to optimize the speed of our code so it is able to process the entirety of Pubmed Central (3.9 million research papers) In addition, we plan on optimizing our SQL code, which is responsible for inputting data into a database, which is accessible and searchable through the Markerville website. Although the international scientific community is beginning to establish a format for research papers that will allow computers to easily understand key findings, our program allows for the extraction of data from millions of papers written before the establishment of this format. Acknowledgements We would like to thank Dr. Parag Mallick for guiding us throughout this project. Would would also like to thank Theodoros Rekatsinas, Alex Ratner and Steven Bach for their help with DeepDive and Snorkel.