Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Medical Data Classifier undergraduate project By: Avikam Agur and Maayan Zehavi Advisors: Prof. Michael Elhadad and Mr. Tal Baumel Motivation • word2vec: An algorithm that associates closely-related words. • Combining with the outcome of our project, this algorithm will help creating a medical text summarizer. Project Goals ● Create a fast, scalable, highly accurate, machine-learning based classifier which predicts whether a given document is medical or not. ● Distributively run this classifier over a large amount of web content and extract medical documents. Building a labeled dataset Problem: Manually collect medical and nonmedical documents is almost impossible. Solution: Using Wikipedia’s archive files, we tagged wiki pages based on their category and title. Result: decent amount of medical and non-medical data. Training Phase Data transformation flow: Documents Boilerpipe Tokenizing TFIDF Feature selection Training Phase • Supervised learning using 2 classification algorithms: o Naive Bayes o SVM • Cleaning the data using: o 𝜒 2 feature selection o Stemming Classifier Evalutaion - Measures 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ⋅ 𝑟𝑒𝑐𝑎𝑙𝑙 𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 2 ⋅ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 Evalutaion Phase Configuration Parameters: • Classification algorithm • Amount of features • Stemming or not Each configuration was trained and then tested on a random 5% of the tagged dataset. Results - Graph Average F-Measure 0.87 0.86 0.85 0.84 0.83 0.82 0.81 0.8 0.79 0.78 0.77 Naïve Base SVM Features Count Distributed Programming Phase • Use Apache Spark framework • Iterate ClueWeb web archives (~14 TB) in a master-slave architecture • Use the same training pipeline to convert web document to vector • Tag each document and export medicaltagged documents’ IDs. Questions?