Download Medical Data Classifier

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Medical Data Classifier
undergraduate project
By: Avikam Agur and Maayan Zehavi
Advisors: Prof. Michael Elhadad and Mr. Tal Baumel
Motivation
• word2vec: An algorithm that associates
closely-related words.
• Combining with the outcome of our project, this
algorithm will help creating a medical text
summarizer.
Project Goals
● Create a fast, scalable, highly accurate,
machine-learning based classifier which
predicts whether a given document is medical
or not.
● Distributively run this classifier over a large
amount of web content and extract medical
documents.
Building a labeled dataset
Problem: Manually collect medical and nonmedical documents is almost impossible.
Solution: Using Wikipedia’s archive files, we
tagged wiki pages based on their category and title.
Result: decent amount of medical and non-medical
data.
Training Phase
Data transformation flow:
Documents
Boilerpipe
Tokenizing
TFIDF
Feature
selection
Training Phase
• Supervised learning using 2 classification
algorithms:
o Naive Bayes
o SVM
• Cleaning the data using:
o 𝜒 2 feature selection
o Stemming
Classifier Evalutaion - Measures
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ⋅ 𝑟𝑒𝑐𝑎𝑙𝑙
𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 2 ⋅
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
Evalutaion Phase
Configuration Parameters:
• Classification algorithm
• Amount of features
• Stemming or not
Each configuration was trained and then tested on
a random 5% of the tagged dataset.
Results - Graph
Average F-Measure
0.87
0.86
0.85
0.84
0.83
0.82
0.81
0.8
0.79
0.78
0.77
Naïve Base
SVM
Features Count
Distributed Programming Phase
• Use Apache Spark framework
• Iterate ClueWeb web archives (~14 TB)
in a master-slave architecture
• Use the same training pipeline to convert
web document to vector
• Tag each document and export medicaltagged documents’ IDs.
Questions?
Related documents