Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS240A Final Project 2 Final for CS240A: Project II: Data Mining in SQL and Datalog. In this project, you will gain experience and understanding on the problem that DB query languages are facing in supporting Predictive Analytics even when the task is as simple as 1R and NBC classifiers which do not require recursion. In fact, you are expected to build generic classifiers, i.e., that operate on tables with arbitrary number of columns, once these are placed in verticalized in column form. Your specific tasks are as follows: Your specific tasks are as follows: TASK A • Build 1R and NBC classifiers using Deal and test them on the verticalized representation in the DeAL tutorial. However, you should not use the rules in the DeAL tutorial. You should instead simplify the training rules by using a Laplace estimator where missing examples are counted as one. Also in the decision rules do not rely on the user-defined aggregate given in the notes: use the standard aggregates instead. TASK B • Using DB2, build an NBC classifier for a dataset used in Task A and one or more datasets of your choice. See if you can find interesting datasets to classify (perhaps some that you have used in other projects). You can find interesting ones at the following sites: • http://www.cs.toronto.edu/~delve/data/ • http://kdd.ics.uci.edu/summary.data.alphabetical.html You are encouraged to try new datasets and applications, and if have your own interesting application you should use it! Task B consists of the following subtasks (you should try to implement them using clean and compact SQL): 1. Select a dataset and load it into DB2 as a tables called, say, DataSet1, Dataset2, etc. Then for each dataset do the following: 2. Randomly partition your DataSet into a TrainSet and a TestSet (the first containing about three times as many tuples as the second). 3. If your data contain numerical attributes, then you should represent them in the TrainSet by either (i) discretizing them, or (ii) approximating their probability by e.g., assuming that they follow a simple Gaussian distribution. (Of course different columns might be better treated by different methods.) 4. Devise a strategy for dealing with missing values. 5. Build a Naive Bayesian Classifier using DB2's SQL aggregates and (preferably) table functions, and store it in a table called NBC. --Try to provide general-purpose solutions and code, expecting that it will be used on data sets with arbitrary number of columns. 6. Write SQL queries that take the tuples in TestSet (wihout class labels) and predict such class labels using your NBC. 7. Build first a 1R classifier (single column) and compare its results to those of the NBC classifier 8. [Boosting of single classifier] Find the misclassified samples from TrainSet, and increase their weights (e.g., by simply duplicating them) to implement the boosting step. Repeat steps 5 and 6 above but but stop as soon you see the accuracy stop improving. Final Project 2. Task C. [Ensemble-based bagging for extra credit] See if you can get better decisions from your DB2 or Datalog classifiers by using voting ensembles of classifiers (possibly by assigning weights to each classifier on the basis of their accuracy). Alternatively build a better voting ensemble by using boosting. Task D. [Back to Datalog, for extra Credit]. Write a K-means classifier in DeAL using an XY-stratified program. Test and demonstrate it on a dataset of your choice (e.g., http://web.cs.ucla.edu/classes/fall16/cs240A/notes/decision/NYtaxiCalls.txt) Task E: Write a nice report about your work. Respective credit for Task A, B and E is 30%, 40% and 30%. Task C and D get 10% extra credit each. The credit in each task depends on the complexity of the dataset and mining methods selected and quality of analysis and solutions proposed. Focus on those and on writing an interesting report, before you work on D and E that are meant for extra credit. More on Data Sets: Good results were reported in the past with datasets led, mushrooms, splice, titanic, waveform, abalone, letter, and census. But data are continuously being revised and upgraded and you are encouraged to try new data sets. However make sure that your data set is not too small, otherwise your experiments with performance will not be interesting. It is important that you write generic classifiers, i.e. one that will work for data sets having different number of columns and data types (e.g., discrete and continuous). In fact, you must test and demonstrate your classifiers on different data sets.The best way to achieve genericity is to work with table in a verticalized format, as is A (e.g., by using DB2s table functions to transform your example tales into a vertical form). You are encouraged to try new datasets and applications, and if have your own interesting application you should use it! Verticalized Representation