Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
WEEKLY STATUS REPORT Juan Bernal 6/7/2008 ACTIVITIES: This week’s work was focused on Weka-grid, mainly running the tasks for SMO with the CCCS dataset without the NUMFAULTS attribute to compare results against those of the SMO with the regular dataset. It was noticed that while removing the number of faults also removed the high weight of the attribute, it also increased the error rates in general by about 50%. Weka-grid was really not very stable and in many test it would not finish the task, and when it was able to finish the task it would take about 1 to 3 minutes to display the output. As before, the file output did not generated a valid file, instead, the file would show as if it had corrupted text, but not output results as expected. After the SMO tests were done then the MultiLayerPerceptron tests were done. These tests were a little more time consuming because Weka-grid failed more frequently on giving the output. In addition, there was an issue with the use of the flag to define the random seed number. Before, I was doing the tests using –S but under the MLP after certain number it would yield a java out of bound exception. To solve this issue the use of –s was implemented; this flag was to set the random number seed for cross-validation, while the –S flag was the value used to seed the random number generator. After obtaining all the results from the MLP the test for RandomForest were done. In this case, Weka-grid would not run RF in a distributed manner. Many attempts were done with different flags and options to try to obtain an output from a distributed run but as before Weka-grid would just hang indefinitely. The only way an output was achieved was by running Weka-grid locally and as before not every run would generate it. The RF test took the longest to show a result about 3 minutes for RF 100 and about 5 minutes for RF500, but while doing the tests another odd error appeared where the error rates would be all ceros but if the dataset was changed from the CCCS-Fit dataset to the CCCS-test dataset then error rates would be non cero, also for some seed numbers this behavior would repeat but for other seed number not. Since RF100 and RF500 under Weka-grid does not yield stable results locally and it doesn’t work when distributing it, then RF100 and RF500 tests were omitted. Working with Weka-parallel ACCOMPLISHMENTS: obtained results for SMO with CCCS fit dataset without the NumFaults attribute for comparison to test with normal CCCS dataset. Also, completed tests and gathered result for the MLP task. ISSUES/PROBLEMS: Weka-grid would not output results each time is executed, when it comes to RandomForest in a distributed mode Weka-grid would not give any outputs and hang every time. Locally it would run RF in occasions but then results would not be regular. There was an issue with the MLP test and the flag to determine the random seed, used –s rather than –S to obtain results. Weka-grid in general is very slow processing the results. PLANS: Finish tests with Weka-parallel, gather output data and create tables and charts for comparison between all of the tools tested and present results obtained. SUMMARIES/CRITIQUES OF PAPERS: In this part, please include a short review of the papers you have read during the last week. The review should include three short paragraphs for each paper. Daniela Barbalace, Claudio Lucchese, Carlo Mastroianni, Salvatore Orlando, Domenico Tulia, “Distributed Data Mining on Desktop Grids”, CoreGrid Technical Report Number TR-0141, June 17, 2008. Institute on Knowledge and Data Management, Institute on Architectural Issues: Scalability, Dependability, Adaptability. CoreGRID - Network of Excellence. URL: http://www.coregrid.net This paper describes the utilization of volunteer computer systems oriented to data mining processes. There have been already desktop grid projects like BOINC and XtremWeb that accomplish complex task, but they are not fit for data-intense processes found on the data mining field. There is a main focus on Closed Frequent Itemset Mining Problem (CFIM), which involves the extraction of significant patterns from transactional dataset from the ones defined by a user threshold. Also, a data-intensive computing network, adapted to mining task on a volunteer computer system is introduced in the paper. The parallel mining of closed frequent itemset is defined in the paper and mention to new algorithms like MT-Closed. The idea with this research is to partition the whole data mining task into independent subsets. The data-intensive computing network then was described where the CFIM data mining problem is worked on a super-peer network for assignment and execution of jobs, also caching is used for efficient distribution. On the super-peer structure special nodes like the Data Source, the Job Manager, the Miners, the Data-Cachers, and Super-Peers interact to each other to perform the data mining tasks. A performance evaluation was done with a volunteer network and the result show to be promising for P2P data mining networks distributing large amounts of data. This paper introduces an emergent technology on distributed data mining under peer to peer networks, given the need to have a better system to distribute large amounts of data on a superpeer network to solve challenging data mining problems. The paper goes into detail for the algorithms of the data-intensive computing network as well as for the parallel mining of closed frequent itemset. This algorithm description is a little hard to understand due to its theoretical description but the data-intensive computing network is very well illustrated and detailed. I think this paper is looking at the future of data mining where any computer could possibly do data mining tasks in a large scale. This paper relates to my project in the informative aspect. It shows me the direction where I would like to go with a development of a distributed/grid enabled data mining utility. Although is very advanced it shows the possibilities of data mining on grids and distributed environments.