Download crg weekly status report

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
WEEKLY STATUS REPORT
Juan Bernal
6/7/2008
ACTIVITIES:
This week’s work was focused on Weka-grid, mainly running the tasks for SMO with the CCCS
dataset without the NUMFAULTS attribute to compare results against those of the SMO with the
regular dataset. It was noticed that while removing the number of faults also removed the high
weight of the attribute, it also increased the error rates in general by about 50%. Weka-grid was
really not very stable and in many test it would not finish the task, and when it was able to finish
the task it would take about 1 to 3 minutes to display the output. As before, the file output did not
generated a valid file, instead, the file would show as if it had corrupted text, but not output results
as expected. After the SMO tests were done then the MultiLayerPerceptron tests were done.
These tests were a little more time consuming because Weka-grid failed more frequently on
giving the output. In addition, there was an issue with the use of the flag to define the random
seed number. Before, I was doing the tests using –S but under the MLP after certain number it
would yield a java out of bound exception. To solve this issue the use of –s was implemented;
this flag was to set the random number seed for cross-validation, while the –S flag was the value
used to seed the random number generator.
After obtaining all the results from the MLP the test for RandomForest were done. In this case,
Weka-grid would not run RF in a distributed manner. Many attempts were done with different
flags and options to try to obtain an output from a distributed run but as before Weka-grid would
just hang indefinitely. The only way an output was achieved was by running Weka-grid locally and
as before not every run would generate it. The RF test took the longest to show a result about 3
minutes for RF 100 and about 5 minutes for RF500, but while doing the tests another odd error
appeared where the error rates would be all ceros but if the dataset was changed from the
CCCS-Fit dataset to the CCCS-test dataset then error rates would be non cero, also for some
seed numbers this behavior would repeat but for other seed number not. Since RF100 and
RF500 under Weka-grid does not yield stable results locally and it doesn’t work when distributing
it, then RF100 and RF500 tests were omitted.
Working with Weka-parallel
ACCOMPLISHMENTS:
obtained results for SMO with CCCS fit dataset without the NumFaults attribute for comparison to
test with normal CCCS dataset. Also, completed tests and gathered result for the MLP task.
ISSUES/PROBLEMS:
Weka-grid would not output results each time is executed, when it comes to RandomForest in a
distributed mode Weka-grid would not give any outputs and hang every time. Locally it would run
RF in occasions but then results would not be regular. There was an issue with the MLP test and
the flag to determine the random seed, used –s rather than –S to obtain results.
Weka-grid in general is very slow processing the results.
PLANS:
Finish tests with Weka-parallel, gather output data and create tables and charts for comparison
between all of the tools tested and present results obtained.
SUMMARIES/CRITIQUES OF PAPERS:
In this part, please include a short review of the papers you have read during the last week. The
review should include three short paragraphs for each paper.
Daniela Barbalace, Claudio Lucchese, Carlo Mastroianni, Salvatore Orlando, Domenico Tulia,
“Distributed Data Mining on Desktop Grids”, CoreGrid Technical Report Number TR-0141,
June 17, 2008. Institute on Knowledge and Data Management, Institute on Architectural Issues:
Scalability, Dependability, Adaptability. CoreGRID - Network of Excellence. URL:
http://www.coregrid.net
This paper describes the utilization of volunteer computer systems oriented to data mining
processes. There have been already desktop grid projects like BOINC and XtremWeb that
accomplish complex task, but they are not fit for data-intense processes found on the data mining
field. There is a main focus on Closed Frequent Itemset Mining Problem (CFIM), which involves
the extraction of significant patterns from transactional dataset from the ones defined by a user
threshold. Also, a data-intensive computing network, adapted to mining task on a volunteer
computer system is introduced in the paper.
The parallel mining of closed frequent itemset is defined in the paper and mention to new
algorithms like MT-Closed. The idea with this research is to partition the whole data mining task
into independent subsets.
The data-intensive computing network then was described where the CFIM data mining problem
is worked on a super-peer network for assignment and execution of jobs, also caching is used for
efficient distribution. On the super-peer structure special nodes like the Data Source, the Job
Manager, the Miners, the Data-Cachers, and Super-Peers interact to each other to perform the
data mining tasks.
A performance evaluation was done with a volunteer network and the result show to be promising
for P2P data mining networks distributing large amounts of data.
This paper introduces an emergent technology on distributed data mining under peer to peer
networks, given the need to have a better system to distribute large amounts of data on a superpeer network to solve challenging data mining problems. The paper goes into detail for the
algorithms of the data-intensive computing network as well as for the parallel mining of closed
frequent itemset. This algorithm description is a little hard to understand due to its theoretical
description but the data-intensive computing network is very well illustrated and detailed.
I think this paper is looking at the future of data mining where any computer could possibly do
data mining tasks in a large scale.
This paper relates to my project in the informative aspect. It shows me the direction where I would
like to go with a development of a distributed/grid enabled data mining utility. Although is very
advanced it shows the possibilities of data mining on grids and distributed environments.