Download Actions - AndyPryke.com

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Response to “Data Mining Project – Outline of Way
Forward” by Dr Glen Williams (W3 Insights)
Document date: 23rd April 2007
Note: A&C is an abbreviation for “Andy and Colin” pending company formation!
Table of Contents
Part 1: Preliminaries __________________________________________________ 1
Part 2: ClusterVis interface changes______________________________________ 2
Part 3: Data Prep and Mining ___________________________________________ 2
Section 3: Weka Vs ClusterVis Vs R ______________________________________ 3
Part 4: Reporting _____________________________________________________ 4
Part 5 Reporting Implementation ________________________________________ 5
Part 6 Training Requirements ___________________________________________ 5
Actions _____________________________________________________________ 5
Part 1: Preliminaries
Sections 1 & 2 - Non-disclosure and Non-competitive Agreements.
Basic NDA and non-competition agreements signed. We plan to re-sign the non-competitive agreement
with new paragraph 1.3 regarding competition of W3 with A&C.
ACTION: Andy to supply para 1.3 for non-competitive agreement
Intellectual property rights. I suggest that the best thing to do would be for W3 to have an exclusive
licence in the schools / education authorities market, but for A&C to retain the right to use IP for other
markets. We're happy to go for a "standard" timeframe for non-competition - whatever this is - 5 years?
IP rights would include ability for W3 to modify and extend code independently of A&C, but restrict
the application area to education. In the case of a third-party modifying/extending the application, we
would require appropriate agreements with them to prevent their use of the technology outside of W3.
ACTION: Agree timeframe and more detailed IP / competition rights.
We're happy for you to name and market the product as best fits your needs. Cassandra sounds good to
me.
Section 3 Publicity
Andy is generally keen to develop case-studies and examples of successful applications to show to
prospective clients. W3 and A&C could develop and approve appropriate case-studies and texts which
emphasise the benefits rather than the in-depth technical details. Examples of data mining /
visualisation could use anonymised data, faked or tweaked data, or with permission of schools,
reference actual results.
ACTION: Colin - Find out about his new employers requirements re: publicity.
ACTION: Andy, W3 - Flesh out in more detail what we would and wouldn't want to say.
Any suggestions from W3 as to the type of "professional guidelines" A&C could develop would be
very useful.
4. Contact Information
Andy would be the first contact for contractual issues. Colin and Andy would also be available as
contacts for their respective technical expertise.
Andy Pryke - 07866 566 178, [email protected]
Colin Frayn - 0773 066 6190, [email protected]
Working pattern - Andy is available 5 to 7 days a week. As W3 have quite full diaries, we could set a
series of meeting dates in advance in order to discuss progress on the project.
ACTION: Colin - suggest best days for availability
5. Timescales
Are there important points in the school year which influence W3's marketing for this work? e.g.
conferences, summer break etc.
Andy is unavailable during the following periods:
 19th-26th April - Vacation
 19th-26th June - Vacation
 24th-30th July - Arts Marketing Association Conference
6. Financial Issues
For short term projects, A&C charge at £500 per day. For longer projects, we may be able to discount
this. All hours will be fully documented, and a schedule provided for the project timescales.
A&C will invoice with VAT via the new company.
Part 2: ClusterVis interface changes
Colin has given a response to this: "I can't see anything in this list that would pose a problem" but
raises the options of developing from scratch or licensing existing code from Birmingham University.
ACTION: A&C - Talk to University about licensing options
ACTION: Colin - Develop a rough timescale for the changes required
Part 3: Data Prep and Mining
Section 1: Data Preparation and cleaning
Many of these data preparation functions are required for both reporting and 3D visualisation.
It is possible to automate the examples given: Dealing with missing scores; matching separate data
tables on date of birth; removing duplicates; discretisation of exam scores and curriculum levels. There
are a number of choices for implementing this, including Excel Macros, Weka, R and fully automatic
scripts. It's possible that a pre-processing report could be generated indicating rows removed, any
discrepancies in data etc.
One question raised is "How much of the data prep/cleaning we push onto the schools?". We suggest
that a set of standard column names and a "data dictionary" be established. This would specify things
such as:
Column Name: SEN
Description: Special Educational Needs indicator
Values: X - extra special, S - slightly special, N - not special.
Example value: "X"
Column Name: Y5En-SAT
Description: Raw Standard Attainment Tests (?)
Values: A whole number between 0 and 25
Example value: 23
This data dictionary would be passed to schools, so they could correctly label their spreadsheet fields..
We could then accept any data which used the given column names, and automatically check that the
data was of the type specified and within a valid range.
ACTION: W3 - Talk to headteacher colleagues and data owners to identify appropriate fields and
formats.
Section 3: Weka Vs ClusterVis Vs R
Questions raised:
Unique selling point - As well as the USP of 3D clustering, other USPs can be claimed when using
Weka or R. Both are highly flexible, and allow complex data processing to be "wrapped up", either a
program in R or a KnowledgeFlow in Weka. The USP then becomes the unique knowledge and
experience embodied in them, from W3 in the educational sector, and A&C in data mining /
visualisation.
Data Volumes - As Colin noted in his response, ClusterVis hand handle a few thousand points, R and
Weka can handle millions of records.
Strengths of different software packages
Use of ClusterVis for Marketing and Presentation - This sounds like a good idea to me. ClusterVis
is exciting and colourful, and allows immediate exploratory analysis of data. It's idea for exploring new
and unfamiliar datasets, use in presentations and marketing.
Weka is best suited for formal analysis by a skilled user, for example for bespoke analysis. Weka has
many data mining and data processing algorithms built in, and these can be accessed through different
interfaces: "explorer", "experimenter" and "KnowledgeFlow". They can also be called from Java
programs or from R.
R can be used by a skilled user to perform analyses. Programs can also be written in the R language to
automatically process data or create reports. R programs can be linked into Excel, though this would
need further investigation by A&C. As well as providing it's own statistical analysis functions, R can
also access Weka functions.
Clementine is a commercial (and very expensive) package similar to Weka's KnowledgeFlow
interface. As might be expected, it's generally much better than Weka in terms of usability and stability.
I believe it also features report writing capabilities, but I'm not familiar with these in detail.
Client Perceptions of Ease vs. Cost
Another point raised by W3 is whether clients will perceive the process as being too easy when they
see the slick interface of Cassandra / ClusterVis. One important point to bring out is the difference
between exploratory analysis, data mining and statistical analysis and reporting.
During exploratory analysis, we're interacting with the data, looking at it, trying out hypotheses
visually. With data mining and statistical analysis we're looking for patterns which we can claim are
definitely there and not just chance occurrences. Reporting takes this a stage further and packages these
raw discoveries into a form where they can be easily understood by the "data owners".
Additional functions for discussion
Points 1,2 and 3: A&C can identify appropriate visualisation, classifier and rule derivation algorithms,
including tree-based algorithms. This will take into account how easy it is to explain the results (or
algorithm) to a lay audience.
Point 4 and 6: A&C and W3 can talk through the possible applications of clustering and association
rules to schools data and see what is relevant.
Point 5: As we'd recommend automatic data pre-processing as much as possible, pre-processing should
not need to be a factor in choosing the analysis methods.
Point 7: Attribute selection can be use on an individual school basis to identify which factors are most
relevant to something we want to predict (e.g. absence). This is useful information in itself. They can
also be used as a pre-processing step to select the most relevant factors to feed into classifiers and other
data mining algorithms, or to visualise. This could be done on an individual schools basis or by
combining data from many schools. In general, we'll use statistically principled methods of attribute
selection and bear in mind how easy it is to explain them.
Point 8: "Is there other open source data mining which could be of use?"
ACTION: Andy will investigate open source tools, particularly with reference to the "Pentaho" suite
of tools which has recently adopted Weka.
Point 9: Export of information from ClusterVis. Colin mentioned that cut and paste should be possible.
A&C can investigate OLE / COM interfacing. See also part 4, "reporting" for more on this.
Point 10 and 11: Can Weka KnowledgeFlow and Experimenter interfaces be used for automation?
Possibly. Andy will look into this. Scripts written in R are probably a better method for reporting
automation.
Part 4: Reporting
There is great potential for automation of reporting. In order to achieve the best balance between
automatic report generation and manual report compilation, we need to establish what the expected
volume of reporting is, and how long manual analysis would take.
For high levels of automation, it's likely that R is the best package. Andy has experience using R to
automatically generate MS-Word reports with analyses and graphs.
The outline report seems sensible, and a mock-up report should be produced at some stage to ensure we
have a clear target.
Miscellaneous questions:
In Part2, under "accelerated learning implications", "Enrichment" is mentioned. I understand this to
mean an individual measurement of progress. I.e. a school could argue that although it is low in the
league tables, its pupils have gained a lot compared to their starting point and therefore it is a good
school. Is this correct? Would we have the data to measure this, particularly in terms of the starting
point and baseline information on progress expected?
"A general statement that counterpoints the differences ... between ... disparate groups". We can
certainly attempt to derive rules which capture the differences.
Under "Attitudes", you mention "analysis on these specific attitude variables with/without reference to
the other PASS factors". I'm not clear on what exactly this means.
Part 5 Reporting Implementation
The guiding principles seem sensible and unproblematic. Technically, it might be easier to output
HTML rather than RTF, but HTML can be loaded directly into MS-Word and saved as either a .doc or
.rtf format file.
The conversion of results / titles into plain English could either be done by hand, or we can automate it.
Depending on the volume of reports, it might be easier to have an automatic system to avoid extra
editing. This would depend on the "data dictionary" containing plain English descriptions for each
variable.
Thought needs to be given as to the interface for selecting report sections and to how templates can be
made both simple and extensible.
Part 6 Training Requirements
Colin dealt with some of this in his response.
I'd add re: Analogies for ClusterVis - we often use the idea of the points being connected by "springs"
which pull them together. This is pretty much what is simulated on a mathematical level.
Data Cleaning / preparation
As mentioned before, it's likely that a lot of this process can be automated. The main limit is the need
for some standardisation in the files presented. This will be dealt with through use of a data dictionary
shared with the schools.
We can explain concepts such as a "conjunctive ruleset", outline the processes in data preparation, and
answer the more detailed questions under "The Analytical Process" in training or another document.
Actions
ACTION: Andy to supply para 1.3 for non-competitive agreement
ACTION: Agree timeframe and more detailed IP / competition rights.
ACTION: Colin - Find out about his new employers requirements re: publicity.
ACTION: Andy, W3 - Flesh out in more detail what we would and wouldn't want to say.
ACTION: Colin - suggest best days for availability
ACTION: A&C - Talk to University about licensing options
ACTION: Colin - Develop a rough timescale for the changes required
ACTION: W3 - Talk to headteacher colleagues and data owners to identify appropriate fields and
formats.
ACTION: Andy will investigate open source tools, particularly with reference to the "Pentaho" suite of
tools which has recently adopted Weka.