Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Response to “Data Mining Project – Outline of Way Forward” by Dr Glen Williams (W3 Insights) Document date: 23rd April 2007 Note: A&C is an abbreviation for “Andy and Colin” pending company formation! Table of Contents Part 1: Preliminaries __________________________________________________ 1 Part 2: ClusterVis interface changes______________________________________ 2 Part 3: Data Prep and Mining ___________________________________________ 2 Section 3: Weka Vs ClusterVis Vs R ______________________________________ 3 Part 4: Reporting _____________________________________________________ 4 Part 5 Reporting Implementation ________________________________________ 5 Part 6 Training Requirements ___________________________________________ 5 Actions _____________________________________________________________ 5 Part 1: Preliminaries Sections 1 & 2 - Non-disclosure and Non-competitive Agreements. Basic NDA and non-competition agreements signed. We plan to re-sign the non-competitive agreement with new paragraph 1.3 regarding competition of W3 with A&C. ACTION: Andy to supply para 1.3 for non-competitive agreement Intellectual property rights. I suggest that the best thing to do would be for W3 to have an exclusive licence in the schools / education authorities market, but for A&C to retain the right to use IP for other markets. We're happy to go for a "standard" timeframe for non-competition - whatever this is - 5 years? IP rights would include ability for W3 to modify and extend code independently of A&C, but restrict the application area to education. In the case of a third-party modifying/extending the application, we would require appropriate agreements with them to prevent their use of the technology outside of W3. ACTION: Agree timeframe and more detailed IP / competition rights. We're happy for you to name and market the product as best fits your needs. Cassandra sounds good to me. Section 3 Publicity Andy is generally keen to develop case-studies and examples of successful applications to show to prospective clients. W3 and A&C could develop and approve appropriate case-studies and texts which emphasise the benefits rather than the in-depth technical details. Examples of data mining / visualisation could use anonymised data, faked or tweaked data, or with permission of schools, reference actual results. ACTION: Colin - Find out about his new employers requirements re: publicity. ACTION: Andy, W3 - Flesh out in more detail what we would and wouldn't want to say. Any suggestions from W3 as to the type of "professional guidelines" A&C could develop would be very useful. 4. Contact Information Andy would be the first contact for contractual issues. Colin and Andy would also be available as contacts for their respective technical expertise. Andy Pryke - 07866 566 178, [email protected] Colin Frayn - 0773 066 6190, [email protected] Working pattern - Andy is available 5 to 7 days a week. As W3 have quite full diaries, we could set a series of meeting dates in advance in order to discuss progress on the project. ACTION: Colin - suggest best days for availability 5. Timescales Are there important points in the school year which influence W3's marketing for this work? e.g. conferences, summer break etc. Andy is unavailable during the following periods: 19th-26th April - Vacation 19th-26th June - Vacation 24th-30th July - Arts Marketing Association Conference 6. Financial Issues For short term projects, A&C charge at £500 per day. For longer projects, we may be able to discount this. All hours will be fully documented, and a schedule provided for the project timescales. A&C will invoice with VAT via the new company. Part 2: ClusterVis interface changes Colin has given a response to this: "I can't see anything in this list that would pose a problem" but raises the options of developing from scratch or licensing existing code from Birmingham University. ACTION: A&C - Talk to University about licensing options ACTION: Colin - Develop a rough timescale for the changes required Part 3: Data Prep and Mining Section 1: Data Preparation and cleaning Many of these data preparation functions are required for both reporting and 3D visualisation. It is possible to automate the examples given: Dealing with missing scores; matching separate data tables on date of birth; removing duplicates; discretisation of exam scores and curriculum levels. There are a number of choices for implementing this, including Excel Macros, Weka, R and fully automatic scripts. It's possible that a pre-processing report could be generated indicating rows removed, any discrepancies in data etc. One question raised is "How much of the data prep/cleaning we push onto the schools?". We suggest that a set of standard column names and a "data dictionary" be established. This would specify things such as: Column Name: SEN Description: Special Educational Needs indicator Values: X - extra special, S - slightly special, N - not special. Example value: "X" Column Name: Y5En-SAT Description: Raw Standard Attainment Tests (?) Values: A whole number between 0 and 25 Example value: 23 This data dictionary would be passed to schools, so they could correctly label their spreadsheet fields.. We could then accept any data which used the given column names, and automatically check that the data was of the type specified and within a valid range. ACTION: W3 - Talk to headteacher colleagues and data owners to identify appropriate fields and formats. Section 3: Weka Vs ClusterVis Vs R Questions raised: Unique selling point - As well as the USP of 3D clustering, other USPs can be claimed when using Weka or R. Both are highly flexible, and allow complex data processing to be "wrapped up", either a program in R or a KnowledgeFlow in Weka. The USP then becomes the unique knowledge and experience embodied in them, from W3 in the educational sector, and A&C in data mining / visualisation. Data Volumes - As Colin noted in his response, ClusterVis hand handle a few thousand points, R and Weka can handle millions of records. Strengths of different software packages Use of ClusterVis for Marketing and Presentation - This sounds like a good idea to me. ClusterVis is exciting and colourful, and allows immediate exploratory analysis of data. It's idea for exploring new and unfamiliar datasets, use in presentations and marketing. Weka is best suited for formal analysis by a skilled user, for example for bespoke analysis. Weka has many data mining and data processing algorithms built in, and these can be accessed through different interfaces: "explorer", "experimenter" and "KnowledgeFlow". They can also be called from Java programs or from R. R can be used by a skilled user to perform analyses. Programs can also be written in the R language to automatically process data or create reports. R programs can be linked into Excel, though this would need further investigation by A&C. As well as providing it's own statistical analysis functions, R can also access Weka functions. Clementine is a commercial (and very expensive) package similar to Weka's KnowledgeFlow interface. As might be expected, it's generally much better than Weka in terms of usability and stability. I believe it also features report writing capabilities, but I'm not familiar with these in detail. Client Perceptions of Ease vs. Cost Another point raised by W3 is whether clients will perceive the process as being too easy when they see the slick interface of Cassandra / ClusterVis. One important point to bring out is the difference between exploratory analysis, data mining and statistical analysis and reporting. During exploratory analysis, we're interacting with the data, looking at it, trying out hypotheses visually. With data mining and statistical analysis we're looking for patterns which we can claim are definitely there and not just chance occurrences. Reporting takes this a stage further and packages these raw discoveries into a form where they can be easily understood by the "data owners". Additional functions for discussion Points 1,2 and 3: A&C can identify appropriate visualisation, classifier and rule derivation algorithms, including tree-based algorithms. This will take into account how easy it is to explain the results (or algorithm) to a lay audience. Point 4 and 6: A&C and W3 can talk through the possible applications of clustering and association rules to schools data and see what is relevant. Point 5: As we'd recommend automatic data pre-processing as much as possible, pre-processing should not need to be a factor in choosing the analysis methods. Point 7: Attribute selection can be use on an individual school basis to identify which factors are most relevant to something we want to predict (e.g. absence). This is useful information in itself. They can also be used as a pre-processing step to select the most relevant factors to feed into classifiers and other data mining algorithms, or to visualise. This could be done on an individual schools basis or by combining data from many schools. In general, we'll use statistically principled methods of attribute selection and bear in mind how easy it is to explain them. Point 8: "Is there other open source data mining which could be of use?" ACTION: Andy will investigate open source tools, particularly with reference to the "Pentaho" suite of tools which has recently adopted Weka. Point 9: Export of information from ClusterVis. Colin mentioned that cut and paste should be possible. A&C can investigate OLE / COM interfacing. See also part 4, "reporting" for more on this. Point 10 and 11: Can Weka KnowledgeFlow and Experimenter interfaces be used for automation? Possibly. Andy will look into this. Scripts written in R are probably a better method for reporting automation. Part 4: Reporting There is great potential for automation of reporting. In order to achieve the best balance between automatic report generation and manual report compilation, we need to establish what the expected volume of reporting is, and how long manual analysis would take. For high levels of automation, it's likely that R is the best package. Andy has experience using R to automatically generate MS-Word reports with analyses and graphs. The outline report seems sensible, and a mock-up report should be produced at some stage to ensure we have a clear target. Miscellaneous questions: In Part2, under "accelerated learning implications", "Enrichment" is mentioned. I understand this to mean an individual measurement of progress. I.e. a school could argue that although it is low in the league tables, its pupils have gained a lot compared to their starting point and therefore it is a good school. Is this correct? Would we have the data to measure this, particularly in terms of the starting point and baseline information on progress expected? "A general statement that counterpoints the differences ... between ... disparate groups". We can certainly attempt to derive rules which capture the differences. Under "Attitudes", you mention "analysis on these specific attitude variables with/without reference to the other PASS factors". I'm not clear on what exactly this means. Part 5 Reporting Implementation The guiding principles seem sensible and unproblematic. Technically, it might be easier to output HTML rather than RTF, but HTML can be loaded directly into MS-Word and saved as either a .doc or .rtf format file. The conversion of results / titles into plain English could either be done by hand, or we can automate it. Depending on the volume of reports, it might be easier to have an automatic system to avoid extra editing. This would depend on the "data dictionary" containing plain English descriptions for each variable. Thought needs to be given as to the interface for selecting report sections and to how templates can be made both simple and extensible. Part 6 Training Requirements Colin dealt with some of this in his response. I'd add re: Analogies for ClusterVis - we often use the idea of the points being connected by "springs" which pull them together. This is pretty much what is simulated on a mathematical level. Data Cleaning / preparation As mentioned before, it's likely that a lot of this process can be automated. The main limit is the need for some standardisation in the files presented. This will be dealt with through use of a data dictionary shared with the schools. We can explain concepts such as a "conjunctive ruleset", outline the processes in data preparation, and answer the more detailed questions under "The Analytical Process" in training or another document. Actions ACTION: Andy to supply para 1.3 for non-competitive agreement ACTION: Agree timeframe and more detailed IP / competition rights. ACTION: Colin - Find out about his new employers requirements re: publicity. ACTION: Andy, W3 - Flesh out in more detail what we would and wouldn't want to say. ACTION: Colin - suggest best days for availability ACTION: A&C - Talk to University about licensing options ACTION: Colin - Develop a rough timescale for the changes required ACTION: W3 - Talk to headteacher colleagues and data owners to identify appropriate fields and formats. ACTION: Andy will investigate open source tools, particularly with reference to the "Pentaho" suite of tools which has recently adopted Weka.