Survey

Transcript

Ch. Eick Christoph F. Eick Ch. Eick Post Analysis Project1 Disclaimer The main purpose of these slides is not criticize groups but rather to learn how to do a better job when analyzing data and interpreting data mining results. Most of you do not have much experience in these tasks Learning without making errors is impossible; therefore, students can benefit from discussing errors of other students Visualization Use large, high resolution displays—some students used displays that did not reveal much because of too high density. Quality of the visualization impacts what you are able to see If you compare displays, put them next to each other!! 2 Use the same coordinate systems/scale in displays you compare Ch. Eick Post Analysis Project1 Part2 Interpretation Scatterplot: the key question is if the attribute/pair of attributes can provide some evidence for the dominance of a particular class in a particular region in the attribute space; not if the attribute pair clearly separates the classes. Vague interpretation of quantitative results; e.g. “Att1 seems to be more important that Att2” versus “the fact the regression coefficient of Att1 is 12 times as large as the regression coefficient of Att2 suggest that attribute Att1 has a much stronger impact on class membership”. Overlooking patterns in displays; e.g. regions that are dominated by one class or only looking for pattern in E/W direction when there are also clear patterns in N/S direction. Not giving summaries at all or giving very “quick” summaries 3 Ch. Eick Some Displays 4 Discuss Scatter Plots generated by Group 8 Ch. Eick 5 Ch. Eick Regression Results Mean Value GlucoseConc 121.6867628 BloodP 72.4051842 BMI 32.4574637 Pedigree 0.4718763 No Scaling: R2 : Multiple R-squared: 0.286 Adjusted R-squared: 0.282 Coefficients: (Intercept) V2 V3 V6 V7 -0.9930791 0.0066490 0.0006933 0.0126270 0.1399540 With Scaling: Coefficients Intercept 0.343923 scale(GlucoseConc ) 0.204457 scale(BloodP) scale(BMI) scale(Pedigree) 0.008583 0.086987 0.046509 The fact that the R2 is 0.28 suggests that the results a suggestive but do not Indicate a strong finding about the importance of the attributes. 6 Ch. Eick Box Plots Thanks to Group 10! 7 Ch. Eick Post Analysis Project1 Part3 Statistical Summaries If there are minor disagreement I took away 1 point If the results do not make any sense, I took away a lot of points (only happened once) If it was not clear how the results were generated (no R-code or incomplete R-code or lack of explanation), I also took away points. Other You were also supposed to interpret the histograms, but the project specification failed to ask you to do that! discuss another example inReview2 Importance of Attributes GC is definitely very helpful for diagnosing diabetes (scatter plot, regression); e.g. if it is quite low, it is very unlikely that the person has diabetes (useful for diabetes test) BMI (boxplot, scatterplot, regression coefficients) and to a lesser extend Pedigree have some usefulness in diagnosing diabetes. No evidence has been suggested by any group that DBP has any usefulness in diagnosing diabetes, although it has a week positive correlation of 0.28 with BMI 8 Ch. Eick Post Analysis Project1 Part4 Linear Regression If you do not scale data, interpretation of the observed coefficients is quite complicated (see previous slide). Lack of quantitative assessment of results Star Plots What is in your opinion the usefulness of this techniques? I myself have difficulties making sense of those, but some of you do seem to like Star Plots much more... Conclusion/Other Findings Half of the groups of quite short conclusions and most summaries are somewhat vague; e.g. they do not write about The importance/usefulness of the attributes The usefulness of the employed techniques Knowledge about diabetes generated in Project1 … Project Weights Fall 2013 Project2>Project3??>Project4 Project1 9

Document related concepts

no text concepts found