Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Exploratory Data Analysis • Set of techniques • The flexibility to respond to the patterns revealed by successive iterations in the discovery process is an important attribute • Free to take many paths in revealing mysteries in the data • Emphasizes visual representations and graphical techniques over summary statistics EDA • Summary statistics , may obscure, conceal the underlying structure of the data • When numerical summaries are used exclusively and accepted without visual inspection, the selection of confirmatory modes may be based on flawed assumptions and may produce erroneous conclusions Previously Discussed Techniques for Displaying Data • • • • • Frequency Tables Bar Charts (Histograms) Pie Charts Stem and Leaf Displays Boxplots Resistant Statistics • Example: data set = [5,6,6,7,7,7,8,8,9] • The mean is 7 and the standard deviation 1.23 • Replace the 9 with 90 and the mean becomes 16 and the standard deviation 27.78. • Changing only one of the nine values has disturbed the location and spread summaries to the point where they no longer represent the other eight values. Both mean and standard deviation are considered nonresistant statistics • The median remained at 7 and the lower and upper quartiles stayed at 6 and 8, respectively. Visual Techniques of EDA • Gain insight into the data • More common ways of summarizing location, spread, and shape • Used resistant statistics • From these we could make decisions on test selection and whether the data should be transformed or reexpressed before further analysis More Techniques • Last section focused on primarily singlevariable distributions • inspect relationships between and among variables Crosstabulation • Technique for comparing two classification variables • uses tables having rows and columns that correspond to the levels or values of each of the variable’s categories Example of a Crosstabulation Oversees Assignment YES NO Row Total Gender Male Row % Col % Tot % Female Row % Col % Tot % Column 22 35.5 78.6 22.0 6 15.8 21.4 6.0 28 28.0 40 64.5 55.6 40.0 32 84.2 44.4 32.0 72 72.0 62 62.0 38 38.0 100 100.0 The Use of Percentages • simplify the data by reducing all numbers to a range from 0 to 100 • translate the data into standard form, with a base of 100, for relative comparisons – A raw count has little value unless we know it is from a sample of 100 (28%) – while this is useful, it even more useful when the research calls for a comparison of several distributions of the data Comparison of a Crosstabulations Oversees Assignment YES NO Row Total Gender Male Row % Col % Tot % Female Row % Col % Tot % Column 225 25.0 62.5 15.0 135 22.5 37.5 9.0 360 24.0 675 75.0 59.2 45.0 465 77.5 40.8 31.0 1140 76.0 900 60.0 600 40.0 1500 100.0 Use of Percentages • Comparing the present sample (100) and the previous sample (1500), we can view the relative relationships and shifts in the data. • In comparing two-dimensional tables, the selection of either the row or the column will accentuate a particular distribution or comparison. ( Note in our last tables both column and row were presented) Presenting Percentages • When one variable is hypothesized to the presumed cause, it is thought to affect or predict a response, label it the independent variable and % should be computed in the direction of this variable • Which direction should the last example(s), gender by oversees assignment run? Independent Variable • (row) - the implication is that gender influences selection for oversees assignments • if you said column, you are implying that the assignment status has some effect on the gender and this is implausible! • Note that you can do the calculations, but they may not make sense! Other Guidelines for Percentages • Averages percentages: Percentages cannot be averaged unless each is weighted by the size of the group from which it is derived. (weighted average) • Use of too large percentages: A large percentage is difficult to understand. If a 1000 % increase, better to state it as a tenfold increase. Other Guidelines for Percentages • Using too small of a base: Percentages hide the base from which they have been computed • Percentage decrease can never exceed 100 percent. The higher figure should be always used as the base. Other Table-Based Analysis • Recognition of a meaningful relationship between variables generally signals a need for further investigation. • Even if one finds a statistically significant relationship, the questions of why and under what conditions remain. • Normally introduce a control variable • Statistical packages can handle complex tables Control and Nested Variables Control Variable Category 1 Category 2 Nested Variable Cat 1 labels Nested Variable Cat 2 Cat 3 Cat 1 Cells ... Cat 2 Cat 3 Data Mining • Describes the concept of discovering knowledge from databases • the idea behind it is the process of identifying valid, novel, useful, and ultimately understandable patterns in data • provides two unique capabilities to the researcher – pattern discovery – predicting trend and behavior Data-Mining Process Investigative Question Sampling yes/no Data Visualization Clustering, factor correspondence Neural Networks Variable selection, creation Treebased models Classification Models Model Assessment Data Transformation Other Stat Models Sampling Yes/No • Use the entire set or a sample of the data • if fast turnaround is more important than absolute accuracy, sampling may be appropriate • Sample - if data set is large - terabytes Modify • Based on discoveries, data may require modification – Clustering, factor, correspondence analysis – Variable selection, creation – Data transformation Factor Analysis • General term for several specific computational techniques • All have the objective of reducing to a manageable number many variables that belong together and have overlapping measurement characteristics Factor Analysis Method • Begins with construction of a new set variables based on the relationships in the correlation matrix • Can be done in a variety of ways • most popular is principal components analysis. Principal Components Analysis • Transforms a set of variables into a new set that are not correlated with each other. • These linear combinations of variables, called factors, account for the variance in the data as a whole. • All factors being the best linear combination of variables not accounted for by previous factors Principal Components Analysis • Process continues until all the variance is accounted for Extracted components Component 1 Component 2 Component 3 % of variance accounted for 63% 29 8 cumulative variance 63% 92 100 Cluster Analysis • Unlike the techniques for analyzing the relationships between variables • Set of techniques for grouping similar objects • Cluster starts with a undifferentiated group • Different that discriminant analysis where you search for set of variables to separate them Cluster Analysis Method • Select the sample (employees, buyers) • Definition of the variables on which to measure the objects • Computation of similarities amount entities through correlation, Euclidean distances and other techniques • Selection of mutually exclusive clusters ( maximization of within-cluster similarity and between-cluster differences) • Cluster comparison and validation Clustering Different methods produce different solutions • Cluster analysis methods are not clearly established. There are many options one may select when doing a cluster analysis using a statistical package. Cluster analysis is thus open to the criticism that a statistician may mine the data trying different methods of computing the proximities matrix and linking groups until he or she "discovers" the structure that he or she originally believed was contained in the data. One wonders why anyone would bother to do a cluster analysis for such a purpose. A Very Simple Cluster Analysis • In cases of one or two measures, a visual inspection of the data using a frequency polygon or scatterplot often provides a clear picture of grouping possibilities. For example, "Example Assignment" is data from a cluster analysis homework assignment. •It is fairly clear from this picture that two subgroups, the first including Julie, John, and Ryan and the second including everyone else except Dave describe the data fairly well. •When faced with complex multivariate data, such visualization procedures are not available and computer programs assist in assigning objects to groups. Dendogram The clusters and their relative distances are displayed in a diagram called a dendogram The following HTML page describes the logic involved in cluster analysis algorithms. http://www.cs.bsu.edu/homepages/dmz/cs689/ppt/entire_cluster_exa mple.html Correspondence Analysis • a descriptive/exploratory technique designed to analyze simple two-way and multi-way tables containing some measure of correspondence between the rows and columns. • provide information which is similar in nature to those produced by factor analysis techniques • allow one to explore the structure of categorical variables included in the table. • The most common kind of table of this type is the two-way frequency crosstabulation table • See http://www.statsoft.com/textbook/stcoran.html Variable Selection, Creation • If important constructs were discovered, new factors would be introduced to categorize the data • Some may be dropped WinSTAT http://www.winstat.com/ Welcome! (text from their home page) WinSTAT is the statistics Add-In for Microsoft Excel, and this is the place to find out all about it. Tired of your hard-to-use, need-to-be-a-fulltime-expert statistics package? Find out why WinSTAT is the program for you. Wondering if WinSTAT covers the functions and graphics you need? Let the function reference page surprise you, complete with sample outputs of tables and graphics for all functions. Still not convinced? There's no way to be sure until you've tried WinSTAT for yourself. We've got the demo download right here. Dmz Note WinSTAT also does clustering, factor analysis, and the usual EDA techniques Model • If a complex predictive model is needed, the researcher will move to the next step of the process, building a model • Modeling techniques include, neural networks, decision tree, sequence-based, classification and estimation Neural Networks • Also called artificial neural networks (ANN) • Collections of simple processing nodes that are connected • Each node operates only its local data and on the inputs it receives through connections • The result is a nonlinear predictive model that resembles biological neural networks and learns through training. Neural Networks • The neural model has to train its network on a training data set. Tree Models • Segregates data by using a hierarchy of ifthen statements based on the values of variables and creates a tree-shaped structure that represents the segregation decisions. Classification –Sky Surveying Cataloging • To predict class (star or galaxy) of sky objects, especially faint ones, based on telescopic survey images (from Palomar Observatory) • 3000 images with 23,040 x 23,040 pixels per image – Approach: – Segment the image – Measure the image attributes (features) 40 of them per object. – Model the class based on these features – Success Story: Could find 16 new red-shift quasars, some of the farthest objects that are difficult to find Estimation • Variation of classification • Instead of just “yes” or ‘no” outcome, generates a score Other Mining Techniques • Association – find patterns across transactions, patterns – Bundling of services • Sequence-based analysis – takes into account not only the combination of items but also the order of the items – In health care, can be used to predict the course of a disease and order preventive care • Fuzzy logic – extension of Boolean – can have truth values between completely true and completely false • Fractal-based transformation – work on gigabytes of data, offering the possibility of identify tiny subsets of data that have common characteristics Other Statistical Products • http://www.statsoftinc.com/ - also includes an online statistical textbook • Statlib: a major site for statistical software of all sorts. – Gopher to lib.stat.cmu.edu – Anonymous ftp to lib.stat.cmu.edu – URL: http://lib.stat.cmu.edu/