Download Introduction to unsupervised data mining

Data Mining Strategies Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales Categorical (nominal) Ordinal (only order matters) Interval (difference between two vars is meaningful) Ratio (when variable is 0.0 there is none of that data; Kelvin is but C and F are not) What to Know about the Scales  The measurement principle involved for each scale  Examples of the measurement scales  Permissible arithmetic operations for each scale Categorical Scale Data  The values of the scale have no numeric meaning  Examples  Gender  Ethnicity  Marital Status  Hair Color  Operations  Counting (only) Ordinal Scale Data  The categories can be ordered  But the intervals between adjacent scale values are indeterminate  Examples  Movie ratings (0, 1 or 2 thumbs up)  U.S.D.A. beef (good, choice, prime)  The rank order of anything  Operations  Counting  Greater than or less than operations Interval Scale Data  Intervals between adjacent scale values are equal  Examples  Degrees Fahrenheit  Most personality measures  IQ intelligence score  Operations  Counting  Greater than or less than operations  Addition and subtraction of scale values. Ratio Scale Data  There is a rationale zero point for the scale  An absolute zero  Examples  Degrees Kelvin  Annual income in dollars  Length, distance, size cm, kB, inches, km  Operations  All plus  Multiplication and division of scale values. Variables  Independent  Dependent  Input  Output x f(x) = 3+ 2x2 f(x) Data Mining Strategies  Unsupervised (No dependent variables used)  Clustering  Market Basket Analysis  Information Visualization  Supervised (At least one dependent variable used for training)  Classification  Estimation  Prediction Clustering  Cluster analysis divides data into groups (clusters) that are meaningful, useful or both  Clusters capture the natural structure of the data  Clustering allows us to think about the data at a new level of abstraction  Cluster analysis is often the first step in a data mining project Cluster of Stars Water Clusters Cellular Clusters Cluster Analysis  Uses information found in the data that describes objects and their relationships  Goal: That objects within a group be similar to one another and different from objects in other groups  The greater the similarity within groups and the greater the difference between groups, the better the clustering How Many Clusters? Three Clusters Identified Six Clusters Identified Types of Clustering  Partitional clustering  Heirarchical clustering  Exclusive clustering  Overlaping clustering  Fuzzy clustering  Complete clustering  Partial clustering Partitional Clustering  A division of a set of data into nonoverlaping clusters  Each data point is in exactly one cluster  Example of Partitional Clustering Heirarchical clustering  Permit subclusters (nested clusters within clusters)  Example of Hierarchical Clustering Exclusive clustering  Each object is assigned to a single cluster Overlaping Clustering  Non-exclusive  A data point can belong to two or more clusters simultaneously Fuzzy Clustering  Every data point belongs to every cluster with a membership weight. C1  Membership ranges from 0 (absolutely does not belong) to 1 (absolutely belongs)  The sum of the membership weightsC1 75% for each point is 1 C2 25% C1 40% C2 60% C1 01% C2 99% C2 Complete Clustering  Assigns every data point to a cluster  No data point is left out of a cluster Partial Clustering  Does not assign every data point to a cluster  Some data points can not belong to any cluster  Noise  Outliers  Uninteresting background  Classify newspaper stories  Many fall into  Global warming  Terrorism  Some stories are unique  Cable Tie just graduated from the CofC in CS Chris Starr: A centroid is the center of a cluster K-Means 1. Select K points as initial centroids 2. Repeat 1. Form K cluster by assigning each point to its closest centroid. 2. Recompute the centroid of each cluster. 3. Until centroids so not change The centroids are repositioned until stable in the K-means algorithm. Observe Your Environment  Start looking for clusters around you  Think about how the clusters are formed Are they hierarchical? Are they fuzzy clusters? Are they complete clusters?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Introduction to unsupervised data mining