Download Philip Robbins 10 Apr 2011 IS6010, Case Study #1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Gibbs sampling wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Philip Robbins IS6010, Case Study #1 1.
10 Apr 2011 GENDER variable: What type of data does GENDER represent?
Nominal Data.
2.
The GENDER variable describes data based on a label: male or female.
GENDER variable: What does the mean gender of 1.40 tell us?
If Male is coded as 1, and Female is coded as 2, a mean of 1.40 tells us that there are
more males compared to females in our GENDER sample.
3.
GENDER variable: What would be the appropriate measure of central tendency for
gender?
For a nominal two element sample you could either use Mean or Mode averages as a
conclusive measurement of central tendency. A mean measurement for nominal samples above
two elements does not contain meaning. A median average is non-conclusive for nominal
data. Using a mode measurement is more appropriate.
4.
GENDER variable: What is the value for central tendency?
Mode = 1
5.
RANKING variable: What type of data does RANKING represent?
Ordinal Data.
6.
The RANKING variable describes order based on the SCORE variable.
RANKING variable: What is the appropriate measure of central tendency?
When using an ordinal scale, the central tendency of a group of items can be described by
using the group’s mode or its median, but the mean cannot be defined. In this case using
a median measurement is more appropriate.
7.
RANKING variable: What is the value for central tendency?
Median = 8.
8.
RANKING variable: Would it be appropriate to describe the average ranking?
why not?
Why or
Ordinal data describes order. It does describe relative size or degree of difference
between these data items, thus, a mean of ordinal data such as RANKING has no definition.
9.
SCORE variable: What type of data does SCORE represent?
Interval Data.
10.
A SCORE variable does not have an absolute zero point.
SCORE variable: What is the mean, median and mode of this data set?
Mean = 73.13, Median = 75, Mode = 55.
1 | P a g e 11.
you?
SCORE variable: What does the difference between the mean, median and mode tell
The Mean represents the arithmetic average or balance point in a distribution and is the
sum of all the elements divided by the number of elements, which is 73.13. The Median,
75, represents the middle element/value when all the SCORE values are ordered and
sequenced from the smallest to the largest value. The Mode represents the data
element/value that occurs most frequently. In this case the Mode is 55, which appears 3
times in the case example.
12.
SCORE variable: Is this data set skewed?
If so, in which direction?
Skewness characterizes the degree of asymmetry of a distribution around its mean. The
Skewness value for SCORE is -0.065, a negative value indicating a very slight skewed
distribution with an asymmetric tail extending towards more negative values. Normal
distributions produce a Skewness static of about zero.
13.
SCORE variable: What is the range of the data set?
How is this determined?
Range = 45. Range is determined from the difference of the range bounds: by subtracting
the lowest SCORE value, 50 from the highest SCORE value, 95.
14.
SCORE variable: What does the kurtosis figure tell you?
Kurtosis is a measure used to describe the distribution of observed data around the mean.
A high kurtosis means more of the variance is the result of infrequent extreme
deviations, as opposed to frequent modestly sized deviations and is portrayed by a curve
with a “peakedness”, heavy tails and a low, even distribution, whereas a low kurtosis
portrays a chart with skinny tails and a distribution concentrated toward the mean. It
is sometimes referred to as the “volatility of volatility”.
SCORE has a Kurtosis of -1.753
15.
SCORE variable: Do you think this data is normally distributed? Why?
No. In this case the Skewness value of -0.065 and a kurtosis of -1.753 indicates a nonnormal distribution.
Histogram plotted shows the effect of negative skewness and negative kurtosis on SCORE
distribution.
MATLAB code:
>> score = [95 92 91 90 88 82 80 75 70 60 59 55 55 55 50];
>> x=1:1:100;
>> y=score;
>> hist(y,x);shg
3
2.5
2
1.5
1
0.5
0
30
40
50
60
70
80
90
100
110
2 | P a g e 16.
SCORE variable: What does the standard error tell you?
The standard error of a method of measurement or estimation is the standard deviation of
the sampling distribution associated with the estimation method. In the case of the
SCORE dataset the standard error or standard deviation is 16.24 from a mean of 75.
17.
SCORE variable: What is the relationship between the variance and the standard
deviation? What do these numbers tell you?
The Standard Deviation (SD) is the square of the Variance. SD has an advantage
it is in the same units as the mean, which makes interpretation easy. Variance
average of the squared differences from the Mean. Variance is used as a measure
far a set of numbers are spread out from each other, in this case SCORES have a
of 263.70 Assuming a normal distribution, the standard deviation tells us that
the participants within this case study scored within 58.76 and 91.24.
18.
WEIGHT variable: What type of data does WEIGHT represent?
Ratio Data.
19.
in that
is the
of how
Variance
68% of
Ratio is like Interval data but with a unique property line of zero.
WEIGHT variable: What is the mean, median, and mode of this data set?
Mean = 144.73, Median = 130, Mode = 108.
20.
you?
WEIGHT variable: What does the difference between the mean, median and mode tell
The Mean represents the arithmetic average or balance point in the WEIGHT distribution
and is the sum of all the elements divided by the number of elements, which is 144.73.
The Median, 130, represents the middle element/value when all the WEIGHT values are
ordered and sequenced from the smallest to the largest value. The Mode represents the
data element/value that occurs most frequently. In this case the Mode is 108, which
appears 3 times in the case example.
21.
WEIGHT variable: Is the data set skewed?
If so, in which direction?
Skewness characterizes the degree of asymmetry of a distribution around its mean. The
Skewness value for WEIGHT is 0.625, a positive value indicating a skew distribution with
an asymmetric tail extending towards more positive values.
22.
WEIGHT variable: What is the range of the data set?
How is this determined?
Range = 135. Range is determined from the difference of the range bounds: by subtracting
the lowest WEIGHT value, 90 from the highest WEIGHT value, 225.
23.
WEIGHT variable: What does the kurtosis figure tell you?
Kurtosis is a measure used to describe the distribution of observed data around the mean.
A high kurtosis means more of the variance is the result of infrequent extreme
deviations, as opposed to frequent modestly sized deviations and is portrayed by a curve
with a “peakedness”, heavy tails and a low, even distribution, whereas a low kurtosis
portrays a chart with skinny tails and a distribution concentrated toward the mean. It
is sometimes referred to as the “volatility of volatility”.
WEIGHT has a Kurtosis of -1.037
3 | P a g e 24.
WEIGHT variable: Do you think this data is normally distributed? Why?
NO. Skewness and Kurtosis indicates distribution is not normal.
WEIGHT
WEIGHT
WEIGHT
WEIGHT
Mean = 144.73
Variance = 2168.07
SD = 46.56
Skewness = 0.625
Histogram plotted shows the effect of positive skewness and negative kurtosis on WEIGHT
distribution.
MATLAB code:
>> weight = [200 110 103 145 130 180 170 90 102 225 225 108 108 108 167];
>> x=60:1:260;
>> y=weight;
3
>> hist(y,x);shg
2.5
2
1.5
1
0.5
0
60
25.
80
100
120
140
160
180
200
220
240
260
WEIGHT variable: What does the standard error tell you?
The standard error of a method of measurement or estimation is the standard deviation of
the sampling distribution associated with the estimation method. In the case of the
WEIGHT dataset the standard error or standard deviation is 46.56 from a mean of 144.73.
26.
WEIGHT variable: What is the relationship between the variance and the standard
deviation? What do these numbers tell you?
The Standard Deviation (SD) is the square of the Variance. SD has an advantage in that
it is in the same units as the Mean, which makes interpretation easy. Variance is the
average of the squared differences from the Mean. Variance is used as a measure of how
far a set of numbers are spread out from each other, in this case, WEIGHT has a Variance
of 2168.07 Assuming a normal distribution, the standard deviation tells us that 68% of
the participants within this case study weighs within 98.17 and 191.29.
4 | P a g e 6010 WEEK 1 NOTES
ROBBINS
==============================================================
SAMPLE SIZE
==============================================================
Watch out for Sampling Bias:
small, unbiased samples tend to yield more accurate results than biased samples,
even if the sizes of the biased samples are larage
and the sizes of the unbiased samples are small.
Increasing sample size incrases precision:
When you say you have precise results (or a reasonalbe degree of precision),
you are saying the results vary by only a small amount from sample to sample,
which will happen if each sample is large.
Watch out for Diminishing returns:
At some point the returns (in terms of an increase in precision) deminish
to the point that further increases in sample size
are of very little benefit.
==============================================================
DESCRIPTIVE STATISTICS
==============================================================
Sampling Methods:
Random Sampling: each person has equal chance of being selected
Stratified Sampling: a method of sampling from a population (strata). the strata should be mutually exclusive and
collectively exhaustive. this type of sampling reduces sampling error. produces a weighted mean that has less
variability than the arithmetic mean of a simple random sample.
Systematic Sampling: selects every kth element. where k = N/n, where N = population size, n = sample size
Cluster Sampling: natural groupings used in statistical population. used in marketing research.
Convenience Sampling: population readily available and convenient. not a representative method; used only for pilot
testing.
Level of Measurements:
Nominal: data that consists of names, labels, or categories only.
Ordinal: describe order, but not relative size or degree of difference between the iterms measured. scale type or rank
order.
Interval: like ordinal level with the additional property that we can determine meaningful amounts of differences between
data.
Ratio: like interval data but with a unique line of zero.
Measure of Central Tendency:
Average: or measurement of central tendency can represent a mean, median, or mode.
Be specific when talking about an average, esp in scientific research to identify if the underlying distribution is skewed.
Mean: the arithmetic average (balance point in a distribution), computed by adding up a collection of numbers and dividing
by their count. It is the value areound which the deviations sum to zero.
drawback is that means are drawn in the direction of the skew / extreme scores (outliners).
* Mean is used with Interval & Ratio data.
Median: the middle element / value of a set.
in situations where outliners dramtically impact the mean the median can be much more representative of the central
tendency of the sample set.
for odd # = order smallest to largest and middl value is median
for even # = order smallest to largest and sum the two data elements in the middle and divide by 2
* Median is used with Ordinal, Interval, Ratio data, and also used when a distribution is highly skewed.
Mode: the data element / value that occurs most frequently.
you can have more than one mode called bimodal. having more than two modes is called multimodal.
* Mode can be used with all data types.
Range: referes to the exterme unit values in a dispersion set
Standard Deviation: (S, SD for a population / s, sd for a sample) is the measure of variability or dispersion there is from
the average (mean, or expected value).
The smaller the varaiblity is, the smaller the standard deviation is.
normality has been oberved with great frequency in nature. standard deviation is derived and describes the variability of
normal distributions.
Relationships {
+ if a distribution is normal, 68% of the participants in the distribution lie within one standard-deviation unit of the mean.
+ a "narrower curve" is attributed to a lower standard deviation
+ more than half of the observations are within 1 standard deviation of the mean
+ more than 90% of the observations are within 2 standard deviations of the mean
+ most observations fall within 3 standard deviations of the mean
}
Shapes of Distributions:
Normal: When very large samples are used the curve on a smooth bell shaped (normal) curve. (i.e. weights of grains of
sand on a beach)
Positive Skew: distribution that is skewed to the right. (trailing tail is on the right, i.e. income curve)
Negative Skew: distribution that is skewed to the left. (trailing tail is on the left,i.e. math test results from PhDs)
"skewed to the left" to indicate a "nagative skew"
"skewed to the right" to indicate a "positive skew"
The Median and Interquartile Range:
Where as the standard deviation measures variablity from a mean average,
the Rangle or the Interquartile range is used to measure variability from the median average.
Range: the highest value minus the lowest value.
the more extreme the value is the more unreliable it is. range is based on two extreme values, thus is considered an
unreliable statistic.
Interquartile Range (IQR): divides a distribution into quarters and the range of the middle 50% is considered the IQR.
When the median is reported as the measure of central tendency, it is customary to report the IQR as the measure of
variability.