Download Review - Week 1 - Columbia Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Data mining wikipedia , lookup

Categorical variable wikipedia , lookup

Time series wikipedia , lookup

Transcript
Review - Week 1
Read: Part I: Chapters 1-6.
Review Week 1:
It is important to understand your data.
We want to know who was measured, what was measured, how and where the data was collected
and when and why the study was performed.
Individuals are the objects described by a set of data. A variable is any characteristic of an
individual.
A categorical variable places an individual into one of several groups or categories. A
quantitative variable takes numerical values for which arithmetic operations make sense.
The distribution of a variable tells us what values it takes and how often it takes these values.
Once you have collected your data you will often want to summarize and describe important
features of the data, using graphical methods and numerical summaries.
To graphically summarize the distribution of a categorical variable use bar or pie charts. For
quantitative variables use histograms or stem-and-leaf plots. When numerical data is collected
over time, use a time plot.
When looking at the resulting graphs, be sure to study:
•
•
•
•
The shape of the distribution – is it symmetric or skewed?
The center of the distribution
The spread of the distribution
Any unusual values that do not follow the pattern of the rest of the data
A statistic is a numerical summary of data. We are interested in summaries that provide
information about the center and spread of the distribution.
The mean of a data set is defined as the sum of all the data points divided by the number of data
points in the set. The median is the midpoint of a data set.
The interquartile range, IQR, measures the spread of the middle 50% of the data.
The variance, s 2 , of a set of data is the average of the squared deviations of the observations
from the mean. The standard deviation, s, is the square root of the variance.
Exercises:
Exercise 1: Looking at data
What are the individuals and variables in the data below? Which variables are categorical and
which are quantitative?
Name
Bob
Sue
Bill
John
Gender
Male
Female
Male
Male
Age
27
33
21
56
Number of siblings
2
1
0
4
Exercise 2: Thinking about histograms
(a) Suppose you measured the height of all male Columbia students. What do you think the
resulting histogram would look like? In particular think about the shape, center and
spread. Sketch what you think the histogram would look like.
(b) Suppose you instead measured the height of all Columbia students. How would the
histogram differ from (a)? In particular think about the shape, center and spread. Sketch
what you think the histogram would look like.
Exercise 3: Measures of center and spread.
(a) Suppose we have the following data set {6, 3, 2, 4, 21}. Calculate the mean and median.
Why do they appear to differ so much?
(b) Suppose we have the following data set {1,1,2,2,2,4,6}. Calculate the mean and median.
(c) Given the data set {1, 2, 3, 6, 8, 8, 10, 14, 15, 17}, find the median and the IQR.
(d) Given the data set {2, 4, 3, 7}. Calculate the standard deviation.
Exercise 4: Measures of center and spread.
The five-number summary for the weights (in pounds) of fish caught in a bass tournament is 2.3 –
2.8 – 3.0 – 3.6– 5.2.
(a) Would you describe this distribution as symmetric or skewed?
(b) Would you expect the mean weight of all fish caught to be higher or lower than the
median? Explain.
(c) You caught 3 bass weighing 2.3 pounds, 3.9 pounds and 4.9 pounds. Were any of the fish
outliers? Explain.
(d) Create a boxplot of these data assuming that the three biggest fish that were caught were
4.7, 4.9 and 5.2 lbs.
Histogram Drill
0
2
Frequency
4
6
8
10
The histogram below represents the average amount (dollars per student) spent by public schools
in each state + the District of Columbia during the school year 1997-8.
4000
6000
8000
10000
spending
Answer the following questions about the histogram:
•
Describe the shape of the histogram.
•
What is the total area under the histogram?
•
Do you believe that the mean or the median is larger? Why?
•
How many states spend more than $8,000 per student/year?
•
How many states spend less than $6,000 per student/year?
•
What proportions of states spend more than $9,000 per student/year?
•
What proportions of states spend less than $5,000 per student/year?
Summation Notation:
Let us denote the measurements in a data set consisting of quantitative variables, as x1 , x 2 , …, xn ,
where x1 is the first measurement in the data set, x 2 is the second, etc.
If we want to calculate the sum of these n numbers, we can simply write x1 + x 2 + … + x n .
n
∑x
Another way of writing this is
i
.
i =1
Ex. Suppose we have a data set consisting of the values {1, 4, 5, 7}.
Here x1 = 1 , x 2 = 4 , x3 = 5 and x4 = 7 .
4
∑x
i
= 1 + 4 + 5 + 7 = 17
i =1
4
∑ 2x
i
= 2 × 1 + 2 × 4 + 2 × 5 + 2 × 7 = 34
i =1
4
∑x
2
1
= 12 + 4 2 + 5 2 + 7 2 = 1 + 16 + 25 + 49 = 91
i =1
n
It is extremely important to note that
∑
i =1
xi2
2
⎛ n ⎞
≠ ⎜⎜ xi ⎟⎟ .
⎝ i =1 ⎠
∑
Exercises:
Exercise 1: A data set consists of the values {1,6,3,2,3}
Calculate:
5
(a)
∑x
(b)
i
i =1
∑ (x
i =1
∑
∑
xi2
i =1
5
(d)
⎛ 5 ⎞
(c) ⎜⎜ xi ⎟⎟
⎝ i =1 ⎠
5
i
− 3)
5
(e)
∑ (x
i =1
− 3)
2
i
2