Download Lecture

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Data mining wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
STATISTICS Install ANALYSIS DATA PACK in Excel before class. Statistics is a way to help describe and analyze a data set. The first thing that we want to do is to collect the data in a table that summarizes the information. EX// We surveyed 50 homes and asked how many pets do you own. Here is the data: HANDOUT P1 I summarized the data by writing down all the possible values that occurred in the data set and then writing down how often they occurred, which is the frequency. Note that if I add up all the frequency I get the number of data points -­‐if I didn’t I did something wrong. Sometimes the frequency doesn’t tell us all that much. I can tell you that 14 houses had two pets but that doesn’t really mean anything until I tell you that I survey 50 houses. So a lot of time these “frequency distributions” are accompanied by the relative frequency. Let’s look at another set of data: EX// scores of a test for 35 students x notice that this data is different x if I were to write down all the values and the corresponding frequency it would mostly be a column of ones EXCEL OUTPUT 1 x so instead we group the data into intervals (or bins) and then do the frequency tables. Go through second part of first handout. Now that we have the data in a table we can start asking questions like: ƒ What percentage of students scored either a A or a B? ƒ What percentage of students didn’t score a A? ƒ What percentage of people have at least 1 pet? ƒ What percentage of people have at most 2 pets? HISTOGRAMS.
Another way to summarize the data is with a histogram, which is similar to a relative frequency table, but combines a bar chart. Most of the time these charts are made with software, they can be done by hand but not in any real sense. I’m going to show you how to do a few on Excel. ** NEED ANAYLSIS DATA PACK** M120 F09
Page 1 of 5
FIRST DATA SET: Pets x
x
x
x
x
x
x
Put bins in sheet before hand Data tab Analysis tab Click histogram Choose input Bin range out put range and click that you want a chart. EXCEL OUTPUT 2 This chart gives a graphical summary of the data Î higher columns mean more data is located there. SECOND DATA SET: scores on test Same steps except play around with bins. . EXCEL OUTPUT 3 This is often how people can mislead with statistics. We will we see further application of histograms later in 9.3 CENTER OF DATA
The other questions that we want to ask about the data have to do with the center of the data. There are three different ways to measure the center: 1. Mean = average (xbar) 2. Median = middle value 3. Mode = most frequent Look at a small data set first then we’ll return to our two examples. EX// 5,2,7,3,8,4,2,2,3 MEAN = Average… Just add them all up and divide by how many there are. (5+2+7+3+8+4+2+2+3)/ 9 = 4 Now with large data sets this could become cumbersome so we have new notation: 6 = sum 6x means sum of all Xs Î xbar = 6x/n ** We’ll see more notation like this with our large data sets. MEDIAN = middle value Sometimes the data set is too large: Lm = = location of the median. Use the median in housing prices and income. Average not always good. MODE most frequent. M120 F09
Page 2 of 5
HANDOUT P1 Now let’s find all three with our large data sets. First = Mode (most frequent) Second = Median (middle value) Lm = = 51/2 = 25.5 Î b/t 25th and 26th value PPSLIDE#1 0,0,0,0,0,1,1,1,1,1,1,1...,2,2,2........
5 zeros
13 one ' s
14 two ' s
the 25th and 26th spot are both 2’s so the median is two. PPSLIDE#2 Third = Mean 0 1 1 1 1 1 1... 2
2
2........
0
0 0 0 1
5 zeros
13 one ' s
14 two ' s
5(0) 13(1) 14(2) 7(3) 7(4) 4(5)
50
Notice this is....
f ( x)
f ( x)
f ( x)
P
P
P
5(0) 13(1) 14(2) 7(3) 7(4) 4(5)
50
So X
¦ f ( x)
¦ f ( x)
n
n
X
Find the mean, median and mode for the second data set. *First create a mid-­‐point column. Can’t do any statistics with “bins” M120 F09
Page 3 of 5
MEASURES OF DISPERSION
The mean, median and mode can be misleading and not always tell you the full picture. PPSLIDE#3 EX// Bowling Scores Even though it you wouldn’t guess it at first but these two guys have the same BERT ERNIE mean, median and mode. 185 182 135 185 Bert’s mean, median and mode = 185. 200 188 185 185 AND Ernie’s mean, median and mode is also = 185. 250 180 But Ernie is a much more consistent bowler. 155 190 So we need other ways to analyze data than just the center. In this section we are going to look at the dispersion of the data –how far the data is spread out. One thing easy thing to look at is the RANGE. Range: the difference between the smallest and largest data point. EX// Range for Bert EX// Range for Ernie Another measurement we can look at for dispersion is called deviation. Deviation = difference between a single data point and the mean. *It does have direction* EX// The deviation of 135 is –50 The deviation of 200 is 15 What we’d like to do is look at the deviation for all the data points –maybe get an “average” deviation. Let’s try to find the average deviation for Bert. His mean is 185 so…. with X =185 However, if we try to find the average of the deviations BERT Deviation we would get: 185 185-­‐185= 0 135 135-­‐185= -­‐50 0+-50+15+0+65+-30
0
200 200-­‐185= 15 0 6
6
185 185-­‐185= 0 250 250-­‐185= 65 This will ALWAYS happen. 155 155-­‐185= -­‐30 We will ALWAYS get zero. So we can’t find the average deviation. Instead statisticians have developed a formula for what is known as the standard deviation, s. M120 F09
Page 4 of 5
Now, technically the formula we use is for s2, which is known as the variance. The variance has no real world practical use so we don’t discuss it in this class. However, it is very useful in theoretical statistics. So the formula is for s2 and then we take the square root to find s, the standard deviation. STANDARD DEVIATION (VARIANCE): s2
¦x
2
nx 2
n 1
Ÿ
HANDOUT P2 EX// Find the standard deviation for Bert (39.6232) GROUP WORK Find standard deviation for Ernie( 3.678) BERT 2
x x 185 34225 135 18225 200 40000 185 34225 250 62500 155 24025 213200 s
¦x
2
nx 2
n 1
ERNIE x 182 185 188 185 180 190 x2 33124 34225 35344 34225 32400 36100 205418 For larger data sets with a frequency table the formula will change slightly to accommodate for the frequency. s
2
¦ fx
2
nx 2
n 1
Ÿ
s
¦ fx
2
nx 2
n 1
Let’s go back to the data sets from last time and find the standard deviations HANDOUT P3 A few things about standard deviation. x Standard Deviation should never be negative x Should be reasonable. Look at you range. It shouldn’t bigger than the rage of your data. x In fact, your homework will ask what percentage of the data lies within one standard deviation of the mean. First example with pets. What percentage of data lies within 1 standard deviation of the mean? [ x s, x s ] M120 F09
Page 5 of 5