Download Slide 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Data mining wikipedia , lookup

Time series wikipedia , lookup

Transcript
Spread of Data
Whilst we have been looking at averages:
mean, medians and modes these tell us little
without knowing how the data spreads out
around these central measures.
The quartiles mentioned in the last section
can be used to describe how data spreads
around the median. They give us the range of
the middle 50% of the data. This is also
known as the Interquartile Range and is often
represented in a Boxplot.
The variance and the standard deviation are
used in a similar way to describe how the
data spreads around the mean. These will be
used frequently in topics throughout S1 and
S2. The standard deviation is the square root
of the variance.
The following data is the ages of children in a
play school
1,1,1,2,2,2,2,2,3,3,4,4,4,4,4,5
Find the mean average age.
Now calculate how much each childs age
differs from the mean.
How would you program a robot to calculate
this?
How would you differentiate between above
and below the mean?
If we want to know how their ages spread
around the mean what is wrong with just
adding these differences together and dividing
by 16?
How the data varies about the mean average is
known as the variance.
We can calculate it for lists of data by
comparing each result to the mean.
Subtract the mean from each piece of data and
square it (to make it positive so they don't
cancel each other out)
Add them together and divide by how many
pieces of data there are.
1,1,1,2,2,2,2,2,3,3,4,4,4,4,4,5
The data above can be put into a frequency table
How might we find
the variance using
this table rather
than finding the
difference between
each piece of data?
This isn't the quickest
method although it does
make the most sense
The formulae can be
manipulated into this
form which is a lot
easier to calculate
Because we have squared x the units x is
measured in have also been squared. This
poses a problem when interpreting the
spread of data so they square root the
variance to give the standard deviation.
Once you've
matched them up
decide the black
and red sets of
data are
examples of
Weights of the 10
what?
Heights of year 11
babies born in Arrowe
students at St Anselms
Park hospital on
College
3/11/06
Size of houses in Bebington
Children who attend churches on the Wirral
Students in UK
Babies born in November 2006
Number of bedrooms
in the houses in
Oaklands Drive
Age of 12 children in
St Andrews Sunday
School
To find the sample mean is exactly the
same as the population mean. However
the sample variance must be calculated
differently.
Compare the differences and similarities
between this formula and the one for
populations.
Remember we use
for populations and
for samples
Page 40
Exercise 3C
Q3, 4 and 6
Page 43
Exercise 3D
Q5 and 6
Page 49
Mixed Exercise 3F
Q5 and 6
Which average would be the best way to
represent this continuous data?
1,1,1,1,1,2,2,3,3,3,4,4,6,11,22
Why?
Group this data with class widths: 1-2, 3-4, 5-7,
8-12, 13-22
Now sketch a histogram for it.
example of skewed data.agg
This data is an example of positive skew
Although it can be seen in this example other
data sets may be only slightly skewed and
less noticible. It is also a long winded way to
decide on skew - drawing a histogram. There
are other methods which involve the data we
have been learning to calculate on this
course.
Work out Q1,Q2 and Q3 for this data.
By calculating the differences between the
median and the two quartiles decide which
side of the boxplot is bigger? (Left or right)
We can use this technique to measure
negative and psitive skew.
Other measures for skewnwss include:
positive skew: mode < median < mean
negative skew: mean < median < mode
or
if we want to quantify the skewness as well as
identify its direction we may use:
3(mean - median)
standard deviation
the closer the value to zero teh nearer a
symmetrical distribution it is - normal
There are other ways to interpret measures of
location (averages) and spread (IQR and s.d.)
Often these values are of most use when
comparing two or more data sets rather than
singular use.
The Coeffient of Variation, V, is defined as:
V = 100 x
This gives a percentage of dispersion.
The Quartile coefficient of variation, QV, is:
QV = 100 x 0.5(Q3 - Q1)
Q2
This last one being more useful when you have
outliers you feel are distorting your data and you
wish to ignore them.
Lastly we should add some ways to quantify
outliers which continue to be mentioned:
Again there are many ways people use but one
Edexcel use is:
if x is less greater than 2 or less than -2 then that
value of x is considered an outlier
Page 71
Exercise 4F
Q1 - 4
Page 72
Mixed Exercise 4G
Q1, 3, 4, 7 and 8