Download Data is/are… - cloudfront.net

Document related concepts

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript


Everyone gets a card; each group gets a corner
(hearts, diamonds, spades, clubs)
Below are two sets of data. Discuss in your
group what we can say about each data? What
could each represent? What couldn’t each
represent?
Data Set #1:
97, 98, 94, 92, 31, 98, 93, 95, 97, 98, 98
Data Set #2:
4.0, 6.8, 7.1, 7.1, 7.2, 7.4, 7.7, 7.8, 12.1
Each group share out


Observations that you or someone else
records
But data is/must be more than just numbers;
it is numbers in context ... the story behind
the numbers...

That’s where statistics come in... According
to textbook author Dr. Robert Gould, “The
goal of statistics is finding meaning in data.”



Think for 1 minute.
Share with the person next to you for 1
minute.
Share out to entire class (random selection;
number off).





Listen to the radio or watch TV news.
Listen/look for a recent study & findings
newscasters are reporting
Summarize the study/report
Why do you think they did the study/What
was the purpose? How do you think they got
their data?
Typed; printed out (do not submit via email);
at least 5-6 complete sentences.
Due: next class, at beginning of class.




...13 million adults, or 46 percent of the state’s population,
are believed to be living with the precursor of type 2 diabetes
or undiagnosed diabetes.
Researchers predict that 33 percent of young adults, aged 18
to 39, are pre-diabetic, which is rare since diabetes is
generally more common among older adults.
According to researchers, up to 30 percent of residents living
with pre-diabetes will develop type 2 diabetes within five
years. As many as 70 percent of adults will develop the
disease within their lifetime.
Why do you think ULCA is interested in this information? How
do you think UCLA get these numbers, percents, predictions?
Let’s think
of a topic...

Now, what
are some
questions about
that topic?

Write on board...
and/or share out


Dr. Gould, UCLA

Dr. Gould, UCLA



See my website, articles assignments
activities & more, at the bottom
20 – 30 minutes. Then report out/share, then
revise based on comments/feedback
Debrief activity

Always, always comment, answer, compare,
contrast… whatever the case.. in context! How
can we find meaning if we don’t have context?


Remember what Dr. Gould said, “Goal of
statistics is finding meaning in data.”
What are the objects? What was measured? What
are the units of measure? Think about how we
started our class tonight with the 2 data sets I
gave you...


Look at the eight questions we considered in
our activity we just did
Could we organize them into two different
types of data, two different types of
variables? How?

Can you think of a categorical data that looks
like numerical data… but it isn’t. It’s really
categorical. Discuss for a minute…


Can you think of a categorical data that looks
like numerical data… but it isn’t. It’s really
categorical. Discuss for a minute…
Always ask yourself, does finding the mean
(average) of this data make sense?



Come up to board and write the number of
different types of social media YOU have used
TODAY; write anywhere; no need to organize
in any special way.
If you are male, please use a blue marker
If you are female, please use a black marker


Number of different types of social media
YOU have used TODAY; blue: male; black:
female
One minute to talk to the person next to you
about one observation you can make about
our data; be prepared to share out your
observation



Number of different types of social media
YOU have used TODAY; blue: male; black:
female
First, it’s always helpful to ...
Second, and probably more importantly, it’s
always helpful to ...


Talk to the person next to you for 2 minutes.
What type of graphical representation would
you choose to best represent this data and
why (your group doesn’t actually have to
create the graphical representation at this
time). Be prepared to explain/justify your
reasoning/your choice.
Share out.






Dot plots
Stem (and leaf) plots
Histograms
Box plots (later...)
(and much later) ... Density curves, scatter
plots, least-squares regression lines, Normal
probability plots, etc.
Why didn’t I list pie charts or bar graphs?
We always want to create a graphical
representation; visuals help us process
information, indentify trends more easily
We always label & scale our graphical
representations
We always use technology when available (no
need to create graphical representations by
hand)


Dot plot... What’s good about dot plots?
What’s not so good?
Histogram... What’s good about histograms?
What’s not so good?


Stem (and leaf) plots... What’s good about stem
plots? What’s not so good?
Box Plots... In Stat Crunch, but will learn much
more about box plots later...




Go to my website, click on COC Math 140 Survey Data
spreadsheet. Find the column ‘How much do you weigh (in
pounds).’ Copy and past into a column in Stat Crunch.
Create a histogram, a stem plot, a dot plot, or a box plot of
this data (your choice). Be sure to label your graphical
representation. Put both your names on it.
Looking at your graphical representation, what can you say
about the distribution/the data? Be prepared to share out one
thing you observe in the graph (we will display your graph up
on screen so we can all see it as you describe it)
We will print it later and turn it in... But we will do something
else with it in a few...

Frequency vs. Relative Frequency


No matter which graphical representation you
created with this data set, how did we
describe the graphical representations?
What types of characteristics did we consider
when trying to describe the graph of this
data?
S – Shape.
Symmetric? Skewed? Uni-Modal,
bi-modal, tri-modal, multi-modal? Gaps?
O– Outlier(s)
Is/are there unusually large or small
values that are “away” from the majority
of the rest of the data?
C – Center
What is the “typical*” value of the
distribution/data?
S – Spread
Typically/on average*, how far apart or
close together is the data/distribution?
* Different types of ‘averages’ and ‘typical’. Will discuss further and in detail soon.



Practice: Lets look at our social media data with a
histogram, dot plot, stem plot, or box plot; &
describe the distribution using SOCS. What is likely?
Unlikely? What type of a statement could you make
(based on this data) about ALL COC students
regarding social media?
Now with the graphical representation you and your
partner created from the ‘weights’ data, describe
the distribution using SOCS. You have 10 minutes.
You will turn this in as an assignment.
Be prepared to do a 1-minute share out as I will
randomly call on a few pairs to share out

Type of first pet ... or favorite social media,
favorite app for cell phone, hair color, make
of car you drive, marital status, etc.

Bar (charts) graphs (caution; very different
from histograms; why?)


On left is bar graph; on right is histogram
Be sure you understand the difference between
the two graphical representations

Bar (Charts) Graphs

Pie Charts

BIG IDEA... the same... visualizing data can be
helpful in observing trends

Can we analyze pie charts or bar graphs with SOCS?
Why or why not?

Whether categorical or numerical, always good to
graph your data


Let’s go to the Math 140 data set, and choose
a set of categorical data; cut and paste into
Stat Crunch; create bar graph and pie chart;
make observations; ask questions
With a partner, choose a different categorical
data set, practice creating a bar graph AND a
pie chart using the data; make observations;
ask questions; we will share out in 10
minutes.








Form groups of 3 randomly (how would we like to do this?)
Each group will have a measuring tape. The first person stands (preferably in
front of a wall) and imagines that she or he is at an ATM getting cash. The second
student stands behind the first. The first student tells the second student how far
back he or she must stand for the first student to be just barely comfortable,
saying for example, “Move back a little, now move forward just a tiny bit,” and so
on. When that distance is set, the third student measures the distance between
the hell of the first person’s right shoe to the toe of the second person’s right
shoe. That will be called the ‘personal distance.’
First, answer the BEFORE THE ACTIVITY questions below (1 paper for the whole
group):
1. Do you think men and women will have different personal distances? Why?
Will the larger distances be specified by the men or the women?
2. Which group do you think will have distances that are more spread out?
3. What do you think the shape of each of the distributions will be?
For each student in your group, record the gender and personal distance. Write
each of these personal distances on the board. Use blue for male and black for
female.
Note: Be respectful of other people’s personal space. Do not make physical
contact with other students during this activity.

Input data into StatCrunch
1-2 paragraph write up which answers the
question, “Do men and women have different
personal distances?”
Include graphs (justify your group’s choice of
graph) & numerical analysis (SOCS) of
data/graphs (from Stat Crunch; cut and paste)
All members of group must contribute
Maximum points possible: 20 project points.

From Robert Gould, Introductory Statistics






Four Corners: Go to your corner based on if
your birthday falls in the Winter, Spring,
Summer, or Fall; 1 minute
In your group, come to a consensus about
the three most important topics we learned
and list them on the board. 5 minutes.

Appropriate graphical representations
(numerical & categorical data)

Always graph the data; always. Always
embed context. Always.

Describing numerical distributions/data sets
via SOCS (the basics; we will get more
sophisticated with our descriptions soon); do
we use SOCS to describe categorical data
distributions? Why or why not?

Shape, Outlier(s), Center, Spread

We loosely defined ‘center’ and ‘spread’

Now we will be much more specific & detailed

... And remember, always embed context

Here we go  ...


When I say a word, you immediately write
down what you think it means; don’t think,
just write. Don’t talk; don’t say anything to
anyone.
Ready?

Average

The annual salaries of 7 patrons in a diner are listed below.
$45,000
$48,000
$52,000
$35,000
$46,000
$40,000
$58,000

Find the mean and the median using Stat Crunch

Are the mean and the median similar? Would they represent a
‘typical’ or ‘average’ customer’s salary?

Should we use the mean or the median in this case?

Graph the data (let’s practice a histogram; then a box plot) using
Stat Crunch. What shape is the distribution?
$45,000
$48,000
$52,000
$35,000
$46,000
$40,000
$58,000
$3,710,000,000

Find the mean and the median using Stat Crunch

Are the mean and the median similar? Would both or either
represent a ‘typical’ or ‘average’ customer’s salary?

Should we use the mean or the median in this case?

Graph the data (histogram; box plot) using Stat Crunch. What
shape is the distribution?

Means are excellent measures of central tendency if
the data is (fairly) symmetric

However, means are highly influenced by outlier(s)


So, if the data has an outlier(s), then a better measure
of central tendency is the median, which is not
influenced by outliers; this is called ‘resistant’
So, consider the shape of data/distribution, then
wisely choose an appropriate measure of central
tendency
.



So, when we are analyzing a numerical
distribution (like looking at a histogram, stem
plot, box plot, etc.), we need to wisely choose
which ‘C’ to use... mean or median
Generally, if symmetric use mean (or median)
as a measure of central tendency; they will be
similar in value (or the same)
If skewed (left or right) use median as a
measure of central tendency; why?



What is the median of each of the following
data sets? What is the mean of each?
(4, 4, 5, 6, 6)
(5, 5, 5, 5, 5)
Are they the same distribution/data set?
Another characteristic that is helpful in
describing distributions/data sets is the
measure of spread (or the typical distance
from the center)


Another characteristic that is helpful in
describing distributions/data sets is the
measure of spread (or the typical distance
from the center)
Two measures of spread that we will focus on
in this course are the standard deviation &
inter-quartile range


a typical distance of the observations from
their mean
is a number that measures how far away the
typical observation is from the center of the
distribution




Your team’s task: Create a data set of four whole
numbers (from 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10) with
the lowest standard deviation value possible
Input your four numbers (again use numbers
from 0 to 10 only) into Stat Crunch, then
calculate the standard deviation
Change a value or values until you get the lowest
possible standard deviation you can. 3 minutes.
Go.
Now create a data set (again only from 0 to 10)
with the largest possible standard deviation.

Not used very often; usually, if we use a mean
as a measure of central tendency, we use the
standard deviation as our measure of spread

Variance is related to standard deviation

variance = (standard deviation)2

standard deviation =
var iance



# siblings you have on board & enter into Stat
Crunch
Numerical analysis (statistical summary in
Stat Crunch) and graphical representation
Describe the distribution

But we still need to describe the center and the
spread of the distribution

Use median and IQR (Inter-quartile Range)

Median & IQR are not effected by outlier(s)
(resistant)

IQR = Q3 – Q1

IQR is amount of space the middle 50% of the
data occupy

Another measure of variability (used with any
distribution) is range

Range = maximum value – minimum value

Range for our data =




Boxplots are the only graphical representation
where we specifically define an outlier
Potential outliers are values that are more than
1.5 IQRs from Q1 or Q3
IQR x 1.5; add that product to Q3; any value(s)
beyond that point is an outlier to the right
Q1; any value(s) beyond that point is an outlier to
the left

Using Stat Crunch, calculate descriptive statistics

Let’s calculate (by hand) to see if we have any outliers

Q3 – Q1 = IQR

IQR x 1.5; add this product to Q3; are there any
values in our data set beyond this point to the right?

IQR x 1.5; subtract product from Q1; are there any
values in our data set beyond this point to the left?

Now use Stat Crunch to create a boxplot; are our
calculations confirmed with our boxplot?

Are they really an outlier?

Is your data correct? Was it input accurately?

COC’s recent 99-year-old graduate

Don’t automatically throw out an unusual
piece of data; investigate

In pairs, choose a set of data from the Math 140
spreadsheet that is skewed (to left or right); you probably
won’t know if the data is skewed until you copy and paste
into Stat Crunch and create a graph

Create a box plot; print out; put your names on it

Label (on the graph) the 5-number summary (with arrows
pointing to each value on the graph)


Analyze through SOCS (which measure of central tendency
should you use? Which measure of spread should you
use?); be sure you show your work to justify that a
point/points are outliers
Now, using the same data, create a histogram. What
characteristics of the data does the histogram show that
the box plot does not?





1. For each of the following sample statistics, classify it as a measure of
spread (variability), a measure of center (average), or a measure of
position. Then write a sentence describing what the statistic tells us.
a) Mean
b) Standard Deviation
c) Minimum
d) Range
e) Median
f) Quartile 3 (Q3)
g) Interquartile Range (IQR)
h) Maximum
i) Quartile 1 (Q1)
k) Variance
j) Mode

2. Which measure of centeris the most accurate for bell shaped (normal)
data sets? Which is the most accurate for skewed data sets?

3. Which measure of spread is the most accurate for bell shaped
(normal) data sets? Which is the most accurate for skewed data sets?

4. List all the measures of position.

5.. Use Statcrunch and the Bear data to find all of the summary statistics
we discussed for the bears weight. You need to give the name of the
statistic, the number and the units.

Will cover Module 1 through Module 10

Topic review sheet on my website