Download Lecture 1 - The University of Texas at Dallas

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Statistics 5311
Applied Statistics for Management I
Textbook:
See Current Syllabus
Prerequisites: See Current UT Dallas Catalog for Educational and Course
Prerequisites
Access to a Statistical Package ( for example EXCEL with the Data
Analysis Toolpack Add-in, or MINITAB, or SAS, or SPSS, etc.)
I will be using Microsoft EXCEL.
Instructor:
John J. Wiorkowski, Professor of Statistics
[email protected]
Telephone:
972-883-2274 (USA, on Central Standard Time)
What is Statistics?
Humorous:
The Science of drawing a precise line between an unwarranted assumption
and a forgone conclusion.
The Science of stating precisely what you don’t know.
Popular Conceptions:
Facts
Demographics
Census Counts
Product Sales
Touchdowns in American Football
Runs scored in British Cricket
Economic Projections
Sales Forecasts
Market Projections
Consumer Price Index
Probability
Odds
Gambling
“Lies, Damned Lies, and Statistics”
(Mark Twain)
Actually, “Statistics” encompasses all of the above
Popular Concept: You can prove anything with
Statistics
In fact, if you think of the word “prove” in its mathematical sense, that is
that things are either true or false, then in fact you can’t prove anything with
statistics.
Statistics uses mathematics, but it generalizes the concept of “true or false”.
EXAMPLE: YOU LAUNCH A NEW PRODUCT
Mathematics: It Will Fail
0
or
It Will Succeed
1
As a decision maker, you cannot afford to have absolute certainty.
Accordingly you must assess the risks (as measured by probability) and make your
decisions in a world of uncertainty.
Statistics:
It Will Fail
It Will Succeed
0________________x___1
.8
This means that your project has an 80% chance of succeeding and a 20%
chance of failing. You as the decision maker must decide if you can live with this
risk.
The main reason that the myth of proving anything with statistics exists is
because most people do not think using statistical logic rather they think
“mathematically” and ignore variability. For example, recently (Fall 1999) a story
aired on television in the Dallas area indicating that fewer Hispanic (individuals of
Latin American Descent) college students received financial aid than Anglo
(individuals of European Descent) college students. This led to angry responses by
politicians in the State of Texas. The local newspaper in fact reported the actual
statistics:
Ethnic Group
Proportion of College Students Receiving Aid
Hispanic
57 percent
Anglo
59 percent
Black
77 percent.
Anyone skilled in statistical thinking would have realized that the HispanicAnglo difference is well within statistical fluctuation and the two groups are
essentially the same, while both the Hispanic and Anglo groups are much below the
Black Ethnic group, a fact which was not reported on TV.
One Definition of “Statistics”
Statistics is a body of theory and techniques designed to:
a) Convert data into “information” through the use of graphical
displays, summarization, and other techniques so that patterns in
the data are apparent:
And
b) extrapolate any perceived pattern in the data to a broader area of
applicability.
Statistics Usually Begins With a Business Problem
Examples:
Accounting -- an auditor is interested in the costs of business
travel
Organization Behavior -- a manager is interested in why turnover
of employees seems to have increased
Marketing -- your firm wishes to expand its product line and
determine which products are of interest to consumers and within
the scope of your company’s expertise
Finance -- the CEO wishes to understand what factors are
affecting the firm’s stock price
Economics -- your firm is interested in the status of the economy
over the next year
Operations Research – store managers have been reporting
increased waiting lines at checkout counters
Management Information Systems – you are contemplating
replacing your financial reporting system
International Management -- you are interested in expanding
your firm’s services to a non-US market
Having Defined Your Problem You Need to Collect
Data
The set of all objects (could be persons, records, computer transactions, etc)
which are relevant to your problem is called the Population .
For example, in the Marketing example above suppose you worked for a
beverage firm and were interested in marketing a new Cola drink, say Vanilla Cola.
The first question of interest is whether people would like it (a later question is
whether they would switch from their present preferences).
What is the Population?
Guess 1 -- All consumers. (Probably too broad.)
Guess 2 -- All consumers who drink soft drinks (Probably too broad)
.
Guess 3 -- All consumers who drink cola beverages.
This population is extremely large, we cannot possibly ask everyone so we
will ask a smaller set of consumers called a sample.
How does one take a sample?
Perhaps the least understood aspect of statistical analysis is the importance
of taking the sample in an appropriate fashion.
All the formulae used in this course, and indeed in most simple statistical
analyses, require a random sample.
A random sample is a sample chosen using a randomization scheme. This
randomization scheme must have the following properties:
the probability of any individual object in the population being included in the
sample must be the same as any other object; and,
the probability of any pair of objects in the population being included in the
sample must be the same any other pair, and,
the probability of any three objects in the population being included in the
sample must be the same as any other set of three objects, and
..
.
the probability of any n objects in the population being included in the sample
must the same any other set of n objects.
We will use the symbol “n” for the sample size and “N” for the size of the
population.
Also a sample can either be taken with replacement (meaning the same object
may be chosen more than once), or without replacement (meaning that once an
object is picked, it can never be picked again).
All the formulae in your textbook assume that sampling is done randomly
with replacement.
(If n/N is less than .02, the formulae in the text can be used even though
sampling is done without replacement).
It is important to realize that you can’t tell if a sample is random by looking
at it.
In the Original Texas Lottery six numbers were picked at random from 50
without replacement (this is equivalent to N=50 and n=6). There are 15,890,700
different samples which could be picked.
Most of them look like the following: 11, 17, 26, 31, 48, 53.
But it could just have easily come out: 1, 2, 3, 4, 5, 6.
Both are equally likely, however most of samples will look more like the
former than the latter.
How To Take a Random Sample
It is usually necessary to form a list of some kind to take the sample. In the
case of our “Vanilla Cola” we would need to get a list of Cola drinkers. One way
this could be obtained is by purchasing this information from food stores which can
often track which of their customers are buying Cola when they use various
discount cards. Alternatively, one might cull the list of individuals who have
responded to various promotions of your company which usually demonstrates use
of your Cola products. (If a list cannot be obtained then other sampling techniques
need to be used, we will discuss these later).
To illustrate how a random sample is taken, let us use N=20 and n=5, and use
EXCEL. Please open your EXCEL worksheet now.
Your screen should look like this. I have entered the list of the population.
If you press the button labeled fx on the tool bar at the top of the screen you will see
something that looks like this:
In the left panel click on “All” and then scroll down the right panel till you see the
function “RAND”. Your screen should look something like this:
The description indicates that this function will generate random number between 0
and 1. In EXCEL two symbols indicate a function. You begin with an “=” sign,
type in the function name and enclose the arguments of the function in parentheses.
The function RAND has no arguments so one just types =rand() as shown below.
When you hit enter a random number will appear in the cell as shown below. (Note
don’t be surprised if this number changes whenever you do something, it is
programmed to generate a new random number whenever any computation is made
or the F9 key is pressed. Press F9 a few times to see the number change.)
Now copy down the entry in this cell to the remaining 19 cases by grabbing the
lower corner of the cell and holding your left mouse button down as you drag down
the column as shown below.
The result should look something like the following:
Since the values of the random numbers will keep changing, we need to fix them so
they will remain the same. With the value still highlighted as above, click on the
word “Edit” on the top line of your spreadsheet. It should look something like this:
Now press copy on the displayed menu. You will see a shimmering line around the
twenty highlighted numbers. Hit “Edit” again, and then click on “Paste Special”.
You should see something like the following:
Now click on the word “Values” on the displayed panel and hit “OK”. This should
result in something that looks like the following. Now when you press the F9 key,
the values will no longer change.
To take our sample of size 5 with replacement I will first multiply the random
number by the population size, in this case 20. This is shown below:
In the next column, I will round the case number by using the EXCEL function
“ROUND”. This has two arguments, the number to be rounded and the number of
decimal places to round to (positive for rounding to the right of the decimal point,
negative for rounding to the left of the decimal point, and 0 for no decimals at all).
The entry would look like this:
Now copy down the two entries in Columns C and D for five rows, and that is my
sample. This is shown below:
This is sampling with replacement, notice that I included case 11 twice. If I wanted
to sample without replacement I could do one of two things. First I could delete the
second pick of Case 11 and continue copying down till I got 5 distinct cases. This is
illustrated below:
Alternatively, I could sort the 20 cases in order of the random numbers and then
just take the first 5 values. Since the data is sorted I could get no duplications. To
begin, highlight the first two columns of data and click on the word “Data”. This is
shown below:
Now click on “Sort” to get a menu that look like this:
Replace the word “Case” in the menu with the word “Column C” and hit OK. This
will result in something that looks like the page below:
My random sample, without replacement, consists of 2, 6, 7, 10, and 20. Both
methods are equally valid, but the second can be faster if the sample size n is a
substantial fraction of the population size N.
Technical Point (not required):
In the first method of taking a random sample described above, we used the
EXCEL “round” function. This actually creates a slightly biased sample since
values between .5 and 1.5 would round to the value of “1”, and values between 18.5
and 19.5 would round to “19”. This means values between 0 and .5 are unused and
only the values of 19.5 to 20 would round to “20”. This means that the value “20”
has only half the probability of being generated as any of the values “1” through
“19”. This can be fixed in EXCEL by using the “rounddown” function rather than
the “round” function (the function arguments are exactly the same as for the
“round” function). The function “rounddown” would take any value between 0 and
.99… and round it to “0”, any value between 1.00 and 1.99 and round it to 1 ….and
any value between 19.0 and 19.99 and round it to “19”. This would generate equally
likely values “0” through “19”. Since we want the values to range from 1 to 20 we
would add one to the generated values. The following EXCEL command would
produce the correct values:
=rounddown(20*rand(),0) + 1
The larger the size of the population (N) the less important is this correction.
Other Forms of Sampling
A random sample will usually automatically mirror different aspects of the
population. For example if half the population is male, then approximately half the
values in the sample will be male. Similarly if 20 percent of the population is over
65 years of age, then approximately 20 percent of the sample will be over 65 years in
age. Suppose however, that you wanted to guarantee that the sample exactly
reflected certain proportions by gender and/or by age. Then one could take a
stratified sample. In this case, one takes the initial population lists and divides them
into groups (say males and females). Such a division is called stratification and
males would be one stratum and females another stratum. You then take two
random samples, one from the male stratum and one from the female stratum.
Within each stratum you can use the results in this text, but in combining the results
you would need to use formulae not in your book.
Stratified Sampling is clearly more expensive than random sampling since
you have to divide the lists and take multiple random sample one for each of the
possible stratum combinations (e.g. males over 65, males under 65, etc.). However
under certain circumstances it can give better results than a random sample.
If no lists can be made, then another form of sampling, called Cluster
Sampling, can be used. In this case we make use of natural clusters in the
population. We then sample the clusters randomly, and then randomly sample
subjects within the clusters. For example in our “Vanilla Cola” case, we might not
be able to obtain a “list” of Cola drinkers. However, since much soda is purchased
in supermarkets, we could randomly pick supermarkets within a metropolitan area,
and then visit the stores and observe who is buying Cola. We could either ask
opinions of every buyer, or we could randomly pick say 10 percent of the Cola
purchasers for interviews. This method is usually cheaper than random sampling
however it is usually not as accurate since cluster to cluster variability can add
significant error to the results. Again if you take a cluster sample, you cannot use
the formulae in your textbook.
Notice that at the basis of all sampling techniques is the random sample.
Course Structure
Population
Inference
(Module 3)
Probability
Basis for
Inference
(Module 2)
Random
Sample
Description
(Module 1)