Download BasicStatisticalConcepts

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

German tank problem wikipedia , lookup

Regression toward the mean wikipedia , lookup

Data assimilation wikipedia , lookup

Choice modelling wikipedia , lookup

Forecasting wikipedia , lookup

Regression analysis wikipedia , lookup

Time series wikipedia , lookup

Linear regression wikipedia , lookup

Least squares wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
MGT 3660
Review of Basic Statistical Concepts
 Descriptive Statistics
o Mean, Median, Mode
o Variance and standard deviation
o Quartiles and Interquartile-Range
 Graphical Presentations
o Dot plot
o Box plot
o Histogram
o Scatter Diagram
 Probability
o Random variable
 Discrete
 Continuous
o Probability distributions
 Binomial
 Normal
 Sampling distributions
o Estimation
 Point estimation
 Interval estimation
o Hypothesis testing
 Regression and Correlation
o Scatter diagram
o Crosstabs
o Correlation coefficient
o Least squares line of regression
Excel Commands and Functions
Statistical Commands in Excel
Data/Pivot Table
Develop frequency distributions and histograms
Tools/Data Analysis
 Descriptive Statistics
 Correlation
 Regression
Statistical Functions in Excel
Descriptive
Statistics
AVERAGE
TRIMMEAN
MEDIAN
PERCENTILE
QUARTILE
VAR
STDEV
Probability
Distributions
BINOMDIST
NORMDIST
NORMINV
NORMSDIST
NORMSINV
Hypothesis testing
ZTEST
TDIST
TINV
TTEST
CONFIDENCE
Regression and
correlation
CORREL
SLOPE
INTERCEPT
FORECAST
TREND
Example: TRIMMEAN(Array, percent)
TRIMMEAN: This function calculates the trimmed mean for a list of numbers.
Parameters:
Array: Range of numerical values to trim and average
Percent: A value between 0 and 1; represents the fractional number of values to
exclude from the range of data. For example, if percent = 0.2, and the
Array contains 20 cell values, 20 x .2 = 4 data values will be trimmed, two
smallest values and two largest values.
Creating a Box Plot
1. Set up the following data for the plot
 Maximum - Q3
 Q3 – Median
 Median - Q1
 Q1
 Q1-Minimum
(Note: Q1 = First quartile, Q3 = Third quartile)
2. Highlight the middle three values from above (do not highlight all 5 values).
3. C lick Insert, and from the “Charts” section, selectColumn/2D stacked Column
bar graph.
4. Click “Switch Row/Column”
5. Delete (i) the legend label and (ii) X-axis label
6. Select the Graph, click Select Data and rearrange series 1, series 2 and series 3 in
the proper order
7. Select the bottom stack. Click “Layout”, “Error bars”, “More Error Bar options”
8. Select “Minus”, “Custom”, and then click “Specify Value”
9. Select “Negative Error Value” and enter the cell address of Q1-Minimum value.
Then click OK and “Close”.
10. Select the bottom bar. Then right click and select “Format Data Series”
11. Select “Fill” and check “No Fill”. Then click “Close”
12. Select the top stack. Click “Layout”, “Error bars”, “More Error Bar options”
13. Select “Plus”, “Custom”, and then click “Specify Value”
14. Select “Positive Error Value” and enter the cell address of Maximum-Q3 value.
Then click OK and “Close”.
15. Add a chart title and resize it.
Probability Distributions
Random Variable (RV): A numerical description of the outcome of an experiment.
Discrete RV: A random variable that can take a countable set of values. For
instance, if an experiment consists of inspecting 10 laptops produced by a
manufacturer, then a random variable X can be defined as the number of defective
laptops in the lot. The possible values for X are any number from zero to 10.
Continuous RV: A random variable that can take an uncountable range of values.
For instance, if an experiment consists of measuring the amount of toothpaste in a
6 oz. tube, then a random variable X can be defined as the amount of toothpaste in
a tube. The possible values for X could be any value between 5.8 oz. To 6.2 oz.
The values within the range is not countable.
Probability Distribution: A description of how the probabilities are distributed over the
values the random variable can assume.
Expected Value: The expected value of a RV is the average value of the RV if the
experiment is repeated over a long run.
Expected Value of a Discrete Random Variable: E(x) = µ =  (x f(x))
Normal Probability Distribution: A continuous probability distribution. The normal
distribution is a symmetrical distribution with a mean, , and a standard deviation, .

Example
The ticket sales for events held at the new civic center are believed to be normally
distributed with a mean of 12,000 and a standard deviation of 1,000.
a.
What is the probability of selling no more than 8,000 tickets?
b. What is the probability of selling more than 10,000 tickets?
c.
What is the probability of selling between 9,500 and 11,000 tickets?
d. How many tickets will have a probability of selling of 98% ?
Confidence Interval and Hypothesis Testing


Simple Random Sample
Point Estimation:
Size
Mean
Standard deviation

(Point Estimator)
Sample Statistic
Population Parameter
n
N
S


Sampling Error = | –  |
Confidence Interval
  x  t 2 .
S
n
(Use Z instead of t only if  is known)
Hypothesis Testing
1.
2.
3.
4.
Set up the null and the alternative hypotheses.
Compute p-value for rejecting the null-hypothesis using t-distribution (Use Z
instead of t only if  is known)
X  0
Use t 
to determine the p-value.
S n
If p-value <= a, then reject the null-hypothesis; otherwise do not reject.
Interpret and report the results
Correlation and Simple Regression
Coefficient of Correlation r:
Correlation coefficient between two sets of data (X and Y) is a number between -1 and 1.
It measures the strength and direction of linear association between the two sets of data of
equal size.
The sign indicates the direction of the association. Positive numbers indicate direct
association and negative numbers indicate inverse relationship.
The value indicates the strength of the association between the two data sets. A number
close to 1 or -1 indicates strong relationship. A number to close to zero indicates weak or
non-existent relationship.
Formula for determining correlation coefficient:
r=
(x i  x )(y i  y )
 (x i  x )
2
 (y i  Y )
2
Simple Linear Regression
Simple linear regression equation is a linear function between two data sets of equal size,
of the form Y = b0 + b1X, where, y = dependent variable and x = independent variable.
The model: Y = b0 + b1X + e,
where, b0 = the y-intercept, b1 = the slope of the line, and e = error
The model may be written as Y = Ŷ + e, where Ŷ = estimated value of Y
Then estimation error = e = Y – Ŷ, and squared error = (Y – Ŷ)2
The following formulas give estimates for b0 and b1 that minimizes the squared sum of
estimation error, called least squared estimates.
( x i  x)( y i  y)
b o  y  b1 x
b1 
2
 (x i  x)
Excel functions for regression:
b1 = SLOPE(y-range,x-range)
b0 = INTERCEPT(y-range,x-range)
Ŷ = FORECAST(y-range,x-range,Given x-value)
Ŷ = TREND (Given x-value,y-range,x-range)