Download Introduction - ERIE - University at Buffalo

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Department of Civil, Structural, and Environmental Engineering
University at Buffalo, State University of New York
2011 Summer ERIE/REU Program
Descriptive Statistics
Igor Jankovic
Content
•
Statistics terminology
1.
2.
3.
•
Presentation of qualitative data
1.
2.
•
Graphical method
Numerical method
Presentation of quantitative data
1.
2.
•
Population vs. Sample
Descriptive statistics vs. Inferential statistics
Data Types
Graphical method
Numerical method
Outliers in a data set
Population vs. Sample
•
•
Population: an entire data set that is the target of our interest
Sample: a subset of data selected from a population
Example:
Electrical engineers recognize that high natural current in computer power
system is a potential problem. To determine the extent of the problem, a survey
of the computer power system load currents at 146 US sites taken (IEEE
Transaction on Industry Applications, July/August 1990). The survey revealed
that less than 10% of the sites had high neutral to full-load current ratios.
•
•
•
Identify the population of interest
(powerload status at all US sites with computer powers systems)
Identify the sample
(powerload status at 146 US sites with computer powers systems
Use of the sample information to make an inference about population
(less than 10% of the sites had high neutral to full-load current ratios)
Descriptive statistics vs. Inferential statistics
•
Two major applications of Statistics:
-Summarizing, describing, and exploring data
-Using sample data to infer the nature of the population data
set
In other words,
•
Descriptive statistics
-The branch of statistics devoted to the organization,
summarization, and description of data sets
•
Inferential statistics
-The branch of statistics concerned with using sample data to
make an inference about populations
Data Types
Quantitative Data:
The data that represent the quantity or amount of something
Qualitative (categorical) Data:
The data that have no quantitative interpretation
Example:
•
•
Length (in centimeters), weight (in grams), DDT
concentration (in ppm): quantitative data
Location and species: qualitative data
Qualitative Data
Graphical method for describing qualitative data
For qualitative data, we define the categories in such a way that
each observation can fall in one and only one category.
Example: Student distribution in terms of year at college in EAS 308
Numer of students in EAS 308
50
Year at College
Junior
Senior
Sophomore
0
10
20
30
40
45
40
35
30
25
20
15
10
5
50
Numer of students in EAS 308
Horizontal Bar Graph
0
Senior
Junior
Sophomore
Year at College
Sophomore
Senior
Junior
Pie Chart
Pareto diagram
Numerical method for describing qualitative data
For qualitative data, we define the categories in such a way that
each observation can fall in one and only one category.
Category frequency for a given category is the number of
observations that fall in that category
Category relative frequency for a given category is the proportion
of the total number of observations that fall in that category
Summary frequency table
Year at college Frequency Percent Cumulative Frequency Cumulative Percent
Sophomore
11
12.4
11
12.4
Junior
35
39.3
46
51.7
Senior
43
48.3
89
100.0
Quantitative Data
Graphical method for describing quantitative data (1)
Dot plots
Steps:
1. Draw a horizontal scale that spans
the range of data
2. Place a dot over the appropriate
value on the scale representing
the value of observations
3. If data value repeats, then the
dots are placed on top of each
other
Graphical method for describing quantitative data (2)
Histograms (most popular and traditional method for describing quantitative data)
Steps:
1. Calculate the range of data
2. Divide the range into 5-20 classes of equal width
3. For each class, count the number (class frequency) of observations
that fall in the class
4. Calculate each relative class frequency = (class frequency)/ total
number of measurements
Graphical method for describing quantitative data (3)
Stem-and-Leaf Display
Steps:
1. Divide each observation in the data set into two parts, the stem and the
leaf. For example, the stem and leaf of the CPU time 2.41 are 2, and
41, respectively.
Stem
Leaf
2
41
2. List the stems in order in a column, starting with the smallest stem and
ending with the largest.
3. Proceed through the data set, placing the leaf for each observation in
the appropriate stem row.
Numerical method for describing quantitative data
Measures of central tendency
- help to locate the center of the relative frequency distribution -Arithmetic mean (mean)
Suppose we have a set of n measurements, y1,y2,y3,…,yn,
n
The arithmetic mean =
y
i 1
n
i
Generally, we use y to represent sample mean and  to represent population mean
-Median
Median is the middle number when the measurements are arranged in ascending
(descending) order
y[(n+1)/2] , if n is odd
Median =
{ y(n/2) + y(n/2+1) } /2, if n is even
Generally, we use m to represent sample median and  to represent population
median
Numerical method for describing quantitative data
Measures of central tendency
- help to locate the center of the relative frequency distribution -Mode
The mode of a set of n measurements, y1,y2,y3,…,yn, is the value of y that
occurs with the greatest frequency
Numerical method for describing quantitative data
Measures of central tendency
Example:
We have 10 sample measurements: 4, 5, 8, 1, 11, 6, 2, 8, 3, 7
Compute the mean, median, and mode.
Solution:
Mean = 5.5
Median = (6+5)/ 2 = 5.5
Mode = 8
Measures of central tendency:
Geometric Mean (from Wikipedia)
Measures of central tendency:
Harmonic Mean (from Wikipedia)
Numerical method for describing quantitative data
Measures of variation
- help to locate the spread of the distribution -Range
Range = largest measurement – smallest measurement
-Variance (of n measurements, measurements, y1,y2,y3,…,yn)
n
Sample variance = s 2 
 ( y  y)
i
i 1

n 1
n
Population variance =
n
2
2 
(y
i 1
i
y
i 1
  )2
n
2
i
n
 [(  yi ) 2 / n]
i 1
n 1
Numerical method for describing quantitative data
Measures of variation
- help to locate the spread of the distribution -Standard Deviation
n
standard deviation of a sample =
s
(y
i 1
i
 y)

n 1
n
standard deviation of a population =  
n
2
(y
i 1
i
y
i 1
 )2
n
n
2
i
 [(  y i ) 2 / n]
i 1
n 1
Skewness: measure of shape
Approximate formula
(accurate for large “n”)
Exact formula
where s is the sample standard deviation.
Kurtosis: measure of “peakedness”
Approximate formula
(accurate for large “n”)
Exact formula
where s is the sample standard deviation.
Numerical method for describing quantitative data
Measures of relative standing
- describes the relative position of an observation within the data set Two measures used to describe the relative standing of an observation are
percentiles and z-scores
Percentiles
- 100 pth percentile
100pth percentile of a data set is a value of y located so that 100 p% of the area
under the relative frequency distribution for the data lies to the left of the 100pth
percentile and 100 (1-p)% of the area lies to its right [note: 0 p 1]
- Lower quartile, QL, , corresponding to 25th percentile.
- Midquartile, m, corresponding to 50th percentile.
- Upper quartile, QU , corresponding to 75th percentile
Numerical method for describing quantitative data
Measures of relative standing
- describes the relative position of an observation within the data set Two measures used to describe the relative standing of an observation are
percentiles and z-scores
Z-scores
The z-score for a value y of a data set is the distance that y lies above or
below the mean, measured in units of the standard deviation.
Sample z-score: z 
y y
s
Population z-score: z 
y

Detecting Outliers
Definition of an outlier:
An observation y that is unusually large or small relative to the other values in a
data set is called an outlier.
Reasons for outliers in a data set:
1. The measurement is observed, recorded, or
entered into the computer incorrectly
2. The measurement comes from a different population
3. The measurement is correct, but represents a rare
(chance) event.
Rule of Thumb for detecting outliers:
Observations with z-scores greater than 3 in absolute value are
considered outliers.
Detecting Outliers
Box Plot Method
Interquartile range, IQR
IQR = QU - QL
Steps to construct a Box Plot
1. Calculate the median m, lower and upper quartiles,
QL, and QU, and IQR, for the y values in a data set
2. Construct a box on the y-axis with QL and QU located at the lower corners. The base
width will be equal to IQR. Draw a vertical line inside the box to locate the median, m
3. Construct two sets of limits on the box plot. Inner fences are located a distance of 1.5
(IQR) below QL and QU; outer fences are located a distance of 3(IQR) below QL and
above QU.
4. Observations that fall between the inner and outer fences are called suspect outliers.
Observations that fall outside the outer fences are called highly suspect outliers.
5. To further highlight extreme values, use Whiskers.
Empirical Rule
If a data set has an approximately mound
shaped distribution, then the following rules of
thumb may be used to describe the data set
Example:
At least 68% of the measurements will lie
within the interval y ± s for samples
At least 95% of the measurements will lie
within the interval y ±2s for samples
Summary
In this lecture, we have learned:
• Some important statistics terminologies
1.
2.
3.
•
How to deal with Qualitative data
1.
2.
•
Graphical method (Bar graph, Pie chart, Pareto diagram)
Numerical method
How to deal with Quantitative data
1.
2.
•
•
Population vs. Sample
Descriptive statistics vs. Inferential statistics
Data Type
Graphical method (Dot plot, Histogram, Stem and Leaf plot)
Numerical method
How to detect outliers in a data set?
Empirical Rule