Download Week 2 Vocabulary: Section 2.1 Frequency

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Data mining wikipedia , lookup

Time series wikipedia , lookup

Transcript
Week 2 Vocabulary:
Section 2.1 Frequency Distributions
Frequency distribution: Quantitative Data is a table that shows classes or intervals (frequency f of a class
is the number of data entries in the class
Lower class limit = least number that can belong to the class
Upper class limit = greatest number that can belong to the class
Class width = distance between lower (or upper) limits of consecutive classes. (Not- lower-upper
within a class)
Range – difference between the maximum and minimum data
Class boundaries- are the numbers that separate classes without forming gaps between them
Constructing a frequency distribution
1. Decide on the number of classes (could be arbitrary)
2. Find the range= highest value – lowest value
3. Find the class width = Divide the range by number of classes (round up to next whole number if
decimal)
4. Decide the class limits
5. Tally
6. Count tally to find frequency
7. Total frequency
Midpoint or a class =
Relative frequency =
=
Cumulative frequency = sum of the frequency for that class and all previous classes. ( CF for the last class
= n)
Frequency Histogram: is a bar graph that represents the frequency distribution of a data set.
Horizontal scale is quantitative and measures the data values
Vertical scale measure the frequencies of the classes
Consecutive bars must touch - (mark the horizontal scale either at the midpoint or at the class
boundaries)( broken axis to left of first bar- see page 44)
Frequency Polygon: a line graph that emphasizes the continuous change in frequencies.
Horizontal scale is quantitative and measures the data values (use midpoint of each class
Vertical scale measure the frequencies of the classes
Connect the points
Relative Frequency Histogram: same shape and same horizontal scale as frequency histogram, but
vertical scale measures the relative frequencies not frequencies ( % of total)
Cumulative frequency graph, Ogive (pronounced Ō-jīve): is a line graph that displays the cumulative
frequency of each class at the upper class boundary.
horizontal axis The upper boundaries
Vertical axis – cumulative frequencies
Section 2.2 Additional Graphs
Stem and Leaf Plot: Example of exploratory data analysis (EDA) Each number is separated into a stem(for
instance, the leftmost digits) and a leaf ( right most digit)
Should have as many leaves as there are entries in original data
Similar to histogram, but still contains the original data values
Provides easy way to sort data
Pie Chart: use for qualitative Data. Divides circle into sectors that represent categories.
Pareto Chart: use for qualitative data. Height of each bar represents frequency or relative frequency (5
of whole)
Scatter Plot: graphing paired data
Time Series Chart: graph line where x axis is time
Section 2.3Measures of Central Tendency
Mean: sum of the data divided by number of entries  

 x Population mean, x   x sample mean
N
n
Affected by outliers
ROUND OFF RULE #1: One more decimal place than the original set of data
Round OFF RULE # 2: Rounding should not be done until the final answer of the calculation
Median: Middle of data when the data set is ordered.
 If the data set has an odd number of entries median is the middle data entry.
 If the data set is even number of entries, the median is the mean of the two middle entries.
Mode: is the data entry that occurs with the greatest frequency.
 If no entry is repeated, the data set has no mode.
 If two entries occur with the same greatest frequency, each entry is a mode and the data set is
called bimodal.
 The mode is the only measure of that is used to describe data at the nominal level, when
working with quantitative data, it is rarely used.
Outliers: data entry that is far removed from the other entries in the data set.
Weighted Mean: the mean of the data set whose entries have varying weights (w) is the weight in each
category. x 
 ( x * w)
w
Mean of a frequency distribution: is approximated by x 
and frequencies of a class, respectively and n 
f
 ( x * f ) , where x and f are the midpoints
n
Symmetric distribution – Mean=Mode=Median
Uniform or rectangular distribution – Mean= Median
Skewed Left (negatively skewed) tail of graph elongates more to left, mean is to left of mode and
median.
Skewed Right (positively skewed) tail of graph elongates more to right, mean is to right of mode and
median.
Section 2.4 Measures of Variation
Range: is the difference between the maximum and minimum data entries in the set.
Deviation: of an entry x in a population data set is the difference between the entry and the mean  of
the data set.
Population variance:  2 
 (x  )
N
2
(note: if you were to just sum up all the deviations from the
mean, you would get 0 for the most part, so the average of the deviations would not be possible.
Therefore we compute a quantity called the “sum of the squares”
Population standard deviation: square root of the population variance.  
Sample variance: s 2 
 ( x  x)
 (x  )
2
N
2
n 1
Sample standard deviation: s 
 ( x  x)
n 1
2
The size of the standard deviation tells up something
about how spread out the data are from the mean.
Technical reason for why n-1 is used instead of n: We obtain a somewhat larger value for the sample
variance than the one we obtain by dividing by n. The need for the larger estimator reflects the fact that
ordinarily a sample has less diversity than it population, for the part is rarely more disperse than the
whole. (s2 may turn out to be larger than  , if a disproportionately large number of extreme values are
selected for the sample; however, on the average s^2 values will tend to equation  2)
Empirical Rule ( 68 – 95 – 99.7)
 ~68% of the data lie within 1 standard deviation of the mean
 ~95% of the data lie within 2 standard deviation of the mean
 ~99.7% of the data lie within 3 standard deviation of the mean
Chebychev’s Theorem:- gives the minimum percent of data values that fall within the given number of
standard deviations of the mean. Depending upon the distribution, there is probably a higher percent of
data falling in the given range.
The portion of any data set lying within k standard deviations (k>1) of the mean is at least
1
1
k2
Standard Deviation for Grouped Data
s
 ( x  x)
n 1
2
f
,n   f , x 
 xf
n
Section 2.4 Measure of Position
Quartiles: the three quartiles divide the data set into four equal parts.
Deciles: divides the data set into 10 equal parts
Percentiles: divides the data set into 100 equal parts.
Interquartile range (IQR): is the difference between the thired and first quartiles.
Standard Score, (z-score) represents the number of standard deviations a given value x falls from the
mean  .
z
Value  Mean x  

SD
