Download Chapter 1:Statistics: The Art and Science of Learning from Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Exploring Data
Descriptive Data
1
Content
•
•
•
•
•
Types of Variables
Describing data using graphical summaries
Describing the Centre of Quantitative Data
Describing the Spread of Quantitative Data
How Measures of Position Describe Spread
2
Variable
• A variable is any characteristic that
is recorded for the subjects in a
study
• Examples: Marital status, Height,
Weight, IQ
• A variable can be classified as either
– Categorical (e.g. Male / Female)
– Quantitative (e.g. Age)
• Discrete or (number of children in family)
• Continuous (weight: 70,25 kg)
www.thewallstickercompany.com.au
3
Categorical Variable
• A variable is categorical if each observation
belongs to one of a set of categories.
• Examples:
1.
2.
3.
4.
Gender (Male or Female)
Religion (Catholic, Jewish, …)
Type of residence (Apartment, House, …)
Belief in life after death (Yes or No)
4
Quantitative Variable
• A variable is called quantitative if observations
take numerical values for different magnitudes
of the variable.
• Examples:
1. Age
2. Number of brothers/sisters
3. Annual Income
5
Categorical vs. Quantitative
• Categorical variables
– percentage of observations in each category is
important
– E.g. % Male, % Female
• Quantitative variables
– center (a representative value) and spread
(variability) are important
– Average Age
– Variation around the average age
6
Discrete Quantitative Variable
• A quantitative variable is discrete if its possible values
form a set of separate numbers: 0,1,2,3,….
• Examples:
1. Number of pets in a household
2. Number of children in a family
3. Number of foreign languages spoken by an
individual
7
Continuous Quantitative Variable
• A quantitative variable
is continuous if it has
an infinite number of
possible values
• Measurements
• Examples:
1. Height/Weight
2. Age
3. Blood pressure
www.wtvq.com
8
4 types of scale
•
•
•
•
Nominal
Ordinal
Interval
Ratio
9
Nominal Scale
• Nominal scale is simplest scale.
• They are numbers or letters assigned to objects
– serve as labels for identification or classification
• e.g. names and gender are categorical variables;
– ‘M’ for Male and ‘F’ for Female,
– or ‘1’ for male and ‘2’ for female,
– or ‘1’ for female and ‘2’ for male.
• Other examples include
– marital status, religion, race, colour and employment
status, and so forth.
10
Ordinal Scale
• A subset of the nominal scale
– Where the scale follows an order
• Ordinal scale creates an ordered (ranked)
relationship
• Typical ordinal scales
– (i) result of examination: first, second, third and fail;
– (ii) quality of products: ‘excellent’, ‘good’, ‘fair’ or ‘poor’
– (iii) social class: upper, middle, lower class
11
Interval Scale
•
•
•
•
Indicate order and distance in units.
The Interval is a measuring tool
But Zero point is arbitrary
Example: a price index
–
–
–
–
the number of the base year (say year 2010) is set to be usually 100
Price of bread is 40 kn (= 100) is year 2010
Price of bread is 50 kn (= 125) in year 2015
We then know price of bread is higher in 2015 by 25%
• Another example of interval scale
– temperature where the initial point is always arbitrary
– O degrees is freezing point in Celsius (used in Europe)
– 32 degrees is freezing point in Fahrenheit (used in US)
12
Ratio Scale
• Ratio scales are absolute rather than relative
• If interval scale can only have an absolute zero
– then it is really a ratio scale.
• Absolute zero
– a point on scale where the attribute is zero
• Examples
– age, money and weight are ratio scales
– because they possess an absolute zero and interval
properties
– A person can’t have a negative weight or negative age
13
Describing data using graphical
summaries
14
Frequency Table
• Frequency table
– a listing of possible values for a variable
– together with the number of observations
– or relative frequencies (%) for each value
15
Be careful to distinguish
Proportions & Percentages (Rel. Freq.)
Proportions and percentages are also
called relative frequencies.
16
Graphs for Categorical Variables
• Use pie charts and bar
graphs to summarize
categorical variables
1. Pie Chart: A circle
having a “slice of pie”
for each category
2. Bar Graph: A graph
that displays a vertical
bar for each category
17
Pie Charts
• Summarize categorical
variable
• Drawn as circle where
each category is a slice
• The size of each slice is
proportional to the
percentage in that
category
18
Bar Graphs
• Summarizes categorical
variable
• Vertical bars for each category
• Height of each bar represents
either counts or percentages
• Easier to compare categories
with bar graph than with pie
chart
• Called Pareto Charts when
ordered from tallest to
shortest
19
Histograms
• Graph that uses bars to
portray frequencies or
relative frequencies for a
quantitative variable
• Frequency is always on
vertical axis
• Intervals always on
horizontal axis
20
Constructing a Histogram
1. Divide into intervals of equal width
2. Count # of observations in each interval
Sodium in
Cereals
21
Constructing a Histogram
3. Label endpoints
of intervals on
horizontal axis
4. Draw a bar over
each value or
interval with
height equal to
its frequency (or
percentage)
5. Label and title
Sodium in Cereals
22
Interpreting Histograms
• Assess where a
distribution is
centered by finding
the median
• Assess the spread of
a distribution
• Shape of a
distribution: roughly
symmetric, skewed to
the right, or skewed
to the left
Left and right sides
are mirror images
23
Examples of Skewness
24
Shape: Type of Mound
Height of 10 year olds
Electricity demand or demand for
seats in a restaurant different
times of day
25
Outlier
An outlier falls far from the rest of the data
26
Time Plots
• Display a time series,
data collected over
time
• Plots observation on
the vertical against
time on the
horizontal
• Points are usually
connected
Time Plot from 1995 – 2001 of
number of people globally
who use the Internet
27
Describing the Centre of
Quantitative Data
28
Mean
• The mean is the sum
of the observations
divided by the
number of
observations
• It is the center of
mass
29
Median
Order
1
2
3
4
5
6
7
8
9
Data
78
91
94
98
99
101
103
105
114
Order
1
2
3
4
5
6
7
8
9
10
Data
78
91
94
98
99
101
103
105
114
121
• Midpoint of the observations
when ordered from least to
greatest
1. Order observations
2. If the number of observations
is:
a) Odd, the median is the
middle observation (99)
b) Even, the median is the
average of the two middle
observations (99+101 =100)
30
Comparing the Mean and Median
• Mean and median of a symmetric distribution are close
– Mean is often preferred because it uses all data
• But in a skewed distribution, the mean is farther out in
the skewed tail than is the median
– Median is preferred because it is better representative of a
typical observation
31
Mode
• Value that occurs most often
• Highest bar in the histogram
• Mode is most often used with categorical data
32
Resistant Measures
• A measure is resistant if extreme observations
(outliers) have little, if any, influence on its value
– Median is resistant to outliers
– Mean is not resistant to outliers
• Example: 75 people in class
–
–
–
–
–
–
72 people absent for 1 day year in year
2 people absent for 50 day each
1 person absent for 100 days
Median = 1 day
Mean = 2.42 days
Mode = 1 day
33
Describing the Spread of
Quantitative Data
34
Range
Range = max – min
Two teams with same average (mean) height = 2.0m
2.5m
2.1m
2.1m
1.8m
1.5m
2.2m
2.1m
2.0m
1.9m
1.8m
The range is strongly affected by outliers.
35
Properties of Sample Standard Deviation
1.
2.
3.
4.
5.
6.
Measures spread of data
Only zero when all observations are same; otherwise, s > 0
As the spread increases, s gets larger
Same units as observations
Not resistant
Strong skewness or outliers greatly increase s
38
How Measures of Position
Describe Spread
40
Percentile
The pth percentile is a value
such that p percent of the
observations fall below or at
that value
70th percentile
41
Finding Quartiles
• Splits the data into four parts
with same number of
observations in each part
1. Arrange data in order
2. The median is the second
quartile, Q2
3. Q1 is the median of the lower
half of the observations
4. Q3 is the median of the upper
half of the observations
42
Measure of Spread: Quartiles
•
1.
2.
3.
Quartiles divide a ranked
data set into four equal parts:
25% of the data at or below Q1= first quartile = 2.2
Q1 and 75% above
50% of the obs are above
M = median = 3.4
the median and 50% are
below
75% of the data at or below
Q3 and 25% above
Q3= third quartile = 4.35
43
Calculating Interquartile Range
• The interquartile range is the distance between
the third and first quartile, giving spread of middle
50% of the data: IQR = Q3 - Q1
44
Criteria for Identifying an Outlier
• An observation is a potential outlier if:
– it falls more than 1.5 x IQR below the first quartile or
– more than 1.5 x IQR above the third quartile.
IQR: (75-25) = 50
Outlier < -25
Outlier > 150
25
50
75
45
5 Number Summary
• The five-number summary
of a dataset consists of:
1.
2.
3.
4.
5.
Minimum value
First Quartile
Median
Third Quartile
Maximum value
46
Boxplot
1. Box goes from the Q1 to Q3 (the IQR)
2. Line is drawn inside the box at the median
(the middle value)
3. Lines go from
–
–
lower end of box to smallest observation
that’s not a potential outlier
from upper end of box to largest observation
that’s not a potential outlier
4. Potential outliers are shown separately,
often with * or +
47
Comparing Distributions using Boxplots
•
•
Boxplots do not display the shape of the distribution as
clearly as histograms
but are useful for making graphical comparisons of two
or more distributions
1,3
1,3 m
m
1,6 m
1,9 m
48