Download CHAPTER 6: LINEAR PROGRAMMING

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Mediation (statistics) wikipedia , lookup

Gibbs sampling wikipedia , lookup

Misuse of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
CHAPTER14:
INTRODUCTION TO
DATA ANALYSIS
14.1
INTRODUCTION
 There are many situations in business where
data is collected and analysed.
 The key ideas of data analysis are important
in the modern business environment.
 Summarising and understanding the main
features of the variables contained within
the data, and investigate the nature of any
linkages between the variables that may
exist.
14.2
WHAT IS DATA
 Example 1
 Population: the set of all people/objects of interest
in the study being undertaken.
– Very large
– Enumerated precisely
– Cannot be Enumerated physically
Population
member
 The information for each member of the
population
–
–
–
–
–
Age:
Gender:
Parish:
Will you vote in the by-election?:
Will you vote for me?
 Variables: one piece of information
– Five variables
 To investigate the connection between the
two pairs of variables:
– 'Will you vote for me' and 'Age'
– 'Will you vote for me' and 'Gender'
– 'Will you vote for me' and 'Parish'
 Population data is used  the outcomes of
the analysis are precise  'perfect
information' results.
 Example 2
 Population: the set of all customers
 A sensible initial set of questions is:
– Do you understand exactly what each variable
is measuring/recording?
– Do you understand the problem under
investigation and are the objectives of the
investigation clear.?
14.3
DESCRIBING VARIABLES
 Classification of variable types
– Attribute variables
– Measured variables
 Attribute Variables:
– An attribute variable has its outcomes described in
terms of its characteristics or attributes.
– Example 1 'By-Election Data':
– Example 2 'Credit Data'
• Does the customer own their own house?
– 0=Yes
1=No
• The Region in which the customer is resident?
–
–
–
–
–
1—South West
2—South East
3—London
4—Midland
5—North
• Handling attribute data is to give it a numerical
code 0, 1, 2 ,….
 Measured Variable
– A measured variable is a variable that has its
outcomes measured; the resulting outcome is
expressed in numerical terms.
– Two types of measured variables
• Continuous variable : continuous scale of
measurement(person's weight)
• Discrete variable : the number of passengers on
flight
– Example 1 'By-Election Data':
• The measured variable in this data set is 'Age'
– Example 2 'Credit Data'
• Measured variables as follows
14.4 THE CONCEPT OF A STATISTICAL
DISTRIBUTION
 Attribute Variable
– Gender of constituents (Example 1)
DISTRIBUTION OF GENDER IN THE CONSTITUENCY
– REGION (Example 2)
DISTRIBUTION OF REGION IN WHICH CUSTOMER IS RESIDENT
 Measured Variable
– Customer's Age (Example 2)
DISTRIBUTION OF AGE OF CUSTOMER
– Household Income (Example 2)
DISTRIBUTION OF HOUSEHOLD INCOME
 What does the distribution show?
– The area under the curve from one income
value to another measures the relative
proportion of the population having household
incomes in that range.
– Lower than £10,000 is relatively rare
– Large proportion of the population have
Household incomes between £20,000 &
£50,000
 The Descriptive Statistics for Distribution
of a Measured Variable
– Distribution of the height of adults in Great
Britain.
– The height of children under 11 years of age
children's heights
adult's heights
– Heights in two different countries, country A
and country B
DISTRIBUTION OF HEIGHTS COUNTRY A & B
 A statistical distribution for a measured
variable can be summarised using three key
descriptions:
– Centre of the distribution
– Width of the distribution
– Symmetry of the distribution
–
 Measuring the Centre of a Distribution:
– The Mean
•
•
•
•
average value = X/n
Average Household Income
symbol for the population mean: 
Formally the population mean of a variable is defined to be:
–  = X/n
– The Median
• The median value of the variable is defined to be the particular
value of the variable such that half the data values are less than
the median value and half are greater.
• Sorting all data in ascending order, the median value is then the
middle value in this list
 Measuring the Width of a Distribution
– The Standard Deviation
• The Standard Deviation is the square root of the average
squared deviation from the mean.
• Symbol of Standard Deviation: 
•  is usually defined in terms of the variance  2as:
–  2 = (X- )2/n
• Standard deviation is the square root of the
variance
• Calculating the standard deviation for the variable
Household Income
• Standard deviation is a relative measure of spread
(width), the larger the standard deviation the wider
the distribution.
– Inter-quartile Range
• The inter-quartile range is the range over which the
middle 50% of the data values varies
• To define the quartiles:
– Q1 : the value of the variable that divides the distribution 25% to
the left and 75% to the right.
– Q2 :the value of the variable that divides the distribution 50% to
the left and 50% to the right.
– Q3 :the value of the variable that divides the distribution 75% to
the left and 25% to the right.
• The inter-quartile range is the value Q3 - Q1
• Calculating the Q1, Q2, Q3 for the variable
'Household Income'
• Conventionally the mean and standard deviation are
one pair of measures of location and spread, and the
median and inter-quartile range as another pair of
measures.
 Measuring the Symmetry (skewness) of a
Distribution
– Pearson's coefficient of Skewness
• Pearson's coefficient of Skewness = 3(mean - median)/standard
deviation
– Quartile Measure of Skewness
• Quartile Measure of Skewness = [(Q1 - Q3) - (Q2 – Q1)]/(Q3 – Q1)
•
14.5
SUMMARY
 What is Data
 Variables
 Two types of variable:
– an attribute variable
– a measured variable
 The concept of a Statistical Distribution:
– As applied to an attribute variable
– As applied to a measured variable
 Descriptive Statistics for a measured
variable:
– Measures of Centre
• Mean
• Median
– Measures of Width
• Standard Deviation
• Inter-Quartile Range
– Measures of Symmetry (Skewness)
• Pearson's coefficient of Skewness
• Quartile Measure of Skewness
14.6 THE NATURE OF A SAMPLE:
 POPULATION:
– Perfect Information
– In practice it is often impossible to enumerate
the whole population.
– A sample drawn from the population to make
judgements (inferences) about the population.
 SAMPLE
– Imperfect Information
– Random sample
• Each item in the population has an equal chance of
being included in the sample.
– The KEY PROBLEM is to use this sample data
to draw valid conclusions about the population
with the knowledge of and taking into account
the 'error due to sampling'
– Unrepresentative sample
• How to Lie with Statistics
 A Credit Scenario
– Population: the set of all customers who used the
credit facilities between 1st January 2000 and 31st
December 2001.
– Sample Size: 654 customers
– Data file: BDMCREDIT.MTW
14.8
DESCRIBING SAMPLE DATA
 Attribute variable: the number of
occurrences of each attribute is obtained
 Measured variable: Sample descriptive
statistics describing the centre, width and
symmetry of the distribution are calculated.
 Attribute Data
– C5 Does the customer own their own house?
Coded: 0 = Yes, l=No
– C6 The Region in which the customer is
resident?
Coded:
–
–
–
–
–
1
2
3
4
5
South West
South East
London
Midlands
North
– Command STAT-TABLE-TALLY
– Summary Statistics for Discrete Variables
• Counts (OWN-OCC)
• Percent(OWN-OCC)
• Distribution graph(OWN-OCC)
Do you Own your own house?
– Summary Statistics for Discrete Variables
– Count(REGION )
– The information in form:
•
•
•
•
•
74 or 11.31% of the respondents are from the Southwest
132 or 20.18% of the respondents are from the Southeast
165 or 25.23% of the respondents are from the London area
161 or 24.62% of the respondents are from the Midlands
122 or 18.65% of the respondents are from the North
 Measured Variables
– For the 'Credit Data
• C2 Customer's Age (AGE)
• C3 Household Income (£ per annum) (SALARY)
• C4 Estimated monthly outgoing on mortgage/rent/rates/utilities/credit
card payments etc. (PAYOUT)
• C7 The Amount borrowed on credit (CREDIT)
– HISTOGRAM
– BOXPLOT
• The BOXPLOT will prove to be a more useful way
of representing the picture of a sample distribution
when the data analysis used to examine the
connection between two sample variables is
discussed in later chapters.
14.7
DATA ANALYSIS USING SAMPLE DATA
 Before attempting to analyse any data, the
analyst should:
– The problem under investigation is clearly
understood and the objectives of the
investigation have been clearly specified. Keep
asking questions until satisfactory answers have
been obtained.
– The individual variables making up the data
set are clearly understood.
– Descriptive Statistics
• Measures of Centre
– Mean
• Sample Mean
– Median
X  945.2
• Measures of Width
– Standard Deviation
• Sample Standard Deviation: S
• Sample Variance: S2
– Inter-Quartile Range IQR
• Symmetry
 Symmetry (Skewness)
– A distribution is skewed if one tail extends
farther than the other.
– A value close to 0 indicates symmetric data.
– Negative values indicate negative/left skew.
– Positive values indicate positive/right skew.
– Example of a negative or left-skewed
distribution (skewness = -1.44096)
Summary for marks
A nderson-Darling Normality Test
30
40
50
60
70
80
A -Squared
P -V alue <
2.37
0.005
M ean
StDev
V ariance
Skew ness
Kurtosis
N
73.540
12.670
160.534
-1.44096
2.92033
100
M inimum
1st Q uartile
M edian
3rd Q uartile
M aximum
90
26.000
67.000
76.000
83.000
92.000
95% C onfidence Interv al for M ean
71.026
76.054
95% C onfidence Interv al for M edian
73.000
79.000
95% C onfidence Interv al for StDev
9 5 % Confidence Inter vals
11.125
Mean
Median
70
72
74
76
78
80
14.719
– The Relationship between the descriptive
statistics and the Boxplot
• The asterisks on the right hand side of the
median are indicating sample values that are in
some sense extreme
14.9
INVESTIGATING RELATIONSHIPS
BETWEEN VARIABLES
 To investigate the relationship between
variables.
– Response variable
• a variable that measures either directly or
indirectly the objectives of the analysis
– Explanatory variable
• a variable that may influence the response
variable
 Example 1
– A university wishes to investigate the salary of
its graduates five years after graduating
– The questionnaire
• 'Current Salary'
• 'Starting Salary'
• 'Class of Degree'
Coded: l=First, 2=Upper
Second, 3=Lower Second, 4=Third, 5=Pass.
• 'Graduate's Gender' Coded: l=Male, 2=Female.
– Response variable
• Current Salary (measured variable)
– Explanatory Variable
• Staring Salary (measured variable)
• Class of Degree (attribute variable)
• 'Graduate's Gender (attribute variable)
 Example 2: CREDIT scenario
– Objectives of the analysis
• To investigate the nature of credit transactions
• The variable 'The Amount borrowed on credit'
• The problem is to investigate the relationship between 'The
Amount borrowed on credit' and the other variables.
– Summary
 Combinations of Response Variable and
Explanatory Variable
EXPLANATORY VARIABLE
 The method for investigating the connection
between a response variable and an attribute
variable depends on the type of variable.
– Investigating the connection between a
measured response and a measured explanatory
variables
– Investigating the connection between a
measured response and an attribute explanatory
variables
Homework
 Find or collect some data in your life or
business practice, answer the following
questions
–
–
–
–
Draw the statistic distribution of data
Calculate the Mean and Standard Deviation
Calculate the Median and Inter-Quartile Range
Calculate the Pearson’s Coefficient of
Skewness and Quartile Measure of Skewness