Download part i: descriptive statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
PROBLEM SET 5
STATISTICS ANALYSIS
PART I: DESCRIPTIVE STATISTICS
The purpose of PART I is to use descriptive statistics to explore patterns of prevalence of
diabetes in a high risk population, the Pima Indians of Arizona.
Database:
The data file PIMA.XLS contains medical data derived from observations of
individuals in the Pima population. Each record of the file refers to an individual's
fasting blood glucose level and blood glucose level measured 2 hours after the ingestion
of 75 g of carbohydrate. All values in the file are non-zero and no values are missing.
Each record of the file contains the following fields:
X1, X2, X3, X4, X5, X6, X7
where X1 = NIH # for identification of the individual
X2 = fasting glucose , mg/100ml of plasma
X3 = 2-hr glucose, mg/100 ml of plasma
X4 = sex; 1-male, 2 female
X5 = age, years
X6 = height, cm
X7 = weight, kg
PIMA.XLS contains 1211 records and can be found in the Biol 315 homepage.
Required Work:
In performing the required work for this exercise, you will use database functions
from an Excel spread-sheet. New spread-sheet functions will include advanced data
filtering and descriptive statistical functions (AVERAGE, STDEV, SKEW, KURT). You
will also need to set up a frequency analysis with which to generate histograms. The
required work will be in two parts:
PART I a. To illustrate how transformations are necessary to obtain Gaussian or Normal
distribution as random variables, you will first transform results of glucose tolerance test.
The required transformation is to get log to the base 10 of variable X3 for males and
females ages 15 to 24 years old. With an EXCEL spread-sheet produce the following
histograms in order to compare their shapes:
a. X3 for males, 15-24 years old2
b. log10(X3) for males, 15-24 year old
c. X3 for females, 15-24 year old
d. log10(X3) for females, 15-24 year old
1
PART I b. Obtain the histograms and a set of descriptive statistics (mean, standard
deviation, skewness and kurtosis) of the log to the base 10 2-hr glucose values for the
following twelve groups:
a. Males, 0-14 years old
b. Males, 15-24
c. Males, 25-34
d. Males, 35-44
e. Males, 45-54
f. Males, 55 and older
g. Females, 0-14 years old
h. Females, 15-24
i. Females, 25-34
j. Females, 35-44
k. Females, 45-54
l. Females, 55 and older
For each sex group (males and females), plot the mean and standard deviation of log
two-hour blood glucose versus age group midpoint for each of the 6 age groups above.
Use 1 standard deviation as the Y error.
PART I Submit:
1. In an EXCEL file (15 points):
a. Two X-Y plots (males and females): Age vs mean of the log transformed 2-hrglucose plus and minus 1 standard deviation.
b. 14 Histograms (Parts I and II) and statistics for the 12 combinations of age and
sex groups (Part II)
1. Your discussion in PART I must cover the following points (15 points)
a. Explain basis for the use of the Normal or Gaussian distribution as a reference
in these studies. Comment on the effectiveness of the logarithmic
transformation.
b. Describe the patterns of variations of mean and standard deviation of log 2-hr
glucose with age in both sexes.
c. Identify patterns of variation of the log transformed 2-hr plasma glucose with
age and sex in term of skewness and kurtosis.
PART II: INFERENCE
The purpose of this PART II is to acquaint you with statistical inference by exploring
blood glucose measurements obtained from Pima Indians. In addition, you will use a
simple function to characterize changes in blood glucose levels with age and sex in this
population. For this exercise, you will need to refer to the distributions of Log 2-hr
glucose for various sex and age combinations. You may consult the distributions you
plotted in problem set 9.
PART IIa. A function describing two overlapping Gaussian distributions has been
proposed to explain the observed Log 2-hr blood glucose distributions in the Pima
Indians. This model is
f ( x)   * N  1 ,12   1    * N  2 , 22  (1)
2


where N ,1 , 12 is the Gaussian distribution for Log 2-hr glucose levels in "normal"
individuals with mean
1 and standard deviation  1 , and N  ,2 ,  22 
is the Gaussian
distribution for Log 2-hr glucose levels in hyperglycemic individuals with mean  2 and
standard deviation  2 . The quantities  and 1    are the relative proportions of
"normal" and hyperglycemic individuals in the population.
Pool the log-2hr glucose values for all ages and both sexes to form a single
frequency distribution and a single histogram. Using this histogram, determine a cut off
point that you think best separates the two component distributions (i.e. pick an antimode
value C). Specify this antimode in the analyses that follow and use only this value. For
your own information, you might make a guess (based on visual inspection of the
distributions) of the components of the model in equation 1 for each age and sex group.
PART IIb. Next reanalyze each of the age and sex groups more formally. For each age
sex group, you must separate "normal" and hyperglycemic individuals. If the Log 2-hr
glucose value is less than C, the individual is "normal"; otherwise, the individual is
hyperglycemic. In each of the age-sex groups, you must calculate the following statistics:
X , the sample mean of Log 2-hr glucose, overall;
s
b.
, the sample standard error of Log 2-hr glucose, overall;
n
a.
X 1 , the sample mean of Log 2-hr glucose for normal individuals;
s1
d.
, the sample standard error of Log 2-hr glucose for normal individuals;
n1
c.
e.
f.
X 2 , the sample mean of Log 2-hr glucose for hyperglycemic individuals;
s2
, the sample standard error of Log 2-hr glucose for hyperglycemic individuals;
n2
n2
, the sample proportion of hyperglycemic individuals; and
n
n1 * n2
p2 q2
n , the sample standard error of the proportion of hyperglycemic
h.

n
n
individuals.
g. p2 
You may choose any procedure to calculate these statistics.
3
PART II c. Make a table of these (a to h) parameters for each age-sex group. The table
must have informative labels.
PART II d. Use an EXCEL spread-sheet to produce the following three XY-plots for
both males and females (ie six graphs):
a. Mean Log 2-hr glucose of normal individuals with 2 S.E. versus age class midpoint;
b. Mean Log 2-hr glucose of hyperglycemic individuals with 2 S.E. versus age class
midpoint; and
c. Proportion of hyperglycemic individuals with 2 S.E. versus age class midpoint.
PART II Submit:
1.In your EXCEL file (10 Points)
a. A table of parameters for each age and gender group as outlined in PART II c.
b. 6 XY-plots: Changes in the mean Log 2-hr glucose and proportion of
hyperglycemic individuals with age for males and females (PART II d)
2. In your PART II discussion, address the following issues (15 Points):
a. Examine variations of the proportion of hyperglycemic individuals with age.
b. The overall mean Log 2-hr glucose value for any subgroup is related to the mean
levels for "normal" and hyperglycemic individuals by a simple relationship:
overall   * 1  1    * 2
Using this model, comment on causes of change with age (pooled sexes) of the
overall mean for Log 2-hr glucose.
c. Apply appropriate statistical tests to test the following:
i.
The mean Log 2-hr glucose value for Pima males age 15-24 is equal to
2.00. Apply the test to determine if females 15-24 have the same value.
ii.
If the mean values for “normal” Pima age 45-54 are equal for males and
females.
iii.
If the proportion of hyperglycemic individuals for Pima age 45-54 are the
same for males and females.
d. Comment on the pattern of variation of Log 2-hr glucose with age for each
gender.
PART III: REGRESSION AND CORRELATION
The purpose of this PART III is to apply correlation analysis to the identification of risk
factors for diabetes in the Pima population. In the analysis that follows, you will need to
calculate a body mass index (BMI) for each individual in the sample data set, PIMA.
BMI is simply the weight of the individual divided by the height squared.
Required Work: PART III a. Sort the PIMA file and generates four data sets. Each data
set should have the format:
V1, V2, V3, V4
where V1 is Log 2-hr glucose, V2 is age, V3 is Log fasting glucose, and V4 is Log BMI
(Log is log to the base 10).
4
Constraints on the data sets are:
Data 1: males, BMI =< 30 kg/m/m, V1 =< 2.3
Data 2: males, BMI =< 30 kg/m/m, V1 > 2.3
Data 3: males, BMI > 30 kg/m/m, V1 =< 2.3
Data 4: males, BMI > 30 kg/m/m, V1 > 2.3
N, N
N, H
O, N
O, H
PART III b. Use EXCEL to find the correlation among for V1, V2, V3, and V4 for all 4
data sets indicated above.
PART III c. Use EXCEL to produce scattergrams (x-y plots with points only) of V1
versus V2 for all four data files.
PART III Submit:
1. In your EXCEL file (10 points):
a. Summary table with four a 4x4 correlation matrix for each of the data sets
b. Four scattergrams with trendlines of V1 versus V2;
2. Discuss risk factors for diabetes and which addresses at least the following issues (15
points):
a. Possible reasons for statistically significant correlation coefficients among the
four variables in the four data sets studied.
b. By developing a graphical representation of the relation between the BMI and
the proportion of the population that is hyperglycemic (i.e. presumptively
diabetic), determine the extent of risk imposed by obesity on incidence of
diabetes.
5