Download here

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Inductive probability wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Multi-armed bandit wikipedia , lookup

Secretary problem wikipedia , lookup

Taylor's law wikipedia , lookup

Law of large numbers wikipedia , lookup

Transcript
BST 140.752
Problem Set 1
Due on Thursday, 4 November 2010
With the following two computing exercises, familiarize yourself with R, software that will be
used throughout the 750's. Start by reading through the Introduction to R materials on the course
website. An R toolbox containing commands that might be helpful can be found at the end of this
homework. Also make good use of the R help files when learning new commands. Document and
submit ALL R codes that you used for this homework.
Problem 1 A study investigated the occurrence of natural (spontaneous) mutation in wild- type E.
coli in the presence of an antibiotic, novobiocin. The dataset contains three variables: (1) Colony,
the number of ampicillin-resistant mutant colonies; (2) Conc, the concentration of novobiocin
(μg/ml); (3) Media, the type of media used for bacterial growth. The experiment involved two
media preparations (LB and M9), 5 concentrations of novobiocin and 100 replicates for each
media-concentration combination. TNTC (too numerous to count) were recorded when the
number of colonies exceeded 300.
200
a. Remove all TNTC entries, assuming they are most likely due to experimental errors.
b. Provide side-by-side boxplots (example below) of Colony of the 10 groups.
c. Provide stem-and-leaf plots of Colony with (1) < 200 (μg/ml) of novobiocin; (2) 200
and 300 (μg/ml) of novobiocin; (3) 200 (μg/ml) of novobiocin with LB media.
d. Calculate the mean, median, and sample variance for the 10 groups.
e. Plot mean versus variance and comment on the relation. Repeat by plotting mean
versus log(variance) and comment.
f. Assuming the number of colonies follows a Poisson distribution, transform Colony by
square-root (variance-stabilizing, Rice page 320) and re-plot b and e.
g. Calculate the % of TNTC in each group.
h. Calculate the trimmed mean for the 10 groups using percentiles from g as the upper
and lower fraction trimmed.
M9
100
50
0
Mutant Colonies
150
LB
0
100
200
250
300
0
100
Novobiocin Concentration
200
250
300
Problem 2 A gamma distribution Gamma (α, β) has density f ( x) 
1
x  1e  x /  . The
 ( )

distribution has mean  and variance  2 .
15.0
a. Simulate 1000 realizations from Gamma (3, 5).
b. Plot the histogram and propose a transformation such that your sample is
approximately normal.
c. For the untransformed sample, plot the trimmed and Winsorized means versus
various percentiles (see below for an example).
14.0
Sample Median
13.0
13.5
Means
14.5
Winsorized Mean
Trimmed Mean
0.1
0.2
0.3
0.4
0.5
Lower Percent Trimmed/Winsorized
Problem 3 Assume that Pi is a projection matrix (i = 1,2) and P1-P2 is positive semi-definite.
Show that
(a) P1P2 = P2P1 = P2.
(b) P1-P2 is a projection matrix.
(For help, see Ingo’s notes or any linear models text.)
Problem 4 Assume that an act of intercourse between an HIV infected person and a non-infected
person has a 1/500 probability of spreading the infection.
a. How many acts of intercourse would an uninfected person have to have with an
infected person or persons to have a 50% probability of obtaining an infection? State the
assumptions of your calculations.
b. A related problem: How many people have to be in a room to have a 50% probability
of at least two people having the same birthday? State the assumptions of your
calculations.
c. Another related problem: I have 200 songs on my mp3 player. The documentation
claims that its shuffle option samples randomly with replacement from the list of songs.
How many songs do I have to listen to have a 50% probability of hearing at least one
song twice?
Problem 5 Given below are the sexes of the children of 7,745 families, each with 4 children,
recorded in the archives of the Genealogical Society of the Church of Jesus Christ of Latter Day
Saints in Salt Lake City, Utah. M indicates a male child and F indicates a female child.
Sequence Freq
MMMM 537
MMMF 549
MMFM 514
MFMM 523
FMMM 467
MMFF 497
MFMF 486
FMMF 473
Sequence Freq
MFFM 526
FMFM 498
FFMM 490
MFFF
429
FMFF
451
FFMF
456
FFFM
441
FFFF
408
a. Estimate the probability distribution of the number of male children, say X, in these
families using these data by calculating proportions.
b. Find the expected value of X.
c. Find the variance of X.
d. Find the probability distribution of p̂ , where p̂ is the proportion of children in each
family who are male. Find the expected value of p̂ and the variance of p̂ .
e. Find the conditional probability that the second child is Male given that the first child
is Male.
Problem 6 A number of clinical characteristics were ascertained in a large group of individuals
with insulin-dependent diabetes mellitus (IDDM). Suppose that the distribution of the percentage
of ideal body weight in this group of patients is normally distributed with a mean 110% and
standard deviation 13%. Answer the questions below using both a normal table and using R (or
the equivalent).
a. What proportion of individuals with IDDM are above their ideal body weight, (i.e.,
above 100% of the ideal body weight)?
b. What proportion of individuals with IDDM have a normal body weight (within 10%
of their ideal body weight)?
Problem 7 The exponential distribution has density f(x) = c exp{−βx} for x ≥ 0 and some
constant c.
a. Calculate what the constant, c, as a function of β.
b. Find the associated distribution function for the exponential density.
c. Find the expected value of an exponentially distributed random variable.
d. Find the variance of an exponential distributed random variable. What is the
relation between the mean and the variance?
e. Analytically, find the probability that an exponential random variable is between
a and b where 0 ≤ a < b.
R Toolbox: Commands you might find helpful
Working with data:
read.table ( …, sep = , header = )
as.factor ( )
as.numeric ( )
as.character ( )
seq ( start = , end = , by = )
rep ( …, each = , times = )
interaction ( )
tapply ( )
sort ( )
R graphics:
plot ( x = , y = , xlim = , ylim = , xlab = , ylab = , main = )
abline (v =, h = )
boxplot ( )
stem ( )
hist ( )
legend ( )
text()
Others :
rnorm ( )
rgamma ( )
for ( ) { }
function ( ) { }