Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BST 140.752 Problem Set 1 Due on Thursday, 4 November 2010 With the following two computing exercises, familiarize yourself with R, software that will be used throughout the 750's. Start by reading through the Introduction to R materials on the course website. An R toolbox containing commands that might be helpful can be found at the end of this homework. Also make good use of the R help files when learning new commands. Document and submit ALL R codes that you used for this homework. Problem 1 A study investigated the occurrence of natural (spontaneous) mutation in wild- type E. coli in the presence of an antibiotic, novobiocin. The dataset contains three variables: (1) Colony, the number of ampicillin-resistant mutant colonies; (2) Conc, the concentration of novobiocin (μg/ml); (3) Media, the type of media used for bacterial growth. The experiment involved two media preparations (LB and M9), 5 concentrations of novobiocin and 100 replicates for each media-concentration combination. TNTC (too numerous to count) were recorded when the number of colonies exceeded 300. 200 a. Remove all TNTC entries, assuming they are most likely due to experimental errors. b. Provide side-by-side boxplots (example below) of Colony of the 10 groups. c. Provide stem-and-leaf plots of Colony with (1) < 200 (μg/ml) of novobiocin; (2) 200 and 300 (μg/ml) of novobiocin; (3) 200 (μg/ml) of novobiocin with LB media. d. Calculate the mean, median, and sample variance for the 10 groups. e. Plot mean versus variance and comment on the relation. Repeat by plotting mean versus log(variance) and comment. f. Assuming the number of colonies follows a Poisson distribution, transform Colony by square-root (variance-stabilizing, Rice page 320) and re-plot b and e. g. Calculate the % of TNTC in each group. h. Calculate the trimmed mean for the 10 groups using percentiles from g as the upper and lower fraction trimmed. M9 100 50 0 Mutant Colonies 150 LB 0 100 200 250 300 0 100 Novobiocin Concentration 200 250 300 Problem 2 A gamma distribution Gamma (α, β) has density f ( x) 1 x 1e x / . The ( ) distribution has mean and variance 2 . 15.0 a. Simulate 1000 realizations from Gamma (3, 5). b. Plot the histogram and propose a transformation such that your sample is approximately normal. c. For the untransformed sample, plot the trimmed and Winsorized means versus various percentiles (see below for an example). 14.0 Sample Median 13.0 13.5 Means 14.5 Winsorized Mean Trimmed Mean 0.1 0.2 0.3 0.4 0.5 Lower Percent Trimmed/Winsorized Problem 3 Assume that Pi is a projection matrix (i = 1,2) and P1-P2 is positive semi-definite. Show that (a) P1P2 = P2P1 = P2. (b) P1-P2 is a projection matrix. (For help, see Ingo’s notes or any linear models text.) Problem 4 Assume that an act of intercourse between an HIV infected person and a non-infected person has a 1/500 probability of spreading the infection. a. How many acts of intercourse would an uninfected person have to have with an infected person or persons to have a 50% probability of obtaining an infection? State the assumptions of your calculations. b. A related problem: How many people have to be in a room to have a 50% probability of at least two people having the same birthday? State the assumptions of your calculations. c. Another related problem: I have 200 songs on my mp3 player. The documentation claims that its shuffle option samples randomly with replacement from the list of songs. How many songs do I have to listen to have a 50% probability of hearing at least one song twice? Problem 5 Given below are the sexes of the children of 7,745 families, each with 4 children, recorded in the archives of the Genealogical Society of the Church of Jesus Christ of Latter Day Saints in Salt Lake City, Utah. M indicates a male child and F indicates a female child. Sequence Freq MMMM 537 MMMF 549 MMFM 514 MFMM 523 FMMM 467 MMFF 497 MFMF 486 FMMF 473 Sequence Freq MFFM 526 FMFM 498 FFMM 490 MFFF 429 FMFF 451 FFMF 456 FFFM 441 FFFF 408 a. Estimate the probability distribution of the number of male children, say X, in these families using these data by calculating proportions. b. Find the expected value of X. c. Find the variance of X. d. Find the probability distribution of p̂ , where p̂ is the proportion of children in each family who are male. Find the expected value of p̂ and the variance of p̂ . e. Find the conditional probability that the second child is Male given that the first child is Male. Problem 6 A number of clinical characteristics were ascertained in a large group of individuals with insulin-dependent diabetes mellitus (IDDM). Suppose that the distribution of the percentage of ideal body weight in this group of patients is normally distributed with a mean 110% and standard deviation 13%. Answer the questions below using both a normal table and using R (or the equivalent). a. What proportion of individuals with IDDM are above their ideal body weight, (i.e., above 100% of the ideal body weight)? b. What proportion of individuals with IDDM have a normal body weight (within 10% of their ideal body weight)? Problem 7 The exponential distribution has density f(x) = c exp{−βx} for x ≥ 0 and some constant c. a. Calculate what the constant, c, as a function of β. b. Find the associated distribution function for the exponential density. c. Find the expected value of an exponentially distributed random variable. d. Find the variance of an exponential distributed random variable. What is the relation between the mean and the variance? e. Analytically, find the probability that an exponential random variable is between a and b where 0 ≤ a < b. R Toolbox: Commands you might find helpful Working with data: read.table ( …, sep = , header = ) as.factor ( ) as.numeric ( ) as.character ( ) seq ( start = , end = , by = ) rep ( …, each = , times = ) interaction ( ) tapply ( ) sort ( ) R graphics: plot ( x = , y = , xlim = , ylim = , xlab = , ylab = , main = ) abline (v =, h = ) boxplot ( ) stem ( ) hist ( ) legend ( ) text() Others : rnorm ( ) rgamma ( ) for ( ) { } function ( ) { }