Download Integrating grouped and ungrouped data: the point

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Integrating grouped and ungrouped data: the point process case
Irma Hernández Magallanes
Institute of Applied Mathematics and Computer Science (IAMCS)
Texas A&M University
Grouped data is a topic that goes back to the end of the nineteenth century at least.
Kulldorff (1968) refers to grouping as a special case of a more general kind of procedure, called
partial grouping. A partially grouped sample refers to the case where available information is
associated with a collection of disjoint sets partitioning a domain. The sample space is divided
into non-overlapping sets. In some of these sets only the counts of observations are recorded
(grouped data) while the individual values of the observations falling in the other sets are
recorded (ungrouped data).
This work is motivated by an interest in modeling the probability of wildfire ignition in
1990 in the Continental United States. The data cover fires that occurred in federal and nonfederal lands. The federal data consisted of each fire's point location (latitude and longitude)
while the non-federal fires were aggregated by county.
Wildfires occurrences can be considered as a spatial point process. For example,
Brillinger, Preisler and Benoit (2003) approximate a point process by a binary process. We
propose integrating the two levels of aggregate data, points and counts, by modeling the fires as a
binary-valued process on space. The sample space is partitioned into small pixels arranged in a
regular two-dimensional grid. Each pixel either has a fire or not.
Under the assumption that the wildfire rate is a smooth varying function of space we
propose a spatial smoothing method for partially grouped data. This smoother is based on locally
weighted likelihood analysis using a binary-valued process to approximate the partial grouped
data. Based on the binary-valued approximation a logit model is used with the National Fire
Danger Rating System fuel model as explanatory variables. The estimated probabilities are
included in a map with the associated uncertainty levels.
Coupling Multiple Hypothesis Testing with Proportion Estimation in
Heterogeneous Categorical Sensor Signal Networks
Christopher Calderon
Numerica Corporation
This is joint work with A. Jones, S. Lundberg, and R. Paffenroth
False alarms generated by sensors pose a substantial problem to a variety of fusion
applications. We focus on situations where the frequency of a genuine alarm is “rare" but the
false alarm rate is relatively high. The work is motivated by chemical and biological threat
detection applications. The goal is to mitigate the false alarms while retaining high power to
detect true events (missing a true signal is considered much more detrimental than declaring a
false alarm in applications of interest). Furthermore we would like to “fuse information” by
utilizing a multiple testing framework. Problems facing our application include: 1) the frequency
of a genuine rare attack is not easy to quantify; 2) the misclassification rates are often unknown
(or are not accurately described by nominal false alarm rates); 3) the statistical properties differ
substantially from sensor to sensor.
We propose to utilize data streams contaminated by false alarms (generated in the field) to
compute statistics on sensor misclassification rates. The nominal misclassification rate of a
deployed sensor is often suspect because it is unlikely that these estimates were tuned to the
specific environmental conditions in which the sensor was deployed (i.e. sensor performance can
have nontrivial spatial and temporal effects). Recent categorical measurement error methods will
be applied to the collection of data streams to “train” the sensors and provide point estimates
along with confidence intervals for the misclassification and estimated prevalence. Open
questions still remain as to how to best combine these estimated signals to make a decision about
the presence of a chemical or biological threat. There are also questions on how to efficiently
assess/detect statistically changes in population parameters. Directions explored to date include
false discovery rate methods aiming to roughly incorporate correlation effects into the computed
false discovery rate statistics via ``empirical nulls”. We have also started preliminary work
investigating resampling based approaches applied to “dimension reduced” sensor output with the
hope that a more precise estimate of the correlation between the reduced dimensions can be
empirically obtained and utilized in testing/decision making.
MIXED- EFFECTS MODELS FOR MODELING CARDIAC
FUNCTIONS AND TESTING TREATMENT EFFECTS
Maiying Kong and Hyejeong Jang
Department of Bioinformatics and Biostatistics, SPHIS,
University of Louisville, Louisville, KY, 40202
Mixed- effects model is an efficient tool for analyzing longitudinal data. The random
effects in mixed models can be used to capture the correlations between repeated measurements
within a subject. The time points are not fixed and all available data can be used in mixed-effects
model, provided data are missing at random. For this reason, we focus on applying mixed-effects
models to the repeated measurements of different aspects of cardiac functions, such as heart rate,
the left ventricle developed blood pressure, and coronary blood flow, in the Gluatathione-Stransferase (GSTP) gene knockout and wild-type mice experienced iscchemia/reperfusion injury.
Each aspect of the cardiac function consists of measurements from three time periods:
preischemic, ischemic, and reperfusion periods. We develop piecewise nonlinear function to
describe the different aspects of the cardiac function. We apply nonlinear mixed effects models
and changing point model to examine the cardiac functions experienced iscchemia/reperfusion
injury and to compare group differences.
Fast and Accurate Inference for the Smoothing Parameter in
Semiparametric Models
Alex Trindade
Mathematics and Statistics Department
Texas Tech University
A fast and accurate method of confidence interval construction for the smoothing
parameter in penalized spline and partially linear models is proposed. The method is akin to a
parametric percentile bootstrap where Monte-Carlo simulation is replaced by saddlepoint
approximation, and can therefore be viewed as an approximate bootstrap. It is applicable in a
quite general setting, requiring only that the underlying estimator be the root of an estimating
equation that is a quadratic form in normal random variables. This is the case under a variety of
optimality criteria such as those commonly denoted by ML, REML, GCV, and AIC. Simulations
studies reveal that under the ML and REML criteria, the method delivers a near-exact
performance with computational speeds that are an order of magnitude faster than existing exact
methods, and two orders of magnitude faster than a classical bootstrap. Perhaps most importantly,
the proposed method also offers a computationally feasible alternative when no known exact or
asymptotic methods exist, e.g. GCV and AIC. An application is illustrated by applying the
methodology to the well-known fossil data. Giving a range of plausible smooths in this instance
can help answer questions about the statistical significance of apparent features in the data.
Confidence Limits for Lognormal Percentiles and for Lognormal Mean
based on Samples with Multiple Detection Limits
K. Krishnamoorthy
Department of Mathematics, University of Louisiana at Lafayette
Lafayette, LA 70508-1010, USA
The problem of assessing occupational exposure using the mean or an upper percentile of
a lognormal distribution is addressed. Inferential methods for constructing an upper confidence
limit for an upper percentile of a lognormal distribution and for finding confidence intervals for a
lognormal mean based on samples with multiple detection limits are proposed. The proposed
methods are based on the maximum likelihood estimates. They perform well with respect to
coverage probabilities as well as power, and are satisfactory for small samples. The proposed
approaches are also applicable for finding confidence limits for the percentiles of a gamma
distribution. An advantage of the proposed approach is the ease of computation and
implementation. An illustrative example with real data sets is given.
Finitely Inflated Poisson Distribution
Santanu Chakraborty
University of Texas - Pan American
Zero Inflated Poisson and Zero Inflated Negative Binomial distributions are well known
in the literature. They are used to model count data sets which have more zeros than usual
Poisson or Negative Binomial datasets. So, the corresponding probability distributions have
inflated masses
at zero and deflated masses at nonzero points. But it should be appreciated that there are count
data sets in the literature also with not only more zeros than a usual Poisson or Negative
Binomial, but maybe more 1s, 2s and 3s. In that case, it makes more sense to inflate the original
Poisson or Negative Binomial at 0, 1, 2 and 3. From this consideration, we introduce Finitely
Inflated Poisson and Finitely Inflated Negative Binomial distributions into the literature. In this
talk, we discuss Finitely Inflated Poisson (FIP) distribution. For example, an FIP with inflations
at k points is defined as follows:
k
P(X x)
i
I
0
i {x
i}
k
(1
i
0
i
)e
x /x!
for x = 0, 1, 2, … where i is the inflator at i for i = 0, 1, 2, _ _ _ , k. We talk about moments,
moment generating function, convolutions of independent identically distributed FIP distributions
and some parametric and Bayesian inferential issues for this distribution in this talk.