Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
M-Estimation with Imprecise Data
Marco Cattaneo
Department of Mathematics, University of Hull
ISIPTA ’15, Pescara, Italy
20 July 2015
precise case
data: X1 , ... , Xn ∈ X i.i.d. with (unknown) distribution PX
goal: estimating the value(s) θ0 of θ ∈ Θ that minimize(s)
L(PX , θ)
loss/distance
e.g.
=
EPX [ρ(X , θ)]
risk
e.g.
=
EP X
(X − θ)
2
mean squared error: θ0 =EPX [X ]
ML estimate (nonparametric) of L(PX , · ): the function L(P̂X , · ) obtained by plugging
in the empirical distribution of the data P̂X
ML decision (Cattaneo, 2013): the estimate(s) θ̂0 that minimize(s)
n
n
e.g.
e.g.
2
1
1
L(P̂X , θ)
=
ρ(X
,
θ)
=
(X
−
θ)
i
i
i=1
n
n i=1
θ̂0 : minimum distance estimator
θ̂0 : M-estimator
1
θ̂0 = n
n
i=1 Xi : least squares estimator
asymptotic consistency: under some regularity conditions (Wolfowitz, 1957; Huber,
1964),
a.s.
θ̂0 −−−→ θ0
n→∞
references
Cattaneo, M. (2013). Likelihood decision functions. Electron. J. Stat. 7, 2924–2946.
Cattaneo, M., and Wiencierz, A. (2012). Likelihood-based Imprecise Regression. Int. J. Approx. Reasoning 53, 1137–1154.
Cattaneo, M., and Wiencierz, A. (2014). On the implementation of LIR: the case of simple linear regression with interval
data. Comput. Stat. 29, 743–767.
Ferson, S., Kreinovich, V., Hajagos, J., Oberkampf, W., and Ginzburg, L. (2007). Experimental Uncertainty Estimation
and Statistics for Data Having Interval Uncertainty. Technical Report SAND2007-0939. Sandia National Laboratories.
Huber, P. J. (1964). Robust estimation of a location parameter. Ann. Math. Stat. 35, 73–101.
Manski, C. F. (2003). Partial Identification of Probability Distributions. Springer.
Schollmeyer, G., and Augustin, T. (2015). Statistical modeling under partial identification: Distinguishing three types of
identification regions in regression analysis with interval data. Int. J. Approx. Reasoning 56, 224–248.
Schuyler, Q., Hardesty, B. D., Wilcox, C., and Townsend, K. (2014). Global analysis of anthropogenic debris ingestion
by sea turtles. Conserv. Biol. 28, 129–139.
Utkin, L. V., and Coolen, F. P. A. (2011). Interval-valued regression and classification models in the framework of machine
learning. In ISIPTA ’11, eds. F. Coolen, G. de Cooman, T. Fetz, and M. Oberguggenberger. SIPTA, 371–380.
Wiencierz, A., and Cattaneo, M. (2015). On the validity of minimin and minimax methods for Support Vector Regression
with interval data. In ISIPTA ’15, eds. T. Augustin, S. Doria, E. Miranda, and E. Quaeghebeur. Aracne, 325–332.
Wolfowitz, J. (1957). The minimum distance method. Ann. Math. Stat. 28, 75–88.
imprecise case
data: S1 , ... , Sn ⊆ X i.i.d. with (unknown) distribution PS , such that Xi ∈ Si
• the distribution PS of the imprecise data only partially determines the distribution PX of the (unobservable) precise data: let [PS ] be the set of all distributions PX compatible with PS (in the sense
that Xi ∈ Si is possible)
• assumptions reducing [PS ] are also possible (e.g., all “beta distributions” on interval data) for the ML
decision, but not for the black-box approach to estimation (see definitions below)
black-box approach (e.g., Ferson et al., 2007): since X1 , ... , Xn are only known to lie
in S1 , ... , Sn , replace the estimate θ̂0 (X1 , ... , Xn ) with (the convex hull of) the set of
estimates
θ̂0 (X1 , ... , Xn ) : Xi ∈ Si
ML estimate (nonparametric) of L(PX , · ): usually not unique, it corresponds to the the
set {L(PX , · ) : PX ∈ [P̂S ]} of all functions obtained by plugging in the distributions PX
compatible with the empirical distribution of the (imprecise) data P̂S
ML decision: the estimate(s) θ̂0 that minimize(s)
L(PX , θ) : PX ∈ [P̂S ]
e.g.
=
EPX [ρ(X , θ)] : PX ∈ [P̂S ]
=co
1
n
n
ρ(X
,θ)
:
X
∈S
i
i
i
i=1
e.g.
=
2
EPX (X − θ) : PX ∈ [P̂S ]
=co
1
n
n
2 : X ∈S
(X
−θ)
i
i
i=1 i
asymptotic consistency, depending on the definition of minimum: under some regularity conditions (and possibly “smoothing corrections”),
pointwise dominance:
a.s.
θ̂0 −−−→ {arg minθ∈Θ L(PX , θ) : PX ∈ [PS ]}
n→∞
• pointwise dominance (“maximality”) and black-box approach (“E-admissibility”) have the same
limit, called sharp collection region by Schollmeyer and Augustin (2015)
• e.g., set of undominated regression functions of LIR approach (Cattaneo and Wiencierz, 2012,
2014), which uses interval dominance for computational reasons
minimax:
a.s.
θ̂0 −−−→ arg minθ∈Θ maxPX ∈[PS ] L(PX , θ)
n→∞
• estimate and limit are usually unique, which greatly simplifies computation, description, and interpretation of the results: see logistic regression example below
• e.g., minimax SVR estimate (Utkin and Coolen, 2011; Wiencierz and Cattaneo, 2015), or LRM
regression function of LIR approach (Cattaneo and Wiencierz, 2012, 2014)
minimin:
a.s.
θ̂0 −−−→ {θ ∈ Θ : L(PX , θ) = 0, PX ∈ [PS ]}
n→∞
• in parametric models the limit is the identification region (Manski, 2003) of the parameter θ (when
L corresponds to a distance between distributions), called sharp marrow region by Schollmeyer and
Augustin (2015): see parametric model example below
• e.g., minimin SVR estimate (Utkin and Coolen, 2011; Wiencierz and Cattaneo, 2015)
example: parametric model
precise data: X1 , ... , Xn ∈ X = {A, B, C }
i.i.d. with (unknown) distribution PX =
(pA , pB , pC )
A
parametric model (represented by blue line):
pB = pC = 1−θ
2 with θ= pA ∈ Θ = [0, 1],
1−θ
with θ ∈ [0, 1]
i.e., PX ,θ = θ, 1−θ
,
2
2
loss L(PX , θ): Euclidean distance between PX
and PX ,θ
empirical
of precise data: P̂X =
n n ndistribution
A
B
C
,
,
n n n , where nA , nB , nC are the count
data of A, B, C , respectively
ML decision with precise data: θ̂0 =
●
B
C
nA
n
a.s.
• asymptotic consistency: θ̂0 −−−→ θ
n→∞
• θ̂0 is also the parametric ML estimator: i.e., the M-estimator with the Kullback–Leibler divergence
from PX to PX ,θ as loss L(PX , θ)
imprecise data: S1 , ... , Sn ∈ {{A}, {B}, {C }, X } i.i.d. with (unknown) distribution
PS = (qA , qB , qC , qna ) (i.e., data are either precisely observed, or missing), such that
Xi ∈ Si
• [PS ] = {PX : pj ≥ qj for all j ∈ X } is the set of all distributions PX compatible with PS =
(qA , qB , qC , qna )
• e.g., the gray area represents the set [PS ] of all distributions PX compatible with PS =
(0.1, 0.4, 0.2, 0.3)
n
empirical distribution of imprecise data: P̂S = n ,
are the count data of A, B, C , and missing, respectively
A
nB nC nna
n , n , n
, where nA , nB , nC , nna
ML decision with imprecise data:
n n +n pointwise dominance: θ̂0 = nA , A n na
a.s.
• asymptotic consistency: θ̂0 −−−→ {pA : PX ∈ [PS ]} (represented by orange segment)
n→∞
• θ̂0 is also the convex hull of the set of estimates θ̂0 (X1 , ... , Xn ) : Xi ∈ Si (black-box approach)
minimax: θ̂0 =
2 nA
3 n
+
1
3
1−2
a.s.
nB ∨nC
n
∨
nA
n
• asymptotic consistency: θ̂0 −−−→ 32 qA + 13 (1 − 2 (qB ∨ qC )) (represented by black point)
n→∞
• θ̂0 changes if the Euclidean distance L(PX , θ) between PX and PX ,θ is replaced by the Kullback–
Leibler divergence from PX to PX ,θ (while this is not the case for the other two definitions of
minimum)
n nB ∨nC
A
minimin: θ̂0 = n , 1 − 2 n
∨
a.s.
nA
n
• asymptotic consistency: θ̂0 −−−→ {pA : PX ,θ ∈ [PS ]} (represented by red segment)
n→∞
• θ̂0 estimates the set of all θ compatible with the distribution of the (imprecise) data: this is often
the goal when the parametric model is assumed to be true
example: logistic regression
precise data: (X1 , Y1 ), ... , (X468 , Y468 ) ∈ R × {0, 1} i.i.d. with (unknown) distribution
P(X ,Y ) , describing the presence (Y = 1) or absence (Y = 0) of marine debris in the
gastrointestinal system of a green turtle that died at time X
logistic regression: estimates (α̂, β̂) of the regression parameters (α, β) = θ ∈ Θ = R2
are obtained by minimizing
L(P̂(X ,Y ) , (α, β)) =
n
(Yi ln (1 + exp(−α − β Xi )) + (1 − Yi ) ln (1 + exp(α + β Xi )))
Yi 1−Yi
n 1
1
= − ln i=1 1+exp(−α−β Xi )
1 − 1+exp(−α−β Xi )
i=1
• (α̂, β̂) are the parametric ML estimates when P(Y = 1 | X ) =
1
1+exp(−α−β X )
is assumed
• of particular interest is the question if the probability of debris ingestion increased over time (β > 0)
or not (β ≤ 0)
imprecise data: [X 1 , X 1 ] × {Y1 }, ... , [X 468 , X 468 ] × {Y468 } ⊂ R × {0, 1} i.i.d. with (unknown) distribution P[X ,X ]×{Y } (Schuyler et al., 2014)
minimax logistic regression: estimates (α̂m , β̂m ) are obtained by minimizing
L(P̂(Y X +(1−Y ) X , Y ) , (α, β)) if β ≤ 0
maxP̂(X ,Y ) ∈[P̂
L(P̂(X ,Y ) , (α, β)) =
[X ,X ]×{Y } ]
L(P̂(Y X +(1−Y ) X , Y ) , (α, β)) if β ≥ 0
• computing the minimax logistic regression corresponds to computing
two (standard) logistic regressions,
with the two extreme cases for the
precise X data: Y X + (1 − Y ) X
and Y X + (1 − Y ) X
1.0
0.8
0.6
0.4
0.2
0.0
probability of debris ingestion
• (α̂m , β̂m ) ≈ (−67, 0.033), and the
significant positivity of β̂m (with pvalue ≈ 0.001) in the logistic regression with worst-case precise X
data (i.e., Y X +(1−Y ) X ) should
imply also the significant positivity
of β̂ in the logistic regression with
the true precise X data: that is, the
ingestion of marine debris by green
turtles increased over time
minimax logistic regression
1975
1980
1985
1990
year
1995
2000
2005