Download The Hidden Cost of Efficiency: Fairness and Discrimination in

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

Principal component analysis wikipedia, lookup

The Hidden Cost of Efficiency: Fairness and
Discrimination in Predictive Modeling
Julius Adebayo, Lalana Kagal, Alex "Sandy" Pentland
Computer Science and Artificial Intelligence Laboratory,
Electrical Engineering and Computer Science, MIT Media Lab
Massachusetts Institute of Technology.
We present a data transformation procedure that completely
eliminates all linear information regarding a sensitive attribute in a large scale individual level data with several
correlated attributes. The algorithm presented here forms
a component of a larger fairness rating system being developed. The goal of the rating system is to elucidate black-box
models and bring interpretability to any predictive model
no matter how complex. As part of the system, we learn
lower dimensional interpretable versions of potentially complex models, attempt to learn a causal structure of the underlying data, and propose an augumentation to the underlying black-box algorithm as a way of reducing bias in
predictive modeling. This work is still ongoing, but here
we highlight one component of the system: the orthogonal
projection algorithm. The orthogonal projection algorithm
combines principal components analysis of the data set with
orthogonalization with respect to sensitive attribute(s). The
orthogonalization algorithm presented is motivated by applications where there is a need to drastically ’sanitize’ a data
set of all information relating to sensitive(s) attribute(s) in
the data, or that perhaps could be inferred from the data,
before analysis of the data using a data mining algorithm.
Our proposed methodology outperforms other privacy preserving methodologies by more than 20 percent in lowering
the ability to reconstruct sensitive attribute from a sample
large scale individual level data. In high stakes contexts such
as determination of access to credit, employment, and insurance where discrimination based on sensitive attributes such
as race, gender, and sexual orientation is prohibited by law,
our proposed algorithm provides a way to help reduce the
information content of such sensitive attributes in available
data, hence limiting bias.
privacy, data mining, orthogonalization, PCA.
We live in the age of Big Data [1]. Over the past few years,
there has been an increase in the scale of available data both
about human behavior and the physical world [2]. Analysis of large scale data across various domains can lead to
improvement in decision making [3]. For example, analysis
Bloomberg Data for Good Exchange Conference.
28-Sep-2015, New York City, NY, USA.
of mobile phone data has led to the development of more
granular poverty maps in developing countries, a better understanding of disease spread, and better ability to augment
census information in areas where such information is lacking [4]. Analysis of large scale data presents tremendous
potential in helping to tackle social problems and perhaps
lead to the development of improved public policies through
a data-driven approach. On the other hand, the potential
harms to individual privacy has never been greater [5–8].
Beyond risks of re-indentifying individuals, predictive privacy harms, stemming from algorithmic decision making
pose particularly pressing harms to individual privacy [9].
To demonstrate what we term predictive privacy harms, we
present the now popular ’Target’ example. In 2012, a New
York Times article noted that the retail chain Target was
able to predict which of its customers were pregnant, and
disclose this information to its marketing department for
advertisement purposes [9]. Target didnot explicitly collect
pregnancy information from its customers, yet, it was able
to easily infer their pregnancy status. This situation highlights a new kind of privacy concern associated with inferring
potentially sensitive information from seemingly non related
data.The potential privacy harms associated with data inferred from aggregated large scale data is what is termed
predictive privacy harm.
Several recent studies have shown that sensitive attributes
such as sexual orientation, ethnicity, religious and political views, personality traits, intelligence, happiness, use of
addictive substances, parental separation, age, and gender
can be inferred from social media data available on platforms like Facebook [10–12]. Aware of the potential ramifications of such a reality, governments and national agencies have started to raise alarms about the devastating privacy harms at play from predictive analysis of individual
level data [6, 13, 14]. Arguably, as predictive analytics becomes more prevalent, the principal challenge going forward
regards the development of methods and frameworks that
would ensure some level of privacy while allowing for useful
predictive analysis of data. At the center of the emerging
problem is the debate about sensitive information that can
be inferred given individual level data.
Privacy preserving data mining (see figure 1 for high level
overview of data mining) has emerged as a field that seeks to
address privacy problems that arise in applying data mining
techniques to large scale individual level data [8]. Several ex-
Figure 1: An overview of the data mining process. Data mining ultimately seeks to learn useful insights from
ample methodologies include K-anonymity, L-diversity, and
geometric data perturbation techniques that transform the
input data to a data mining algorithm in order to increase
the difficulty to inferring potentially sensitive attributes as
part of predictive analytics process [8,15]. Currently, the privacy preserving data mining literature is devoid of methodologies that seek to completely eliminate all information regarding a sensitive attribute in a large scale individual level
data set.
To address this gap we present a data transformation procedure that completely eliminates all linear information regarding a sensitive attribute in a large scale individual level data with several highly correlated attributes.
Our procedure combines a principal components analysis of
the original data set with orthogonalization with respect to
desired sensitive attribute to completely eliminate all linear
information relating to the sensitive attribute in a given data
The rest of the paper is as follows. Section 2 provides an
overview of current methodologies in privacy preserving data
mining as well as the motivation for the procedure presented
in this paper. Section 3 follows with a problem definition as
well as an overview of the methods used in the presented
procedure. In section 4 we present our procedure for complete elimination of sensitive attributes from large scale data.
Following Section 4, section 5 presents results of our procedure on a large scale consumer mortgage data set. We
then end with a conclusion and overview of the methodology presented and discussion about relevant applications of
our procedure.
Privacy preserving data mining concerns the development of
methods that seek to preserve ’privacy’ in some form given
the application of a data mining algorithm [8, 15]. Several
methodologies focus on transforming the data into one that
preserves or makes it difficult to compromise privacy espe-
cially for individual level data. Typically these methods
seek to reduce the granularity of the data representation so
as to increase the difficulty of identifying sensitive attribute
A subset of these privacy preserving algorithms are known
as perturbation techniques. Data perturbation techniques
involve a transformation of some or all attributes of a particular data set before applying a data mining algorithm [15].
Often, such transformations are applied in order to prevent re-identification or inference of potentially sensitive attributes. Example data perturbation techniques include random noise addition to data [16], rotation perturbation of sensitive attributes [?], and random projection perturbation [17,
18]. Other methodologies include K-anonymization [19] and
additional methodolgies such as L-diversity and T-closeness
that address certain weaknesses of the K-anonymization methodology [20,21]. Recent studies have highlighted several weaknesses associated with all of the aforementioned perturbation techniques [15]. Random noise addition is susceptible
to techniques that leverage the spectral properties of the
randomized data in order to separate the noise from the actual data [15]. Further, random noise addition can only be
utilized with certain data mining algorithms. Rotation perturbation methods have also been shown to be susceptible
to distance-inference attacks [15]. In cases where all information regarding a sensitive attribute is to be eliminated,
the above methods are not adequate.
3.1 Problem Definition
The critical privacy question going forward is not about attribute or identity disclosure, but related to the ’use’ of attributes learned or inferred as part of a machine learning or
data mining process. Our proposed procedure seeks to tackle
this problem, which is: how does one completely eliminate
the information content of a sensitive attribute in a large
scale individual level data set?
The focus of this paper is on the elimination of sensitive information in large scale individual level data. More precisely,
given a large scale individual level data set with multiple
correlated attributes along with other sensitive attribute(s),
we seek to develop a methodology to transform the data
into one where all information relating to the sensitive attribute(s) has been eliminated. Further, on applying a data
mining technique to the transformed data set, the ability to
reconstruct the sensitive attribute should be substantially
more difficult. Next we present a review of relevant concepts
before going into a detailed explanation of our proposed procedure.
Review of Required Concepts
Here we present an overview of Principal components analysis (PCA) and Orthogonalization, which are the two essential elements of our proposed algorithm.
Principal Components Analysis (PCA)
PCA helps identify patterns in data [22]. PCA is widely used
method across di↵erent fields ranging from psychology to
computer vision. PCA helps to decorrelate data as well as to
perform dimensionality reduction [23]. Often, PCA can be
used to tease out the predictive value of the di↵erent factors
in a dataset leading to the development of more accurate
predictive models [22, 24].
Given a random
P vector X 2 R with zero mean and covariance matrix x , PCA performs a linear transformation of
X to a lower dimension vector Y 2 Rm , m < n such that
Y = ATm X
and ATm Am = Im . Im is the mXm identity matrix. Further,
Am is a nXm matrix whose columns are orthonormal. The
columns of Am correspond to the P
eigenvectors of the first m
largest eigenvalues of the matrix x [22].
In addition to minimizing reconstruction error, PCA maximizes the variance of the projection along each component.
An added benefit of PCA is that it helps decorrelate a highly
co-linear data set. more detailed development of PCA is presented in [22–24].
Orthogonal Projection
Here we explore the general notion of orthogonal projection. A more concrete derivation and detailed overview is
presented in [25]. We demonstrate orthogonal projection
using a 2 dimensional example represented in figure 2.
In figure 2, the vector A? S shown is the component of the
vector A perpendicular to S. Given that A is perpendicular
to S, then their inner product is zero. Further, the process
of obtaining the component of vector A that is perpendicular to S is known as orthogonal projection. As expected,
the concept of orthogonal projection maps beyond the two
dimensional setting to higher dimensions.
Orthogonal projection is a particular type of a larger class of
linear transformations. Intuitively, given two vectors whose
inner product is zero, one can conclude that no linear transformation of one vector can produce the other. This means
Figure 2: Depiction of Orthogonal projection of A
onto S.
that linearly, one cannot learn any information about the
second vector given the other if their inner product is zero.
Ultimately, this intuition underlies our approach of eliminating information with respect to a sensitive attribute in a
particular data set.
In this section we present an overview of our sensitive attribute elimination algorithm as shown in figure 3. From
figure 3, we present a general schematic for the transformation of a dataset in order to eliminate the sensitive variable.
Algorithm 1 Algorithm to create polygon intersection
INPUT: An n x p data matrix X 1 that can be decomposed
into x~1 , x~1 , . . . , x~p feature (attribute) vectors, and x~s
sensitive attribute vector.
Xnew that can be decomposed into x~⇤1 , x~⇤2 , . . . , x~⇤p
where each vector x⇤i
2 Xnew is orthogonal to
x~s .
Obtain the principal components of X
Transform X given the principal components to obtain
for principal component xpre
~ in Xpre do
obtain x⇤i , the component of xpre
~ that is orthogonal to
join x⇤i column wise to Xnew
end for
Given an n x m data matrix X with m attributes, n samples,
and s sensitive attributes, as a first step in our process, we remove all s sensitive columns of matrix X to produce matrix
Xpre . Because the columns of X are highly correlated, just
removing the s sensitive columns is not enough to completely
prevent these attributes from being reconstructed. On removing the sensitive attributes to obtain Xpre , we perform
PCA on Xpre in order to completely decorrelate the data.
Once Xpre has been decomposed into its principal components, the component of each principal component that is
orthogonal to the subspace of all the sensitive attributes is
then obtained. By definition, the inner product between
the projection obtained and the sensitive attribute vectors
are guaranteed to be zero. Intuitively, this means we have
been able to transform the data set to one where the linear
Figure 3: An overview of our proposed orthogonalization process.
Available Data Attributes
Applicant Race
Applicant Income
Applicant Sex
Co-Applicant Race
Co-Applicant Sex
Loan Type
Loan Purpose
Property Neighborhood Minority Population
Population in Property Neighborhood
Neighborhood Census Information
Loan Preapproval Status
relationship between the remaining attributes and the sensitive attribute vectors have been eliminated. The orthogonal
transformation of each principal component is important because it ensures that no linear transformations of the newly
obtained data set can reconstruct the original sensitive attributes. As expected this is because we have deliberately
eliminated all linear dependence between the remaining feature vectors and the sensitive attributes.
We demonstrate the proposed orthogonalization algorithm
on a real world mortgage application data, home mortgage
disclosure data, obtained through the United States consumer financial protection bureau. The Home Mortgage
Disclosure Act (HMDA) of 1975 and the Dodd-Frank Act
of 2011 requires financial entities to report and publicly disclose information about mortgage applications received and
the resulting loan decision made by bank loan underwriters [26]. As part of the legislative requirement, each financial entity operating in a large metropolitan area is required
to report details of mortgage applications received. Per application received, the reported information includes action
taken on the loan, applicant race and ethnicity, co-applicant
race and ethnicity, sex, median family income, loan purpose,
loan type, home census tract information, amongst other information [26, 27]. For our analysis, we subsample the data
to a set consisting of 2.5 million of the 14.8 million mortgage
applications made in 2011. These 2.5 million applications
were made for properties categorized as One-to-Four family
dwelling and were conventional loans without any government assistance or subsidies. All available attributes as part
of the data set are shown in the table below. In figure 4(A)
we show a bar chart of the racial breakdown in the sample
data examined. Applicant race is considered as the sensitive
attribute for our analysis. Figure 4(B) shows the overall volume of loan applications by purpose from 2011 to 2013.
Given the list of attributes, we consider applicant race to
be the sensitive attribute of choice for our analysis. The
goal in analyzing the above data is to remove all information (linear) relating to the applicant’s race in the data set.
According to several enacted legislation, race, gender, sexual
orientation, and several other protected characteristics can
not be used as a basis for mortgage underwriting decisions.
Given such requirement, a data transformation procedure
that eliminates all information relating to the applicant race
in the above data set is valuable.
Simple Feature Removal is not Enough
Figure 5(A) demonstrates that simply removing sensitive attributes in highly correlated data sets is not enough to prevent inferring the sensitive attribute from the other features.
We demonstrate this by predicting the sensitive attribute,
applicant race, with the rest of the features shown in the table above. The results are shown in Figure 5(A). The task of
predicting applicant race can be approached using popular
classification algorithms. To quantify our predictive performance, we use the area under the receiver operating characteristic curve (AUROC). Given a curve of true positive
rate versus false positive rate, the AUROC is the area under
this curve. The AUROC is a standard measure in machine
learning literature that serves to quantify the predictability
of a particular classification task. As seen in Figure 5(A),
before orthogonalization, the average classifier performance
Figure 4: Figure (A) shows a bar chart of the racial breakdown in the sample data examined. Applicant
race is considered as the sensitive attribute for our analysis. Figure (B) shows the overall volume of loan
applications by purpose from 2011 to 2013.
is 0.78, which drops by about 30 percent to 0.55 after applying our orthogonalization algorithm. By just removing
the sensitive attribute the ability to infer the sensitive attribute, applicant race, isn’t a↵ected by any means because
of the presense of other highly correlated attributes. However, on applying our orthogonalization procedure we see a
30 percent decline in reconstruction ability for the sensitive
Reconstruction After Orthogonalization Algorithm is Applied
Going further, we compare our orthogonalization algorithm
to other preserving privacy methodologies in the literature.
Figure 5(B) shows the predictive performance of our orthogonalization procedure along side random noise addition and
rotation perturbation. Again, we note that predictive performance on the reconstruction of the sensitive attribute is
more than 20 percent lower compared to the two other methods shown. Our orthogonalization procedure is guaranteed
to remove all linear information with respect to the desired
sensitive attributes given any large scale dataset. This guarantee ensures stability and consistency when our orthogonalization procedure is applied across a variety of datasets.
Our proposed algorithm has several limitations. First, the
orthogonalization procedure requires a priori access to the
entire data set or a significant portion of the data set to
be transformed. In cases where only a small fraction of the
overall data is available, the proposed procedure would have
a limited performance. Second, our work has focused almost exclusively on removing all information with respect
to a sensitive attribute without any regard for the utility of
the data in a future data mining context. If the sensitive attribute is highly correlated to a future desired variable that
is to be predicted using a dataset that has undergone our
proposed transformation, then the predictive performance
of such process would be highly limited. To note, in high
stakes markets such as credit, employment, and insurance
where discrimination on the basis of race, gender and sexual
orientation are prohibited by law, then a drastic reduction
in utility given removal of the aforementioned sensitive attributes might be the desired outcome.
The emergence of large scale datasets has necessitated the
development of new methodologies to temper privacy harms
that applying data mining techniques to such data might
bring about. Our orthogonalization algorithm is motivated
by such situations in which there is a need to drastically
’sanitize’ a data set of all information relating to a sensitive
attribute that the data might contain. In particular, given a
large scale individual level data set with multiple correlated
attributes, our methodology leverages principal components
analysis and orthogonal projection in order to completely
eliminate all linear information with respect to a sensitive
attribute in such data set. Our proposed methodology outperforms other privacy preserving methodologies by more
than 20 percent in lowering the ability to reconstruct sensitive attribute from a sample large scale individual level data.
The utility of the presented technique becomes heightened
in situations where a large scale data is to be analyzed using
data mining techniques that can easily reconstruct the sensitive attribute if it just simply removed from the data. While
complete elimination of a sensitive attribute from a data set
can be considered extreme, we note that several proposed
privacy legislation and frameworks that seek to prevent predictive privacy harms adopt a view of eliminating all information relating to a particular sensitive attribute. In these
cases, our orthogonalization algorithm can serve as a useful
tool to accomplish sensitive attribute removal.
Figure 5: Figure (A) shows predictive performance before and after transformation the data using orthogonalization procedure. Figure (B) shows a comparison of our proposed procedure with random noise addition
and rotation perturbaion.
[1] S. Lohr, “The age of big data,” New York Times,
vol. 11, 2012.
[2] V. Mayer-Schönberger and K. Cukier, Big data: A
revolution that will transform how we live, work, and
think. Houghton Mi✏in Harcourt, 2013.
[3] A. Kusiak, J. A. Kern, K. H. Kernstine, and B. T.
Tseng, “Autonomous decision-making: a data mining
approach,” Information Technology in Biomedicine,
IEEE Transactions on, vol. 4, no. 4, pp. 274–284,
[4] E. Letouzé, “Big data for development: Opportunities
& challenges,” 2011.
[5] J. Reno, “Big data, little privacy.”
[6] O. Tene and J. Polonetsky, “Big data for all: Privacy
and user control in the age of analytics,” Nw. J. Tech.
& Intell. Prop., vol. 11, p. xxvii, 2012.
[7] J. K. Laurila, D. Gatica-Perez, I. Aad, O. Bornet,
T.-M.-T. Do, O. Dousse, J. Eberle, M. Miettinen
et al., “The mobile data challenge: Big data for mobile
computing research,” in Pervasive Computing, no.
EPFL-CONF-192489, 2012.
[8] C. C. Aggarwal and S. Y. Philip, A general survey of
privacy-preserving data mining models and algorithms.
Springer, 2008.
[9] K. Crawford and J. Schultz, “Big data and due
process: Toward a framework to redress predictive
privacy harms,” BCL Rev., vol. 55, p. 93, 2014.
[10] C. Jernigan and B. F. Mistree, “Gaydar: Facebook
friendships expose sexual orientation,” First Monday,
vol. 14, no. 10, 2009.
[11] M. Kosinski, D. Stillwell, and T. Graepel, “Private
traits and attributes are predictable from digital
records of human behavior,” Proceedings of the
National Academy of Sciences, vol. 110, no. 15, pp.
5802–5805, 2013.
S. Zhao, S. Grasmuck, and J. Martin, “Identity
construction on facebook: Digital empowerment in
anchored relationships,” Computers in human
behavior, vol. 24, no. 5, pp. 1816–1836, 2008.
T. Craig and M. E. Ludlo↵, Privacy and big data. ”
O’Reilly Media, Inc.”, 2011.
K. Davis, Ethics of Big Data: Balancing risk and
innovation. ” O’Reilly Media, Inc.”, 2012.
K. Chen, G. Sun, and L. Liu, “Towards attack-resilient
geometric data perturbation.” in SDM. SIAM, 2007,
pp. 78–89.
J. Domingo-Ferrer, F. Sebé, and J. Castella-Roca, “On
the security of noise addition for privacy in statistical
databases,” in Privacy in statistical databases.
Springer, 2004, pp. 149–161.
K. Liu, H. Kargupta, and J. Ryan, “Random
projection-based multiplicative data perturbation for
privacy preserving distributed data mining,”
Knowledge and Data Engineering, IEEE Transactions
on, vol. 18, no. 1, pp. 92–106, 2006.
H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar,
“Random-data perturbation techniques and
privacy-preserving data mining,” Knowledge and
Information Systems, vol. 7, no. 4, pp. 387–414, 2005.
L. Sweeney, “k-anonymity: A model for protecting
privacy,” International Journal of Uncertainty,
Fuzziness and Knowledge-Based Systems, vol. 10,
no. 05, pp. 557–570, 2002.
A. Machanavajjhala, D. Kifer, J. Gehrke, and
M. Venkitasubramaniam, “l-diversity: Privacy beyond
k-anonymity,” ACM Transactions on Knowledge
Discovery from Data (TKDD), vol. 1, no. 1, p. 3, 2007.
N. Li, T. Li, and S. Venkatasubramanian, “t-closeness:
Privacy beyond k-anonymity and l-diversity,” in Data
Engineering, 2007. ICDE 2007. IEEE 23rd
International Conference on. IEEE, 2007, pp.
Y. Lu, I. Cohen, X. S. Zhou, and Q. Tian, “Feature
selection using principal feature analysis,” in
Proceedings of the 15th international conference on
Multimedia. ACM, 2007, pp. 301–304.
L. I. Smith, “A tutorial on principal components
analysis,” Cornell University, USA, vol. 51, p. 52,
Y. Cui and J. G. Dy, “Orthogonal principal feature
selection,” 2008.
C. D. Meyer, Matrix analysis and applied linear
algebra. Siam, 2000.
P. Bayer, F. Ferreira, and S. L. Ross, “Race, ethnicity
and high-cost mortgage lending,” National Bureau of
Economic Research, Tech. Rep., 2014.
W. Li and L. Goodman, “A better measure of
mortgage application denial rates,” Washington:
Urban Institute, 2014.