The Hidden Cost of Efficiency: Fairness and Discrimination in Predictive Modeling Julius Adebayo, Lalana Kagal, Alex "Sandy" Pentland Computer Science and Artificial Intelligence Laboratory, Electrical Engineering and Computer Science, MIT Media Lab Massachusetts Institute of Technology. Email: firstname.lastname@example.org, email@example.com, firstname.lastname@example.org ABSTRACT We present a data transformation procedure that completely eliminates all linear information regarding a sensitive attribute in a large scale individual level data with several correlated attributes. The algorithm presented here forms a component of a larger fairness rating system being developed. The goal of the rating system is to elucidate black-box models and bring interpretability to any predictive model no matter how complex. As part of the system, we learn lower dimensional interpretable versions of potentially complex models, attempt to learn a causal structure of the underlying data, and propose an augumentation to the underlying black-box algorithm as a way of reducing bias in predictive modeling. This work is still ongoing, but here we highlight one component of the system: the orthogonal projection algorithm. The orthogonal projection algorithm combines principal components analysis of the data set with orthogonalization with respect to sensitive attribute(s). The orthogonalization algorithm presented is motivated by applications where there is a need to drastically ’sanitize’ a data set of all information relating to sensitive(s) attribute(s) in the data, or that perhaps could be inferred from the data, before analysis of the data using a data mining algorithm. Our proposed methodology outperforms other privacy preserving methodologies by more than 20 percent in lowering the ability to reconstruct sensitive attribute from a sample large scale individual level data. In high stakes contexts such as determination of access to credit, employment, and insurance where discrimination based on sensitive attributes such as race, gender, and sexual orientation is prohibited by law, our proposed algorithm provides a way to help reduce the information content of such sensitive attributes in available data, hence limiting bias. Keywords privacy, data mining, orthogonalization, PCA. 1. INTRODUCTION We live in the age of Big Data . Over the past few years, there has been an increase in the scale of available data both about human behavior and the physical world . Analysis of large scale data across various domains can lead to improvement in decision making . For example, analysis Bloomberg Data for Good Exchange Conference. 28-Sep-2015, New York City, NY, USA. of mobile phone data has led to the development of more granular poverty maps in developing countries, a better understanding of disease spread, and better ability to augment census information in areas where such information is lacking . Analysis of large scale data presents tremendous potential in helping to tackle social problems and perhaps lead to the development of improved public policies through a data-driven approach. On the other hand, the potential harms to individual privacy has never been greater [5–8]. Beyond risks of re-indentifying individuals, predictive privacy harms, stemming from algorithmic decision making pose particularly pressing harms to individual privacy . To demonstrate what we term predictive privacy harms, we present the now popular ’Target’ example. In 2012, a New York Times article noted that the retail chain Target was able to predict which of its customers were pregnant, and disclose this information to its marketing department for advertisement purposes . Target didnot explicitly collect pregnancy information from its customers, yet, it was able to easily infer their pregnancy status. This situation highlights a new kind of privacy concern associated with inferring potentially sensitive information from seemingly non related data.The potential privacy harms associated with data inferred from aggregated large scale data is what is termed predictive privacy harm. Several recent studies have shown that sensitive attributes such as sexual orientation, ethnicity, religious and political views, personality traits, intelligence, happiness, use of addictive substances, parental separation, age, and gender can be inferred from social media data available on platforms like Facebook [10–12]. Aware of the potential ramifications of such a reality, governments and national agencies have started to raise alarms about the devastating privacy harms at play from predictive analysis of individual level data [6, 13, 14]. Arguably, as predictive analytics becomes more prevalent, the principal challenge going forward regards the development of methods and frameworks that would ensure some level of privacy while allowing for useful predictive analysis of data. At the center of the emerging problem is the debate about sensitive information that can be inferred given individual level data. Privacy preserving data mining (see figure 1 for high level overview of data mining) has emerged as a field that seeks to address privacy problems that arise in applying data mining techniques to large scale individual level data . Several ex- Figure 1: An overview of the data mining process. Data mining ultimately seeks to learn useful insights from data. ample methodologies include K-anonymity, L-diversity, and geometric data perturbation techniques that transform the input data to a data mining algorithm in order to increase the difficulty to inferring potentially sensitive attributes as part of predictive analytics process [8,15]. Currently, the privacy preserving data mining literature is devoid of methodologies that seek to completely eliminate all information regarding a sensitive attribute in a large scale individual level data set. To address this gap we present a data transformation procedure that completely eliminates all linear information regarding a sensitive attribute in a large scale individual level data with several highly correlated attributes. Our procedure combines a principal components analysis of the original data set with orthogonalization with respect to desired sensitive attribute to completely eliminate all linear information relating to the sensitive attribute in a given data set. The rest of the paper is as follows. Section 2 provides an overview of current methodologies in privacy preserving data mining as well as the motivation for the procedure presented in this paper. Section 3 follows with a problem definition as well as an overview of the methods used in the presented procedure. In section 4 we present our procedure for complete elimination of sensitive attributes from large scale data. Following Section 4, section 5 presents results of our procedure on a large scale consumer mortgage data set. We then end with a conclusion and overview of the methodology presented and discussion about relevant applications of our procedure. 2. LITERATURE REVIEW AND MOTIVATION Privacy preserving data mining concerns the development of methods that seek to preserve ’privacy’ in some form given the application of a data mining algorithm [8, 15]. Several methodologies focus on transforming the data into one that preserves or makes it difficult to compromise privacy espe- cially for individual level data. Typically these methods seek to reduce the granularity of the data representation so as to increase the difficulty of identifying sensitive attribute information. A subset of these privacy preserving algorithms are known as perturbation techniques. Data perturbation techniques involve a transformation of some or all attributes of a particular data set before applying a data mining algorithm . Often, such transformations are applied in order to prevent re-identification or inference of potentially sensitive attributes. Example data perturbation techniques include random noise addition to data , rotation perturbation of sensitive attributes [?], and random projection perturbation [17, 18]. Other methodologies include K-anonymization  and additional methodolgies such as L-diversity and T-closeness that address certain weaknesses of the K-anonymization methodology [20,21]. Recent studies have highlighted several weaknesses associated with all of the aforementioned perturbation techniques . Random noise addition is susceptible to techniques that leverage the spectral properties of the randomized data in order to separate the noise from the actual data . Further, random noise addition can only be utilized with certain data mining algorithms. Rotation perturbation methods have also been shown to be susceptible to distance-inference attacks . In cases where all information regarding a sensitive attribute is to be eliminated, the above methods are not adequate. 3. PROBLEM DEFINITION AND OVERVIEW OF METHODS 3.1 Problem Definition The critical privacy question going forward is not about attribute or identity disclosure, but related to the ’use’ of attributes learned or inferred as part of a machine learning or data mining process. Our proposed procedure seeks to tackle this problem, which is: how does one completely eliminate the information content of a sensitive attribute in a large scale individual level data set? The focus of this paper is on the elimination of sensitive information in large scale individual level data. More precisely, given a large scale individual level data set with multiple correlated attributes along with other sensitive attribute(s), we seek to develop a methodology to transform the data into one where all information relating to the sensitive attribute(s) has been eliminated. Further, on applying a data mining technique to the transformed data set, the ability to reconstruct the sensitive attribute should be substantially more difficult. Next we present a review of relevant concepts before going into a detailed explanation of our proposed procedure. 3.2 Review of Required Concepts Here we present an overview of Principal components analysis (PCA) and Orthogonalization, which are the two essential elements of our proposed algorithm. 3.2.1 Principal Components Analysis (PCA) PCA helps identify patterns in data . PCA is widely used method across di↵erent fields ranging from psychology to computer vision. PCA helps to decorrelate data as well as to perform dimensionality reduction . Often, PCA can be used to tease out the predictive value of the di↵erent factors in a dataset leading to the development of more accurate predictive models [22, 24]. n Given a random P vector X 2 R with zero mean and covariance matrix x , PCA performs a linear transformation of X to a lower dimension vector Y 2 Rm , m < n such that Y = ATm X (1) and ATm Am = Im . Im is the mXm identity matrix. Further, Am is a nXm matrix whose columns are orthonormal. The columns of Am correspond to the P eigenvectors of the first m largest eigenvalues of the matrix x . In addition to minimizing reconstruction error, PCA maximizes the variance of the projection along each component. An added benefit of PCA is that it helps decorrelate a highly co-linear data set. more detailed development of PCA is presented in [22–24]. 3.2.2 Orthogonal Projection Here we explore the general notion of orthogonal projection. A more concrete derivation and detailed overview is presented in . We demonstrate orthogonal projection using a 2 dimensional example represented in figure 2. In figure 2, the vector A? S shown is the component of the vector A perpendicular to S. Given that A is perpendicular to S, then their inner product is zero. Further, the process of obtaining the component of vector A that is perpendicular to S is known as orthogonal projection. As expected, the concept of orthogonal projection maps beyond the two dimensional setting to higher dimensions. Orthogonal projection is a particular type of a larger class of linear transformations. Intuitively, given two vectors whose inner product is zero, one can conclude that no linear transformation of one vector can produce the other. This means Figure 2: Depiction of Orthogonal projection of A onto S. that linearly, one cannot learn any information about the second vector given the other if their inner product is zero. Ultimately, this intuition underlies our approach of eliminating information with respect to a sensitive attribute in a particular data set. 4. ALGORITHM FOR SENSITIVE ATTRIBUTE REMOVAL In this section we present an overview of our sensitive attribute elimination algorithm as shown in figure 3. From figure 3, we present a general schematic for the transformation of a dataset in order to eliminate the sensitive variable. Algorithm 1 Algorithm to create polygon intersection graph INPUT: An n x p data matrix X 1 that can be decomposed into x~1 , x~1 , . . . , x~p feature (attribute) vectors, and x~s sensitive attribute vector. OUTPUT: An n x p transformed matrix Xnew that can be decomposed into x~⇤1 , x~⇤2 , . . . , x~⇤p where each vector x⇤i 2 Xnew is orthogonal to x~s . Obtain the principal components of X Transform X given the principal components to obtain Xpre for principal component xpre ~ in Xpre do obtain x⇤i , the component of xpre ~ that is orthogonal to x~s join x⇤i column wise to Xnew end for Given an n x m data matrix X with m attributes, n samples, and s sensitive attributes, as a first step in our process, we remove all s sensitive columns of matrix X to produce matrix Xpre . Because the columns of X are highly correlated, just removing the s sensitive columns is not enough to completely prevent these attributes from being reconstructed. On removing the sensitive attributes to obtain Xpre , we perform PCA on Xpre in order to completely decorrelate the data. Once Xpre has been decomposed into its principal components, the component of each principal component that is orthogonal to the subspace of all the sensitive attributes is then obtained. By definition, the inner product between the projection obtained and the sensitive attribute vectors are guaranteed to be zero. Intuitively, this means we have been able to transform the data set to one where the linear Figure 3: An overview of our proposed orthogonalization process. Available Data Attributes Applicant Race Applicant Income Applicant Sex Co-Applicant Race Co-Applicant Sex Loan Type Loan Purpose Property Neighborhood Minority Population Population in Property Neighborhood Neighborhood Census Information Loan Preapproval Status State relationship between the remaining attributes and the sensitive attribute vectors have been eliminated. The orthogonal transformation of each principal component is important because it ensures that no linear transformations of the newly obtained data set can reconstruct the original sensitive attributes. As expected this is because we have deliberately eliminated all linear dependence between the remaining feature vectors and the sensitive attributes. 5. EXPERIMENTAL RESULTS We demonstrate the proposed orthogonalization algorithm on a real world mortgage application data, home mortgage disclosure data, obtained through the United States consumer financial protection bureau. The Home Mortgage Disclosure Act (HMDA) of 1975 and the Dodd-Frank Act of 2011 requires financial entities to report and publicly disclose information about mortgage applications received and the resulting loan decision made by bank loan underwriters . As part of the legislative requirement, each financial entity operating in a large metropolitan area is required to report details of mortgage applications received. Per application received, the reported information includes action taken on the loan, applicant race and ethnicity, co-applicant race and ethnicity, sex, median family income, loan purpose, loan type, home census tract information, amongst other information [26, 27]. For our analysis, we subsample the data to a set consisting of 2.5 million of the 14.8 million mortgage applications made in 2011. These 2.5 million applications were made for properties categorized as One-to-Four family dwelling and were conventional loans without any government assistance or subsidies. All available attributes as part of the data set are shown in the table below. In figure 4(A) we show a bar chart of the racial breakdown in the sample data examined. Applicant race is considered as the sensitive attribute for our analysis. Figure 4(B) shows the overall volume of loan applications by purpose from 2011 to 2013. Given the list of attributes, we consider applicant race to be the sensitive attribute of choice for our analysis. The goal in analyzing the above data is to remove all information (linear) relating to the applicant’s race in the data set. According to several enacted legislation, race, gender, sexual orientation, and several other protected characteristics can not be used as a basis for mortgage underwriting decisions. Given such requirement, a data transformation procedure that eliminates all information relating to the applicant race in the above data set is valuable. 5.1 Simple Feature Removal is not Enough Figure 5(A) demonstrates that simply removing sensitive attributes in highly correlated data sets is not enough to prevent inferring the sensitive attribute from the other features. We demonstrate this by predicting the sensitive attribute, applicant race, with the rest of the features shown in the table above. The results are shown in Figure 5(A). The task of predicting applicant race can be approached using popular classification algorithms. To quantify our predictive performance, we use the area under the receiver operating characteristic curve (AUROC). Given a curve of true positive rate versus false positive rate, the AUROC is the area under this curve. The AUROC is a standard measure in machine learning literature that serves to quantify the predictability of a particular classification task. As seen in Figure 5(A), before orthogonalization, the average classifier performance Figure 4: Figure (A) shows a bar chart of the racial breakdown in the sample data examined. Applicant race is considered as the sensitive attribute for our analysis. Figure (B) shows the overall volume of loan applications by purpose from 2011 to 2013. is 0.78, which drops by about 30 percent to 0.55 after applying our orthogonalization algorithm. By just removing the sensitive attribute the ability to infer the sensitive attribute, applicant race, isn’t a↵ected by any means because of the presense of other highly correlated attributes. However, on applying our orthogonalization procedure we see a 30 percent decline in reconstruction ability for the sensitive attribute. 5.2 Reconstruction After Orthogonalization Algorithm is Applied Going further, we compare our orthogonalization algorithm to other preserving privacy methodologies in the literature. Figure 5(B) shows the predictive performance of our orthogonalization procedure along side random noise addition and rotation perturbation. Again, we note that predictive performance on the reconstruction of the sensitive attribute is more than 20 percent lower compared to the two other methods shown. Our orthogonalization procedure is guaranteed to remove all linear information with respect to the desired sensitive attributes given any large scale dataset. This guarantee ensures stability and consistency when our orthogonalization procedure is applied across a variety of datasets. Our proposed algorithm has several limitations. First, the orthogonalization procedure requires a priori access to the entire data set or a significant portion of the data set to be transformed. In cases where only a small fraction of the overall data is available, the proposed procedure would have a limited performance. Second, our work has focused almost exclusively on removing all information with respect to a sensitive attribute without any regard for the utility of the data in a future data mining context. If the sensitive attribute is highly correlated to a future desired variable that is to be predicted using a dataset that has undergone our proposed transformation, then the predictive performance of such process would be highly limited. To note, in high stakes markets such as credit, employment, and insurance where discrimination on the basis of race, gender and sexual orientation are prohibited by law, then a drastic reduction in utility given removal of the aforementioned sensitive attributes might be the desired outcome. 6. CONCLUSION The emergence of large scale datasets has necessitated the development of new methodologies to temper privacy harms that applying data mining techniques to such data might bring about. Our orthogonalization algorithm is motivated by such situations in which there is a need to drastically ’sanitize’ a data set of all information relating to a sensitive attribute that the data might contain. In particular, given a large scale individual level data set with multiple correlated attributes, our methodology leverages principal components analysis and orthogonal projection in order to completely eliminate all linear information with respect to a sensitive attribute in such data set. Our proposed methodology outperforms other privacy preserving methodologies by more than 20 percent in lowering the ability to reconstruct sensitive attribute from a sample large scale individual level data. The utility of the presented technique becomes heightened in situations where a large scale data is to be analyzed using data mining techniques that can easily reconstruct the sensitive attribute if it just simply removed from the data. While complete elimination of a sensitive attribute from a data set can be considered extreme, we note that several proposed privacy legislation and frameworks that seek to prevent predictive privacy harms adopt a view of eliminating all information relating to a particular sensitive attribute. In these cases, our orthogonalization algorithm can serve as a useful tool to accomplish sensitive attribute removal. t! Figure 5: Figure (A) shows predictive performance before and after transformation the data using orthogonalization procedure. Figure (B) shows a comparison of our proposed procedure with random noise addition and rotation perturbaion. 7. REFERENCES  S. Lohr, “The age of big data,” New York Times, vol. 11, 2012.  V. Mayer-Schönberger and K. Cukier, Big data: A revolution that will transform how we live, work, and think. Houghton Mi✏in Harcourt, 2013.  A. Kusiak, J. A. Kern, K. H. Kernstine, and B. T. Tseng, “Autonomous decision-making: a data mining approach,” Information Technology in Biomedicine, IEEE Transactions on, vol. 4, no. 4, pp. 274–284, 2000.  E. Letouzé, “Big data for development: Opportunities & challenges,” 2011.  J. Reno, “Big data, little privacy.”  O. Tene and J. Polonetsky, “Big data for all: Privacy and user control in the age of analytics,” Nw. J. Tech. & Intell. Prop., vol. 11, p. xxvii, 2012.  J. K. Laurila, D. Gatica-Perez, I. Aad, O. Bornet, T.-M.-T. Do, O. Dousse, J. Eberle, M. Miettinen et al., “The mobile data challenge: Big data for mobile computing research,” in Pervasive Computing, no. EPFL-CONF-192489, 2012.  C. C. Aggarwal and S. Y. Philip, A general survey of privacy-preserving data mining models and algorithms. Springer, 2008.  K. Crawford and J. Schultz, “Big data and due process: Toward a framework to redress predictive privacy harms,” BCL Rev., vol. 55, p. 93, 2014.  C. Jernigan and B. F. Mistree, “Gaydar: Facebook friendships expose sexual orientation,” First Monday, vol. 14, no. 10, 2009.  M. Kosinski, D. Stillwell, and T. Graepel, “Private traits and attributes are predictable from digital          records of human behavior,” Proceedings of the National Academy of Sciences, vol. 110, no. 15, pp. 5802–5805, 2013. S. Zhao, S. Grasmuck, and J. Martin, “Identity construction on facebook: Digital empowerment in anchored relationships,” Computers in human behavior, vol. 24, no. 5, pp. 1816–1836, 2008. T. Craig and M. E. Ludlo↵, Privacy and big data. ” O’Reilly Media, Inc.”, 2011. K. Davis, Ethics of Big Data: Balancing risk and innovation. ” O’Reilly Media, Inc.”, 2012. K. Chen, G. Sun, and L. Liu, “Towards attack-resilient geometric data perturbation.” in SDM. SIAM, 2007, pp. 78–89. J. Domingo-Ferrer, F. Sebé, and J. Castella-Roca, “On the security of noise addition for privacy in statistical databases,” in Privacy in statistical databases. Springer, 2004, pp. 149–161. K. Liu, H. Kargupta, and J. Ryan, “Random projection-based multiplicative data perturbation for privacy preserving distributed data mining,” Knowledge and Data Engineering, IEEE Transactions on, vol. 18, no. 1, pp. 92–106, 2006. H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar, “Random-data perturbation techniques and privacy-preserving data mining,” Knowledge and Information Systems, vol. 7, no. 4, pp. 387–414, 2005. L. Sweeney, “k-anonymity: A model for protecting privacy,” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 05, pp. 557–570, 2002. A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam, “l-diversity: Privacy beyond        k-anonymity,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 1, no. 1, p. 3, 2007. N. Li, T. Li, and S. Venkatasubramanian, “t-closeness: Privacy beyond k-anonymity and l-diversity,” in Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on. IEEE, 2007, pp. 106–115. Y. Lu, I. Cohen, X. S. Zhou, and Q. Tian, “Feature selection using principal feature analysis,” in Proceedings of the 15th international conference on Multimedia. ACM, 2007, pp. 301–304. L. I. Smith, “A tutorial on principal components analysis,” Cornell University, USA, vol. 51, p. 52, 2002. Y. Cui and J. G. Dy, “Orthogonal principal feature selection,” 2008. C. D. Meyer, Matrix analysis and applied linear algebra. Siam, 2000. P. Bayer, F. Ferreira, and S. L. Ross, “Race, ethnicity and high-cost mortgage lending,” National Bureau of Economic Research, Tech. Rep., 2014. W. Li and L. Goodman, “A better measure of mortgage application denial rates,” Washington: Urban Institute, 2014.