Download CC: Logistic Regression

CN6121- Group Coursework Comparative Study of Various techniques in AI Carol Chachati & Myriam Marlin U1312503 & U1630248 School of Architecture, Computing and Engineering, University of East London [email protected], [email protected] •Explain why they use a particular software to implement the algorithm •Write about how the implementation of the algorithm was done and how the data was represented •Results obtained from the simulations of both learning with the appropriate screen shots •Detailed analysis of the results, such as compare, contrast between the results obtained from the simulations of the two types of AI techniques •Conclusion of comparison 1.2. Overview Abstract: Supervised learning and unsupervised learning are both different learning curves used by a machine in order draw a graph which can bring predictions of new inputs of data. Supervised uses labelled features, while unsupervised uses unlabelled features. With supervised you know what outcome you want, and values are discreetly either one or another, 1 or 0. Whereas unsupervised features are not supervised and you rely on the algorithm to give you an outcome based on the features data you give it and see what output it can give you. The decision of which one is better is usually based on a number of factors and depends on the data input, amount of it, success of it, and tailoring with inclusion or absence of features. (Brownlee, 2016) (Stanford University, no date) Supervised Learning: 1. Introduction: Regression and Classification: 1.1 Objectives: Supervised learning techniques usually come under either regression or classification type problems. Classification type problems involve categories so either 0 or, while regression involves getting an output value based on some features and rather than categories it can be any range of number of the output data type e.g.: input features of room numbers and such, output is price which is continuous. (Cord and Cunningham, 2008) (Ng, A., no date) •To compare supervised and unsupervised learning •In pairs do the coursework •Each person in the group must do a type of learning, one person supervised, and other unsupervised •Each person must use the same dataset to implement their algorithm depending on what learning they are using. The dataset you use in the implementation algorithm should consist of at least 100 instances, or more than 200 for a dataset with only a few attributes. •Each person must discuss why they chose the particular AI techniques, and discuss why they have been used for comparison Supervised learning is when the training data has labelled features and the output you want is discrete and you know what outcome you would like. Supervised learning is used for classification problems and regression problems. An algorithm that uses supervised learning, creates a model from the training data in which predictions of outputs of new data can be made based on this model. (Cord and Cunningham, 2008) (Stanford University, no date) (Ur Réhman, D.S. (2016) The supervised learning algorithm that will be used is Logistic Regression by Carol. Coefficients: The coefficients of variables are the base of how the prediction is made, like Linear Regression which Logistic Regression is based on. The coefficients are worked out by the algorithm, and the coefficients will show how much impact the variable has on the final output prediction. (Altman, Gill and McDonald, 2004) Cost Function: The estimated error loss between predicted output and what prediction you actually wanted for that data. (Ng, A., no date) The unsupervised learning algorithm that will be used is K-means clustering by Myriam. • Apriori, Eclat are used to discover association rules learning problems • K-means, expectation Maximisation(EM), k-Medians and hierarchical Clustering are used for clustering problems • Quadratic Discriminant Analysis (QDA), Principal Component Analysis(PCA) and Sammon Mapping are some examples of algorithms used for Dimensionality Reduction problems. Dimensionality Reduction algorithms can also be adapted to supervised learning methods (Roweis & Saul, 2000) (Agrawal & Ramakrishnan, n.d.). Unsupervised learning: Unlike supervised learning, unsupervised learning refers to machine learning where only the input data is known as there are no defined output variables. The purpose of unsupervised learning is to distribute the data or replicate the underlying structure. This type of learning enables algorithms to find and produce the structure in the data or the distribution on it, to learn more about the data itself (Brownlee, 2016). Unsupervised learning problems can be divided into two groups: Association and dimensionality reduction problems It refers to discovering rules which describe large parts of the data. For example, an algorithm design to find rules to describes large part of the data collected from customers to predict how people shop – customer who purchased products X are likely to buy products Y as well (Lee & Verleysen, 2010). Clustering problems It refers to finding a way to discover the inherent groups in the data. For example, an algorithm design to group customers based on their previous orders (Hartigan & Wong, 1979). Popular types of unsupervised learning algorithms: 3. Methodology: 3.1 Chosen Techniques: Carol Chachati C.C: Regression Logistic Logistic Regression finds the probability of a category being more likely to be the outcome than another category based on inputted independent variables. The number of categories is limited to 2, classification can either be one category or the other. (BioMedware, 2014) The algorithm is based on linear regression for ‘classification problems’ shown in Figure 1. Figure 1 Logistic Regression based on Linear Regression (Sammut and Webb, 2011) In Figure 2, you can see that B are the coefficients of the x values which are the independent variable values. (Sammut and Webb, 2011) Figure 2 Logistic Regression based on Linear Regression (Sammut and Webb, 2011) The logistic regression model differs to the linear regression as it is not for regression problems, but classification problems. It uses logs of the input data creating a linear graph from them. The Letter P at the front for probability of 1 or 0, such as heart disease or not heart disease is different to the function used in linear regression like in figure 1 where it gives a probable value based on the data, such as if a house has a certain amount of rooms, the prediction would be of how much the house costs for every individual house and there is more variance in what values are predicted than just being binary. (Stanford University, no date) (Sammut and Webb, 2011) The Equation used for Logistic regression model is shown in Figure 3. Maximum Likelihood: Logistic Regression works on the idea of maximum likely hood, by which when a value is predicted it can vary between 0 and 1 and calculating the best way to do so. As shown by the logit form of the Logistic Regression model in figure 5. (Altman, Gill and McDonald, 2004) Figure 5 Logit form of Logistic Regression (Altman, Gill and McDonald, 2004) The first derivation done by maximum likelihood is shown in figure 6 , in which maximum likelihood tries to get the best way of finding the best prediction by trying to find the best B , coefficients. The better the coefficients are figured, the better the output prediction, if the coefficients are not worked out properly, the model will not accurately predict if the output is 0 or 1. (Altman, Gill and McDonald, 2004) Figure 3 Mathematical Logistic Regression equation (BioMedware, 2014) By which P is the probability is the logit of odds of an event i.e. one category or another is what will happen if the event will happen or not like odds (event) shown in Figure 4.(faculty.cas.usf.edu, 2016) The regression coefficients gives the x variable values their weight to the probability of an event outcome . The x value represents all the x values up to the last x variable used in calculating the probability occurring. All of this is totalled as shown by the total sign, with M signifying the set of data inputted. (BioMedware, 2014) (Statistics Solutions, 2016) Figure 4 Logistic Regression equation in words (Statistics Solutions, 2016) Figure 6 Derivation by maximum likelihood (Altman, Gill and McDonald, 2004) Decision Boundary: Since predictions are varied between 0 and 1 a decision boundary has to be decided for the data. For example if the decision boundary value is equal to or below 0.5 that means anything below 0.5 is classed as 0 and anything with a value higher than 0.5 is 1.Since datasets vary, it is important to decide on a good decision boundary as datasets values and problems vary so a more suitable decision boundary value should be decided on so that when analysing of the classification occurs ,testing data that has been classified well can actually be given a better accuracy output than if the decision boundary was different and vice versa. (Stanford University, no date) Convergence and the data: In order for data predictions to be good, Logistic Regression must converge.Convergance is when the algorithm reaches a global maximum, and predicted values are predicted around this only, which is a good sign of showing that the algorithm was able to find the best way of predicting the odds of the classifier in the model. If too many local maximum points are found by the algorithm if a global maximum is not found, although this should not since the log output should be concave i.e. a bell shape graph, values maybe predicted towards it, rather than more ambiguously. Convergence can also fail because of the way data is split into training and testing data itself. If data is not randomised, data splitting or cross validation are not used the model will not be able to support a variety of different data values. (Altman, Gill and McDonald, 2004) C.C – end Myriam Marlin M.M: Kmeans clustering The K means clustering approach is the most commonly used method for K-means cluster analysis. It is a simple algorithm designed to solve clustering problem by classifying a given dataset into a k number of clusters fixed Apriori. The principle of the algorithm is to assign k centroids to each of data point based on their closeness. Once each data points are assigned, the algorithm recalculates the new centroids based on the average of the data points of a cluster. For example, centroids can be p-length mean vectors where p represents the number of variables. If data points are still pending, the process is repeated until all the observation are reassigned or reached. Kmean clustering advantages K mean clustering has many advantages such as: It is easy to understand as well as robust It is fast and efficient It provides the best results when datasets are distinct and well separated from one another. K mean clustering is designed to minimise the risk of a squared error function also known as an objective function using the algorithm below: Where represents the Euclidean distance, which is the chosen distance between the cluster centre and the data point And n represents distance of n data points with their assigned cluster centres M.M – end 3.2 Reasons for comparing algorithms: The coursework says to choose two AI techniques for comparison, one supervised and one unsupervised. Choosing logistic regression as the supervised technique was based on the available algorithms on weka for the dataset and had a higher prediction accuracy in weka then some other algorithms available to try. Also the technique is suitable for binary classification problems that have several independent variables. Kmeans clustering algorithm was selected for the comparison for the fact that it is also an technique suitable for binary classification and also because of the toolbox we can use to apply it. The algorithm can be run in Rstudio. Unlike Weka, R programming language like Python and MATLAB can import libraries and ML package. Weka has his own ML package and is less flexible than other tools and software used for data exploration and statistical analysis. Like Python programming the R provides freedom to transform, clean and explore datasets as well as offering ways to tweak and tune the algorithms. Even though Weka is an education oriented tool it is less performant for data science and offer less room to improve coding skills. Moreover, R offers comprehensive facilities to handle the methodology available for partitioning clustering. 4. Simulations: 4.1 Introduction to dataset The dataset selected to test the two algorithms used to compare supervised and unsupervised machine learning was found on the University California Irvine repository (UCI) which is a website that contains a large database of datasets. The dataset we used for both simulations is available at: http://archive.ics.uci.edu/ml/datasets/Statlog+( Heart). It is a heart disease dataset which contains 270 instances and 14 variables. The class is the presence (2) or absence (1) of heart disease. The data types are numerical values and the attributes are a mix of Real, Ordered, Binary and Nominal types. The attributes information is as follow: 1. age 2. sex 3. chest pain type (4 values) 4. resting blood pressure 5. serum cholestoral in mg/dl 6. fasting blood sugar > 120 mg/dl 7. resting electrocardiographic results (values 0,1,2) 8. maximum heart rate achieved 9. exercise induced angina 10. oldpeak = ST depression induced by exercise relative to rest 11. the slope of the peak exercise ST segment 12. number of major vessels (0-3) coloured by fluoroscopy 13. thal: 3 = normal; 6 = fixed defect; 7 = reversible defect algorithms using it was less than for R and MATLAB. Also there were many terminologies to understand in order to understand output. Also on the standard algorithms that you could try out on your data, for example multinomial logistic regression was available as well as logistic regression but if the encoding did not produce the same values as the classification of these standard algorithms it would be down to what classification calculations were used to do so, and it was difficult to find this out. Algorithms could fit on the dataset, but not necessarily make sense to such as multinomial logistic regression, while available, there are only 2 categories for the heart disease dataset so it didn’t seem right to use multinomial logistic regression that normally takes more than 2. 4.2 Input encoding/ input representation: C.C: Logistic Regression First the data from UCI repository is converted to csv format in excel and it can then be read into R. class is made into a factor so that R can split up the class1 which is the response variable into categories which it will understand so for the program, 0 is for present and 1 for absent. ‘contrasts’ is used to confirm this. (Alice, 2015) @attribute class {1 2} absent or present 4.2 Reasons why the particular software or tool was chosen: R studio and R were used instead of weka and matlab.This is because R is easy to understand and is used for in industry, it provides an easy to use interface and much quicker to figure out. Whereas MATLAB, while it had good classification facilities, the encoding has seemingly overly flexible possibilities which are harder to put together. As for weka, it was hard to understand the classification output, while there is a command line interface for weka, ready information of how to encode Code reference (Alice, 2015) (Rai, 2015) The contrast output in Figure A shows Present being the reference. So Present is represented as 1 in R while Absence of heart disease is represented as 0. Figure A contrast output (Chachati, 2017) Code reference (Alice, 2015) (Rai, 2015) Coefficient analysing: Any variables with stars next to them means they have a significant effect on predictability, so any variable with no star can be taken out from the model later on. (Alice, 2015) (Rai, 2015) Summary(themodel) output Code reference (Alice, 2015) (Rai, 2015) Data is randomised, seeding enables the same training and testing test every time you run the script and split of 70% for building model to 30% used for testing. (Alice, 2015) (Rai, 2015) Using themodel <-glm(class ~.,data=tdata) so all the 13 independent variables are included : ‘glm’ is the function used to do logistic regression along with choosing ‘binary’ for binary logistic regression. A prediction model is created. You can see in the ‘glm’ function I took out some of the 13 independent variables, leaving in 8, since the earlier summary model, the coefficient showed the left out variables did not have much significance to predicting the response outcome. (Alice, 2015) (Rai, 2015) With just 8 variables specified: set to NULL. When the head of the dataset is displayed on the console, the class column is removed. You can see most of the variables now that have a significant impact on the final prediction response. C.C – end M.M: Kmeans clustering Once the dataset is ready, the first thing to do is to prepare it so it can be used in R. To do so the dataset is extracted on notepad ++ and saved as a text file which will be then opened in Excel to add the columns labels before saving it in Comma Separated Values (CSV). The dataset can then be imported to Rstudio using the import dataset button in the environment panel. Fig 2. Remove the class column from dataset in Rstudio In order to obtain the same result every time a function is applied to the dataset, we use the function set.seed(1234) to keep the variable fixed by setting the seed of the generator of random number in R so results can be reproduced. The data ‘s variables need to be rescale for better comparability. To do so, the commands below are used in R: #Prepare Data heart1 <- na-omit(heart1) # delete missing data heart1 <- scale(heart1) # standardise variables Fig 1. Importing dataset to Rstudio As the k means algorithm is designed to group the observations into clusters, we need to remove the class column from the original dataset. To do so we used the code to create a copy of the dataset with the class column As the use of the k-mean algorithm in R requires us to specify how many clusters need to be extracted, we can use a method in R called “Elbow” which consist into looking for a significant elbow (bend) in the sum of squared error (SSE) using the function below: The result obtained when the function was run shows that the number of clusters for the dataset is 2 as shown below. Fig 3. Result of plot to suggest the number of cluster for kmeans in Rstudio present in the dataset e.g. observation number 5 is placed in the cluster 2 based on most of its data being close to the cluster means calculated for each variables of the cluster 2 The result also display the within cluster sum of squares by cluster which represents the multiplication of the squared distance of each mean to the global mean by the number of data points it illustrates. The available components are also displayed in the results which can be used to carry out further analysis such as see the results per size or per cluster by using the command illustrated in Fig5. We can also use the 2 values in the class = 1 (absent) and 2 (present) to set the number of cluster to two. We can now apply the kmeans function to the dataset as shown below. Fig 5. Results based on size and cluster components in Rstudio 5. Results obtained: C.C 5.1 Logistic Regression results For the Logistic regression these are the results obtained. Fig 4. K-means results in Rstudio The return results display the cluster size (230, 40) as well as the means calculate for each variables of the dataset. The clustering vector shows the cluster of each observation Prediction output: Matrix and the misclassification error. defined in the original dataset. This table will allow us.to analyse the performance of the classifier (classification model) on the dataset that contains the true values also known as a confusion matrix. Fig 6. Confusion matrix table from Data School. Code reference (Alice, 2015) (Rai, 2015) The class output: with 2 being the reference ie Present. So Present is represented as 1 in R. While Absence of heart disease is represented as 0. For the matrix description, 34 were predicted to be correctly classed absent and 5 were incorrectly classed present. While 6 were predicted to be incorrectly classed absent and 29 were correctly classed present.(DBD, 2017) Misclassification error is then calculated. It shows an output of 0.1486486 which means this is what was incorrectly classified so this means, there was a 0.8513514 accuracy i.e. approximately 85% accuracy which means the logistic regression worked well. C.C – end M.M 5.2: Kmeans clustering results To provide a deeper analysis of the result, we produced a table that will compare the result based of the clusters to the class variable The confusion matrix allows to easily compare the following terminology:  true positives (TP) represent the number of patients that we predicted as patients that do not have heart disease and the actual data shows that the patients are heart disease free.  true negatives (TN) represent the patients that do have heart disease that we also predicted as patient with the disease.  false positives (FP) represents the patients we predicted as patient with heart disease that are disease free. This is also referred as Type I error.  false negatives (FN) represents the patients we predicted as disease free that have been diagnosed with heart disease (Data School, 2017). Based on the confusion matrix table in Fig.6, we have two predicted classes in the dataset; > plot(heart[c("resting_BP","serum_cholestoral")], col = heart$class) > plot(heart[c("resting_BP","serum_cholestoral")], col = results$cluster) “0” for the patient that do not have a heart disease and the class “1” for patient diagnosed with a heart disease. The classifier contains a total of 270 observations related to patient tested for heart diseases. Now let create the confusion matrix as follow: PLOT CLASS PLOT CLUSTER Fig 4. Confusion matrix in Rstudio The confusion matrix shows that 23 patients were classified in cluster two (present) when they did not have the disease and 103 patients with the disease were assigned to cluster 1 (absent). Based on the results, the confusion matrix provides a list of rate that are usually computed such as accuracy and the misclassification rate as follow: Accuracy which represents the number of time the classifier was correct using (TP+TN)/Total = (127+17)/ 270 = 0.86 Misclassification Rate represents how many often the classification was wrong using (FP+FN)/Total= (103+23)/270= 0.46 Based on the calculated rates, the accuracy rate is acceptable but the misclassification is too high for the size of the dataset. Overall, the performance of the classifier is poor. Next, we can also produce a graph using the plot command to allow us to have a visual comparison of the results. For the purpose, we will compare the graphs using the resting blood pressure variable with the serum cholesterol variable of the class attribute in the original dataset to the one produced by the clustering results. To do so we used the commands bellow: Fig 5. Plots of the serum cholesterol vs resting BP. Top graph is based on the class of the original dataset – Bottom graph based on the cluster results. The graph comparison allows us to notice that the results are very similar in shape however some observations are overlapping and were assigned to the wrong cluster. These discrepancies could be caused by the fact that the Euclidean distance calculated have unequally weighted the underlying factors. Moreover, cluster centres that are chosen randomly can lead to the wrong result. 6. Critical Analysis of results 6.1 Comparison of results: The k-means worked on most of data set at once, while logistic regression used a model and then worked on a training set. Both algorithms produced ok accuracy, but Logistic Regression turned out better as the misclassification error was 85% while kmeans misclassification rate was over 40%.Both algorithms had their pros and cons, k-means had overlapping in the classifying which would contribute to error while Logistic Regression had error depending on the coefficient values. Across accuracy, misclassification rate, tp rate, fp rate, specificity, precision and prevalence Logistic Regression was better. 1 2 absent present Tn = 127 Fn = 103 129 Fp = 23 150 Tp = 17 110 110 Accuracy = (tp+tn)/total = 0.86 misclassification rate = fp+fn/total = 0.46 Tp rate = 17/110 = 0.16 Fp rate = 23/150 = 0.15 Specificity tn/actual no = 127/150 = 0.8 precision tp/predicted yes = 100/110 = 0.9 Overall, the k-means algorithm applied to the dataset for clustering analysis did not provide us with satisfactory results as the performance of the classifier is poor. Despite, the high accuracy rate obtained through the confusion matrix results, the misclassification rate is too high to conclude that the k-means technique was a success. Moreover, the comparison of the rate obtained through the k-means algorithm and the logistic regression algorithm shows that the later had a better accuracy and misclassification rate than the k-means algorithms. M.M –end prevalence C.C actual yes / total = 110/260 = 0.4 (Data, 2014) absent present 7. Conclusion M.M 1 absent tn = 34 Fn = 6 40 2 present Fp = 5 Tp= 29 34 39 35 Accuracy = (tp+tn)/total = (29+34)/74 = 0.85 Misclassification rate = fp+fn/total = (5+6)/74 = 0.15 Tp rate = 29/35 =0.83 Fp rate = 5/39 = 0.13 Specificity tn/actual no = 34/39 = 0.87 The two algorithms classified the data in different ways and while one used the whole data set, the other split the data with predictions for a training data. If I were to do the coursework again, I would do more tests including variables and excluding variables along with comparison to doing the same for k-means. This would give a better overview of which algorithm better worked for the dataset. Also I would have tried other things such as cross validation and using different types of classifiers to test the model, and the same would be for kmeans. C.C –end 8. References: Precision Tp/predicted yes = 29/34 = 0.85 Prevalence Actual yes / total = 35/74 = 0.47 (Data, 2014) Agrawal, R. & Ramakrishnan, S., n.d. Fast Algorithms for Mining Association Rules, San Jose: IBM Almaden Research Center . Alice, M. (2015) How to perform a logistic regression in R. Available at: https://www.rbloggers.com/how-to-perform-a-logisticregression-in-r/ (Accessed: 6 January 2017). Altman, M., Gill, J. and McDonald, M. (2004). Numerical issues in statistical computing for the social scientist. 1st ed. Hoboken, NJ: Wiley, pp.238-248. on/Logistic.html (Accessed: 5 January 2017). BioMedware, (2014). A general equation for logistic regression. [image] Available at: https://www.biomedware.com/files/docume ntation/spacestat/Statistics/Multivariate_Mo deling/Regression/logistic_gen2.jpg [Accessed 30 Dec. 2017]. Hartigan, J. A. & Wong, M. A., 1979. Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), pp. 100-108. BioMedware, (2014). BioMedware SpaceStat Help - About Aspatial Logistic Regression. [online] Biomedware.com. Available at: https://www.biomedware.com/files/docume ntation/spacestat/Statistics/Multivariate_Mo deling/Regression/About_Aspatial_Logistic _Regression.htm [Accessed 30 Dec. 2016]. Brownlee, J. (2016) Supervised and Unsupervised machine learning Algorithms. Available at: http://machinelearningmastery.com/supervis ed-and-unsupervised-machine-learningalgorithms/ (Accessed: 30 December 2016). Cord, M. and Cunningham, P. (2008) Machine learning techniques for multimedia case studies on organization and retrieval. Berlin: Springer Verlag Berlin Heidelberg. (Chapter 2 & 3) Data School, 2017. Simple guide to confusion matrix terminology. [Online] Available at: http://www.dataschool.io/simple-guide-toconfusion-matrix-terminology/(Accessed 3rd January 2017). DBD, U. (2017). Confusion Matrix. [online] Www2.cs.uregina.ca. Available at: http://www2.cs.uregina.ca/~dbd/cs831/notes /confusion_matrix/confusion_matrix.html [Accessed 5 Jan. 2017]. faculty.cas.usf.edu (2016) Logistic regression. Available at: http://faculty.cas.usf.edu/mbrannick/regressi Lee , J. A. & Verleysen, M., 2010. Unsupervised Dimensionality Reduction: Overview and Recent Advances. Barcelona, WCCI 2010 IEEE World Congress on Computational Intelligence. Rai, B. (2015). Logistic Regression with R: Categorical Response Variable at Two Levels. [video] Available at: https://www.youtube.com/watch?v=xrAg3F LQ0ZI&feature=youtu.be [Accessed 1 Jan. 2017]. Roweis, S. & Saul, L. K., 2000. Nonlinear Dimensionality Reduction by Locally Linear Embedding, New York: Science. Sammut, C. and Webb, G. (2011). Encyclopedia of machine learning. 1st ed. New York: Springer, p.631. Statistics Solutions (2016) Logistic regression. Available at: http://www.statisticssolutions.com/regressio n-analysis-logistic-regression/ (Accessed: 5 January 2017). Stanford University (n.date) Supervised Learning [MOOC]. Available at: https://www.coursera.org/learn/machinelearning/supplement/NKVJ0/supervisedlearning (Accessed: 9 November 2016). Stanford University (n.date) Unsupervised Learning [MOOC]. Available at: https://www.coursera.org/learn/machinelearning/lecture/olRZo/unsupervisedlearning (Accessed: 9 November 2016) Stanford University (no date) Supervised Learning. [MOOC]. Available at: https://www.coursera.org/learn/machinelearning/supplement/NKVJ0/supervisedlearning (Accessed: 9 November 2016). ur Réhman, D.S. (2016) Data Driven Machine Learning (Part 1). Available at: https://moodle.uel.ac.uk/pluginfile.php/7898 08/mod_resource/content/3/machine%20lear ning%20intro-lec%201.pdf (Accessed: 09 November 2016).

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download CC: Logistic Regression