Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Activity Recognition Using, Smartphone Based, Accelerometer Sensors by Ricardo Emanuel Gouveia da Costa Cachucho Thesis of MADSAD on Data Mining Supervised by João Mendes Moreira João Gama 2011 I was born in Madeira in 1986 having lived there until 2004 when I moved to Porto to study economics and then Data Mining in the Faculty of Economics of Porto (FEP). In Madeira I took part of several groups connected to outdoor activities: member of the first paintball team in Madeira, member of MadeiraEcoChallenges: a group created to organize eco-friendly activities as bike tourism (Porto-Algarve 2006 edition), trekking (crossing Madeira by walk in 2008) and other activities related to nature. During the bachelor period in Porto I developed a particular will to learn how to model complex problems and that was one of the major reasons to take part of the master in Data Analysis and Decision Support Systems. During this period I have also been a member of this institution choir, eCOROmia, a great school in terms of human relations and public performances. It is also with big pleasure that I am a member of the Pedagogic board of FEP having the opportunity to take part in very important implementations in FEP in these past year and half. After almost 7 years living in the beautiful city of Porto I moved to Leiden where I plan to stay for the next years to improve my knowledge in the Data Mining with the Pattern recognition group of LIACS and living the Dutch dream: “going by bike to work”. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página II Acknowledgments Writing the acknowledgments is the nicest part! A lot of people helped me in this work directly and indirectly and I will try to thank to them all even if they don´t appear here. This work is the result of a partnership between the master I took part (MADSAD) and LIACS. I have to start thanking the Professor Carlos Soares for pushing forward this partnership. My both supervisors, Professor João Gama and João Moreira I thank their patience, encouragement and share of knowledge in this last year; I hope we can continue this relation in the future improving this work. To Arno Knobbe, my supervisor in Leiden a special thanks for receiving me, making me part of your group, encouraging the share of knowledge and giving me a opportunity for my life; I hope our expectations come truth in the close future. To all the rest of the guys in the Pattern Recognition group of LIACS, Ugo Vespier, Shenfa Miao, Marvin Meeng, Joaquin Vanschoren and the unpronounceable Wouter Duivesteijn thank you for your daily support sharing knowledge, brainstorming and reviewing my work. To my group of friends in Madeira, Porto and Leiden: You make my life enjoyable, thank you for inspiring me and making me believe in the future. A special thanks to Irene, which has been by my side, loving me, listening and inspiring my work; being next to you, is my daily inspiration. At last but not least to my family, supporting me in all the ways possible: Alexandra, Miguel, Cristina, Manuel you have been my support in Porto in these last years; Mother Cecília and father Cachucho to you I own almost all I have done until now: It is admirable the family you have built and I thank you for that. At last but not least to little João Tomé: You were born during one night of writing and reviewing this thesis and when I knew that you were so perfect I got the strength to work several days without sleeping; so young and yet already so inspiring! (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página III Resumo Activity Recognition é uma área de investigação que se propõe a monitorizar indivíduos de forma eficiente e não intrusiva. Uma das abordagens mais discutidas é o uso de sensores pelos indivíduos, por exemplo através de acelerómetros, que recolhem informação para a criação de modelos e posterior classificação de dados não rotulados. A informação desta modelação poderá servir para aplicações de interactividade social, marketing ou serviços de saúde. Na área da saúde já existe uma grande implementação de sensores de forma a captar informação, mas o uso de aplicações não intrusivas e que produzam resultados actualizados pode ajudar na prevenção de doenças, educação para a saúde (através de sistemas de recomendação), sistemas de alerta para idosos que pretendam levar uma vida independente mas em alto risco de acidentes ou monitorização de recuperações de pacientes. O objectivo deste trabalho é a criação de um modelo geral que possa ser aplicado num ambiente móvel (smatphones). Já existem propostas na literatura para este problema mas nenhum considera os problemas de orientação do telemóvel na recolha dos dados. Para além disso fazemos uma proposta para este problema ser interpretado como um problema de classificação hierárquico onde partimos de características mais gerais das actividades (actividades passivas ou activas) para as mais específicas (andar, correr, sentado, em pé parado). As experiências criadas têm como objectivo mostrar que o conhecimento sobre o problema aumenta através da aprendizagem hierárquica e que é possível gerar melhores resultados que a abordagem normal aos problemas de classificação. Tentamos também conjugar a abordagem hierárquica com a construção de atributos que possam lidar com os problemas de orientação resultantes da rotação dos sensores usados para a captação de dados. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página IV Abstract Activity Recognition is the task of predicting which activities are taking place at a certain moment when considering only one individual user. A common approach to this problem is by using body worn sensors in order to collect data. The development of inexpensive and small sensor made this area a reality, and normally the approaches will propose the use of accelerometer sensors. For example when considering the two biggest smarthphone operator systems, all their available smartphones will have a built in triaxial accelerometer. With the models for activity recognition it is possible to build new applications for social interaction, marketing or healthcare systems. Focusing in the healthcare domain it is easy to realize that this must be one of the areas that take into account the sensoring. Though there is the opportunity to create applications that can help in an efficient and less expensive way in the recovery of patients, preventing diseases, educating individuals for a healthier life or monitoring elderly people who want to have an independent life. The main goal of this thesis is to create a generalized model to be applied in a mobile environment. There are some proposals in literature for this problem but they don´t consider the orientation problem of the device collecting data. There is the need of taking rotation of the sensor into account when considering collecting data from smartphones. We also found that there is the possibility to interpret this problem as a hierarchical model, where the classes can be described from more general characteristics to lower levels of classification with more specific information. The experiences purposes are to show the qualities of the hierarchical approach for the problem of activity recognition and the development of features which are able to deal with orientation problems in order to enable the application of this model to mobile environment. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página V Contents Chapter 1: Introduction......................................................................................................... 1 1.1 Our Problem: Sensor-based, single-user activity recognition................................... 1 1.2 Motivation ................................................................................................................. 2 1.3 Challenges ................................................................................................................. 4 1.4 Contributions ............................................................................................................ 6 1.5 Outline ....................................................................................................................... 7 Chapter 2: Related Work....................................................................................................... 9 2.1 Single User, Sensor-based Activity Recognition ........................................................ 9 2.2 Feature Transformation .......................................................................................... 11 2.2.1 2.3 2.3.1 Feature Transformation for Activity Recognition ............................................... 13 Modeling ................................................................................................................. 14 Hierarchical Classification ................................................................................... 16 Chapter 3: Methodology ..................................................................................................... 17 3.1 Data Mining Overview: CRISP-DM .......................................................................... 17 3.2 Features Construction and Selection ...................................................................... 18 3.2.1 Sliding Windows .................................................................................................. 20 3.2.2 The orientation problem ..................................................................................... 21 3.3 Modeling ................................................................................................................. 22 3.3.1 Learning Algorithm .............................................................................................. 22 3.3.2 Hierarchical classification .................................................................................... 25 3.4 3.4.1 3.5 Evaluation................................................................................................................ 26 Model Selection and Assessment ....................................................................... 27 Conclusion ............................................................................................................... 29 Chapter 4: Data Analysis ..................................................................................................... 30 4.1 Data Collected ......................................................................................................... 30 4.2 Data Description...................................................................................................... 31 2.2.1 Time Series .......................................................................................................... 32 4.2.2 Non Temporal Analysis, Boxplots........................................................................ 34 4.2.3 Non Temporal Analysis, Scatterplots .................................................................. 36 4.2.4 Non Temporal Analysis, Pearson Correlations .................................................... 38 Chapter 5: Experiments and Modeling ............................................................................... 41 5.1 Experimental Setups ............................................................................................... 41 (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página VI 5.2 Baseline Experiments .............................................................................................. 42 5.2.1 All Activities Classification ................................................................................... 43 5.2.2 Active Activities Classification ............................................................................. 45 5.2.3 Passive Activities Classificcation ......................................................................... 46 5.3 First set of Statistical Measures .............................................................................. 48 5.3.1 All Activities ......................................................................................................... 49 5.3.2 Active Activities ................................................................................................... 51 5.3.3 Passive Activities ................................................................................................. 53 5.4 5.4.1 5.5 5.5.1 5.6 Hierarchical Classification ....................................................................................... 54 Sub-grouping into Passive and Active Activities ................................................. 55 The Orientation Problem ........................................................................................ 58 Implementing 1st Derivatives .............................................................................. 58 Model Assessment .................................................................................................. 61 5.6.1 Assessment of the Flat Classifier ......................................................................... 61 5.6.1 Assessment of the Hierarchical Classifier ........................................................... 63 5.7 Discussion ................................................................................................................ 66 Chapter 6: Conclusions and Future Work ........................................................................... 68 References........................................................................................................................... 69 (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página VII Chapter 1: Introduction In this chapter, we will give an overview of the topic discussed in this thesis. We will start with a definition of the central problem of activity recognition, and provide a motivation for this work in terms of the interactions between man and machine. We will then present the main challenges in activity recognition, as inspired by our initial investigation and analysis of the relevant literature. In Section 1.4, we present the main contributions of this thesis, in terms of single-user activity recognition, with a particular focus on triaxial accelerometers (sensors for measuring acceleration in three perpendicular directions). We conclude with an outline of the remainder of this thesis, in Section 1.5. 1.1 Our Problem: Sensor-based, single-user activity recognition Activity recognition is the task of recognizing the actions of individuals, based on the limited information from sensors on and around the individual. We can identify two main sources of information for this recognition task: environmental inputs (e.g. Wi-fi, vision based recognition, or other sensors placed in the [1]) or body-worn sensors (smartphone inputs, networks of wearable sensors). In this thesis, we will focus on this second source of input. We will specifically consider the applicability of a practical mobile device already in use by the large majority of the population in developed countries, the mobile phone, more specifically the smartphone. This choice makes sense as they are typically incorporated with a triaxial accelerometer. An accelerometer is a sensor that can measure the force acting upon it, be it from physical acceleration or from the Earth‟s gravity. Most accelerometers measure acceleration along either 2 (X and Y) or 3 axes (X, Y and Z). The majority of smartphones are fitted with a triaxial accelerometer (three axes), since the two most used smartphone software platforms (Android and iPhone) require this type of sensor. The accelerometer in smartphones is used by the operating system of the phone to perform orientation-sensitive tasks (such as rotating the screen to match the view of the user), as well as by various applications installed on the phone. With this useful capability already built into the phone, our aim is to „abuse‟ the sensors for the task of capturing the forces acting on the phone due to the owner‟s activities, in a continuous manner. The data thus produced will be the starting point for training an automated (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 1 system to recognize the person‟s activity from acceleration data. For obtaining a system to recognize activities, we build upon the large body of work in the Data Mining field [references]. The input data considered for this mining exercise will be streaming data from accelerometer sensors collected from smartphones with Android systems (although the proposed method could work just as well on alternative smartphones). More formally, the definition of our initial problem is: from a stream of accelerometer data S = {s0, …} with si = (xi, yi, zi), try to predict one activity from a predefined set of activities A = {a1, …, ak}. In our experiments, the set of activities will cover simple activities from everyday life: A = {walking, running, standing, sitting}. Being the aim of this work to find a generalized activity recognition model, considering simple daily life activities such as walking, running, standing, sitting, our testing hypothesis for this thesis is if exploring the hierarchy of the classes we can improve the accuracy of the model, exploring several measures in terms of accuracy and efficiency. To start exploring our problem and the hypothesis considered above, in section 1.3 it will be presented the challenges I found to be important. From the challenges found, in chapter 3 can be found the modeling methodology, where some aspects of our testing hypothesis will be reasoned. 1.2 Motivation During the last decade the evolution of inexpensive and wearable sensors such as the accelerometer, GPS receptor, cameras and microphones along with computational developments in terms of hardware and software, has opened a new field of opportunities in the mobile applications domain. Any of these sensors considered can be easily found in any smartphone, widespread in almost all the mobile telecommunications markets. This sensoring world is changing the paradigm of human relation with the machine. This new paradigm is what in literature can be found as Ubiquitous Computing or Pervasive Computing. Until these days our relation with the machines can be (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 2 characterized by the fact that we receive a reaction, or a set of reactions, from a direct instruction that we give. For example, when we drive a car we are giving instructions to the car and it will respond according to these instructions or when we type something in a computer normally we expect some results from it. The development of activity recognition models is one of the steps needed to transform our actions into reactions from computers and other automated systems without needing to give direct instructions and many of the times unconsciously. The most promising field to apply these activity recognition models is the healthcare systems domain. This is caused by the high costs of monitoring people nowadays. We could consider the possibility to check the recovery development of patients automatically and constantly that have their activities constraint or even monitoring elderly people that want to have an independent way of living but at the same time need some monitoring because the risk of falling is high. Another consideration is the implementation of recommender systems in order to avoid sedentary ways of life. The enormous application field along with the need that societies have to make healthcare systems more efficient should be enough for activity recognition to be considered a hot topic. There are also opportunities to explore the enrichment of activity recognition model with information about networks of people, business and places. Through the classification of activities, localization and crossing this information with networks, it is possible to create efficient connections. One could imagine the possibility of connecting people with people, people with services and people with places with a ubiquitous computing system. The development of systems where people interact with the machines fitting to their characteristics have such a potential that I believe they will be developed in a short period of time. But on the other hand, this area is also causing some discussions about privacy issues for the subjects that are being monitored and therefore there is the need for the development of an ethical methodology procedure for Ubiquitous computing. One good reason to do this is that societies are more and more aware of this systems and the lack of confidence in the systems` privacy can be a setback for the development of this area. Although it is my belief that developing ethical methodological standards for (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 3 this area is something that must be discussed it is not my aim to discuss this topic in this thesis. 1.3 Challenges In this section we have an overview of which challenges can be considered when facing a single user activity recognition problem built from accelerometer data. What kind of transformations can be applied to the raw data streams in order to classify the activities? Feature construction and selection is a much discussed area in Data Mining literature and one of the biggest discussions inside the topic of activity recognition. The approaches to what have been done can be found in section 2.3. Is the orientation of the smartphone important for a good classification? Yes! It is possible to detect the orientation of the device because the accelerometer can detect the direction of Earth´s gravity. Considering that, we will build a classifier for smartphones where mobiles can be placed anyhow in the pants pocket. Therefore we have the problem of which features to create in order to achieve good accuracy measures for the classifier. The figure 1-1 shows the axis rotation possibilities inside a pocket. 1-1 Example of a smartphone rotation. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 4 What kind of activities can be recognized using accelerometers data? There are works that have considered many daily-basic activities [2, 3] (waking, running, standing, sitting, walking upstairs, walking downstairs…), but for a start I will only consider simple activities that can be classified at any time: walking, running, standing and sitting. Do different placements of the smartphone change the signal pattern of the accelerometer for a single activity? Until now all the works consider only one specific placement and orientation for the accelerometer [3, 4] or network of sensors [2, 5]. In the case of data collected from phones, they all consider the pants pocket to be the best place because the reference work [2] claim the sensor placed on the hip is the most informative sensor from the network of sensors used and also because it is a typical location for the placement of a smartphone. How about to collect a sample, in order to infer about a normal population? If we think about sampling, to build a general activity recognition model we need to think about the placement of the device and also about whether the subject is representative of the population (for example, are the patterns of activities the same for males and females?). The fact is that it is very difficult to collect data from many people and it is a process that can be done if we have resources to pay subjects to do this or it can be a very long process to find candidates that will effectively do this collection. What kind of model has a good tradeoff between accuracy and computational costs, considering mobile devices? In section 3.3.1 there is a discussion about which characteristics a learning algorithm should have in order to balance the tradeoff between accuracy and performance. Which kind of learning procedure can we develop in case we have a little amount of labeled data? There is a very good work [6] that suggests that it is possible to construct a model training with 5% of labeled data. The problem that I find here is that those 5% must be representative of the population activities patterns, so we should start searching for a good generalized model and only then move to semi-supervised learning. This implementation looks promising when considering that each model will feat to each (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 5 user and from that we have improvements in terms of accuracy and better information from each user. Should we consider the temporal continuity of the activities? If temporal continuity of the activities is implemented then there is a possibility to improve the classifier by considering a chain of activities. Considering a sequence of five classifications where one can find “walking”, “walking”, “running”, “walking”, “walking”: the probability of having a misclassification in “running” is high and the temporal continuity could help to solve this problem. 1.4 Contributions The claim of this thesis is that it is possible to create a simple generalized hierarchical classification model for activity recognition, by the use of few time domain features, extracted from sliding windows. We build this model from a dataset partially collected for previous works in activity recognition [7] and partially for this research and having as inputs the timestamp, the X, Y, Z axes acceleration values and a label of which activity is taking place for each record. The framework for the development of feature construction relies on the knowledge about each activity considered in order to capture relevant characteristics aiming the best discrimination between activities. Then it can be found an unused modeling approach for activity recognition that enables the minimization the number of features needed to have a good classification model. The state of the art in activity recognition proposes a flat classification model: from the input data transformation there is a class prediction. But considering these previous works it is possible to see that most of the misclassification are made into similar classes as for example: walking upstairs, walking downstairs and regular walking (check Kwapisz et al. [3] confusion matrixes). (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 6 0 -2 -1 Acceleration in G force 1 2 All Activities Active and Active and Passive and Passive and Walking Running Sitting Standing Data Streams: frequency = 20Hz Red=X.Axis; Green=Y.Axis; Yellow=Z.Axis Figure 1-2 Example of accelerometer data for different activities. Considering the classification of four activities: walking, sitting, standing and running, is possible to see, by the visualization of the signal produced by the accelerometer for each activity, that we are facing two different kinds of upper level activities: Passive (activities which don´t involve movements: Standing and Sitting) and Active (activities that involve movement: Walking and Running). From this hierarchical conceptualization it was develop a model that first classifies any activity into Passive or Active and then a lower level classification models for standing and sitting when considered a Passive activity or walking and running after classified as Active. 1.5 Outline We briefly provide some background information in the next chapter. Section 2.1 introduces the related work of activity recognition when considering the use of sensor based information. Section 2.2 describes some concepts and literature of a major task in activity recognition that is feature transformation. Section 2.3 presents literature related of the following step in activity recognition: the implementation of learning algorithms (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 7 in order to build a generalized model. We will introduce a small section, 2.3.1, of hierarchical classification that was never applied for the problem we are dealing with. Chapter 3 explains our framework and methodologies proposed to address our problem in more detail. We describe some notions of CRISP-DM standard model in Section 3.1 and how it can be useful to have a previous overview of what was developed during this work. The techniques for the transformation of the original data in order to build a good model are explained in Section 3.2, where we present the concept of sliding window and how the construction of features is directly related to the problem of orientation described in the section 1.3. In Chapter 4, is dedicated to the data analysis as a preparation procedure for the dynamics that we will develop in the chapter where we present the experimental procedures. In section 4.1 we present the characteristics of the dataset initially collected. Section 4.2 start by giving an interpretation of the dataset as a time series and then we move to a non temporal analysis once it is proposed in section 3.2.1 the implementation of sliding windows to extract timeless features. Chapter 5 summarizes the experiment setup and outcomes. We start the experimental procedures by presenting the baseline results and then we move to the section 5.2 until 5.5 where there is a chain of relations that will lead us to the selection of two models to compare in the model assessment in section 5.6. At last in this chapter we present a brief discussion about some of the choices made during the experimental procedures or while studying methodologies and related work. Finally, we conclude our work in Chapter 6 with some conclusions and ideas for future work. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 8 Chapter 2: Related Work In this chapter we start the work by analyzing the state of the art of activity recognition when considering a single user classifier model based in accelerometer data as input. In relation to the related work about this topic the overview can be found in section 2.1. The data mining approach for activity recognition involves many techniques as feature transformation (feature construction and selection) and learning methods. The feature transformation is approached in section 2.2 as a general data mining task and then focusing what have been done in activity recognition in relation to this topic. At last we move into the machine learning area in section 2.3. Being impossible to make an overview of this topic related work for the purposes I intent we have as starting point, for the decision of which learning algorithms to consider, the characteristics of the dataset and the influences on the decision to be made. In this section there are also some references to the topic of mobile data mining. 2.1 Single User, Sensor-based Activity Recognition The topic of single user activity recognition based on accelerometer data is recent but there have been many approaches in order to deal with it. When interested in this topic there are scientific conferences that can be used as references: UbiComp, ISWC, Pervasive, ACM SIGKDD, SensorKDD, are some examples of where to look for works. The common approach to the single user activity recognition problem [2-4, 8, 9], is to apply the techniques proposed via a two-stage process. First they derived features from the accelerometer raw data collected extracting them from batches of data (normally called sliding windows). Then they have applied one or more classifiers to recognize different activities considered in their works. Bao & Intille [2] have the most referenced work in this topic and can be found almost in all the works done after 2004. They have collected a sample from 20 subjects in an unsupervised way, using a network of sensor placed simultaneously in different parts of the body. In this work [2] they have two conclusions very important for the direction of (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 9 the research done afterwards. The first indicates that there are good possibilities to create a generalized model. Secondly they have concluded that the most discriminative accelerometer in terms of which activity is taking place was placed on the hip, what can be a good indicator for the pocket in pants is a good placement to collect the input data for the purpose of having a good model and this conclusions can be also found on other works [3, 4]. If there is the will to take this kind of model to the application field we need also to consider other normal placements for the smartphone (i.e. handbag for the women or the suits pocket). From the inspiration created by Bao & Intille using a network of sensors to measure accelerations of the body in order to classify human activities, there were many works developed with this framework [9-15] where the data would be collected from networks of sensors and then sent to a desktop computer where the classification would take place. There is also the approach of using sensors placed in the body and the classification would take place in the mobile phone, after a generalized model being learnt in a desktop computer [16]. The work of Győrbíró et tal. [16] is also interesting once the placement in the wrist of a wearable sensor solved their orientation problems. When considering the use of smartphones to collect data from the built in accelerometer sensor it is possible to imagine the rotation inside pockets of even different placements of the mobile causing problems in the classification process. But the question here is: are the people really willing to use a sensor in the wrist every day for the assessment of their activities? Another approach was done by Miluzzo et al. [4] by the development of a mobile application that involves several classifiers, some of them working on the phone and some backend classifiers, producing several levels of classification. The mobile, from the raw accelerometer data collected by the built in accelerometer, calculates the mean, the standard deviation of the acceleration and the number of peaks in each batch of data, applying the sequence based sliding windows [17]. Fom these features they propose a decision tree was trained using J48[18] in WEKA[19] workbench, to classify which activity is taking place. In this work they deal with the activity recognition problem in a different framework from the previous works, where all the sensoring and classification (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 10 takes place on the mobile device that people already use, but they don´t deal with the mobile orientation problem, and the results in the test dataset are not that good. The WISDM (Wireless Sensor Data Mining) group [3], published their work done in 2010, were they have collected a sample of activities, accelerometer-based data, from 29 subjects. All these subjects were asked to record some daily activities for specific periods of time using a smartphone on their right pants pocket facing forward. This is by far the most complete sample that we can find in literature about single user, sensorbased, activity recognition. Experimental results from the batches of instances with 20 and 10 seconds extracted from time-based sliding windows [17], showing that 10 seconds is better to classify the activities. They didn´t consider the hypothesis of using some technique in order to find the best length to apply in a sliding window. The classification techniques proposed were J48 Decision Trees, Logistic Regression and Multilayer Perceptron and they have concluded that there isn´t a classification model that overalls all the others. Here raises the question of what model to choose? This question is addressed in section 3.3.1. One of the solutions proposed to improve the capability of a good classification of the activities is adding a temporal continuity probability in order to reflect a level of confidence for the prediction. This approach can be found in [9] where Krishnan & Panchanathan propose a method to aid classifying successive temporally close frames. 2.2 Feature Transformation There is a naïve approach in the Data Mining process. In this approach when having a dataset the first to do is look inside a machine learning software package, e.g. Weka Explorer [19], and try to find the algorithm that adjusts the best to the dataset. Although there are several proposals in literature [20] and software packages[19, 21] in order to deal with different kinds of problems and characteristics in the datasets, there will be always a problem to deal with a noisy dataset and/or, even more, an inadequate description of the space domain. There are already learning techniques to deal with noisy dataset [22] as decision trees or K-NN but the construction of features that represent the space in the optimal way, creating domains in the space represented (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 11 without overlapping it will always improve the results of a classifier. It is possible to find this search for the adequate representation language of the space in literature as feature transformation [23-25] and normally can be separated into two different topics: Feature construction and Feature Selection. Normally the conventional learning algorithms rely on the existing features that were provided by the user. In this case, data analysts assume the task of analyzing the original dataset and extracting the unique characteristics of each class from the original dataset, in order to provide the learning algorithm with the features needed to learn a robust model. This method is a domain knowledge procedure but there are methods implemented that uses multiple operators to improve the representation space of the dataset [23] with less intervention and effort from the data analyst. The feature construction procedures are normally integrated with a selection method in order to optimize the dimension of the space representation minimizing it without compromising the predictive efficiency of the model. A good example of this interaction in the use of a selective learner, like C4.5 decision trees, with sophisticated constructive induction components proposed by Pfahringer [26]. Whereas feature construction produces better descriptions of the space domain than the original dataset improving the learning and classification process, feature selection eliminates irrelevant features because they are insignificant for the problem or redundant in relation to others. Most of the data mining problems start with a rich original dataset in terms of features where many of them won´t be useful for the classification process, so this topic can be found in generalized literature about data mining [20, 22] more often than feature construction. There are some learning algorithm that implement this feature selection procedures as C4.5 decision trees [18] or MARS [22] but many others are not able to deal with irrelevant features, as neural networks, SVMs or KNN [22] (see section 3.3.1). In order to implement these learning algorithms incapable to deal with irrelevant features there is a preprocessing stage, in data mining literature [20, 22] or software‟s [19, 21], normally called feature or attribute selection where several techniques can be applied [23]. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 12 2.2.1 Feature Transformation for Activity Recognition The starting point for my reviews about feature transformations in the context of my problem is Preece and Goulermas [5]. This work is the main reference in terms of feature extraction for this thesis once it is the only work comparing different techniques to extract features. These authors made a comparison of 14 different works where several feature extraction techniques, in order to classify activities from accelerometer data, were classified as time-varying acceleration signal, frequency analysis and wavelet analysis: Time domain features: mean acceleration, standard deviation, inter quartile range, correlation between axes, number or average of peaks (considering sliding windows), time between peaks; Spectral features: energy, frequency-domain entropy, principal frequency; Wavelet features (not developed in this thesis); Time domain features are normally what we know by statistical description of the datasets but in these cases they are extracted from batches of data limited by time (see section 3.2.1 for sliding windows). Features extracted from the three axes as means, standard deviation [14], first quartile and third quartile [10] or correlations between axes are commonly used for the classification process but they need some modifications in order to be reliable in a real application environment (see the orientation problem in section 3.2.2). Once the data collected from the accelerometer is many times recognized as a signal there are many approaches that extracted the frequency of the cycles from this data. Frequency-domain features normally involve the Fourier Transformation technique from which can be extracted, for example, the principal frequencies in the signal or spectral energy. One of the most referenced works in activity recognition [2] used a mixture of time and frequency-domain features as a way to improve results but by using networks of sensors placed in specific places they didn´t have to deal with most of the (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 13 challenges presented in section 1.3 as for example the orientation problem that is conceptualized in section 3.2.2 and addressed in chapter 5 during the experiences. 2.3 Modeling When considering modeling using machine learning techniques it is important to have in mind that the choice for the algorithm must be done according to the characteristics of the dataset used. According to Friedman et al. [22] facing a large dataset as a data stream with a set of features that have not been selected yet there are not many options. He suggests the use of decision trees or a multivariate adaptive regression analysis created by him (MARS) and with implementation in R language. But there are different approaches when dealing with large datasets which have several features that were not selected yet. When analyzing the problem trough the data mining perspective it is usual to prepare the dataset in order to be handled by a specific algorithm [20]. According to Witten and Frank it would be normal to consider the possibility to apply feature selection techniques and down sampling techniques and then apply an algorithm that suits better the interests of the data analyst. It might be interesting the use of KNN if we consider semi-supervised learning but for now, having in mind the goal of a generalized model for activity recognition, we don´t see the need to transform the characteristics of my dataset in order to use a particular learning algorithm. All the previous works in single user sensor-based activity recognition we analyzed, proposed a two-step methodology where from the transformation of the raw data a number of features were constructed by the use of sliding windows and then used to learn a model. We have seen different approaches to build features but there were also different approaches when it was the moment to choose a learning algorithm. When deciding the learning algorithm to use, some of the authors decided to apply a set of different learning techniques [3, 13], others proposed only Decision trees as J48 [4] in WEKA workbench because of their efficiency and lightweight in the classification process, or kNNs [5] because of their conceptual simplicity, even if KNN is not (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 14 efficient for a large dataset [22]. Once more it is important to remember the possibility to transform the original dataset in order to apply it to most of the traditional algorithms found in literature. Since the costs of labeling the datasets for a general classifier are very time consuming there is already a very interesting proposal [27] to overcome this problem trough the ability to learn from unlabeled data for activity recognition, whereas the only problem is not having the implementation in a mobile environment. Masud [6] has also a proposal for semi-supervised learning from dynamic datasets in general, that could be applied for the problem of activity recognition from accelerometer data streams. When looking for directions having in mind the implementation in a mobile device, we have found only few authors that implement activity classifiers on the mobile device [4]. They implemented decision trees, having as arguments the lightweight, efficiency and the fast implementation of this model. There is a very recent implementation [28] where the training data is provided by the user itself, labeling the activities and then learning a non linear model in the mobile device called GMT [29] (Geometric Matching Template). This led us to the area of mobile data mining. Last year‟s development of smartphones in terms of numbers in the telecommunications market and their computational and sensorial capacity presents an unprecedented opportunity for mobile data mining. At this moment the number of real-time analysis of the data in mobile environment is growing at a very fast rate and for different application domains, such as healthcare systems [30]. Mobile data mining involves the generation of interesting patterns from datasets that were collected from mobile devices. One of the characteristics of this area is that the datasets grow very fast and there must be a search for efficiency in order to implement data mining models in the mobile device. There are proposals to reduce the source data only to statistical summaries before performing data mining [31], that can be considered during the implementation of sliding windows. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 15 2.3.1 Hierarchical Classification Hierarchical classification deals with problems where classes are organized in hierarchies. These classes are organized from more generic to more specific. There is a good tutorial on hierarchical classification techniques done by Freitas and Carvalho [32]. Some studies indicate that it is possible to distinguish static activities as sitting and standing from the active activities by a threshold using some measure of acceleration [12]. This means that the problem can be decomposed firstly in active and passive activities and then in other subgroups within these two major classes. From this decomposition it is my belief the orientation of this area can go from more general activities to more specific considering the characteristics of each activity. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 16 Chapter 3: Methodology In this section we explain some of the CRISP-DM [33] stages implemented in this project as a framework for the implementation of some techniques and most of all to explain the dynamics between the construction of features and the modeling experiments. Then the most relevant stages of this project will be shown: Features construction and modeling, discussing the most important concepts for the development of this thesis. At last, in this section we will present some important concepts in order evaluate the learnt models presented in chapter 4. 3.1 Data Mining Overview: CRISP-DM The aim of this thesis is not to make an overview of the Data Mining world, but it was considered important to define a framework in order to have a guideline for the investigation project. As a reference the standard process model proposed by consortium CRISP-DM [33] was considered, as seen in figure 3-1. Figure 3-1: CRISP-DM procedures[32]. Certainly this work was not developed in a enterprise environment but assuming that our “business” is activity recognition as described in section 1.1 and that we are highly motivated by the reasons explored in section 1.2, our starting point here will be a (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 17 labeled dataset collected by multiple smartphones accelerometers where the activities proposed in our problem are recorded. Our next stage is to have a good data understanding that can be found in chapter 4. Once we are working with a data stream that can be considered multivariate time series (timestamp, X, Y, Z), we will start with the data analysis in order to have the characterization of the time series that we will be working with. Then the non temporal data analysis will be also shown to explore as much as possible the characteristics of this data typology. The next two phases, Data Preparation and Modeling, are the most time consuming and very interactive. The result of the relation between these two phases will be described in the Experiments chapter of this thesis (chapter 5) and we can observe the development of this interaction along the sections inside this chapter. The last phase considered for this project was the Evaluation. Along the period of experiments we could find some results that would look very good in terms of modeling evaluation but the conceptualization of this model performance in a real mobile environment was time consuming. 3.2 Features Construction and Selection One of the main topics in the Data Mining literature is Feature Transformation [34]. This area approaches two different kinds of problems that are typical when dealing with some data mining problem: Feature Extraction or/and Feature Construction. For feature extraction let‟s assume that we have a large dataset with many features where there is irrelevant information or redundancy. The aim here is to select from the original dataset which are the relevant features (for example with sub-group discovery) or combination of features to reduce the dimension of the problem. The common approaches in feature selection are PCA (Principal Component Analysis) for numerical attributes and MCA (Multiple Correspondence Analysis) for categorical attributes. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 18 Another problem can be that the original dataset of a classification problem is composed of features that need to be transformed in order to expand the space representation with new, more predictive features. In this case we are facing a feature construction problem, which is a form of modifying the representation of the space provided by the original data or in other words to expand the space representation. Constructive induction is a general approach to deal with inadequate attributes in the original dataset where feature construction and feature selection interact in a dynamic process in order to achieve the best set of features for a specific problem. This approach can be addressed by automatic methods [34] or manually by the data analyst. The data analyst in this case can assume the task of determining which could be the relevant attributes transformed from the original space. There is a general approach explained by Bloedorn et al. [23] and shown in figure 3-2. Figure 3-2: General approach for constructive induction by Bloedorn [32]. The process of feature construction is an important issue because we are trying to avoid much of the assumptions made in the past, as fixed placement and orientation of the sensor [2, 3], fixed frequency of collection of the data [3] and the use of a network of sensors [2], leading us to an implementable model. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 19 3.2.1 Sliding Windows When extracting features from a data stream it is important to have a sampling technique to keep the model actualized. A general approach to do this is to implement a sliding window over the dataset extracting from those batches of data, timeless statistical information [35] and then discarding past examples for the classification process. This sampling decision can be a sequence-based window with a fixed dimension of size n or a timestamp-based window of a fixed duration as explained by Babcock et. al [17]. The sequence-based window of size k consists in deciding a dimension, k, to extract information from the incoming data. From the same frequency it is arriving new data to the processor it will be also dismissed the older data. Let´s assume a data stream where is the variable and a window of dimension , and extraction of features in this case will start from 1 to and then when the . The instance arrives the first instance is dismissed for this local classification. To extract features, there is the main question of deciding on a dimension of this window, but normally this decision is made from understanding the problem we are trying to classify, or from recommendations if the literature related to the problem has already applied this technique. Although there are different approaches to decide the dimension of this window, it is always convenient to have in mind that longer windows will be richer in terms of information, thus producing normally better features to be used in the classification process. Smaller windows will have the ability to reflect more quickly different classes. There will be always a trade-off between quality of features produced and the ability to recognize changes in terms of the classification. The timestamp-based window is used when we want to classify a period of time that was chosen based on the understanding of a particular problem, but the data analyst doesn´t know at what frequency the new data will come in. In the case of our sensor there are some variations in terms of the collection frequency but that can be controlled, so there wasn´t any major reason to apply this technique. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 20 3.2.2 The orientation problem To start this section it is important to remember the challenge proposed to build features having in mind the orientation problem. The placement of the mobile device in the pants pockets does not solve the problem of orientation because the mobile can still rotate inside the pocket, transferring information from the X to Y axes or vice-versa. The initial idea to solve the mobile device orientation problem is to aggregate the information in X and Y axes in order to always have the same information regardless of the rotation the mobile phone is making inside the pockets. In order to achieve good performances in the classification process we realized that statistical measures of location will distinguish between passive activities: standing or sitting. The explanation for this is that passive activities do not involve movement so all the collection of data will be concentrated in some specific location measuring the gravitational force. The question here then must be: how can we aggregate the information of several axes in order to maintain the discriminative power of the features? Using statistical measures of dispersion is logically useful when considering active activities. It is easy to conceptualize that some activities are more vigorous than others, thus involving more movements as can be seen in figure 1-2. Then our question here is: Is it possible aggregate the X and Y axis to all the statistical measures proposed in the related work of activity recognition [5], when considering the orientation problem of the mobile device? In this thesis is shown a proposal of a new feature able to measure the amount of movement, without losing the challenge of the mobile orientation inside the pants pocket. If we consider the amount of movement in a time domain, the sum of the first derivatives of each axis in relation to time is implemented by measuring the accelerations regardless to the orientation. As a result there will be a measure of the amount of movement in each window of data: x2+ 2 + (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining 2 Página 21 3.3 Modeling 3.3.1 Learning Algorithm When discovering the Data Mining area, it is fascinating to discover so many learning algorithms, starting with simple Naïve Bayes, K Nearest Neighbors, Linear Regressions or non-linear models as Decision Trees, Neural Networks and then moving to ensemble learning methods; it is easy to achieve the conclusion that this area is already huge and growing. At the same time for the data analyst there is a concern in relation to the decision of which learning methodology should be implemented in a specific problem. There are some dataset characteristics [36] that can help us decide which kind of model should be used. When working with data streams the first characteristic that comes out is the amount of data we will be dealing with and then the first concern when choosing the learning algorithm should be the computational scalability. Looking at the figure 3-3 it is possible to see that facing a large number of instances (large N) decision trees is one of the few good choices from the traditional learning methods. Figure 3-3: Some characteristics of different learning methods dealing with datasets [36]. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 22 Let´s assume a classification problem where ( ) and features ( is our target variable with two classes ) to discriminate between classes. From the set of N observations, where each observation is , the goal is to build a classifier in order to predict on unlabeled data. Decision trees learning algorithms are a top down induction approach, for example C4.5[18] or CART[37], where the classification model grows from the root (most discriminative feature) until the leaves (classes). This procedure is done by binary split where each node represents a decision to separate the dataset into two different subgroups, and from these two groups we will implemented a new splitting decision until a threshold where the node becomes a set of hierarchical rules that originates a classification. In order to make a splitting decision we need a heuristic to compare between different splitting options. Normally these heuristics will have two extremes where a division that keeps the proportion of classes in each subgroup will be the minimum and on the other extreme a split that differentiates perfectly between two classes will be considered the maximum. These measures can be: 1. Measures that check the distribution of classes in a dataset and a subgroup of the same dataset, normally classed Impurity measures. Two good examples are the Gini Index [37] or the Information Gain [18]; 2. Misclassification error measures [36]; 3. Statistical measures of independence, normally based in the Chi-squared test, between the proportions of classes in the dataset and the subgroups created by the splits; During the splitting process there is a risk of overfitting the classification model to the dataset used in the learning process. From the training sample if we increase the decision tree too much we will have a model completely adapted to the training dataset and as a consequence not general enough to apply it in a real environment. This overfitting is represented in the figure 3-4 from the moment when the prediction error rate of the test samples achieves its minimum and then starts to increase along with the growth of the decision tree (model complexity). (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 23 Figure 3-4: Relation between model complexity and prediction error rates [36]. In order to solve this overfitting problem there are some proposals in terms of pruning the models created by decision trees. There are two pruning strategies: pre-pruning and post-pruning. Most of them can be found in Esposito et al. [38] where they propose several rules to stop growing decision trees. For the pre-pruning decision normally we have as a stopping decision that all instances in the leaf are from the same class or have a threshold of a minimum number of instances in each leaf. The second pruning method presented is the post-pruning strategy. A very simple method [39] to make this kind of pruning is to use the misclassification rate in order to prune the tree. First we calculate the misclassification rate of the bottom level node, considering the majority class in that node. Then if there is an upper level node with a smaller or equal misclassification error the tree is pruned and this upper level node becomes a leaf. At last we must see which problems we might have when working with decision trees. For instance it is known that one major problem of decision trees is their high variance: often small changes in the data produces big differences in the models created. This is caused by the hierarchical splitting of the dataset and one error in the top of the tree will (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 24 be propagated down to the rest of decision tree. Another limitation of this learning algorithm is also the lack of smoothness on the decision domain. Even tough decision trees are a non parametric method that can create non linear models, if we look into the spaces created they are always rectangles (when comparing two features) or boxes (when a comparing three features). According to Hastie et al. [36], the MARS learning algorithm can be seen as a smoother solution to this discriminative problem. 3.3.2 Hierarchical classification Normally a classification problem can be addressed as a flat classification, where each example is assigned to a class out of a finite set of classes. But some problems can be addressed as a hierarchical classification process where firstly there is a classification into major classes and then inside each class it is possible to create a classifier into subclasses as shown in figure 3-5. Costa et al. [32] have an introduction offering perspectives on the characteristics of a hierarchical classification problem. Figure 3-5: Hierarchical classification conceptualization. The figure 3-5 shows that the hierarchical classification problem starts without any kind of classification and then in the first level of nodes we have three major classes (1,2,3). (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 25 After the first prediction level it is possible to take this classification process to lower levels of classification with more specific information about each example. For example, an instance can be classified as 1 and then there is another classification process to classify it as 1.1 or 1.2. From this figure we can conclude then that this organization of the classification decomposes what would be a complex decision problem into smaller problems. From this structure we are in a position to require that an example should be classified into a leaf, which in this case would imply one of the following classifications: 1.1, 1.2, 2, 3.1, 3.2, and 3.3. Or to be more flexible, allowing the classifier to predict only above some predictive confidence threshold, being able in this case to classify only as 1, 2 or 3. It is clear then that this decision will imply a trade-off between accuracy (predicting 1, 2, 3 is easier than predicting sub-classes, in general) and usefulness (normally classifying as a sub-class will provide more information than when prediction a major class) of the classification. Assembling this classification procedure allows us to have more efficiency in terms of computational requirements. Let‟s assume that there is a set of features, calculated from the raw data that distinguishes between class 1, 2 or 3 but then in order to classify between 1.1 and 1.2 we need a different set of features calculated from the raw data. If one instance is classified as 2 then there is no need to calculate the set of features needed to distinguish between 1.1 and 1.2. At last if we consider the feature construction process that is proposed in section 3.2 it is clear that it will be useful to use this hierarchical approach in order build the best set of features to discriminate firstly between passive and active activities and then between standing and sitting or between running and walking. 3.4 Evaluation “Evaluation is the key to make real progress in data mining” [20]. From all the techniques that we can find in literature and software it is necessary to decide which techniques to use and whether the produced models are reliable. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 26 In the paragraph above it is possible to find two different goals for a specific data mining problem. First, it is necessary to estimate the performance of different models in order to choose the best one, usually called in literature as model selection process. Then there is the model assessment where after choosing a final model it is necessary to test it on an independent data set and estimate the predictions errors. 3.4.1 Model Selection and Assessment The evaluation procedures, in order to make the model selection, estimate measures of accuracy and error predictions from the training dataset. To estimate these measures there are different methods and there are different kinds of measures that can be used to make the selection of the best model. The simplest method is to decide a partition between the training set and the testing set. Let´s assume a dataset, , and a decision to use 25% of this dataset to test the model. There will be a random selection from to extract 75% of instances, will be used to learn a model. The remain dataset, , that , is then used to evaluate the learnt model. A well known method is the k fold cross-validation. In machine learning one normally uses a holdout set of data in order to measure its performance. After choosing a learning algorithm to learn a predictor from the training dataset , it is necessary to evaluate the quality of this predictor. In order to do so, k fold cross-validation method proposes the partition of the dataset between training and testing in k folds. This method makes a partition of train, all in folds uses all the combinations of , and the remaining fold to test, is used to train and test the predictor folds of data to , repeating this k times until . Besides the partition methods to evaluate the models created it is necessary to see which kind of evaluation measures should be used to make the decision of which is the best model. Considering a binary classification problem, a very interesting output to interpret the evaluation results is to see the confusion matrix, example in figure 3-6, and see what (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 27 kind of errors where produced by the model. But this interpretation is not a reliable comparison measure from which we can make a selection of different models. Figure 3-6: Confusion matrix [36]: true positive = TP; true negative = TN; false positive =FP; false negative = FN. There is a naïve measurement for model selection that is the accuracy of the model, which is the rate of correct predictions from the model. There is the possibility in unbalanced datasets that the major class (for example, class “yes”) is all well predicted, reflecting a high accuracy measure, and the less represented class (class “no”) is being misclassified. This will give us a good accuracy measure but a poor classification model. In order to solve this problem it is useful to interpret other measures, such as the Area under ROC curve (AUC) or the relative absolute error (RAE). A ROC (Receiver Operating Characteristic) curve has two dimensions: the sensitivity on the x-coordinate and (1 – specificity) on the y-coordinate. One model, when comparing their ROC curves, dominates another if it is above and left of the other (higher true positive and lower false positive). The ROC curve displays the relationship of predictions and outcomes by plotting the estimates of sensitivity versus (1specificity) for all possible threshold values. There are two boundaries for the AUC measure [40]: 0.5 when we are facing a random prediction and 1 when we achieve the best modeling from the dataset available to make the model selection. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 28 Suppose that for a single instance we have a classification probability vector where k is the number of classes and then we have a vector where you can find the real class . Considering that the learning algorithm is a decision tree and makes deterministic predictions where one of the probabilities will be 1 and all the rest will be 0 we have as a quadratic loss function for each instance correct prediction is where the and the sum contains the incorrect predictions. From here it is possible to apply the quadratic loss function to the test sample: . From the quadratic loss function it is possible to calculate another good evaluation measure to select the best model: the RAE (Relative absolute Error). This measure is represented by applying the absolute value to the quadratic loss function . So, in order to select the best model it is important to minimize the RAE and maximize the area under the ROC curve. Having chosen the best model through the evaluation that was explained in this section there is one last but very important evaluation to make: the model assessment. Using the cross-validation method we will look for the lowest error rates. This comparison will only be performed after the last model is chosen. 3.5 Conclusion In this chapter we have seen some of the most important techniques and methods that will be used in the next chapters 4 and 5. It has been clearly shown that the relation between feature construction and model selection will be dynamic. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 29 Chapter 4: Data Analysis This chapter has as purpose to understand the characteristics of the data we will be using to select and assess our activity recognition model and help the development of the experiments that will be described in chapter 5. To start there is a small description of the dataset collected, seeing the application used to collect new data, and how it was developed the collection of data. In section 4.2 we start the statistical analysis of the dataset, starting with a time series analysis and then moving to a non temporal analysis once it will be applied sliding windows to the data streams in order to extract timeless statistical information. 4.1 Data Collected One of the applications [41] available to collect data from Android-based smartphones, which interface can be found in figure 4-1, collects the acceleration (in ) for each of the three axis that these sensors can measure: X, Y and Z. Figure 4-0-1: AccDataRec interface and QR Code (that can be used to download the application). The starting dataset used for this project is composed by 4 different activities {Walking; Running; Sitting; Standing} and have 293970 labeled examples, collected at 20Hz which corresponds to 20 instances per second, giving us approximately 4 hours of (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 30 activities in total, collected with the AccDataRec application [42] measuring the acceleration (in ) that the accelerometer can recognize, including the gravitational force and then scaled to G-force (by the expression ). This data was partially collected for this thesis and partially collected for previous works [7]. There is a constraint in this collection of data concerning to the frequency at which it is collected. The intuition behind this problem is that the Android platform changes the power oriented to the sensors depending on the status of the mobile (active, stand by,..) but the changes in the collection frequency rate are not significant depending on the smartphone used. 4.2 Data Description The first characteristic of this dataset is the discrepancy in the distribution of examples for the activities considered, as the figure 4-2 clearly shows. About 78% of the examples collected are walking, then running is about 13% of the dataset, and at last we have the passive activities (Sitting and Standing) representing only 9% of the dataset. Percentage of instances for each Class Standing Sitting Running 13% 78% Walking Figure 4-2: Distribution of examples for the activities. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 31 2.2.1 Time Series To start our data analysis for each activity we present chronograms (in figures 4-3 and 4-4), with a sample of 100 instances for each activity, from where we can distinguish active activities (that involve movement) and passive activities. Modeling these time series has not been considered because our problem is not to predict which the next activity of a particular user is, but to classify the activity that is taking place at the moment. The interest of the chronograms in figure 4-3 and 4-4 in this research is to indicate which kind of features should be calculated in order to discriminate between classes. Whereas in figure 4-4 we can see clear gravitational information in passive activities that can be explored be statistical measures of location, in figure 4-3 the different amounts of movement between the activity running and walking might be explored by the use of statistical measures to calculate the dispersion. 1 0 Acceleration m/s^2 -2 -1 0 -2 -1 Acceleration m/s^2 1 2 Walking Acelerometer Data 2 Running Acelerometer Data 0 20 40 60 80 100 0 Frequency: 20Hz => 5 seconds of activity Red=X Axis; Blue=Y Axis; Yellow=Z Axis 20 40 60 80 100 Frequency: 20Hz => 5 seconds of activity Red=X Axis; Blue=Y Axis; Yellow=Z Axis Figure 4-3: Sample of Active data: Running (left) and Walking (rigth). (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 32 0.5 0.0 Acceleration m/s^2 -1.0 -0.5 0.0 -1.0 -0.5 Acceleration m/s^2 0.5 1.0 Standing 1.0 Sitting 0 20 40 60 80 100 0 20 Frequency: 20Hz => 5 seconds of activity Red=X Axis; Blue=Y Axis; Yellow=Z Axis 40 60 80 100 Frequency: 20Hz => 5 seconds of activity Red=X Axis; Blue=Y Axis; Yellow=Z Axis Figure 4-4: Sample of Passive data: Sitting (left) and Standing (rigth). At last it is important to say that these time series don´t show any kind of trend for each activity. In terms of seasonality it might look appealing considering the study of them for active activities but these seasonality analyses, by producing the total and partial autocorrelations for each axis, we find them to be insignificant when considering time series with these dimensions, as we can find in the example of figure 4-5 where the ACF and PACF of walking was calculated considering the Y axis. -2.0 -1.0 ACF 0.0 1.0 Series: walk[3] 0 20 40 60 80 100 60 80 100 -2.0 PACF -1.0 0.0 1.0 LAG 0 20 40 LAG Figure 4-5: ACF and PACF of walking with a lag of 100 records. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 33 4.2.2 Non Temporal Analysis, Boxplots A good way to have an intuition of how different de activities are and how the statistical measures (as means, standard deviations and interquartile ranges) will be when applying a sliding window to extract features is to look at the boxplot of all the training dataset for each axis in relation to each activity. The boxplots, represented in figures 4-6, 4-7 and 4-8, for each axis indicates that if we try to discriminate activities with a simple mean considering all the data for each activity it will be considerably goo if we only consider passive activities as sitting and standing. Taking into account all the activities, it can be very difficult discriminate between activities as running or walking once the distribution in each axis domain for the active activities are similar. 0 -2 -1 X Axis Values 1 2 Boxplot of X Axis Running Sitting Standing Walking Activities Figure 4-6: Boxplot for the X axis data in relation to the different activities. The difference between the Passive activities when analyzing the plots in the figures 4-7 and 4-8 are even bigger and this happens because the gravitational force when standing is applied, for this dataset, on the Y axis and when sitting this force will be measured in the Z axis which can be conceptualized (see figure 1-1) by the way the sensor is measuring the acceleration. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 34 0 -2 -1 Y Axis Distribution 1 2 Boxplot of Y Axis Running Sitting Standing Walking Activities Figure 4-7: Boxplot for the Y axis data in relation to the different activities. 0 -2 -1 Z Axis Distribution 1 2 Boxplot of Z Axis Running Sitting Standing Walking Activities Figure 0-1: Boxplot for the Z axis data in relation to the difFerent activities. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 35 It can be concluded that features containing statistical measures of location will not be useful to distinguish between active activities (walking and running) and there must be first a separation between passive and active activities in order to use statistical location measures for the classification of standing and sitting. 4.2.3 Non Temporal Analysis, Scatterplots In our analysis to the scatterplots it is important to search for characteristics between different axes from the raw data in order to define which characteristics the features that will be computed should have. To start this analysis consider a scatterplot only with passive (standing and sitting) activities, in figure 4-8, once we already know that measures of location can discriminate well these two activities, and then adding the remaining activities. Figure 4-9: Scatterplot of all axes considering only passive activities. As suspected before, when analyzing the box plots, with the scatter plots we can see two different groups in the passive activities taking into account only the raw data. With location measures, like the mean or the median for each axis, as features it will be (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 36 possible to reinforce the discrimination between the two passive activities but from the raw data it would be possible to build a good model considering a non linear algorithm as KNN. This is another good indication on the direction of which kind of features should be calculated to discriminate these passive activities, but there is still the problem discussed in the previous section of discriminating between passive and active activities. In figure 4-10 we can see part of the problem when it is included running to the scatterplot. Figure 4-10: Scatterplot of all axes considering passive activities and running. By adding to the scatterplot the activity running we can see overlapping problems in the domain of the three axes. This is natural because if we conceptualize the cyclical movement of the body while running; running involves cyclical accelerations in each axis of the accelerometer so the challenge is to develop features that capture peculiarities in these distributions, solving at the same time the orientation problem of the smartphone. If we focus only in the distribution of Y it is possible to see some saturation problem in this axis, something natural when the sensors have boundaries ranges smaller than some of the accelerations that are produced by active activities. The problem grows when we decide to consider in the same scatterplot all the activities, as it can be seen in figure 4-11. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 37 Figure 4-11: Scatterplot of all axes considering all the activities. With this last scatterplot it becomes clear the problem of classification only using raw data. The activity walking has a huge majority of instances and the distribution in the scatter plots tell us that this activity covers almost all the domain of each axis as it happened with running in figure 4-10. The problem of saturation here is even more recognizable. In the distributions of all the axes we can see a peak in the extremes caused by the measuring limitation of the accelerometers placed on almost all the smartphones. In the case of these collections the range of the sensors were from -2,3g to 2,3g. 4.2.4 Non Temporal Analysis, Pearson Correlations To finish this non temporal data description we will take a look on the crosscorrelations between axes for each activity. Once all the variables are numeric it was used the Pearson correlation coefficient, from which it is possible to deduct if there is a linear dependency between each two numerical variables; in this case between each two axes for each activity. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 38 0.0 -1.0 -0.5 Correlation degree 0.5 1.0 Correlations X/Y X/Z Y/Z Axis in Relations Red=Walking; Green=Running; Yellow=Sitting; Purple=Standing Figure 4-12: Cross-correlation for each pair of axes calculated for all the activities. Looking at the correlations between axes when taking into account all the dataset, we can find that there aren´t linear relations between different axes for the activities walking and running. Although we cannot find strong correlations between the two active activities the calculations of this correlation will be performed taking into account smaller portions of data, using sliding windows because this features were proposed by related works[5] of activity recognition. A very interesting relation in the passive activities is that they reflect the linear relation between Y and Z axis. This is happening because in someone is sitting if it moves to adjust the way it is sitting the gravitational force will raise in the Z axis almost by the same amount it will decrease in the Y axis. From this stage of the document we can already see which will be the first set of experiences after the baseline (that will be done only with the raw data). The data analysis show that we must focus on dispersion features in order to characterize active (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 39 activities and location measures in order to find where the gravitational force is being applied and from that predict if the person is Sitting (gravitational force measured in Z axis) or standing (where the gravitational force will be recorded in Y axis, in X axis or partially recorded by both these axes). . (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 40 Chapter 5: Experiments and Modeling In this chapter I present and analyze all the experiments towards the best set of features to learn a generalized model in order to recognize the activities I am proposing: walking, running, sitting and standing. The Experimental setup works as a methodology guide of the implementations done in this chapter to achieve the best set of techniques implemented along the experimental procedures. Then all the experimental results can be found from the section 4.2, with the baseline experiments, until the model assessment in section 5.6, with the best set of features and techniques proposed, and a brief discussion in section 5.7. 5.1 Experimental Setups In this section we start by describing the dataset used in these experimental procedures, the reasons to try hierarchical approach in this classification problem, the learning algorithm implemented, the attribute selection techniques used in order to validate the construction of new features, and evaluation of our experimental results. The final dataset used in this work was provided by 4 subjects that have recorded all 4 activities several times. This collection was done by smart phones incorporated by triaxial accelerometers. The baseline dataset is composed by almost 300000 labeled instances collected at a 20Hz average frequency, which corresponds to one second per each 20 instances. This gives us a total of about 4 hours of activities collected. The baseline dataset was composed by the number of instances shown in table 5-1, used in the baseline experiences: Walking Running Sitting Standing instances 193840 2517 17776 8070 Percentage 87.23% 1.1% 8% 3.63% Table 5-0-1: Number of instances and percentage used for the baseline experiences Once in the chronograms in section 4.2.1 we can see that we need to construct different features when considering different activities, it is easy to see that sub-grouping the dataset first into Passive activities (involving low amount of movements: sitting and (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 41 standing) and Active activities (involving some or much movements: walking and running) can be a good way to look for the best set of features. In the baseline experiments will use the software Weka [19] (Waikato Environment for Knowledge Analysis) to experiment the chosen algorithm, J48 Decision tree [18], in the following datasets: “All.csv”, “Active.csv”, and “Passive.csv”; and from these results see if the hierarchical approach should be used. The choice of using decision trees for activity recognition in mobile applications comes from some evidences that there is not a learning algorithm getting better results than all others in Activity recognition until now [3] and also the aim of implementing this model in a mobile environment where there is computational limitations and decision trees already proved to be efficient [4]. When growing decision trees there are divide and conquer methods that select the best features, starting the root of the tree and continue growing until we have the entire dataset domain conquered and all the classes are separated in an optimal way. In order to confirm the attribute selection of the divide and conquer methodology from C4.5, implemented in Weka as J48 algorithm, it was decided to use the CfsSubsetEval for attribute selection in Weka “Select Attributes” menu for the baseline experiments. In the section 4.2 it was discussed the evaluation of models performance in order to select the best for assessment. From the learning results of Weka, using the crossvalidation, we will try to minimize the Relative Absolute Error (RAE) and maximize the areas under ROC curve trying to combine these two evaluation measures. Then the best set of techniques will be the one that minimizes RAE and maximizes the ROC area. In order to do this optimization I will start by comparing the first set of experiments with features with the results of the baseline experiments, and build the best model possible from those results. 5.2 Baseline Experiments In this baseline experiment we have only used the original data recorded from the mobiles accelerometers. There are three input features: X, Y and Z axis and the target (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 42 variable is the labels of the activities recorded. We start to experiment a divide and conquer learning algorithm (J48) for all the activities collected and we will analyze the evaluation measures from the cross-validation process. Then, once it was proved previously in the data description that there are different patterns in the data if we separate the dataset problems between active and passive activities this problem will be divided into two subgroups and apply the same learning procedure (J48 decision tree of Weka) to classify these two subgroups separately. 5.2.1 All Activities Classification Taking into account all the activities we can see that the attribute selection process selected all the axes as useful information in order to learn a classification model as it is shown in the figure 5-1. Figure 5-0-1: Feature Selection using CfsSubsetEval The first classification model learnt was a C4.5 decision tree with a pre-pruning decision of having a minimum number of objects in each leaf; 1% of instances (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 43 considering the smallest class. From the results in the figure 5-2 if we would take into account only the percentage of corrected instances this model would look perfect. Taking into account that the sample is composed by a large majority of the class walking and the passive activities are have a steady signal that helps the decision tree deciding where to cut for these two classes, it is clear that when growing this decision tree it was considered almost only three classes: walking, sitting and standing. The class running, according to the ROC area (70%), it is classified slightly better than a random guess (that would be 50%). Figure 5-2: Results from the baseline experiment for all the activities using J48 algorithm. From the confusion matrix in figure 5-2 we can see that the biggest problem of this top down approach, with overlapping of classes in the axes domains and unbalanced dataset, is that the most represented class causes classification problems. We can see that most of the instances labeled as running were classified as walking for example. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 44 This problem would have even more expressive results if we consider only active activities. The steady signal of the acceleration for the passive activities is a good indication that after discriminating between passive and active activities classifying between standing and sitting should be a trivial problem; so we need to set the efforts most of all to find one measure that distinguishes between passive and active activities and also a set of features to distinguish between the classes running and walking. 5.2.2 Active Activities Classification Another approach in this phase of experiments is to growth a decision tree only with active activities and an unbalanced dataset (more recording of walking than running). Once more it was used J48 as a decision tree leaning algorithm with a pre-pruning decision of having a minimum number of instances in each leaf: at least 25 examples, corresponding to 1% of running. The results of this experiment are in the figure 5-3. Figure 5-3: Results from the baseline experiment only for active activities using J48 algorithm. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 45 The results for the sub-group classification problem are as expected in section 5.2.1 very poor. The RAE it is nearly 100% and the ROC area shows a model no better than a random guess. From the evaluation measures of this model, we can point to two kinds of problems we are facing with our baseline dataset. First, there are similar distributions caused from the ambulatory processes of active activities. Secondly, the disparities between the samples collected from the activity “Running” (1,3% of the dataset “Active.csv”) and “Walking” is causing problems in the divide and conquer method of this learning algorithm. In order to solve these two problems the development of features will be very important in order to discriminate these two activities and the the composition of the dataset must be more balanced. Once it is very difficult to have subjects willing to collect accelerometer data and there are not online datasets for activity recognition it was decided to balance the dataset in the following experiences by the down sampling of the most represented activity, walking, and to put the sampling efforts in order to collect only running activities. 5.2.3 Passive Activities Classificcation As we can see in the beginning of this chapter, using the attribute selection toolkit in Weka Explorer to find which features to use in learning algorithm without feature selection, we discovered that from the three raw attributes {X, Y and Z} CfsSubsetEval selected only two axes (Y and Z) to discriminate between the activities standing and sitting, as it is shown in figure 5-4. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 46 Figure 5-4: Attribute selection from the raw data using CfsSubsetEval and considering only passive activities. The approach for this subgroup classification problem is to use again the hierarchical learning algorithm J48 to learn a model. The pre pruning decision in this case was to growth a decision tree where in each leaf there must have a minimum number of objects; at least 1% of cases taking into account the activity with fewer examples (in this case standing: 8273*0.01 ≈ 82 examples per leave). The input data was {X, Y and Z} accelerometer signal and the results and respective decision tree can be seen in the figure 5-5 Figure 5-5: Results from the baseline experiment only for passive activities using J48 algorithm. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 47 In the case of passive activities we can see that there it is possible to growth a very simple decision tree using the Z axis only once. With these results in terms of Accuracy ( ≈ 100%), Relative Absolute Error (REA ≈ 2%) and ROC areas ( ≈ 1) it is possible to confirm that there will be no problems classifying Passive activities using only statistical measures of location, after solving the problem of discrimination between passive and active activities. 5.3 First set of Statistical Measures In the related work of activity recognition was found a huge variety of feature construction used for activity recognition. We can find that most of works [3-5] used statistical features or combinations of statistical features and frequency-domain features. For this first set of experiments it was chosen a set of statistical features where can be found relations between axes, location and dispersion information: means and standard deviations[4, 14], inter quartiles ranges[5] and correlation measures[2] were computed in sliding windows of 100 instances, that will give us 5 seconds in average of information. The computation of this first set of features was done in R language[21] and the idea here is to repeat the experimental setup of the baseline after incorporating the new set of features and balancing the dataset. The final train and test dataset is more balanced than the dataset used for the data analysis, which can be found in section 4.1 and in the baseline experiments, with more instances of running and a down sampling of walking reducing the number of instances for about 150000, which will give us 2 hours of information for all the activities to extract features. The table 5-2 gives the information about the new dataset chosen for the following experimental procedures. Walking Running Sitting Standing instances 87630 37647 17776 8070 Percentage 58% 24.9% 11.78% 5.35% Table 5-2: Number of instances and percentages from the dataset used from section 5.3 until section 5.5. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 48 5.3.1 All Activities When our problem is considered as a flat classification problem the attribute selection procedure, having as target variable all the activities {Running, Walking, Standing and Sitting} and as input features {Mean.X, Mean.Y, Mean.Z, SD.X, SD.Y, SD.Z, IQR.X,IQR.Y,IQR.Z, Cor.XY, Cor.XZ and Cor.YZ} , selects only information about the Y axis. This selection, in figure 5-6, is the result of different oscillation amplitudes when considering the running and walking in the SD.Y (standard deviation of Y) and IQR.Y (Inter Quartile Range of Y) or the gravitational force measured in the Y axis that can be used to distinguish between standing and sitting. Figure 5-6: selection from the first set of features using CfsSubsetEval and considering all the activities. These results are not taking into account the rotation that the mobile can have inside the pockets. The information that is being measured in the Y axis can go partially or totally to the X axis, but this problem will be addressed in section 5.5. After this attribute selection stage it was implemented J48 for all the activities, with a pre-pruning decision of a minimum number of objects in each leaf; 1% of instances at least (considering the less represented activity: Standing). The setup information for this experience is shown in the figure 5-7 and then the figure 5-8 presents the results from the learning process. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 49 Figure 5-7: Experimental setup of the flat classification process with the J48 algorithm. Figure 5-8: Results from the first set of experiments using statistical features of location, dispersion and relation between axes, considering all the activities and using the J48 algorithm. From the pruned decision tree created by J48 algorithm the first conclusion is that the improvements from the baseline experiments to this stage are remarkable: the ROC areas are all around 1 and the RAE decreased 4%. The decision tree created has 31 leaves and the total number of nodes is 61 as it can be seen in the figure 5-9 and used almost all features extracted from axes information: means, standard deviations and Inter Quartile Ranges. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 50 Figure 5-9: Information about the decision tree created by this experience. Apparently a simple classification problem with only 4 activities to be classified can work pretty good with these statistical features proposed in the literature but if we take into account that we want to implement this systems in a mobile environment and that there is computational limitations, the question is now if we can still keep these good results with simpler model. At last, in most of the classification problems these results would be considered near to perfection but we need to point out that in a real environment application this features would not resist to the orientation problems that the mobile rotation could cause and it would be good to improve the classification model in order to have a more clean confusion matrix being clear that there are some improvements in relation to the discrimination between walking and running should be done. At last when the comparison between the selection of attributes by CfsSubsetEval, where it was chosen only features extracted from the Y axis to discriminate between groups, and the features used in the pruned decision tree. There is a big discrepancy between the number of features used by the tree (9 features to build this classifier) and number of features chosen by the CfsSubsetEval. 5.3.2 Active Activities When applying the same experimental setup that can be found in section 5.3.1 but applied only to active activities it is expected that the statistical measures used to discriminate between walking and running will be measures of dispersion in each axis. The evaluation measures of this learning procedure can be found in the figure 5-10. Our initial expectations where correct as it is possible to see in figure 5-11 and with the additional information that now the classifier only used statistical features of dispersion containing information from the Y axis. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 51 Figure 5-10: Results from the first set of experiments using statistical features of location, dispersion and relation between axes, considering only active activities and using the J48 algorithm. The results considering only the active activities are surprising because the RAE from almost 100% in the baseline experiments decreased to 3% when considering statistical measures of dispersion calculated from sliding windows and the areas under the ROC curves are almost perfect. It is clear that with these results are good and the decision tree produced does not show signs of overfitting. Figure 5-11: Decision tree produced by the J48 classifier only for active activities. With the results from this dataset is easy to say that the problem is almost solved but it is necessary to have in mind that all the smartphones were all orientated in the same way creating the possibility to obtain good results only with features created from the Y axis. This happens because when running the vertical oscillations are bigger than when (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 52 the subject is walking. But what if the orientation of the phone changes inside the pockets? This problem will be addressed in the section 5.4. 5.3.3 Passive Activities From the baseline experiments it was expected that location measures of the Z axis would be enough to distinguish between standing and sitting. The algorithm used for this classification problem, J48, was as until now defined with a pre-pruning decision of having a minimum number of objects in each leaf; at least 1% of instances, considering the smallest class in this dataset: standing. This experiment was conducted having a dataset with only passive activities, where the input features are the same from the section 5.3.1 and the target variable is composed only by {Standing, Sitting} and the evaluation measures for this learning process can be found in the figure 5-12 along with the description of the decision tree created in the figure 5-13. The expectations were not defrauded and it is possible a perfect model it a simple Mean extracted from the Z axis. The ROC areas are perfect ( =1) and the the RAE is 0% what is in accordance with the expectations created in section 5.2.3. Figure 5-12: Results from the first set of experiments using statistical features of location, dispersion and relation between axes, considering only passive activities and using the J48 algorithm. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 53 Figure 5-12: Decision tree produced by the J48 classifier only for passive activities. It is possible to assure the robustness of this classification problem without the need to create a feature were the rotation of the mobile inside the pockets is taken into account. If the subject is sitting the gravitational force will be mostly measures in the Z axis, but when standing the gravitational force can be measured in the Y axis, the X axis or even the combination of both axes but never in the Z axis. 5.4 Hierarchical Classification It was demonstrated in the Data Analysis in chapter 4 that the flat classification problem of four activities there was the possibility of aggregate these classes into two major groups: Passive and Active activities; but this methodology was not implemented until now. The idea in this section is to make some experiments where the problem is firstly divided into Passive and Active activities, and then use the classifier from the section 5.3.2 to classify Walking and Running (inside active activities) and use the classifier from the section 5.3.3 to classify Standing and Sitting (inside passive activities). Until now the experiences have been developed in comparison with the baseline but now the aim is to see if there is a feature from the set of features used in the section 5.3 that is able to discriminate between passive and active activities. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 54 5.4.1 Sub-grouping into Passive and Active Activities From the dataset created for the last section it was changed the name of the classes associated to the features calculated where the instructions change the name of classes as running and walking into “Active” or standing and sitting into “Passive” are: > Passive_feat1<-Passive_feat > Passive_feat1[,ncol(Passive_feat1)]="Passive" > Active_feat1<-Active_feat > Active_feat1[,ncol(Active_feat1)]="Active" > All_feat_bin<-merge(Passive_feat1,Active_feat1,all=T) > write.csv(All_feat_bin,file="All_feat_bin.csv") From this transformation the goal of the experiences was to find if it is possible to have a good discrimination between passive and active activities. The hypothesis at this stage was that if this implementation would reduce the number of features needed for the classification process, then it would be part of the methodology proposed to address my problem. Firstly it was implemented the CfsSubsetEval in the Select Attributes menu in Weka explorer in order to see which were the best features that would solve this problem, and once more the features from the Y axis showed to be the most important as it is possible to see in the figure 5-14. Figure 5-14: selection from the first set of features using CfsSubsetEval and considering all the activities after renaming then into Passive and Active activities. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 55 Before stepping into the classification problem it is necessary to remember the problem discussed in the end of the section 4.3.2 where it was reminded that there are no guarantees that the mobile will be used in the pockets always with the same orientation, then the information collected in the Y axis can be partially or totally transferred to X axis. This problem will be addressed in the section 5.5. Regardless of the problem described in the paragraph above, it is now important to implement the chosen learning algorithm, J48, with the pre-pruning decision that was applied until now of a minimum number of examples in each leaf: in this case the threshold is 256 cases in each leaf having as target variable the classes {Passive and Active} the same input features as in the section 5.3. The evaluation measures of the model produced are shown in the figure 5-15. Figure 5-15: Results from the upper level of the hierarchical approach using the same statistical features of section 5.3, considering all the activities renamed into Passive and Active activities and using the J48 algorithm. The evaluation measures of this model are: RAE around 1,7% and the areas under the ROC are almost 1. These results must be considered among the RAE in section 5.3.2 and 5.3.3 but, once the RAEs of these sections are almost 0%, we can say that these results are significantly better when compared with the 10% of RAE from the section 5.3.1 when it is considered all the activities together as a flat classification problem. Secondly, it is also important to make a comparison between the number of features needed for a classification problem considering all the activities at once and when it is (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 56 implemented the sub-group decomposition approach. From the section 4.3.1 we know that if we consider all the activities for the classification problem the decision tree will incorporate 9 features with information from all the axes: means, standard deviations and inter quartile ranges. As it is possible to see in the figure 5-16, when considering the hierarchical approach we can see that the for the first classification between passive and active activities it is needed 4 features and then moving to the lower level of classification when it is passive activity we need only one feature, the mean of the Z axis (see figure 5-13), to discriminate between standing and sitting or when it is an active activity it is necessary 2 features, taken from the Y axis (see figure 5-11), to discriminate between walking and running. The features of the lower level of classification are also used to classify between passive and active activities, reducing from 9 features needed in the section 4.3.1 to 4 features when using the hierarchical classifier. Figure 5-16: Decision tree produced by the J48 classifier for all the activities in the upper level classification of the hierarchical approach {Passive, Active}. From this section the conclusion is that subgroup approach for these 4 activities is useful because improves the results of the classification model and at the same time reduces the computational needs for this problems what can be helpful in a mobile environment. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 57 5.5 The Orientation Problem In this section we present the development of features in order to solve the orientation problem caused by the mobile device inside the subjects‟ pants pockets. We are still considering the placement of the mobile in one of the pants pockets of an individual but this can be considered as a step into the ability to implement this model in a real environment. We are now proposing the implementation of a feature called the sum of the first derivatives in order to time. This feature is proposed because it calculates the amount of movements recorded in all axes and I believe this can be the threshold needed to discriminate between Passive and Active activities as it was presented in the section 3.2.2: x2+ 2 + 2 5.5.1 Implementing 1st Derivatives At this stage of the experimental procedures it is presented a set of features almost the same than in the last section. The only difference is the extraction of a new feature, denominated as the sum of the 1st Derivatives (Sum.1st.Derivative). From the boxplot presented in the figure 5-17 it is possible to have an idea of the discriminative power of this feature. 3000 2000 1000 0 Sum of 1st Derivatives Distribution 4000 Boxplot: Sum of the 1st Derivatives Active Passive Activities Figure 5-17: Boxplot presenting the Sum.1st.Derivative when considering the upper level of the hierarchical classification problem. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 58 The experiment setup of this section is the use of the input features {Mean.X, Mean.Y, Mean.Z, SD.X, SD.Y, SD.Z, IQR.X, IQR.Y, IQR.Z, Sum.1st.Derivarite} in order to classify the target variable {Active, Passive} by implementing the learning algorithm J48, from Weka, with a pre-pruning decision of at least 1% of objects in each leaf (considering the smaller class: Passive activities). This experiment has as goal to see if this new feature reduces the number of features needed to discriminate between passive and active activities and at the same time solves the problems of orientation detected previously. For the observation of this experiment results and decision tree produced please see the figure 5-18. Figure 5-18: Results from the upper level of the hierarchical approach after the implementation of the feature st Sum.1 .Derivative, and it´s respectively decision tree produced by the J48 algorithm. From the figure above we can conclude that the “Sum.1st.Derivative” becomes the most powerful feature to discriminate between passive and active activities once we can see it as the root of the decision tree. The next comparisons were done in relation to the section 5.4.1 once the algorithm and specifications are the same and the only difference is the introduction of the new feature. The evaluation measures of the model also improved, when considering the RAE (0.7208%). We can see also that the number of features needed to build this classification model was also reduced from 4 features in section 5.4.1 to 3 features now. This experiment was repeated once more with a new pre-pruning decision of having a new pre-pruning decision of at least 5% of instances in each leaf, when considering the (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 59 smaller class (in this case the passive activities). The minimum number of objects in the leaves at this time was 1300 and the results can be seen bellow in figure 1-19. Figure 1-19: Results from the upper level of the hierarchical approach and a more severe pre-pruning decision; respectively decision tree produced by the J48 algorithm. With similar results in terms of evaluation measures from the section 5.4.1 the major conclusion about this experiment is that it is possible to use this feature as a threshold to distinguish between passive and active activities. From the confusion matrix it was possible to see that most of the incorrectly classified instances were active activities classified as passive which can be explained by some transition moments facing some environmental obstacle. At last for this section it should be presented transformations to the statistical features presented in section 5.3 with the conceptualization described in section 3.2.2. These transformations are meant to capture all the information of axes X and Y once the mobile device can rotate inside the pockets, check figure 1-1. Once the dataset used was collected using always the same placement and orientation it would worthless to present these transformations in this thesis with the dataset available. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 60 5.6 Model Assessment In the previous sections of experiments we can see two different approaches that must be compared: flat classification and the hierarchical model. In this section it was decided to compare the two approaches having as target variable the four activities considered {walking, running, standing, sitting} and to include as input features {Mean.X, Mean.Y, Mean.Z, SD.X, SD.Y, SD.Z, IQR.X, IQR.Y, IQR.Z, Sum.1st.Derivarite}. A new flat classification model was learnt in order to include the feature developed in section 5.5 (Sum.1st.Derivative) and then this model will be compared to the hierarchical model. The comparison will be done by estimating the error rates using the cross-validation method, also by the comparing the dimension of the models by the number of nodes and the time taken to learn each approach. 5.6.1 Assessment of the Flat Classifier To start this comparison first we decided to start by learning a new flat model using one more feature than the ones used in section 5.3.1: Sum.1st.Derivative. The input features for this model are mentioned in the beginning of section 5.6 and the target variable is composed by the activities {walking, running, standing, sitting}. The learning algorithm used was J48 and the pre-pruning decision was for the minimum number of objects in each leaf: at least 1% of the minor class (in this case standing) and the setup can be seen in figure 5-20. The size of the decision tree produced is 53 nodes (see figure 5-21), that is less than the decision tree produced in the experiment 5.3.1 with a size of 61, and once the only change from that experiment to this is the introduction of the feature developed in the experiments 5.5 it is possible to claim that this improvement was caused by the feature Sum.1st.Derivative. It is also possible to see in figure 5-21, that it took 27 seconds to build this flat classification model. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 61 5-20: Setup for the flat classification model. 5-21: Time taken to build and size of the flat classification model The evaluation results of the flat classifier can be seen in figure 5-22. 5-22: Evaluation results for the flat classifier. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 62 As figure 5-22 shows, the estimation of the RAE is 3.1325% for the flat classifier. All areas under the ROC curves have a value close to one and in general it could be considered as an almost perfect model. From these results it is necessary to compare with the results of the hierarchical model. 5.6.1 Assessment of the Hierarchical Classifier In order to estimate the error rate of the hierarchical classifier a three step stages was implemented. Firstly we used the model created in section 5.5 (see figure 5-18) in order to estimate the error rate of the upper classification level. The second step was to split the dataset into passive and active activities and then build a lower level classifier for both datasets. At last, after building the models in the second step, the weighted error (RAE) was calculated for the hierarchical model in order to compare it with the error of the flat classification model. The first step only involves the interpretation of the results presented in figure 5-18. The RAE of the upper levels classifier is 0.7208%. The size of this tree is seven nodes and the building procedure took 8.41 seconds. In figure 5-23 it is shown the confusion matrix of this model and the error at this stage of the model will be propagated until the lower levels´ leaves. 5-23: Confusion matrix of the upper level classifier, from experience shown in figure 5-18. In the second step, after splitting the dataset into passive and active activities, two lower level models were produced. For the passive activities dataset the target variable was {standing, sitting} and for the active dataset it was {walking, running}. The input features for these models were the same as shown in figure 5-24. The purpose of these models is to integrate them in the classification procedure after the activities have been classified as active or passive as the figure 5-25 explains. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 63 5-24: Passive and active experimental setups. 5-25: Conceptualization of the hierarchical model. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 64 The output and evaluation results for the passive and active activities models are presented in figure 5-26. For the active model, J48 algorithm selected as relevant features {SD.Y, IQR.Y, SD.Z, IQR.X} and for the passive model the only feature selected was {Mean.Z}. The sizes of the trees produced were 11 nodes for the active model and 3 nodes for the passive model. In terms of time taken to build these models the results are, 0.21 seconds for passive activities and 21.4 seconds for active activities. 5-26: Full output results for passive and active models, using J48 algorithm. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 65 At last, it is necessary to calculate a weighted error for the hierarchical model in order to compare it with the flat classifier. For the upper level classifier, the errors will be propagated until the lower level classifiers` leaves so the error should be fully considered for the estimated RAE of the hierarchical model. On the other hand, in order to estimate the error of the lower levels classifiers, we must take into account only the correctly classified instances in the upper level classifier that might be misclassified at this stage of classification truth the sensitivity and specificity of the upper level classifier: The weighted RAE for this hierarchical classification problem is 4.0084%. 5.7 Discussion The first conclusion from this set of experiences is that the construction of features here is the major problem in order to build good classifiers. The feature developed in the section 5.5 addresses the orientation problem and at the same time produces better discriminations between active and passive activities. This feature is also important when we build the lower level classifier to classify between running and walking. Secondly, the results that we have seen in section 5.6, it is difficult to have a strong claim in relation to which modeling approach is better. It is our belief that the hierarchical model it is better than the flat classification problem, even if it produces slightly worst results. The hierarchical approach brings some advantages in terms of knowledge about the activity recognition problem and when considering new activities, not considered in this thesis, it will give a better framework to develop new features. There might be also some discussions about the sampling procedure with the use of sequence-based sliding windows. This method was only implemented in this work because the collection frequency of the accelerometer was defined to be always at 20Hz. Considering the need to preserve battery in the smartphones the implementation of time-based sliding windows should be taken into account when implementing this model in a mobile application. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 66 At last, at this moment, our claim is that this model is not prepared enough to be applied in an application when considering other problems described in section 1.3 as challenges. First of all, most of these challenges need to be solved and only then, we might have a strong claim in relation to the robustness of this activity recognition model to the real mobile environment. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 67 Chapter 6: Conclusions and Future Work The development of this work was really helpful in relation to the development of Data Mining knowledge of the author thanks to the working environment and the knowledge shared by the supervisors. It should be seen as an incomplete work, as all master thesis, but it is our belief that it should be considered as a good framework for future work. As future work it is important to set at least five main goals. Firstly it is important to develop the model in order to implement an application to the mobile environment. This should be done by integrating most of the challenges in section 1.3 and by developing programming skills. Secondly after the first goal is set, it is necessary considering new activities into account for the model in order to take advantage of the hierarchical model approach. In order to consider new activities it is necessary the development of better sampling methodologies. The approach to this problem could be the development of new mobile applications in order to collect labeled accelerometer data, therefore making it easier for the user and for the group developing the model. One of the most interesting directions this work could have is the implementation of a semi-supervised model, where the relations between features can be considered as constraints so that the model adapts better to the single user by learning from the data collected by him/her. It is our belief that this approach can give useful information by interpreting the adapted models. At last, to address the activity recognition problem, it would be interesting to enrich the information of the model. There are several approaches that could be considered. For example, more information could be collected if more sensors would be used. The constraint to decide which sensors to use should be by focusing on the sensors already integrated in the mobile devices. Another approach could be by enriching the model with server-based information in order to interact with the users. It is also an option not to collect more data, but by the use of the outputs of the model one could think to track the routines of the users. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 68 References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. Marc Mertens, G.D., Jonas Van Den Bergh, Toon Goedemé, Koen Milisen, Jos Tournoy, Jesse Davis, Tom Croonenborghs and Bart Vanrumste. Towards automatic monitoring of activities using contactless sensors,. in 20th Annual Belgian-Dutch Conference on Machine Learning, Benelearn 2011. 2011. The Hague, The Netherlands. Intille, L.B.a.S.S., Activity Recognition from User-Annotated Acceleration Data. PERVASIVE COMPUTING: Lecture Notes in Computer Science, 2004. Volume 3001/2004: p. 1-17. Jennifer R. Kwapisz, G.M.W., Samuel A. Moore. Activity Recognition using Cell Phone Accelerometers. in Fourth International Workshop on Knowledge Discovery from Sensor Data. 2010. Washington, DC. Miluzzo, E., et al., Sensing meets mobile social networks: the design, implementation and evaluation of the CenceMe application, in Proceedings of the 6th ACM conference on Embedded network sensor systems. 2008, ACM: Raleigh, NC, USA. p. 337-350. Preece, S.J.G., J.Y.; Kenney, L.P.J.; Howard, D.;, A Comparison of Feature Extraction Methods for the Classification of Dynamic Activities From Accelerometer Data. IEEE Transactions on Biomedical Engineering, 2009. 56(3): p. 871 - 879. Masud, M.M., et al., A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data, in Proceedings of the 2008 Eighth IEEE International Conference on Data Mining. 2008, IEEE Computer Society. p. 929-934. Santos, A.C., et al., Providing user context for mobile and social networking applications. Pervasive Mob. Comput., 2010. 6(3): p. 324-341. Jhun-Ying Yang, Y.-P.C., Gwo-Yun Lee, Shun-Nan Liou and Jeen-Shing Wang, Activity Recognition Using One Triaxial Accelerometer: A Neuro-fuzzy Classifier with Feature Reduction. Lecture Notes in Computer Science, 2007. Volume 4740/2007: p. 395-400. Krishnan, N.C.P., S.;, Analysis of low resolution accelerometer data for continuous human activity recognition, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2008 (ICASSP 2008). 2008: Las Vegas, NV. p. 3337 - 3340. Ermes, M., et al., Detection of Daily Activities and Sports With Wearable Sensors in Controlled and Uncontrolled Conditions. Information Technology in Biomedicine, IEEE Transactions on, 2008. 12(1): p. 20-26. Felicity, R.A. and et al., Classification of a known sequence of motions and postures from accelerometry data using adapted Gaussian mixture models. Physiological Measurement, 2006. 27(10): p. 935. M. J. Mathie, B.G.C., N. H. Lovell and A. C. F. Coster, Classification of basic daily movements using a triaxial accelerometer. MEDICAL AND BIOLOGICAL ENGINEERING AND COMPUTING, 2004. Volume 42(Number 5): p. 679-687. Preece SJ, G.J., Kenney LP, Howard D, Meijer K, Crompton R., Activity identification using body-mounted sensors--a review of classification techniques. Physiol Meas, 2009. 30(4): p. 1-33. Susanna Pirttikangas, K.F.a.T.N., Feature Selection and Activity Recognition from Wearable Sensors. Lecture Notes in Computer Science, 2006. Volume 4239/2006: p. 516-527. Uwe Maurer, A.R., Asim Smailagic and Daniel Siewiorek, Location and Activity Recognition Using eWatch: A Wearable Sensor Platform. Lecture Notes in Computer Science, 2006. 3864/2006,: p. 86-102. Norbert Győrbíró, Á.F.a.G.H., An Activity Recognition System For Mobile Phones. MOBILE NETWORKS AND APPLICATIONS, 2008. Volume 14(Number 1): p. 82-91. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 69 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. Motwani, B.B.a.M.D.a.R. Sampling From a Moving Window Over Streaming Data. in SODA. 2002. Quinlan, J.R., C4.5: programs for machine learning. 1993, San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. Waikato, M.L.G.a.U.o., WEKA. 2011. Witten, I.H.a.F., Eibe and Hall, Mark A., Data Mining: Practical Machine Learning Tools and Techniques. 3rd ed. 2011, San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. R: A Language and Environment for Statistical Computing. 2009, R Development Core Team: Vienna, Austria. Trevor Hastie, R.T., Jerome Friedman, ed. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. 2001, Springer. Bloedorn, E. and R.S. Michalski, Data-Driven Constructive Induction. IEEE Intelligent Systems, 1998. 13(2): p. 30-37. Dietterich T.G., M.R.S., Inductive Learning of Structural Descriptions: Evaluation Criteria and Comparative Review of Selected Methods. Artificial Intelligence, 1981. 16(3): p. pp.257-294. Wah, P.M.a.L.A.R.a.B.W. Principled Constructive Induction. in Eleventh International Joint Conference on Artificial Intelligence. 1089: Morgan Kaufmann. Pfahringer, B. CIPF 2.0: A Robust Constructive Induction System. in ML-COLT'94. 1994. Guan, D.a.Y., Weiwei and Lee, Young-Koo and Gavrilov, Andrey and Lee, Sungyoung. Activity Recognition Based on Semi-supervised Learning. in 13th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications. 2007: IEEE Computer Society. Precup, J.F.S.M.a.D., Activity Recognition With Mobile Phones. 2011. Frank, J.a.M., Shie and Precup, Doina. A novel similarity measure for time series data with applications to gait and activity recognition. in 12th ACM international conference adjunct papers on Ubiquitous computing. 2010. Copenhagen, Denmark: ACM. Pari Delir Haghighi, A.Z., Shonali Krishnaswamy, Mohamed Medhat Gaber. Mobile Data Mining for Intelligent Healthcare Support. in 42nd Hawaii International Conference on System Sciences. 2009. Hawaii: IEEE Computer Society. Goh, J. and D. Taniar, An Efficient Mobile Data Mining Model, in Parallel and Distributed Processing and Applications. 2005. p. 54-58. Costa, E.P., et al., Comparing several approaches for hierarchical classification of proteins with decision trees, in Proceedings of the 2nd Brazilian conference on Advances in bioinformatics and computational biology. 2007, Springer-Verlag: Angra dos Reis, Brazil. p. 126-137. Chapman, P., et al., CRISP-DM 1.0 Step-by-step data mining guide. 2000. Motoda, H.L.a.H., Feature Extraction, Construction and Selection: A Data Mining Perspective. 1998, Norwell, MA, USA: Kluwer Academic Publishers. Motwani, M.D.a.A.G.a.P.I.a.R., Maintaining Stream Statistics over Sliding Windows. SIAM Journal on Computing, 2002: p. 635--644. Hastie, T., R. Tibshirani, and J.H. Friedman, The elements of statistical learning: data mining, inference, and prediction. 2009: Springer. Leo Breiman, J.H.F., Richard A. Olshen, and Charles J. Stone, Classification and Regression Trees. 1984: Wadsworth International Group. Esposito, F., D. Malerba, and G. Semeraro, Decision Tree Pruning as a Search in the State Space, in Proceedings of the European Conference on Machine Learning. 1993, Springer-Verlag. p. 165-184. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 70 39. 40. 41. 42. Bratko, I., Prolog: programming for artificial intelligence. 3rd ed. ed. 2001, Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc. Fawcett, T., An introduction to ROC analysis. Pattern Recogn. Lett., 2006. 27(8): p. 861874. Santos, A.C. AndroLib: Accelerometer Data Recorder. 2010; Available from: http://www.androlib.com/android.application.pt-acoelhosantos-android-accnFwm.aspx. Santos, A.C., AccDataRec. 2010, Android Market. (a) Mestrado em Análise de Dados e Sistemas de Apoio à Decisão (b) Ramo do Conhecimento: Data Mining Página 71