Download A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR

The Pennsylvania State University The Graduate School Department of Industrial and Manufacturing Engineering A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR MODELING HUMAN GAIT AND GEOSPATIAL TRAJECTORIES A Thesis in Industrial Engineering by Yixiang Han  2013 Yixiang Han Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science August 2013 ii The thesis of Yixiang Han was reviewed and approved* by the following: Conrad S. Tucker Assistant Professor of Industrial Engineering Thesis Advisor Timothy W. Simpson Professor of Industrial Engineering Paul Griffin Professor of Industrial Engineering Head of the Department of Industrial Engineering *Signatures are on file in the Graduate School iii ABSTRACT Less than 35% of human communication is verbal (hearing, listening, etc.), whereas greater than 65% of human communication is nonverbal (body posture, facial expressions, etc.). By analyzing nonverbal human communication instead of just verbal communication, researchers may be able to perceive latent human features such as body language, neurological patterns, etc., otherwise missed through verbal communication alone. In this thesis, human kinematics (i.e., human gait and geospatial trajectory) is modeled and analyzed so as to perceive and predict human behavior and kinematic patterns. A data mining driven methodology is proposed for modeling and predicting both human gait (i.e., human walking posture) and human geospatial trajectory (i.e., a sequence of geospatial locations from a moving individual in an indoor space). The human gait mining component of the proposed methodology captures multimodal gait data in order to model and predict neurological patterns that influence human gaits. The human trajectory mining component of the methodology aims to predict common regions of interest (CRI) in indoor design spaces by modeling geospatial trajectory patterns. A Parkinson’s disease (PD) detection case study is used to validate the human gait component of the methodology, and an engineering design case study involving students working in teams is used to validate the human trajectory methodology. Analyzing human gait and geospatial trajectory would reduce human variations and recognize desired patterns in both human gait and geospatial trajectory so as to evaluate human movement characteristics and understand human movement dynamics. iv TABLE OF CONTENTS List of Figures .......................................................................................................................... v List of Tables ........................................................................................................................... vi Acknowledgements .................................................................................................................. vii Chapter 1 Introduction ............................................................................................................ 1 Chapter 2 Literature Review ................................................................................................... 4 2.1 Existing Techniques for Modeling Human Movement .............................................. 4 2.1.1 Existing Human Gait Modeling ...................................................................... 4 2.1.2 Existing Human Geospatial Trajectory Modeling........................................... 6 2.2 Data Mining based Human Movement Modeling ...................................................... 7 2.2.1 Data Mining based Human Gait Modeling ..................................................... 8 2.2.2 Data Mining based Human Geospatial Trajectory Modeling.......................... 9 Chapter 3 Methodology .......................................................................................................... 11 3.1 Human Gait Modeling Methodology ......................................................................... 11 3.1.1 Step 1: Sensor Data Acquisition...................................................................... 12 3.1.2 Step 2: Data Preprocessing .............................................................................. 14 3.1.3 Step 3: Data Mining Knowledge Discovery.................................................... 15 3.1.4 Step 4: Model Performance Evaluation and Application ................................ 20 3.2 Geospatial Trajectory based Human Motion Modeling Methodology ...................... 23 3.2.1 Step 1: Data Acquisition ................................................................................. 24 3.2.2 Step 2: Data Transfer....................................................................................... 25 3.2.3 Step 3:Data Mining Knowledge Discovery..................................................... 25 3.2.4 Step 4:Model Visualization ............................................................................. 30 Chapter 4 Case Studies and Discussion .................................................................................. 31 4.1 Parkinson’s disease detection based Case Study........................................................ 32 4.1.1 PD Data Acquisition and Preprocessing ......................................................... 33 4.1.2 PD-based Data Mining Knowledge Discovery and Evaluation ...................... 35 4.2 Geospatial Trajectory Clustering ............................................................................... 37 4.2.1 Geospatial Trajectory Data Acquisition and Preprocessing ............................ 38 4.2.2 Geospatial Trajectory based knowledge Discovery and Explanation ............. 39 Chapter 5 Conclusions and Future Work ................................................................................ 47 References ................................................................................................................................ 49 v List of Figures Figure 3-1. Framework of the proposed human gait based methodology................................ 12 Figure 3-2. Skeletal image with 20 nodes and example data from Shoulder_Center node. .... 14 Figure 3-3. Framework of geospatial trajectory modeling....................................................... 24 Figure 4-1. PD forward walking experiment overhead view. .................................................. 34 Figure 4-2. The learning factory layout. .................................................................................. 38 Figure 4-3. Extracted characteristic points of User 1............................................................... 40 Figure 4-4. Visualization of trajectory partitioning for User 1. ............................................... 41 Figure 4-5. Clustering visualization. ........................................................................................ 43 Figure 4-6. Clustering visualizations in the first period........................................................... 44 Figure 4-7. Clustering visualizations in the second period. ..................................................... 44 Figure 4-8. Clustering visualizations in the third period. ........................................................ 45 vi List of Tables Table 3-1. Confusion matrix example...................................................................................... 21 Table 4-1. Algorithms performances in walking experiment. ................................................. 36 Table 4-2. Other evaluations among multiple algorithms in walking experiment. .................. 36 Table 4-3. Original trajectory of User 1. .................................................................................. 39 Table 4-4. Example result based on clustering algorithm. ....................................................... 41 Table 4-5. Result of clustering algorithm. ............................................................................... 42 vii Acknowledgements It is my pleasure to thank everyone that helped make my thesis possible. I would like to express the deepest appreciation to my advisor, Dr. Conrad S. Tucker. He patiently provided the guidance, motivation, remarks and useful comments for me to proceed through not just my Master study and the learning process of this master thesis but my overall academic and professional career as well. He has shaped my growth and development regarding my research and scholarship. Without his tremendous mentorship and persistent help this thesis would not be possible. I would also like to offer my special thanks to my thesis committee member, Dr. Timothy W. Simpson, for his guidance, encouragement, insightful comments, and immensely helpful suggestions. His guidance has served me well and I owe him my heartfelt appreciation for taking the time to advise me through the process of writing this thesis. I thank my fellow lab mates in the Design Analysis Technology Advancement (D.A.T.A.) Lab in the School of Engineering Design, Technology and Professional Programs for their great support and enlightenment. Their friendship and assistance has meant more to me than I can express in words. Thank you all for your patience and friendly assistance. It has been a great pleasure to be a student in the Harold and Inge Marcus Department of Industrial and Manufacturing Engineering at the Pennsylvania State University at University Park. I deeply thank Dr. Paul Griffin, the department head, Dr. M. Jeya Chandra, the graduate program coordinator, and all other members of the department and the university. Last but not least, I would like to thank my family, Xiaokao Han and Fengmei Li, for giving birth to me and supporting me throughout my life. I love you all my life. 1 Chapter 1 Introduction Research has shown that more than 65% of human communication is non-verbal (e.g., posture, gesture) while about 35% is considered verbal (e.g., speech, discussion) [1]. The verbal human communication component conveys a large volume of information, but may miss latent aspects such as body posture, facial gestures, etc. that may provide researchers with added dimensions of knowledge. Within research pertaining to nonverbal communication, human movement behavior modeling is gaining significant interest across research domains ranging from public security surveillance to human movement-based disease diagnosis [2–4]. By analyzing human motion, researchers are able to compare and evaluate human movement characteristics in order to capture movement patterns and understand human dynamics. The objectives of analyzing human gait behavior and geospatial trajectory behavior are both important and complementary. Human gait is defined as the act of self-propulsion achieved by using human extremity [5]. Human geospatial trajectory is defined as a sequence of geospatial locations from a moving individual in order to recover human motion in a given space [6]. Human gait behavior analysis focuses on human motion analysis including human body segments (e.g., posture detection) while geospatial trajectory behavior analysis focuses on human trajectory movement analysis in a given space without considering specific parts of the human body structure. There is already a wide spectrum of applications based on human gait modeling such as athletic performance evaluation, medical diagnosis, public security surveillance and video 2 conferencing, etc. [7-8]. Analyzing human gait behaviors helps capture and recognize human movement characteristics and interesting gait patterns that could be taken as evidence for different targets and applications. For example, potential Parkinson’s disease (PD) patients are typically diagnosed by specific types of gait such as muscle rigidity, vocal problems, gait disorders through several kinematics experiments based on published criteria [8–10]. Another example is that swimmers analyze video tapes from other swimmers in order to learn about some performance indicators such as basic speed, stoke mechanics, starting and turning abilities which could be helpful in their personal trainings [12-13]. Other similar case studies could be found in other domains such as tennis, basketball and airport security surveillance [13–15]. In addition to human gait analysis, human geospatial trajectory methodologies focuses on human movement within a given space to capture geospatial position, velocity, time, acceleration, etc., without considering human body segments (i.e., human body is considered as one point, and different segments are assumed to have the same movement status). The objective is to detect their geospatial movement patterns such as the trajectory shape, common regions of interest (CRI) and density of multiple trajectories in specific setting. For example, researchers have proposed methodologies to detect overcrowded situations in an indoor space (e.g., shopping malls, career fairs, railway stations, etc.) so as to provide event alarms and better reorganized layouts [16]. Furthermore, this tracking strategy could also contribute to traffic control in order to relocate different facilities and promote user experience [17]. Other similar examples can be found in [19-20]. The aforementioned methodologies for modeling human gait and geospatial trajectories are usually achieved by a number of different human body models, which range from stick figures, ribbon-based 2-D contour, to 3-D volumetric human models. However, there are some 3 limitations that must be addressed [2], [20–22]. First, there is no human motion variation (i.e., human size) included in the applied standard 2-D or 3-D models. For example, different individuals may have different heights, weights, and other parameters which could lead to variations in human motion modeling and consequently, affect the predictive accuracy. In this thesis, the human variation is addressed by introducing ratio components for position, velocity, and acceleration between each pair of joints. In addition, joint correspondence (i.e., identifying every joint in successive frames) is required in certain 2-D and 3-D models, which may restrict the modeling flexibility and make it only applicable to some motion types (e.g., simple walking). In this thesis, each joint in the 3-D model is detected automatically by using a multimodal sensor. In order to mitigate these challenges, a data mining driven methodology is proposed to models human gait and geospatial trajectory in order to normalize and categorize various human gaits and understand utilization density in an indoor space. The rest of this thesis is organized as follows. This Chapter provides an introduction and background relating to human motion analysis. Chapter 2 describes previous work related to the research topic, discusses pros and cons of these methodologies, and contrasts them to the methodology proposed in this thesis. The human gait and geospatial trajectory components of the methodology are introduced in Chapter 3 with results and discussions presented in Chapter 4. Chapter 5 concludes the thesis and identifies future research expansions. 4 Chapter 2 Literature Review This chapter discusses the past research that is closely related to the topic in this thesis. The literature review begins by discussing the performances of various existing human motion modeling methodologies in section 2.1. Data mining based human motion modeling methodologies are then discussed in section 2.2. As part of this section, various classification algorithms that are most relevant to this study topic are discussed and compared with emphasis on extracting significant motion features in both human gait and trajectory classification and prediction. 2.1 Existing Techniques for Modeling Human Movement Existing methodologies proposed to model human movement focus on human motion tracking without system-based automatic motion recognitions and classifications (e.g., public security surveillance system) [23–25]. For example, existing passive surveillance system (e.g., Closed Circuit Television (CCTV) cameras) can only track human movements and require welltrained camera operators to manually view video feed so as to recognize any suspicious act. In this section, multiple existing modeling methodologies are discussed relating to both human gait modeling (see Section 2.1.1) and geospatial trajectory modeling (see Section 2.1.2). 2.1.1 Existing Human Gait Modeling An approach to modeling human gait motion is based on 2-D human stick figures, which consists of multiple joints or nodes connected by multiple line segments. This “skeletal” 5 representation of the human body could be a significant aid to help track and estimate human gait. One example is to model human gait based on moving light display (MLD) [23]. In this methodology, the human body kinematics is modeled using 12 MLD lights representing the head, shoulders, hips, elbows, wrists, knees, and ankles. This MLD-based model can help translate 3-D human gait into 2-D projections during different human motion experiments. Other similar examples can be found in [25], [27-28]. However, the joint correspondence required in human gait modeling is the most challenging and complex part since each joint requires node-to-node correspondence between successive frames. In addition, the 2-D projection provided by these methodologies cannot provide depth data (i.e., only X and Y) for each joint and is not capable of describing real-world 3-D human movements. Depth data is needed for more accurate for 3-D modeling. These shortcomings are addressed in the proposed methodology by collecting 3-D joint data (i.e., X, Y and Z coordinates) based on the applied multimodal sensor that can collect and store in a database (data structure is discussed in more detail in Chapter 3). Once stored in a database, researchers are able to extract information or run predictive models on this data, thus making it possible to model and predict human gait, as demonstrated later in this thesis. Another approach to tracking and recognizing human motion is based on the application of 2-D contour modeling. The objective in this methodology is to model human gait by adding human outlines, which is more precise than just applying “skeletal” models in previously discussed approaches [27-28]. For example, human gait could be modeled based on a ribbonbased 2-D model without putting markers on the human body [28]. In this model, there are eight joints included in the 2-D model, which represent shoulders, elbow, hips, and knees, respectively. By comparing the difference of moving ribbons between two frames, the moving ribbons could be extracted. The resulting parameter curves recover motion characteristics in different body parts. Other similar 2-D ribbon-based examples could be found in [26,30]. However, human size 6 variation during gait analysis is not considered. In addition, there are body structure constraints that may restrict the modeling flexibility and make it only applicable to some simple motion types (e.g., simple walking). In order to reduce human size variation and model other human motion types, the proposed data mining driven methodology is introduced in Chapter 3. Comparing to the previous 2-D human gait modeling, 3-D modeling would have several advantages such as viewing each joint independently based on the 3D angle parameters and modeling other complex and unconstrained human movements [30]. There are usually two parameters included in 3-D models, classified as “skeletal” figures and surrounding tissues [30]. For instance, a 22-DOF model is applied to construct the skeleton of the human body such as arm, leg and torso. Then cylinders, spheres and other different primitives are applied to generate the 3D model [31]. However, the 3-D model is still incapable of addressing human gait variations since parameters in 3-D model are unchanged. In addition, the included body constraints may reduce the model flexibility in other complex motions instead of just simple walking. Other similar 3-D models could be found in [32–34]. 2.1.2 Existing Human Geospatial Trajectory Modeling Human trajectory analysis addresses geospatial positions without considering the human body structure. Some studies have considered qualitative methodologies to collect geospatial trajectory patterns such as visual observations, interviews, and questionnaires [35], [37-38]. For instance, human geospatial trajectory analysis has conducted based on a questionnaire at Osaka Science Museum in Japan [35]. In the experiment, each volunteer was asked to fill out a questionnaire after touring about their interactions with different robots. By analyzing popular types of robots and the amount of time spent on each one based on the feedback provided in the 7 questionnaires, researcher were able to discover that there was no preference between males and females and there is no correlation between age of visitors and popularity of robots. However, there are subjective biases in volunteers since different people may give different feedback to the same one. In addition, this methodology requires a lot of time for human collaboration. Finally, there could be potential privacy problems during the data collection process. For example, confidential information such as name, age and phone number might be collected and accessed. Other examples can be found in [37-38]. Other approaches to modeling human geospatial movement is based on technologies such as Bluetooth, WI-FI, GPS and video camera recording, which can record precise time, location and trajectory data. For example, the Bluetooth sensor was installed in one of the busiest regions inside the Louvre museum in order to record the number of visitors and collect geospatial position with corresponding time [38]. Finally, researchers concluded that there are strong connections between Samothrace and Hall access, and between Hall access and big gallery, which could help explain the most frequent trajectory patterns [34]. In their work, visitors’ trajectories could be clearly described, and correlations between different nodes could also been discovered. However, the main limitation is that the busiest nodes are predetermined by officers (i.e., domain knowledge is included). In addition, this methodology may only be available to model one direction trajectory while other types may not be available (e.g., circular or a backand-forth type trajectory). Other similar examples cold be found in [40-41]. 2.2 Data Mining based Human Movement Modeling Instead of just tracking and modeling individual gait and geospatial trajectories, highlevel objectives about comparing and recognizing both different human gait features and 8 geospatial movement patterns could provide additional insights and discoveries. The methodology aims to generate clusters for similar features that would finally provide reliable guidance to recognize human activities for new untested people. In order to achieve this goal, data mining based methodologies are discussed in this section for both human gait analysis and human geospatial trajectory analysis. 2.2.1 Data Mining based Human Gait Modeling Charayaphan and Marble propose a data mining based methodology to detect hand motion and classify different hand signs [41]. In the methodology, a frame grabber with IBM PC was applied to help extract multiple frames of hand motions without applying any 2-D or 3-D models. The hand detection is accomplished by comparing the grey scale difference between two successive frames followed by hand sign classification based on stop position. Another similar example can be found in [42]. Polana and Nelson proposed a methodology that does not require the joint correspondence or track specific parts of an individual as mentioned in Section 2.1 [43]. Instead, human motion is tracked from the moving pixels followed by spatial translation where the image frames are reduced to the same size as the object. Finally, the generated spatial grayvalued frame set in each time point t would be considered as the feature vector to compare with the reference motion set based on the K nearest neighbors (KNN) algorithm [43]. Heisele proposed another methodology to model human gait based on Color Cluster Flow (CCF) [44]. In this methodology, the pedestrian is represented by two initial color-based clusters (i.e., lilac jacket and blue pants). The trajectories of these two clusters are considered as the approximation of real human motion. By comparing the cluster trajectories to the reference trajectories, it helps identify the real human motion. However, the main limitation for these methodologies is that they 9 can only be applied to periodic and simple motion types (e.g., parallel walking), and has lower predictive accuracies in complex motions such as rotation motion of arms or legs. In addition, there is usually a predefined reference motion set required before the motion modeling. 2.2.2 Data Mining based Human Geospatial Trajectory Modeling Johnson [45] proposed one methodology to model human trajectory based on probability density function (pdf) modeling. In the methodology, the human trajectory is described in a sequence of flow vectors denoted by Q={f1 ,f 2 ,...,f n } , where n is the total number of images captured for one subject. Then a learning network is applied to model the pdf to classify n input data nodes into k output nodes (k and n are predetermined parameters) based on nearest distance. Fu [46] models and predicts human geospatial trajectory that measures the similarity between two individual trajectories based on a similarity matrix. Then a two-layer clustering algorithm is employed where the dominant paths and routes are generated in the first layer. Then the Tightness & Separation Criterion (TSC) is applied to quantitatively evaluate the clustering results. However, the main shortcoming of the methodologies is that they only deal with trajectory clustering without detecting common regions of interest (CRI) to help explain the different motivation behind each individual’s activities. In addition, it may only perform well in single directional trajectory and cannot be applied in other cases such as circular or rectangular trajectories. Having reviewed the research problem, motivation, previous related methodologies and corresponding pros and cons, a data mining driven methodology is proposed to overcome the aforementioned limitations mentioned above in both human gait analysis and human geospatial 10 trajectory analysis and extract significant movement patterns to compare and understand human gait and geospatial activities. This methodology is introduced in the next chapter. 11 Chapter 3 Methodology Based on the background and explanation established, the proposed data mining driven methodology for modeling human gait and geospatial trajectories is presented in detail in this section. The proposed methodology aims to overcome the human motion variation by adding the ratios of position, velocity and acceleration between each pair of joints. There are four steps included in the methodology. In the first step, data acquisition is conducted in order to collect human gait and geospatial trajectory data. In the second step, data preprocessing technique proposed for data cleaning and transferring. In Step 3, data mining algorithms are employed to model and extract human movement features so as to explore common movement patterns. Step 4 of the proposed methodology outlines a validation and evaluation framework that helps determine the robustness of the proposed methodology. The human gait based methodology is discussed in Section 3.1 followed by the human geospatial trajectory based methodology in Section 3.2. 3.1 Human Gait Modeling Methodology The human gait modeling methodology aims to capture multimodal gait data in order to model and predict neurological patterns that influence human gait. The Human Gait Modeling component of the methodology is partitioned into a total four steps: data acquisition (Step 1), data preprocessing (Step 2), data mining knowledge discovery (Step 3), and model evaluation and application (Step 4) in Fig. 3-1. Step 1 discusses how to set up experiments and collect human 12 gait data based on the multimodal sensor employed in this work. Step 2 discusses how to preprocess and store the collected data into server. Step 3 discusses how to extract correlated features from the generated data set and apply these features to model, recognize and evaluate human gait patterns. In Step 4, the trained model could be evaluated and extended into different domains to help recognize and compare human gait patterns. For example, Step 4 in Fig. 3-1 could apply proposed human gait modeling into human movement related disease detection (e.g., Parkinson’s disease detection). Similar applications would be threat detection and athletic performance evaluation. Figure 3-1. Framework of the proposed human gait based methodology. 3.1.1 Step 1: Sensor Data Acquisition The first step of human gait modeling outlines the experiment setup for collecting human gait data. In this step, the overall body movement (i.e., human gait) is captured through a 13 multimodal sensor system including RGB video camera and infrared depth sensor. In the sensor data acquisition, the human body gait is modeled and captured based on the “skeletal” model shown in Fig. 3-2, where total twenty joints are represented by the black circles. Comparing to the modeling methodologies mentioned in the Literature Review, this multimodal sensor is able to automatically recognizing each of the twenty joints without placing sensors on human body. In addition, there are position, velocity and acceleration ratios created between each pair of joint to help normalize shape variation existing in a population. By utilizing the multimodal sensor, this virtual skeletal model is able to capture movements of joints in 3-D environment (i.e., X, Y and Z coordinates) in real-time manner with privacy preserved that is sometimes a desirable feature in human gait modeling (e.g. human gait based Parkinson’s disease detection). In this research study, the Microsoft Kinect sensor is used for data collection which is capable of tracking human motion by applying a similar virtual skeletal image shown in Fig. 3-2. The hardware is able to capture each frame of human gait approximately every 33ms and generating a 3MB data file in 4s. 14 Figure 3-2. Skeletal image with 20 nodes and example data from Shoulder_Center node. 3.1.2 Step 2: Data Preprocessing In addition to the initial X, Y, Z position data, velocity and acceleration of each single joint are calculated by taking the derivative of position and velocity and creating additional features in the raw data. In order to reduce the human gait variation (e.g., longer legs may have longer length of stride), the ratio of position, velocity, and acceleration between each pair of joint are also generated to reduce human gait variation. Since not all features generated previously are expected to have the same predictive power to the response variable, only the most relevant features corresponding to the response variable should be selected in order to obtain more insights to capture and distinguish human gait patterns, which leads to the feature space reduction [52]. Since multiple data mining algorithms are applied in this study, a candidate feature selection algorithm should be independent of the 15 multiple classifiers while maintaining the good performance [52]. In this thesis, the Correlationbased Feature Selection (CFS) is selected where the most relevant feature set to the corresponding output variable is selected with minimum correlation inside the feature set [48]. In the CFS methodology, the correlation between relevant features (i.e., the features included in the relevant feature set) and irrelevant features (i.e., the features not included in the relevant feature set) is a function of the number of components inside the feature set, average value of inner-correlation among inside features, and average value of correlation between inside components and outside features which is shown in Equation 3.1. More technical details can be found in [48]. rzc  k rzi k  k ( k  1) rii (Equation 3.1) where, : is the correlation between the current relevant features and the potential relevant features. k: is the number of features. rzi : is the average value between the relevant features and the potentially relevant features. rii : is the average value between two relevant features. 3.1.3 Step 3: Data Mining Knowledge Discovery After data preprocessing is completed, the aim is to develop a function f ( X )  Y that can help map the selected features X  ( x1 , x2 ,..., xn ) where n is the total number of features selected to the class variable Y. From a theoretical point of view, there are two types of data mining learning methodologies: supervised learning and unsupervised learning. Supervised learning is 16 the machine learning task of inferring a function from labeled training data while unsupervised learning refers to the problem of trying to find hidden structure in unlabeled data [49]. Since observations are labeled in human gait modeling, the supervised learning is selected. In addition, multiple data mining algorithms including Binary Logistic Regression, Support Vector Machine, C4.5, Random Forest, and IBK are employed since they are proved to have good performances in human gait classification problem [50–55]. Based on the performances of different algorithms, the most accurate and reliable model and partitioning criteria would be generated. Binary Logistic Regression In Binary Logistic Regression, each selected input variable would be given a coefficient in order to formulate a function mapping input variable to the output variable. Here the coefficients could be considered as prediction power indicators. The equations are shown in Eq.3.2 and Eq.3.3. By applying multiple linear features as input variable for a new observation, the model estimates its probability of falling into one category. For example, in terms of the Parkinson’s disease (PD) case study presented in the following section, one category would be PD patient and another would be controls. More information about logistic regression can be found in [50][51]. Linear regression may help the classification problem; however, its accuracy is sometimes inconsistent since the linear combination of features may not be able to explain all the variation in the response variable. By considering these limitations, support vector machine is introduced. f ( x)   T x (Equation 3.2) n  *  arg min  ( f ( xi )  y ) 2  i 1 (Equation 3.3) 17 where, β: is the coefficient and f(x) is the logistic function. : is the value of the ith feature for one observation. y: is the value of the output variable. Support Vector Machine (SVM) In addition to the logistic regression model, SVM is another available classifier by maximizing the margin space between two different clusters. In contrast to other data mining algorithms, the observations that are close to the partition boundary of the clusters receive more attention in SVM and would finally generate the separating boundary based on a kernel function shown in Eq. 3.4. In practice, SVMs are made robust by adding some “slacking variables” that allow training error to be non-zero. In addition, SVM would also be able to transform the current data to a higher dimensional space and construct the decision boundary. Specific technical details could be referred in [52][53]. SVM may help increase accuracy in logistic regression modeling; however, it is sometimes difficult to explain the kernel function and results of the algorithm since it is a non-parametric technique and lacks transparency of results and cannot represent the kernel function as simple parametric function of input variables [56]. In order to overcome these limitations while maintaining the modeling accuracy and robustness, the C4.5 decision tree algorithms is discussed. n f ( x1 , x2 ,..., xn )   wi xi  b i 1 where, (Equation 3.4) 18 : is the coefficient and f(x) is the logistic function. b: is the tolerance of the misclassification error. : is the value of the ith feature for one observation. C4.5 C4.5 is well established classification algorithm proposed by Quinlan in 1986 [54][55]. C4.5 is usually employed to classify one type of pattern in binary classification problems [54][55]. The algorithm comprises of two main steps: (1) best attributes evaluation and (2) splitting point selection. The attribute evaluation step attempts to select the most informative node in each subset of the training data set (the whole training data set for the root node selection) based on the maximum value of gain ratio, which is calculated based on equations from Eq. 3.5 to Eq. 3.8. The splitting point selection attempts to decide the best numerical split point that has the minimum misclassification error which is based on Eq. 3.6. More information about decision tree can be found in [50], [59-60]. m Info( D)   pi log 2 ( pi ) (Equation 3.5) i 1 v Dj j 1 D InfoA ( D)    Info( D j ) (Equation 3.6) Gain( A)  Info(D)  Info A (D) (Equation 3.7) GR( A)  where, Info( D)  InfoA ( D) Gain( A)  (Equation 3.8) v D SplitInfo( A) Dj j   log 2 ( ) D j 1 D 19 I (D): is the expected information needed to classify a tuple in D. D: is the data set. m: is the total number of classes. : is the probability that an arbitrary sample belongs to class and is estimated by /|D InfoA ( D) : is the information needed to split D into v partitions by selecting the attribute A. Gain(A): is the information gained by branching on an attribute A. I(A): is the information of attribute A. GR (A): is the gain ratio of attribute A. SplitInfo(A): is the information of attribute A. : is the number of instances in D that belong to the jth partition. Random Forest (RF) Random Forest retains many benefits of decision tree classification algorithm (such as the C4.5) while achieving better results through the use of bagging, random subsets of variables, and a voting scheme [57]. By using a random selection, M random cases are sampled with replacement in the training data set for each tree. Then N features are also randomly sampled to help construct single tree (M and N are predetermined parameters). Second, all the input variables and cases are taken to help general a single tree as the similar procedure in C4.5. Finally, a large number of trees are generated and they vote for the most popular class. We call this entire procedure a random forest (RF). More details can be found in [61-62]. 20 IBK IBK classifier is an instance-based machine learning algorithm based on K-Nearest Neighbor (KNN). Instead of constructing explicit abstractions such as linear logistic regression model, decision tree model and SVM model, IBK compares similarity between the observations in training data set and hold-out observations in test data set. In addition, this algorithm assumes that similar instance should have similar classifications. By computing the instance similarity (shown in Eq. 3.9), IBK would be able to classify new instances to its nearest neighbors and finally generate clusters. More information is given in [59][53]. Similarity ( x, y )   n  f ( xi , yi )   i 1 n  (x  y ) i 1 i i 2 (Equation 3.9) where, : is the value in one dimension of one observation. : is the value in one dimension of another observation. n: is the total number of features in the feature space. 3.1.4 Step 4: Model Performance Evaluation and Application After discussing human gait modeling, the next step is to evaluate model performance based on multiple evaluations. Before employing the following performance metrics, k-fold cross validation is employed. In the k-fold cross validation, the whole data set is randomly partitioned into training data set and test data set. Each time the training data set is applied to train the model while the test data set is applied to validate performance. This procedure is repeated another k 21 times, and the performance is averaged and represented in multiple evaluation measures. Based on the literature review, k is assigned to be 10 [52] [60]. The first evaluation is based confusion matrix (example is shown in Table 3-1) that contains four cells: (1) true positive, (2) false positive, (3) false negative and (4) true negative. These values would help generate Correctly Classified Instance (CCI), precision, recall, Fmeasure and ROC curve [61]. Table 3-1. Confusion matrix example. Actual Status Predicted Status True False True True Positive (TP) False Positive (FP) False False Negative (FN) True Negative (TN) The second evaluation measure Correctly Classified Instance (CCI) explains the weighted average accuracy of different models for the two categories. The calculation is shown in Eq. 3.10. CCI  TP  TN *100% TP  TN  FP  FN (Equation 3.10) The third metric, Kappa statistic (KS) [53][62], measures the proportion of all positive and negative cases after considering chance prediction. Generally, its value ranges from -1 to 1 where the model is considered as reliable when its value is from 0.8-1. In addition, KS<=0.2 (poor); 0.2<KS<=0.6 (fair); 0.6<KS<=0.8 (substantial). The calculation is shown in Eq. 3.11. KS  p0  pc 1  pc where, p0 : is the probability of total agreement. (Equation 3.11) 22 pc : is the probability because of chance. There are other evaluation measures called precision, recall, and F-measure, which can be calculated from confusion matrix. Precision and recall can be considered as the Type I and Type II errors to describe the confidence interval of applied model and calculations are shown in Eq. 3.12 and Eq. 3.13. For example, if the precision value is greater than 0.95, then the researchers are 95 % confident that the model is able to classify observations correctly. F-measure is another performance indicator, and it can be considered as a weighted average of the precision and recall. Note that it gets the best performance at value of 1 and the worst performance at the value of 0. The equation is shown in Eq. 3.14. precision  recall  F  2* TP TP  FP TP TP  FN precision * recall precision  recall (Equation 3.12) (Equation 3.13) (Equation 3.14) The last evaluation measure is the receiver operating characteristic (ROC) curve. Since a classification model is usually applied based on particular values of thresholds or parameters, the ROC curve is able to describe different model performances based on different values of threshold in order to choose the best operating point. The best operating point might be chosen so that the classifier gives the best trade-off between the costs of failing to detect positives against the costs of raising false alarms. These costs need not be equal; however this is a common assumption. Note that the best place to operate the classifier is usually the point on its ROC that lies on a 45 degree line closest to the north-west corner (0, 1) of the ROC plot. 23 Once the human gait modeling and performance evaluation are completed, the most suitable model could be applied to detect and recognize human gait patterns and visualize results for decision support. The main benefit for decision support is able to quantify and visualize the human gait results. In addition, the decision support also helps measure and evaluates human gait patterns based on a small subset of relevant features. Finally, this decision support may serve as a system to give reference for any interesting gait pattern detection based on the particular application domain. 3.2 Geospatial Trajectory based Human Motion Modeling Methodology The motivation of analyzing human geospatial trajectory is not only to model geospatial movement patterns relative to an indoor space but also recognize common trajectory patterns from multiple people so as to achieve the objectives in different application domains (e.g., averaging indoor space utilization, maintaining crowd control, etc.). Here, the trajectory pattern could be understood as a set of regions that are of interests to different individuals. In the methodology, there are a total four steps: data collection (Step 1), data preprocessing (Step 2), data mining knowledge discovery (Step 3) and visualization (Step 4). The framework for human trajectory modeling is shown in Fig. 3-3. 24 Figure 3-3. Framework of geospatial trajectory modeling. 3.2.1 Step 1: Data Acquisition Since there is not too much novel contribution in Step 2, Step 1 and Step 2 are combined. The first step of the human trajectory based methodology is data acquisition, which is captured through a wireless indoor tracking system helping update real-time individual geospatial location (i.e. X and Y coordinates) with corresponding time stamps. By utilizing the GPS-based tracking system, geospatial locations of each individual can be updated and considered as an approximation of the individual geospatial trajectory. Then researchers are able to extract individual trajectory patterns and establish common trajectory patterns among multiple people. In 25 this study, the BuzNet Real-Time Locating System (RTLS) was used to track the trajectories of multiple individuals in an indoor space. Once the data collection is completed, the data would be stored in a database in a suitable format for subsequent steps in the data mining process. 3.2.2 Step 2: Data Transfer Human geospatial trajectory data transfer is based on a hardware and software platform that consists of three primary components: (1) Routers, (2) tags, and (3) Base Station. Routers are fixed-position devices that form the wireless network infrastructure of the hardware. Tags are wireless, battery-powered mobile devices placed on individuals in an indoor environment. Base Station is a PC (typically, Microsoft Windows-based) that is loaded with the software. When individuals are walking around in a given indoor space, this system provides an interactive visualization interface for the tracking of individual locations approximately every 2 minutes. At the same time, the Base Station stores every calculated location for every tag in a database (locally-stored or cloud-based) that can be accessed and analyzed. 3.2.3 Step 3:Data Mining Knowledge Discovery By comparing individual trajectories among multiple people in an indoor space, researchers can extract common trajectory patterns in order to understand and recognize how the indoor design space is utilized. Since The TRACLUS algorithm is irrespective of trajectory types (e.g., dual direction trajectory), it can extract individual movement features from different trajectories which will provide more information in trajectory clustering [63]. The methodology contains two steps: (1) partitioning and (2) clustering. The first step attempts to capture and 26 recover the real trajectory based on a subset of trajectory points. The second step attempts to group different line segments generated in the previous step so as to recognize trajectory patterns among different people based on clusters. In this section, the individual trajectory partitioning methodology is explained first followed by the clustering methodology. Trajectory partitioning We assume that the original real individual trajectory could be duplicated based on the data collected in the previous step. Some simple trajectories could be classified directly (i.e., one directional trajectory); however, most of the trajectories are not in this type and cannot provide insight if they are classified directly without any partitioning. The partitioning algorithm provides one approach to duplicate the original trajectory without losing much information based on an optimal subset of characteristic points. Given an individual trajectory T  {t1 , t2 ,..., tn } , optimal characteristic points P  { p1 , p2 ,..., pn } would be generated [63]. Here collected in previous data collection, and is any position point is any characteristic point extracted. In the partitioning algorithm, the Minimum Description Length (MDL) function is applied to evaluate each point (equations are shown from Eq. 3.15 to Eq. 3.18). Assuming the first point in the original trajectory is the starting point, if its MDL_par cost is less than or equal to its MDL_nonpar cost, then we continue searching until the first point that violates this requirement is found. Assuming the first point that violates the MDL cost function is , then the point is considered as one characteristic point and taken as the new starting point to search next characteristics point until all the points in the original trajectory is checked. Finally the characteristic point set P can be established. 27 L( D | H )  MDLpar  ( L( H )  L( D | H )) (Equation 3.15) L( H )  log 2 (len( p j p j 1 )) (Equation 3.16) p j 1 1  log k pj 2 (d  ( p j p j 1 , tk tk 1 ))  log 2 (d ( p j p j 1 , tk tk 1 )) MDLnonpar  (Equation 4.17) currentindex  j  startindex log 2 (len( p j p j 1 )) where, MDLpar : is the MDL cost of one possible characteristic point. MDLnonpar : is the non-MDL cost of one point. L( H ) : is the length of hypothesis when the next location is added. L( D | H ) : is the distance between line segments. len( p j p j 1 ) : is the Euclidean distance between two points. d ( p j p j 1 , tk tk 1 ) : is the perpendicular distance between two line segments. d ( p j p j 1 , tk tk 1 ) : is the angle distance between two line segments. len( p j p j 1 ) : is the Euclidean distance between two points. is the jth characteristic point in one trajectory. is the (j+1)th characteristic point in one trajectory. is the kth location point in one trajectory. is the (k+1)th location point in one trajectory. (Equation 4.18) 28 Trajectory clustering By classifying different individual movement features into different clusters, researchers would be able to understand the density of all the trajectories in the indoor design space in order to improve user experience. Based on the characteristic points selected in the previous section, the original individual trajectory could be duplicated and represented as line segment combinations. In this section, the objective is to classify these line segments into different clusters where common movement patterns are restored. The clustering algorithm is based on the DBSCAN algorithm, which is a type of density based clustering algorithm [63]. Given a set of line segments L={l1 ,l2 ,...,l j } , multiple clusters could be generated C ={c1 ,c2 ,...,ck } , where j and k are the total number of line segments and clusters [63]. In the methodology, there are two parameters: (1) ε and (2) MinLn. ε is a threshold to determine the distance between any pair of line segment, and MinLn helps explain the minimum number of line segments inside the cluster. The algorithm contains three steps [63]. First, a queue Q is constructed to include all the unlabeled line segments during the algorithm. Each time, the ε–neighborhood of one unclassified line segment in the queue is computed based on the distance function shown in Eq. 3.19 and Eq. 3.20. If N (li ) >=MinLn is satisfied, then a density-based set is generated until all the unclassified line segments are examined. Otherwise, the line segment is considered as noise. Second, the algorithm attempts to expand clusters. Assuming there are M neighborhoods generated in the first step, then for any one neighborhood in terms of , the similar process is repeated but . If there are other neighborhoods connected to (i.e. ), then a cluster would be generated. Finally, trajectory cardinality is conducted to ensure that all the line 29 segments inside the cluster are from different individual trajectory. More details can be found in [63]. N (li )  {l j  Q | dist (li , l j )   } (Equation 3.19) dist (li , l j )  d (li , l j )  d (li , l j )  d|| (li , l j ) (Equation 3.20) where, N (li ) : is the number of ε–neighborhood of one unclassified line segment . ε: is a threshold to determine the distance between any pair of line segment. dist (li , l j ) : is the distance between two line segments. d (li , l j ) : is the perpendicular distance between two line segments. d (li , l j ) : is the angle distance between two line segments. d|| (li , l j ) : is the parallel distance between two points. As mentioned earlier, the goal of trajectory modeling is to detect and recognize movement patterns not only for individuals but also for multiple people. By visualizing the trajectory clustering results based on the methodology discussed in Section 3.2.2, we may be able to understand dynamics behind the geospatial movement patterns in these two aspects. In addition, the clustering results may also be applied to help achieve some high-level objectives as well. For instance, introducing the specific facility layout in a specific setting could help understand indoor space utilization or public space crowd control so as to generate guidance to relocate various resources and increase user experience. In the next chapter, two case studies about human gait and geospatial trajectory are discussed to validate the methodologies. 30 3.2.4 Step 4:Model Visualization The goal of human geospatial trajectory modeling is to discover all possible utilized regions to better understand human movement dynamics, describe space utilization patterns evolution during different time periods based on the clustering results and provide possible better indoor space design. From Step 1 to Step 3, researchers are able to obtain the total number of clusters, the number of line segments included in each cluster, number different individuals included and locations of these line segments in each cluster. Clustering visualization helps better understand how the indoor space is utilized and may lead a better indoor space design. In the previous sections, the utilized region is assumed to be the location in a design space containing clusters of individuals. Based on the data mining trajectory clustering methodology, several common movement patterns from different individual trajectories can be detected, which equivalently means the clusters of common movement patterns. In the second aspect, based on the clustering result, the total number of clusters generated and the number of individuals included in each one can be obtained. In addition, the evolution of indoor space utilization based on the change of movement patterns in different time periods is also addressed to obtain the change of human movement behavior patterns. 31 Chapter 4 Case Studies and Discussion In order to validate the proposed methodologies in Chapter 3, a suitable application/case study is chosen for human gait modeling and geospatial trajectory modeling, respectively. A Parkinson’s disease (PD) detection case study is introduced for explaining the human gait modeling. The objective in the case study is to propose a non-invasive motion tracking methodology that will serve as a healthcare decision support system, capable of predicting the emergence of PD based on extracted PD gait patterns. An indoor design space utilization case study is presented to validate the proposed human geospatial movement component of the proposed methodology. The objective is to understand how the indoor space is utilized based on the density of all the trajectories. For the data acquisition in the two case studies, voluntary participants from the university were invited. Since the two studies focused on human related topics and asked for volunteers, where the personal information may be identifiable during experiments and may cause privacy problem, the skeletal frames are applied in human gait related topic while user ID is tracked anonymously in the geospatial trajectory related topic. It is important to note that all the experiments were designed and carried out following all the guidelines and rules enforced by the Institutional Review Board (IRB) and Office for Research Protections (ORP) for research involving human participants in the experiments. The details about PD detection are discussed in Section 4.1 while the trajectory clustering details are discussed in Section 4.2. 32 4.1 Parkinson’s disease detection based Case Study Parkinson’s disease (PD) is a motor related disease that affects more than one million people in North America and is the 2nd most neurological disorder after Alzheimer’s disease [7],[68-69]. PD results from the death of dopamine-generating cells in a region of the middle brain called substantia nigra [66]. The symptoms of PD include shaking, muscle rigidity, slowness of movement, difficulty with walking, and some vocal problems; however, the most obvious symptoms are gait-related, especially during the early stages of PD [8]. Here the early PD stages are defined as the stages from I to III in the Hoehn and Yahr Staging of PD [8]. PD is now diagnosed based on published criteria such as Unified Parkinson's Disease Rating Scale (UPDRS) [67]. Since the reason of neuron cell death is still unclear, sometime it is difficult to diagnose PD accurately, and approximately 20%-25% misdiagnosis is expected in the clinical PD diagnosis [68]. In addition, the current clinical PD diagnosis process has a high demand for the human resources and facilities which could increase financial burdens to not only PD patients but also insurance providers and even the government. All these limitations would let PD patients occupy more healthcare resources, receive more possible side effects, and decrease PD management efficiency. There are also some data mining based methodologies applied in PD diagnosis. Even though they have proved effective in PD recognition, the fundamental limitation is that there are predetermined assumptions for the biomarkers that may reduce the final PD modeling performance. For example, hands and feet are usually taken as the biomarkers to track and capture PD [10]. However, these may not be the best features to predict PD in terms of accuracy and robustness of prediction. Due to the disadvantages of current PD diagnosis, there is a demand for an integrated PD detection system that is capable of identifying the emergence of PD motor symptoms in a cost- 33 efficient, objective, and non-invasive way. The proposed data mining driven methodology would highly satisfy this. 4.1.1 PD Data Acquisition and Preprocessing For these experiments, the Microsoft Kinect was configured at an elevation of 3 feet and 10 inches above the floor. Each subject’s body presence was verified, and the camera angle was adjusted by having the subject stand relaxed while facing the Kinect at a distance of 10 feet. Then each volunteer was invited to the walking experiment where human gait was updated in about every 30ms (i.e., collecting each frame of human gait in about every 30ms). In this forward walking (FW) experiment, the subject was asked to first take 2-3 steps backward (4 feet) from the point of camera calibration, still remaining within the distance limit of the device. Subjects were then instructed to walk comfortably to the Kinect and were not given any specific instructions regarding side of initiation. Finally, individual human gait data set was labeled with a class variable (i.e., subject is PD or control) since the PD status was known before the experiment. The experiment overhead view is shown in Fig. 4-1. In this forward walking experiment, the subject pool consists of a total seven PD patients without medication and seven controls without PD symptoms. Based on the data sampling rate of 33Hz, more than one thousand frames were collected for each subject during the walking experiment. 34 Figure 4-1. PD forward walking experiment overhead view. In the next step, data preprocessing is conducted to reduce noise in the original data set. For example, in the FW experiment, arm swing may not be captured when it swung to the back of the body, and in this case multiple zeros would be generated in multiple successive frames. The summary of this step is shown as follows: 1. The velocity and acceleration of each node were also generated in X, Y, and Z directions similar to position data; 2. The ratio about position, velocity, and acceleration are generated between every two nodes in X, Y, and Z coordinates to reduce human motion variation; 3. PD status is the response variable, and PD is considered as TRUE and control is considered as FALSE; 4. Two dataset are finally generated. The first one is PD-OFF dataset which contains the data form seven PD patients without medication. The second one is Control dataset which 35 contains the data from seven controls. There are 1891 features included in each of the two data sets. 4.1.2 PD-based Data Mining Knowledge Discovery and Evaluation In the first step, the feature selection based on the CFS algorithm mentioned in Section 3.1 is conducted to generate the optimal subset in the forward walking experiment in terms of PD detection. There are 32 features generated from the 1890 original features (the last one is output variable). Among these 32 features, 18 features are related to position, 9 features are related to velocity, 1 feature is related to acceleration, and the rest fall into ratios. In the second step, multiple machine learning algorithms are employed to discover novel knowledge pertaining to the data acquired. As discussed in Section 3.1, different models may have different advantages and lead to different classification accuracies. By evaluating the performances based on the 10 fold cross validation technique, the most accurate and reliable classification model could be identified. From the Table 4-1, the IBK classifier is the best classifier since its accuracy is almost 99%, which means the model could identify 99% of the human gait frames correctly among the seven PD patients and seven controls. At the same time, the accuracies of J48 (a classifier based on C4.5) and random forest both exceed 90%. The worst model is logistic regression where the accuracy is only about 64.3%. More information about confusion matrix in forward walking could be referred in Table 4-1. From this table, we can also validate that logistic regression has lower accuracy, compared to the SVM, J48, random forest and IBK models. In addition, the values of other performances can be obtained in Table 4-2. For example, the IBK classifier can recognize 98.6% of the PD frames correctly. From these two 36 tables mentioned, the IBK and random forest are the best classifiers in terms of the forward walking experiment. Table 4-1. Algorithms performances in walking experiment. Algorithm Confusion matrix IBK Binary Logistic Regression J48 SVM Random Forest Accuracy PD Control Sum PD 1498 25 1523 Control 16 1349 1365 Sum 1514 1374 2888 PD Control Sum PD 1079 444 1523 Control 556 779 1365 Sum 1665 1223 2888 PD Control Sum PD 1414 109 1523 Control 120 1245 1365 Sum 1534 1354 2888 PD Control Sum PD 1055 468 1523 Control 556 809 1365 Sum 1611 1277 2888 PD Control Sum PD 1472 51 1523 Control 82 1283 1365 Sum 1554 1334 2888 98.8% 64.3% 92.1% 64.5% 95.4% Table 4-2. Other evaluations among multiple algorithms in walking experiment. Algorithm IBK Binary Logistic Regression J48 SVM Random Forest TP Rate 0.986 FP Rate 0.014 Precision 0.986 Recall 0.986 F-Measure 0.986 ROC Area 0.986 0.643 0.364 0.643 0.643 0.642 0.705 0.921 0645 0.08 0.36 0.921 0.645 0.921 0.655 0.921 0.645 0.929 0.643 0.954 0.048 0.954 0.954 0.954 0.991 37 To summarize, the performances of all these machining learning classifiers are different in FW experiment. In all these classifiers above, the IBK, random forest and J48 are better than other two classifiers based on multiple evaluation measures and could be applied in future PD detection application. For example, by looking at the features extracted from these three models, researchers are able to identify the common relevant features to the PD detection. In addition, the average value of these three algorithms may be considered as one quantitative PD detection result in order to do PD comparison. However, since this case study is a pilot study and it has several limitations. The main one is that there are only seven PD patients and seven controls involved in the case study to validate human gait based modeling. One possible future work would be having more subjects involved in different ages and keeping the same proportion in both males and females. Furthermore, multiple stages scale detections applied in the current PD long-term evolution (e.g. UPDRS) are attempted to be quantified which may help improve long term PD management. 4.2 Geospatial Trajectory Clustering In this section, the geospatial trajectory based case study is discussed. The objective is to extract individual movement patterns and compare these patterns in order to generate clusters for common movement patterns that could serve to help explain motivations behind these activities. In order to achieve this goal, the related data acquisition is discussed in Section 4.2.1, followed by the modeling and visualization is Section 4.2.2. 38 4.2.1 Geospatial Trajectory Data Acquisition and Preprocessing The data collection is conducted in the Learning Factory in Pennsylvania State University which involves data collected throughout the 3,500 sq. ft. of the facility lab, work, and shop space (see Fig. 4-2) [74-75]. It is designed for students in the College of Engineering to conduct design and other related works. BuzNet Real-Time Locating System (RTLS) was used to track the trajectories of teaching assistants (TAs) at the Learning Factory [71]. Figure 4-2. The learning factory layout. In the experiment, there are twelve battery-powered Buznet tags provided to TAs when they are working on their duties. TA was assumed to wear a tag while guiding student’s experiments until the work is done and the tag is returned to the container. By collecting and analyzing TA’s trajectories of a semester, we are able to understand their trajectory patterns and dynamics. During each experiment, the X-Y 2-D dimensional position data would be updated 39 about every two minutes with corresponding time stamp, tag ID, and sequence number. These data would also be stored in database automatically. 4.2.2 Geospatial Trajectory based knowledge Discovery and Explanation By looking at the results in partitioning algorithms, it is clear that this algorithm is able to approximate the original individual trajectory based on the minimum number of characteristic points. For example, there are 13 position nodes in the original trajectory in User 1 (shown in Table 4-3); however, only Points 1, 4, 12 and 13 are selected as characteristic points to approximate the original trajectory (shown in Fig. 4-3). Similar results could be seen in User 2. For a clearer understanding, the trajectory visualization of User 1 is shown in Fig. 4-4. The original trajectory of User 1 is represented as multiple black nodes connected by green line. Based on the proposed partitioning algorithm, the trajectory is approximated by red dots connected by a black line. Table 4-3. Original trajectory of User 1. 40 user number sequence number x location y location 1 1 18.6 11.8 1 2 21.4 14.9 1 3 21.5 15.1 1 4 20.8 15.2 1 5 20.5 15.6 1 6 20.7 15.1 1 7 21.1 15 1 8 21.5 15.2 1 9 21.2 15.1 1 10 20.9 15.4 1 11 20.9 15.4 1 12 21.1 15.2 1 13 18.6 11.8 2 1 18.6 11.8 2 2 20.9 15.6 2 3 21.7 15.3 2 4 20.9 15.8 date 1/20/2012 1/20/2012 1/20/2012 1/20/2012 1/20/2012 1/20/2012 1/20/2012 1/20/2012 1/20/2012 1/20/2012 1/20/2012 1/20/2012 1/20/2012 1/20/2012 1/20/2012 1/20/2012 1/20/2012 Figure 4-3. Extracted characteristic points of User 1 time 18:15:16 18:17:17 18:19:16 18:21:17 18:23:18 18:25:18 18:27:18 18:29:18 18:31:18 18:33:17 18:35:18 18:37:18 18:39:18 18:43:18 18:45:18 18:47:18 18:49:16 41 Figure 4-4. Visualization of trajectory partitioning for User 1. Based on the results in the partitioning algorithm, there are totally 1287 line segments generated. By letting ε=1 and MinLn= 8, the clustering algorithm was applied to these line segments and generated clusters. At last, each effective line segment in the queue was assigned to a cluster, as well as the original trajectory to which each line segment belongs. Notice that there are some line segments that cannot be classified into any one cluster since they violate the parameters and MinLn, and we labeled this type of line segments as noise. For example in Table 4-4, Line 1 and Line 2 are grouped in a cluster while Line 3 is grouped into Cluster 2 even though all three lines are from the same trajectory. Since multiple lines could be included in original trajectory, it is possible that each individual trajectory could be grouped into different clusters and helps provide more detail about trajectory patterns. Table 4-4. Example result based on clustering algorithm. Line Segment No. Cluster No. Trajectory No. 42 Line 1 C1 Tra 1 Line 2 C1 Tra 1 Line 3 C2 Tra 1 Line 4 C2 Tra 2 Line 5 C2 Tra 2 Line 6 C1 Tra 3 Line 7 C1 Tra 3 Line 8 C1 Tra 3 Line 9 C1 Tra 3 Table 4-5. Result of clustering algorithm. Total number of line Cluster segments cardinality C1 58 20 C2 41 18 C3 8 3 C4 42 15 C5 15 8 C6 59 14 C7 48 14 C8 224 46 C9 322 44 Cluster No. The final clustering result is represented in Table 4-5. There are nine clusters generated based on 817 line segments from the total 1287 line segments in the first step. That is to say, about 63.5 % of the individual movement patterns could be shared among multiple people, represented in nine clusters. Moreover, Cluster 8 and Cluster 9 are the most common movement patterns shown in blue and green in Fig.4-5 since 546 line segments from 90 individual trajectories are included in these two. Cluster 8 helps explain movements from about 18.7% of the total people in the case study, and most of the movements are represented in the middle two 43 spaces (work space and shop space). At the same time, there are “back and forth” movements patterns since most of the line segments are parallel types. Cluster 9 explains the movements shared by 17.8% of the sample included in case study. Comparing to Cluster 8, more trajectory patterns are represented and more spaces are used such as PC room, presentation room, as well as toilet. The similar thing is that there are still “back and forth” patterns involved. Notice that there are some lines outside of the Learning Factory because people go out of the building before they return the tags. Comparing to Fig. 4-2, the clustering results provide a clearer picture about the human trajectory movement patterns as well as the indoor space utilization patterns as shown in Fig. 4-5. Figure 4-5. Clustering visualization. 44 Figure 4-6. Clustering visualizations in the first period. Figure 4-7. Clustering visualizations in the second period. 45 Figure 4-8. Clustering visualizations in the third period. In order to detect possible movement pattern evolution, the original trajectory data set was separated into three periods: from January 20th 2012 to February 21th 2012 for the first period, from February 22th 2012 to March 22th 2012 for the second one, and from March 23th 2012 to April 23th 2012 for the last one. The visualizations are shown in the Fig. 4-6, Fig. 4-7 and Fig.4-8. In addition, there are several points needed to be addressed. First, utilized spaces are increasing as time goes on from the first picture (Fig. 4-6) to the last one (Fig.4-8). Second, the similarities among multiple clusters are increasing as time goes on. One possible explanation is that students have no specific assignments or tasks and just wander around to know each section in Learning Factory. However, as the semester goes on, students may need to design the prototype and then go to the machining room for milling. During the end of the semester, the PC room usage is decreased, but the presentation room is increased since they may complete the project already and give final presentations. To summarize, it is clear that having more 46 information about different geospatial trajectory patterns based on proposed methodology in this thesis instead of just mapping location points. In addition, this methodology provides one approach to recognize the utilization relationship between or among multiple spaces in order to capture the indoor space utilization patterns which can be taken as evidence for indoor space utilization optimization. 47 Chapter 5 Conclusions and Future Work This thesis proposes a human movement tracking methodology for both human gait and geospatial trajectory with preserved privacy, which means person is unidentifiable based on the movement data collected. The methodology is partitioned into two components. The first component is human gait modeling where the objective is to model and predict neurological patterns that influence human gait. In addition, we are able to solve human gait variation problem by introducing ratios in position, velocity and acceleration. The second component is human geospatial trajectory modeling and it aims to predict common regions of interest (CRI) in indoor design spaces in order to capture and optimize indoor space design. The experimental results show that our proposed human gait modeling is able to detect significant gait difference between PD patients and controls, and our proposed human geospatial trajectory modeling is able to detect common regions of interest form multiple people in the Learning Factory which can serve as a tool for future indoor space design. Based on these research findings, we can demonstrate the feasibility of employing multimodal sensors and supervised machine learning algorithms to model and predict human movement kinematics. It is time to consider how this work can be expanded and improved upon in the future. In terms of human gait modeling, one possible future work would be to identify the common relevant features among multiple machine learning algorithms in order to search for the most relevant features to the human gait class variable. For example, by examining all the selected features in different machine learning algorithm, researchers are able to recognize the most predictive features to the PD detection. In terms of human geospatial trajectory modeling, one possible future extension would be to add indoor space layout information in order to optimize 48 the indoor space utilization efficiency. For example, by adding facility layout information of the Learning Factory, the designers are able to better design the space and improve the utilization efficiency. 49 References [1] B. James, Body language: 7 easy lessons to master the silent language. Saddle River, New Jersey, 07458: FT Press, 2009. [2] J. K. Aggarwal and Q. Cai, “Human motion analysis: a review,” in Proceedings of the 1994 IEEE Workshop on. IEEE, 1994, pp. 90–102. [3] L. Wang, W. Hu, and T. Tan, “Recent developments in human motion analysis,” in Pattern Recognition, vol. 36, no. 3, pp. 585–601. [4] H. Fujiyoshi, “Real-time Human Motion Analysis by Image Skeletonizadion,” in Fourth IEEE Workshop on. IEEE, 1998, pp. 15–21. [5] “Human Gait,” http://en.wikipedia.org/wiki/Gait_(human). . [6] A. Hakeem, R. Vezzani, M. Shah, R. Cucchiara, and R. Emilia, Estimating Geospatial Trajectory of a Moving Camera. Hong Kong: ICPR 2006, 2006, pp. 82–87. [7] D. Gil and D. J. Manuel, “Diagnosing parkinson by using artificial neural networks and support vector machines,” Global Journal of Computer Science and Technology, vol. 9, no. 4, 2009. [8] S. J. G. Lewis, T. Foltynie, a D. Blackwell, T. W. Robbins, a M. Owen, and R. a Barker, “Heterogeneity of Parkinson’s disease in the early clinical stages using a data driven approach.,” Journal of neurology, neurosurgery, and psychiatry, vol. 76, no. 3, pp. 343–8, Mar. 2005. [9] D. B. Calne, B. J. Snow, and C. Lee, “Criteria for Diagnosing Parkinson’s disease,” Annals of Neurology, vol. 32, no. Supplement S1, pp. 125–127, 1992. [10] J. Barth, M. Sunkel, K. Bergner, G. Schickhuber, J. Winkler, J. Klucken, and B. Eskofier, “Combined analysis of sensor data from hand and gait motor function improves automatic recognition of Parkinson’s disease,” in Engineering in Medicine and Biology Society (EMBC), 2012 Annual International Conference of the IEEE, 2012, pp. 5122–5125. [11] Http://www.swimsmooth.com/certifiedcoaches.html, “Swimming video analysis.” . [12] and J. M. H. Smith, David J., Stephen R. Norris, “Performance evaluation of swimmers,” Sports Medicine, vol. 32, no. 9, pp. 539–554, 2002. 50 [13] G. L. Foresti, “A Real-Time System for Video Surveillance of Unattended Outdoor Environments,” in IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 1998, vol. 8, no. 6, pp. 697–704. [14] M. Xu, L. Duan, C. Xu, and Q. Tian, “A fusion scheme of visual and auditory modalities for event detection in sports video,” in Acoustics, Speech, and Signal Processing, 2003, vol. 3, pp. 111–189. [15] A. F. Smeaton, P. Over, and W. Kraaij, “Multimedia Content Analysis,” in Signals and Communication Technology, 2009, pp. 151–174. [16] and A. E. E. Prassler1, J. Scholz, Tracking People in a Railway Station during Rush-Hour. 1999, pp. 162–179. [17] and A. T. Regazzoni, Carlo S., “Distributed data fusion for real-time crowding estimation,” in Signal Processing, 1996, vol. 53, pp. 47–63. [18] A. Fod, A. Howard, and A. Overview, “A Laser-Based People Tracker,” in Robotics and Automation, 2002. Proceedings, 2002, no. May, pp. 3024–3029. [19] M. S. L. Scanners, H. Zhao, and R. Shibasaki, “A Novel System for Tracking Pedestrians Using Multiple Single-Row Laser-Range Scanners,” Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions, vol. 35, no. 2, pp. 283–291, 2005. [20] L. W. Campbell and A. F. Bobick, “Recognition of human body motion using phase space constraints,” in Proceedings of IEEE International Conference on Computer Vision, 1995, pp. 624–630. [21] N. H. Goddard, “Incremental Model-Based Discriminat ion of Articulated Movement from Motion Features,” in Proceedings of the 1994 IEEE Workshop on. IEEE, 1994, pp. 89–94. [22] I. A. Kakadiaris, D. Metaxas, R. Bajcsy, and I. Science, “Active Part-Decomposition, Shape and Motion Estimation of Articulated Objects: A Physics-based Approach,” Computer Vision and Pattern Recognition, 1994. Proceedings CVPR’94, pp. 980–984, 1994. [23] R. Rashid, “Towards a system for the interpretation of moving light displays,” in Pattern Analysis and Machine Intelligence, IEEE …, 1980, no. 6, pp. 574–581. [24] G. Johansson, “Visual perception of biological motion and a model for its analysis,” Perception & psychophysics, vol. 14, no. 2, pp. 201–211, 1973. 51 [25] M. K. Leung, Y. Yang, and M. Senior, “First Sight : A Human Body Outline Labeling System,” Pattern Analysis and Machine Intelligence, IEEE Transactions, vol. 17, no. 4, 1995. [26] G. Johansson, “Visual motion perception,” Scientific American, vol. 232, no. 6, pp. 76–88, 1975. [27] J. A. J. k. A. Webb, Visually Interpreting The Motion of Objects in Space. Computer Science Department, University of Texas at Austin: , 1981. [28] I.-C. C. H. Chang, “Ribbon-Based Motion Analysis of Human Body Movements,” in In Pattern Recognition, Proceedings of the 13th International Conference, 1996, pp. 436– 440. [29] C. S. Works, “Image sequence analysis of real world human motion,” Pattern Recognition, vol. 17, no. 1, 1984. [30] D. . Gavrila, “The Visual Analysis of Human Movement: A Survey,” Computer Vision and Image Understanding, vol. 73, no. 1, pp. 82–98, Jan. 1999. [31] D. M. Gavrila and L. S. Davis, “3-D model-based tracking of humans in action,” in Computer Vision and Pattern Recognition,, 1996, pp. 73–80. [32] L. Goncalvest, E. Di Bernardotl, E. Ursellaj, and P. Peronat, “Monocular tracking of the human a r m in 3D,” in Computer Vision, 1995, pp. 764–770. [33] I. A. Kakadiaris and D. Metaxas, “3D Human Body Model Acquisition from Multiple Views,” in Computer Vision, 1995. Proceedings., Fifth International Conference on. IEEE, 1995, pp. 618–623. [34] R. Szeliski, O. K. Square, and S. B. Kang, “Recovering 3D Shape and Motion from Image Streams using Non-Linear Least Squares,” in Computer Vision and Pattern Recognition, 1993. Proceedings CVPR ’93., 1993 IEEE Computer Society Conference, pp. 752–753. [35] T. Nomura, T. Tasaki, and T. Kanda, “Questionnaire – Based Research on Opinions of Visitors for Communication Robots at an Exhibition in Japan,” in Human-Computer Interaction-INTERACT, 2005, pp. 685–698. [36] T. Shibata, K. Wada, and K. Tanie, “Tabulation and analysis of questionnaire results of subjective evaluation of seal robot at Science Museum in London,” Proceedings. 11th IEEE International Workshop on Robot and Human Interactive Communication, pp. 23– 28, 2002. 52 [37] F. Girardin, F. D. Fiore, C. Ratti, and J. Blat, “Leveraging explicitly disclosed location information to understand tourist dynamics: a case study,” Journal of Location Based Services, vol. 2, no. 1, pp. 41–56, Mar. 2008. [38] C. R. Yuji Yoshimura, Fabien Girardin, Juan Pablo Carrascal, “New Tools for Studying Visitor Behaviors in Museum: A Case Study at the Louvre,” in and Communication Technologies in Tourism 2012. Proceedings of the International conference in Helsingborg (ENTER 2012)., pp. 15–27. [39] H. Cao, N. Mamoulis, D. W. Cheung, P. Road, and H. Kong, “Mining Frequent Spatiotemporal Sequential Patterns,” in Data Mining, Fifth IEEE International Conference, pp. 27–30. [40] G. Andrienko and S. Augustin, “Visual Analytics Tools for Analysis of Movement Data,” ACM SIGKDD Explorations Newsletter, vol. 9, no. 2, pp. 38–46, 2007. [41] C. Charayaphan, “Communications Image processing system for interpreting in American Sign Language motion,” Journal of Biomedical Engineering, vol. 14, no. 5, pp. 419–425, 1992. [42] S. Tamura and S. Kawasaki, “Recognition of sign language motion images,” Pattern Recognition, vol. 21, no. 4, pp. 343–353, Jan. 1988. [43] F. Polana, R. Nelson, and N. York, “Low Level Recognition of Human Motion,” in Motion of Non-Rigid and Articulated Objects, 1994., Proceedings of the 1994 IEEE Workshop on. IEEE,, 1994, pp. 77–82. [44] U. Kreljel, W. Ritter, R. Dbag, and U. Daimlerbenz, “Tracking Non-Rigid, Moving Objects Based on Color Cluster Flow,” in IEEE Computer Society Conference, 1997, pp. 257–260. [45] N. Johnson and D. Hogg, “Learning the distribution of object trajectories for event recognition,” Image and Vision Computing, vol. 14, no. 8, pp. 609–615, Aug. 1996. [46] T. T. Zhouyu Fu , Weiming Hu, “Similarity based vehicle trajectory clustering and anomaly detection,” in Image Processing, 2005. ICIP 2005. IEEE International Conference on (Volume:2 ), pp. 11–14. [47] I. K. Fodor, “A Survey of Dimension Reduction Techniques,” 2002. [48] M. A. Hall, “Correlation-based feature selection for machine learning,” Doctoral dissertation, The University of Waikato, 1999. [49] B. Fritzke, “Growing Cell Structures: A Self-Organizing Network for Unsupervised and Supervised Learning,” Neural networks, vol. 7, no. 9, pp. 1441–1460, 1994. 53 [50] R. G. Ramani, G. Sivagami, and and G. S. Ramani, R. Geetha, “Parkinson disease classification using data mining algorithms,” International Journal of Computer Applications, vol. 32, no. 9, pp. 17–22. [51] N. Landwehr, M. Hall, and E. Frank, “Logistic Model Trees,” Machine Learning, vol. 59, no. 1–2, pp. 161–205, May 2005. [52] A. Tsanas, M. A. Little, P. E. McSharry, J. Spielman, and L. O. Ramig, “Novel speech signal processing algorithms for high-accuracy classification of Parkinson’s disease,” Biomedical Engineering, IEEE Transactions on, vol. 59, no. 5, pp. 1264–1271, 2012. [53] A. Ozcift, “SVM feature selection based rotation forest ensemble classifiers to improve computer-aided diagnosis of Parkinson disease,” Journal of medical systems, vol. 36, no. 4, pp. 2141–2147, 2012. [54] S. Wu, “A Data Mining Analysis of The Parkinson’s Disease,” in iBusiness, 2011, vol. 03, no. 01, pp. 71–75. [55] S. M. Gabrilovich, Evgeniy and E. Gabrilovich, “Text Categorization with Many Redundant Features: Using Aggressive Feature Selection to Make SVMs Competitive with C4.5,” in Proceedings of the twenty-first international conference on Machine learning, pp. 41–48. [56] S. Abe, Support vector machines for pattern classification. Springer London Dordrecht Heidelberg New York, 2010. [57] L. E. O. Breiman, “Random Forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001. [58] M. F. Amasyalı, B. Diri, and M. F. Amasyal\i, Automatic Turkish Text Categorization in terms of Author, genre and gender. Springer Berlin Heidelberg, 2006, pp. 221–226. [59] D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learning algorithms,” Machine Learning, vol. 6, no. 1, pp. 37–66, Jan. 1991. [60] T. D’heygere, P. L. M. M. Goethals, and N. De Pauw, “Use of genetic algorithms to select input variables in decision tree models for the prediction of benthic macroinvertebrates,” Ecological Modelling, vol. 160, no. 3, pp. 291–300, Feb. 2003. [61] A. H. Fielding and J. F. Bell, “A review of methods for the assessment of prediction errors in conservation presence/absence models,” Environmental conservation, vol. 24, no. 1, pp. 38–49, 1997. [62] E. Dakou, T. D’heygere, A. P. Dedecker, P. L. M. Goethals, M. Lazaridou-Dimitriadou, N. Pauw, and N. De Pauw, “Decision Tree Models for Prediction of Macroinvertebrate Taxa 54 in the River Axios (Northern Greece),” Aquatic Ecology, vol. 41, no. 3, pp. 399–411, Jul. 2006. [63] J. Lee and J. Han, “Trajectory Clustering : A Partition-and-Group Framework,” in Proceedings of the 2007 ACM SIGMOD international conference on Management of data, 2007, pp. 593–604. [64] S. Patel, K. Lorincz, R. Hughes, N. Huggins, J. Growdon, D. Standaert, M. Akay, J. Dy, M. Welsh, and P. Bonato, “Monitoring motor fluctuations in patients with Parkinson’s disease using wearable sensors,” Information Technology in Biomedicine, IEEE Transactions on, vol. 13, no. 6, pp. 864–873, 2009. [65] X. Huang, H. Chen, W. C. Miller, R. B. Mailman, J. L. Woodard, P. C. Chen, D. Xiang, R. W. Murrow, Y.-Z. Wang, and C. Poole, “Lower low-density lipoprotein cholesterol levels are associated with Parkinson’s disease,” Movement disorders, vol. 22, no. 3, pp. 377–381, 2007. [66] “Parkinson’s disease introduction.” http://en.wikipedia.org/wiki/Parkinson’s_disease. [Online]. [67] P. Martinez-Martin, A. Gil-Nagel, L. M. Gracia, J. B. Gomez, J. Martí nez-Sarriés, and F. Bermejo, “Unified Parkinson’s disease rating scale characteristics and structure,” Movement disorders, vol. 9, no. 1, pp. 76–83, 1994. [68] a J. Hughes, S. E. Daniel, L. Kilford, and a J. Lees, “Accuracy of clinical diagnosis of idiopathic Parkinson’s disease: a clinico-pathological study of 100 cases.,” Journal of Neurology, Neurosurgery & Psychiatry, vol. 55, no. 3, pp. 181–184, Mar. 1992. [69] T.W. Simpson and E. Kisenwether, “Driving entrepreneurial innovation through the learning factory: The power of interdisciplinary capstone design projects,” in ASME Design Engineering Technical Conferences-Design Education Conference., 2013. [70] T. W. Lamancusa, John S and Simpson, “The Learning Factory–10 Years of Impact at Penn State.,” in International Conference on Engineering Education, pp. 16–21. [71] “Simple & Reliable Indoor Positioning Overview.” http://www.buzbynetworks.com/buznet/buznet-overview. [Online]. Available: Available:

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR