Download Resource optimization in embedded systems based on data mining

Resource optimization in embedded systems based on data mining Author: AKOO HEMATBOLLAND Supervisor (KTH): Professor Timo Koski Supervisor (Scania CV AB): Håkan Gustavsson Master thesis KTH. Royal Institute of Technology M.Sc. in Engineering Physics SCI. School for Engineering Science Stockholm, Sweden 2008 Resursoptimering av inbyggda system Baserad på Data Mining AKOO HEMATBOLLAND Examensarbete i matematisk statistik om 30 högskolepoäng Vid programmet för Teknisk Fysik Kungliga Tekniska Högskolan år 2008 Examinator: professor Timo Koski Handledare på Scania: Håkan Gustavsson Kungliga tekniska högskolan Skolan för Teknikvetenskap KTH SCI 100 44 Stockholm URL: www.csi.kth.se –2– Sammanfattning Detta examensarbete behandlar resursoptimering i inbyggda system för Scanias lastbilar. Arbetet gick ut på att analysera historisk försäljningsdata för att bättre förstå vilka val kunden gör. Stor del av arbetet har varit att studera lämpliga metoder för analysen samt att utvärdera verktyg. Eftersom det handlar om stora data mängder har metoder inom Data Mining tillämpats för att utvinna relevant information om kundens val. Metoder och verktyg har sedan testats genom att analysera fem olika funktioner (funktion A-E). Funktionsidentiteterna är av sekretesskäl inte tillgängliga i detta offentliga dokument. Med hjälp av Data Mining, främst här med Two Step Clustering och BIC (= ett bayesianskt informationskriterium), kan ökad kunskap om kundens val ge företaget reducerade kostnader och förbättra relationen till kunden. Exempel på förvånande resultat var att en stor andel av alla sålda lastbilar med funktion A och påbyggnadsnod var sålda till Thailand. Ett annat exempel var att en stor andel konstruktionsbilar med funktion B såldes till Dubai med pneumatiskt bromssystem. Framtida arbete av särskilt intresse är att göra en mer omfattande fallstudie där man tar hänsyn till en stor mängd funktioner. –3– Abstract This master thesis discusses the resource optimization in embedded systems for Scania's trucks. It is about analyzing historical sale data to find out more information about customer’s choice. Big part of the work has focus upon studying appropriate methods and tools to the analysis. Since we are dealing with a large amount of data, data mining techniques have been used to find relevant information about customer choice. Methods and tools (the main tool is Two Step Clustering with BIC = Bayesian Information Criterion and log likelihood metric) have been tested on five different functions (function A-E). The identity of these functions has been suppressed in this public version of the final report. With Data Mining a company’s knowledge about customer’s choice can reduce costs and improve the value of customer relationships. Example of results was that a big proportion of all trucks with function A and BWS (Body Work System) were sold to Thailand. Another example was that a big proportion of construction-trucks with function B was sold to Dubai and had pneumatic brake system. Future work of particular interest would be to perform a more extensive case study which one considers a larger amount of functions. –4– Acknowledgements This master thesis constitutes the final part of the education I am pursuing at the Royal Institute of Technology (KTH); a M.Sc. in Engineering Physics, with a specialization on financial mathematics and statistics. The work of this thesis has been carried out at the Scania department of pre-development (REP), part of the Systems Development division. I would like to thank Håkan Gustavsson, my supervisor at Scania and Professor Timo Koski, my supervisor at KTH. Thank you! In particular I would like to thank Saddaf Shabbir at Vectuz Webwork AB for her in-depth knowledge of programming and Ann Lindqvist at the Scania department of Diagnostic Communications (RESD) for her knowledge of statistics. I would also like to thank my good friends Anders Ingårda and Assad Alam. Thank you! Finally I would like to thank my mother, my father and my sister – always supporting me. I love you! Södertälje, Sweden 2008 Akoo Hematbolland –5– Table of contents 1. Introduction ....................................................................................................................9 1.1 Background.............................................................................................................9 1.2 The evolution of the automotive industry ..............................................................11 1.3 Problem statement.................................................................................................12 1.4 Research issues .....................................................................................................14 1.5 Scania – a case study.............................................................................................15 1.6 ECU systems.........................................................................................................15 1.7 Problem statement revisited – Data Mining ...........................................................18 1.7.1 Data Mining process......................................................................................18 1.7.2 What is meant with function? ........................................................................18 1.8 Large data set........................................................................................................19 2. Data mining ..................................................................................................................20 2.1 Introduction ..........................................................................................................20 2.1.1 Data Mining Tasks ........................................................................................22 2.2 Data preparation....................................................................................................23 2.2.1 Attributes and Measurement ..........................................................................23 2.2.2 The Different Types of Attributes..................................................................24 2.3 Cluster analysis .....................................................................................................26 2.3.1 Hierarchical Clustering..................................................................................27 2.3.2 K-means Clustering .......................................................................................27 2.3.3 Gaussian Mixture Model ...............................................................................28 2.3.4 Distance measure...........................................................................................28 3. Binary Clustering..........................................................................................................29 3.1 Mathematical criteria – A general clustering model for binary data .......................29 3.1.1 K-means Clustering .......................................................................................31 3.1.2 The principle of Minimum Description Length MDL ....................................32 3.1.3 Stochastic Complexity SC .............................................................................32 3.2 Two Step Cluster in SPSS .....................................................................................33 3.2.1 CF-tree ..........................................................................................................33 3.2.2 Cluster step....................................................................................................34 3.2.3 Log-Likelihood distance................................................................................34 3.2.4 Auto Clustering using BIC ............................................................................35 3.3 Data Mapping RPM in VisuMap ...........................................................................35 4. Analysis........................................................................................................................36 4.1 K-means in MatLab – A simple example...............................................................36 4.2 Data Mapping in VisuMap ....................................................................................38 4.3 Two Step Clustering in SPSS ................................................................................39 4.3.1 Clustering strategy.........................................................................................39 4.4 Result in SPSS ......................................................................................................39 4.4.1 Function A ....................................................................................................40 4.4.2 Function B.....................................................................................................42 4.4.3 Function C.....................................................................................................44 4.4.4 Function D ....................................................................................................46 4.4.5 Function E.....................................................................................................48 4.5 Change over time ..................................................................................................50 4.5.1 Function D ....................................................................................................50 4.5.2 Function E.....................................................................................................51 5. Discussion ....................................................................................................................52 5.1. Tools.....................................................................................................................52 –6– 5.2. Binarization ..........................................................................................................53 6. Related wok – Stock market..........................................................................................54 7. Future work – Function to function ...............................................................................55 8. Conclusion....................................................................................................................56 9. References ....................................................................................................................57 10. Appendices ...............................................................................................................59 10.1. Appendix A – Importance of “knowing your data”................................................60 10.2. Appendix B – Dependency Structure Matrix .........................................................62 10.3. Appendix C - K-means in MatLab.........................................................................65 –7– Reading guide Chapter 1 introduces the thesis. It provides a motivation, as well as a description of the research issues and the assignment. Chapter 2 introduces Data Mining techniques in general. Chapter 3 discusses Binary Clustering. Chapter 4 discusses the analyses performed in this thesis. Chapter 5 outlines a discussion about the methods and tools. Chapter 6 describes related work. Chapter 7 presents future work. Chapter 8 provides the conclusions of this thesis. For the one who can only spend a quarter of an hour reading the thesis the conclusions may be of greatest interest as well as the section outlining the problem statement. The following parts of the thesis are the most relevant if you are… Math student The introductionary sections (perhaps 1.7-1.8 and 2.3). Binary Clustering (chapter 3). Results, analysis and conclusion as well as future work. Importance of “Knowing your data” (Appendix A) Scania employee The introductionary sections. Data Mining introduction (Chapter 2.1). Results, analysis and conclusion as well as future work. DSM-clustering (Appendix B). –8– 1. Introduction This chapter serves as an introduction to the thesis. It introduces background, the evolution of automotive industry, the problem statement and the research issues, as well as providing a motivation for this research. 1.1 Background To facilitate people’s life in modern information society the technology of computer, auto control and communication has been developed. Microcontroller is used widely in electric appliance, automobile, robots, science instrument and medical device. The embedded system indicates that the computer and auto control technology has permeated into many kinds of products in our life [1]. Until recently, in the automotive industry, reuse of software has entirely been a typical activity of suppliers. They try to reduce the increasing software development costs that stem from rising complexity and size of software in the modern automobile [4]. In the current time, not only the suppliers but also the manufacturers have to deal with the problem of reuse. The manufacturers have to deal with additional problems such as to integrate the networked hardware components to one automotive system. The automotive industry is facing a new challenge at the beginning of the third millennium. This means that electronics will make 90% of the innovations and out of that 80% in the software part [17]. The development of electronics will be affected by this major change. It will be a need for more functionality that is highly connected for this matter before running out of time. The Mercer Management Consulting and Hypovereinsbank [2] have done a study that values the software in the automotive industry remarkably. This study claims that in 2010, 13% of the production cost of a vehicle will be software (Figure 1.1). Figure 1.1. – (Mercer Management Consulting and Hypovereinsbank, 2001) To reflect on this matter a changed development process has to be recognized and this change must also include the methods of developing software for the automotive domain. Intensive work has been done in parts of this field that covers requirements engineering, quality of software or model based software development among others [4]. The main targets are to –9– decrease the software development time and to increase the quality of software. Reuse of software is also another target to include in this challenge. Created mainly for the specific requirements and standards automotive industries for vehicles to work in the field, modern controller consist of numerous different Electric Control Units (ECUs) based on the embedded system [1]. The Control Area Network (CAN) configuration in construction machinery is made by the different ECUs. The main tasks include measuring, driving or operating control device for sensor-actuator management and carry out a number of tasks in real time. The separation of the hardware of an ECU from the embedded software is the main requirement for reusing software in the automotive domain. A few years ago automotive manufacturers saw the ECU’s of a car as a single unit. They defined them as black boxes when they ordered them from the supplier. When receiving the samples, they tested them as black boxes.The drawback with this method for the manufacturers is that the software has to be developed from scratch for each new project, if the supplier is changed. This will cause expenses, and an increase of development time. The responsibility for the whole electronic system is another subject that will need more consideration by the manufacturers. The suppliers view will only cover their part of the system. It is essential for the manufacturers to develop process and methods that will make reuse software on the system level available. The methods for the reuse of software will modify the manufacturers to develop relevant software on their own in the future. [4] Since the truck industry follows the same development track as the car industry (with some latency) the truck manufacturers must deal with the same problem. To many people, cars and trucks are the same product – the only difference being the size. This is however far from the truth. There are major differences – differences that will be explored in this section. In an article by Zientz [5], he discusses the differences between passenger cars and commercial vehicles: “The main purpose of commercial vehicles is the transportation of goods. This means that the manufacturers of commercial vehicles, unlike the passenger car sector, must deal with a wide variety of trucks and special purpose vehicles. Trucks for example are produced in a wide variety of combinations regarding the maximum load to be carried, the number of axles, the size of the engine and the size of the truck cabin. The customer base for truck manufacturers varies a lot, from private business owners to large haulage companies with a fleet of several hundred trucks. These hauling companies have a strong purchasing power that may influence cost and feature structures of the vehicle manufacturers. Most European truck manufacturers are developing vehicles for the global market, in order to ensure necessary production quantities. This globalisation brings additional challenges to the manufacturer with regard to different customer demands, regional regulations and competition in regional markets. Hence, the ability to address strong variation is a key success factor in this business.” – 10 – 1.2 The evolution of the automotive industry The increase of the importance of embedded systems within the automotive industry is a fact. High end cars can contain well over 50 ECUs within the car sector. Yet the truck division is fairly diverse and the number of ECUs in trucks is more in the magnitude of a dozen [5]. The cost of a characteristic ECU in a truck is approximately 1000 SEK [3]. See figure 1.2 to observe the evolution of the number of ECUs in passenger cars. Figure 1.2. Number of ECU in latest car releases (Zelke, 2006) A study made by McKinsey & Company in 2006 [6] shows that they expect the value of electronics in automobiles to increase from the current 25 percent to 40 percent in 2015. According to this study the software and electronics drive about 70 to 90 percent of all innovations in cars- a figure that will be more increased by 2015. Electronics is also seen as a major lever allowing manufacturers to differentiate product offering and expand into new markets. The statistics from McKinsey & Company is for the passenger car sector, but it is also valid for the truck industry, which most of the time follow the same evolutionary track but with some latency. Consider the following example taken from Erik Persson’s thesis [3]: A particular module supports the function cruise control as well as the function adaptive cruise control. Then suppose that the majority of the customers only request the function cruise control. The module would then have the code needed to implement this function, though this is code that in this case is not being used. This latent code is something that the customer does not have to pay for, so in a sense it is given away for free. (However, the function that this piece of code implements cannot be used by the customer.) The function adaptive cruise control requires a distance sensor in order to function, which is not required by the regular cruise control. For a vehicle configuration in which this function has not been chosen the sensor is not mounted, and hence incurs no cost. Yet there are other associated hardware costs: the size of the memory has been dimensioned to harbour both functions and when only cruise control is used these results in dead space in the memory. Only a fraction of all vehicles has both functions, rendering it more likely that a smaller memory would be sufficient. The conclusion would be that the resource utilization of this module is low, and hence its cost-efficiency as well. However, the adaptive cruise control – 11 – logic is very complex and distributed, which would make it far from straight forward in practice to evaluate this function. An architecture could – as in the case of the cruise control – lead to a situation in which a customer choice incurs unnecessary cost as functionality is not used to its full extent. The architecture that uses the resources in the best way are hence one in which consideration has been taken to the choice of the customer. The conclusion of this section informs us that electronics is very important from a financial or business perspective. For instant relatively small savings on a component level may result in saving of nearly 10 million EURO [3] over the production period. Hence manufacturers’ knowledge about functionality is getting more and more important. 1.3 Problem statement In previously chapter we saw that the cost of electronics has risen rapidly over the last years, and a reduction of the product cost of the electronics system would thus have a significant impact on the total cost of the vehicle. A big part of current and future functionality is realized by the electronics system. This system consists of modular components with the same requirement as the traditional mechanical components. The electronics system in vehicles implement distributed functions that are employing different hard- and software components in order to realize their functionality. The way in which components are allocated and connected is described by the architecture of the system. Scania CV AB produces automotive products with a common product platform of modular components to keep the product cost low, a high level of quality and to offer the customer a maximized range of choice. The customer can virtually tailor the vehicle and may choose between many different functions such as cruise control, anti-spin, ESP, retarder etc. The system architecture is the same within a product family, but every produced vehicle can still be unique as its configuration is chosen by the customer. Figure 1.3 show that every vehicle has its own “DNA.” Hence, there is an almost infinite set of variants, implying that it is virtually impossible to achieve a perfect architecture with respect to resource utilization. Figure 1.3. Every truck has its own “DNA” – 12 – The “DNA” is described by special codes. These codes describe the physical configuration of every single vehicle. Its value can affect the parameters in one or several control units and our focus will be on the electrical control units. The purpose of this work is to investigate the resource utilization by using historical sale data which are described by the DNA codes. The goal is to get knowledge about the customer’s choice by looking at historical sale data for the electronic system of the vehicles. Figure 1.4 shows how the electric system is allocated in a truck. The set of possible customer choices is enormous and the number of optional functions can be very large. If we consider 20 functions simultaneously which have 5 attributes each, we will get 95.367.431.640.625 different combinations. Figure 1.4. 20 Control Units with 5 attribute each lead to astronomical numbers We will pick out five functions (function A-E) which depend on DNA codes in the vehicle. These codes can be components, control units, countries and type of vehicle. From a database containing historical sales data we will then pick out the subset of vehicle where each function is included. The problem is now to find a pattern between the trucks based on control units, segment and countries. sales data analysis function design No. 1 Cost efficient systemdesign 5 functions – 13 – Figure 1.5 below shows the abstract model of this work. The inputs are the given functions and historical data while the output are statistics. Function A-E DNA code Analysis Historical Data DNA code Statistics -Control Units -Segment -Country Figure 1.5. Abstract model of this work An architecture may in some cases lead to a situation where a customer choice incurs unnecessary costs as the function requires a particular hardware that otherwise would not be necessary. This implies that the architecture that utilizes the resources best is an architecture in which one has taken into account how the product family has been configured with respect to customer choice [3]. In many cases, numerous modular components employed in the electronics system have functions that are not being used, as the customer may have chosen a more low-end, less advanced configuration of the vehicle. The result of this work can be used as support when making architectural decisions 1.4 Research issues The purpose of this master thesis is to investigate how historical sales can be used to find out more information about customers choice of functions. By applying Data Mining techniques we will try to find patterns in our data. If we don’t find any interesting pattern for any of our five functions we will be forced to pick out other functions. In order to fulfil the purpose of this master thesis, four research questions were initially formulated: Research issues 1 – Can we find a pattern where several vehicles uses similar configuration based on the electrical control units? Patterns of interest: 2 – What kind of truck is it (Segment - distribution, long-haulage or construction)? 3 – Which country has bought these trucks? 4 – Change over time? – 14 – 1.5 Scania – a case study Scania is one of the world’s leading manufacturers of trucks and buses for heavy transport applications. A growing proportion of the company’s operations consist of services. Scania operates in about 100 countries and employs almost 33 000 people. Research and development are concentrated to Södertälje, Sweden, and production units are located in Europe and Latin America. This master thesis has been carried out at REP, which is a department of the division Systems Development. REP mainly works with pre-development of systems and functions realized by electronics and software, and has no responsibility for parts in production. The department develops vehicle functionality, as well as works with the long term improvement of methods for systems development, e.g. methods for system modelling. Moreover, REP has the responsibility to co-ordinate the pre-development within Systems Development and to keep contacts with universities, institutes and research programs in the area. The supervisor from the Scania side has been Håkan Gustavsson, currently pursuing a PhD within the project Decision methods for E/E-system Architectural Design (DAD). 1.6 ECU systems An overview of how the electrical system has been designed in Scania’s vehicles is giving in this section. The network that links the control units plus some of the systems included in this network is described. The Electric Control Unit ECU systems write and read “packets” of digital information in a network called Control Area Network. There are approximately 30 control units (ECUs) which are linked together in the CAN network. This means that in a vehicle with advance specifications (high-end), most of the systems interchange information over the CAN network. The advantage is that the driver and mechanic are able to gain more information about the condition of the vehicle and regarding any faults [22]. This makes the troubleshooting both simpler and faster. Furthermore, it enables the mechanics to change functions in the ECU systems. The CAN network shown in the figure 1.6 contains 18 ECU systems. However, there are only five ECUs in the simplest vehicle (low-end). – 15 – Figure 1.6. Location of ECUs that can be part of CAN network in an advance vehicle Reducing the risk of interferences to message between the most important ECU systems (coordinator, brakes, engine and gearbox) from less important messages (radio, ACC, ATA, etc), the important system are linked together in a special CAN bus (red bus), The other systems are divided into two CAN buses, which is called the yellow and green bus [22] . – 16 – The two figures 1.7 and 1.8 below show the network structure for two different truck configurations. The first version as a high-end version, where the customer has chosen almost all of the available functionality, and as a consequence there are more than 20 ECUs required to implement this functionality (each box represents an ECU). D i a g n o s t ic b u s CO O C o o r d i n a to r s y s te m AUS A u d io S y s t e m CSS C ra s h S a fe ty S y s te m R ed Bus LAS L o c k in g a n d A la rm S y s te m ACC A u t o m a ti c C li m a t e C o n t ro l GMS G e a rb o x M a na ge m en t S y s te m s EMS E n g in e M a na gem ent S y s te m EEC E x h a u s t E m is s io n C o n tr o l W TA a u x ilia r y h e a te r s ys te m W a te r - T o - A i r IC L Green Bus I n s t r u m e n t C l u s te r S y s te m TCO T a ch o g ra p h S y s te m Yellow Bus C TS C lo c k a n d T im e r S y s te m R TI R o a d T ra n s p o rt In fo r m a t i c s s y s te m V IS V i s ib i li ty S y s t e m APS A i r P r o c e s s in g S y s te m BWS B o d y W o rk S y s te m B o d y B u ild e r T r u c k Figure 1.7. The embedded system of a high-end version of a Scania vehicle Figure 1.8. The embedded system of a low-end version of a Scania vehicle – 17 – BMS B ra k e M a na ge m en t S y s te m SM S S u s p e n s io n M an ag em e nt S y s te m 1.7 Problem statement revisited – Data Mining This chapter revisits the problem statement by highlight the sense of data mining. 1.7.1 Data Mining process From a database the data will be represented by a large matrix. This work will be delimited by looking at customer’s choice of functions as Boolean variables (0/1). This means that the customer’s choice is representing by binary attributes. Hence the matrix will be built on binary values. The structure in the data set will be investigated by Data Mining techniques. The final part is to visualize the structure in the modified binary matrix. Figure 1.9 shows the process. Figure 1.9. Data mining process for this work 1.7.2 What is meant with function? Consider following example to highlight the concept of a function: Call a fictitious (not realistic) function for temperature display. This function uses a sensor and two ECUs to display the temperature. All of these components are described by DNA codes. Here, the ICL and ACC are control units while the sensor and temperature display are components in the vehicle. This function can be described by a mathematician as: f ( x1 , x 2 , x3 , x 4 )  f ( DNA23, DNA4, DNA1, DNA3)  f ( sensor , ICL, COO , display ) In this work five functions will be picked out to analyze. Architectures at Scania will help us to find appropriate functions to the analysis. These functions depend on DNA codes in the vehicle which can be both components and control units: f 1 ( x1 , x2 ,.....)  A f 2 ( y1 , y 2 ,....)  B f 3 ( z1 , z 2 ,.....)  C f 4 (q1 , q 2 ,....)  D f 5 ( w1 , w2 ,...)  E Once the function is defined we will look at historical data sales (approximately a quarter of a million trucks from our database) to find those vehicle using this function. – 18 – We will now investigate if we can find a pattern based on the control units, countries and segment for these vehicles. Notice that we are just looking at control units despite the function are correlated to other functions. See future work for more information. The following binary matrix shows n trucks with a common function. The rows are the cases (trucks in this case) and the columns describe attributes: Electric Control Units, countries and segment (distribution, long-haulage or construction). Truck 1 Truck 2 Truck 3 … … … Truck n Attribute 1 Attribute 2 Attribute 3 ...... Attribute k 0 0 0 …. 1 1 0 …. 0 1 0 ….. 0 0 1 …. 0 0 0 … 0 1 0 1 0 The ones in the above matrix indicate that there is a certain control unit, country and type for a vehicle (segment) and the zeros indicate that there is not. 1.8 Large data set Since the historical data set from the database is large this problem must be tackled by data mining techniques. When dealing with binary data (binary vectors) one has to use appropriate methods and algorithms to classify the data. Moreover, reliable patterns and visualization of the patterns depends of the nature of the data and the chosen distance measure. Finally, right tool (software) must be found. “Right tool” means software that can handle such large data set, data type and contains the algorithms needed for this problem. The purpose of this work is to investigate the resource utilization in automotive embedded systems. The initial phase of the work constituted of formulating a problem statement and research issues. The next step consisted of studying methods and algorithms based on data mining techniques. Once the data preparation was finished the method and algorithms was tested. The main tool was SPSSs TSC (Two Step Clustering). SPSS is powerful software made for statistical analysis [23]. The final part of the work was to visualize the results in SPSS. The literature survey formalized the background of the problem. It laid the foundation for the theoretical framework used to evaluate the resource efficiency and to compare different methods. The literature survey included doctoral theses, Scania internal documents, published books, articles in various journals, other publications etc. Problem statement Data Mining Binary Clustering Analysis Tools – 19 – SPSS Visualization Conclusions 2. Data mining This chapter describes data mining techniques. It also describes data preparation, some clustering methods and distance measurement. 2.1 Introduction In the current time, vast amount of data are collected and stored in computers. The reason is to extract useful information later. The relevant information of data is not known at initial time of collection and therefore the database is not designed to distil any particular information [8]. The nature of the data in the database is unstructured. The science of extracting useful information from large data sets is usually referred to as “Data Mining” or “Knowledge Discovery from Data.” Hence data mining is the process of sorting through large amounts of data and picking out relevant information [7]. Here data can be any facts, numbers, or text that can be processed by a computer. Patterns, association, relationship among all this data can provide information. Figure 2.1 shows the concept of data mining – finding relevant information in large data set. Figure 2.1. Data mining – find relevant information in large amounts of data There are many different application areas for data mining, ranging from scientific applications such as the classification of volcanoes on Venus to internet search engine. Data mining include many techniques from computer science, statistics and data analysis and optimization to name a few. This makes it to an interdisciplinary science [8]. – 20 – Data mining is an integral part of Knowledge Discovery in Databases (KDD) [9], which is the process of converting raw data into useful information, as shown in Figure 2.2. This process consists of a series of transformation steps, from data pre-processing to post-processing of data mining results. Filtering Patterns Visualization Pattern Interpretation Input Data Data Preprocessing Data Mining Feature Selection Dimensionality Reduction Normalization Data Subsetting Post-processing Information Figure 2.2. The process of knowledge discovery in databases (KDD) The input data can be stored in a variety of formats (flat files, spreadsheets, or relational tables). In this work the input was a relational table which was imported to the software SAS Statistical Analysis System [35] from internal software which is connected to the database. Once the appropriate functions were imported to SAS, a binarization of the data was made. From SAS, the data was exported to different software’s for the analysis. The purpose of preprocessing is to transform the raw input data into an appropriate format for subsequent analysis. The steps involved in data pre-processing include fusing data from multiple sources, cleaning data to remove noise and duplicate observations, and selection records and features that are relevant to the data mining task at hand. The pre-processing part of this work was to select records, binarization of the data and handle missing values. Because of the many ways data can be collected and stored, data pre-processing is perhaps the most laborious and timeconsuming step in the overall knowledge discovery process [10]. An example of post-processing is visualization. Data visualizations are the display of information in a graphic or tabular format. Successful visualization requires data to be converted into a visual format so that the properties of the data and the relationships among data items can be analyzed [9]. The visualization part of this work was the graphic presentation of the results. Since binary data is hard to visualize, this was made in two different ways. One in visualization software called VisuMap which offers methods to visualize high dimensional data [25] and the other one was to export the results to Excel and the make the graphic presentation (see chapter 4). – 21 – 2.1.1 Data Mining Tasks Data mining tasks are generally divided into two major categories: Supervised learning: predictive tasks The objective of these tasks is to predict the value of a particular attribute based on the values of other attributes. The attribute to be predicted is commonly known as the target or dependent variable, while the attributes used for making the prediction are known as the explanatory or independent variables. Unsupervised learning: descriptive tasks Here, the objective is to derive patterns (correlation, trends, clusters) that summarize the underlying relationship in data [9]. Descriptive data mining tasks are often exploratory in nature and frequently require post-processing techniques to validate and explain the results. Cluster analysis, which is an unsupervised learning, will be used in this work. Cluster analysis seeks to find groups of closely related observations so that observations that belong to the same cluster are similar to each other than observations that belong to other clusters. Clustering has been used to group sets of related customers to find areas of the ocean that have a significant impact on the Earth’s climate [10]. An example on importance of the data preparation is giving Appendix A. – 22 – 2.2 Data preparation New research in data mining is often driven by the need to accommodate new application areas and their new types of data [10]. Data that is to be analyzed can differ in several ways. The attributes used to describe data objects can be quantitative or qualitative, and different data types require different tools and methods to analyze the data. Hence it is vital to represent the data in a way that suits the methods used. The Quality of the Data is often far from perfect through presence of noise, missing values and inconsistent or duplicate data. Most data mining techniques can handle some imperfections, but the result is often improved if the quality of the data is increased. In this work the missing values was to be tackled by searching through the data set and replace it with some suitable values. Moreover, many if-statements were made to pick out the chosen functions and merge the data to create correct binary data matrix. Once again, the pre-processing part in Figure 2.2 above is one of the most important steps in the data mining process [9]. Pre-processing data is all about making data more suitable for data mining and analysis. 2.2.1 Attributes and Measurement A data set usually contains a collection of data objects, also called records, points or observations. Data objects have different attributes that describe the property of an object, such as the mass or colour of the object. The definition of an attribute is a property or characteristic of an object that may vary, either from one object to another or from one time to another. In this work the records was trucks and the attributes was Electric Control Units, segment and countries. In practical, attributes don’t need to be numbers or symbols, but to analyze the characteristics we can assign them these and for that, a measurement scale is needed. The definition for a measurement scale is a rule or function that associates a numerical or symbolical value with an attribute of an object [9]. This is needed to handle the data effectively and correct. Since it is possible to assign different measurement scales to an attribute, it is obvious that the properties of an attribute need not be the same as the properties of the values used to measure it. In this work the measurement scale was categorical since the attributes was binary (see table 2.1). The type of an attribute says what properties of the attribute are represented by the values used to measure it. It is vital to understand and know the type of an attribute, in order to reach correct conclusions from the resulting analysis. – 23 – 2.2.2 The Different Types of Attributes The different types of attributes are derived from the following operations that can be performed on numbers: 1. 2. 3. 4. Distinctness = and ≠ Order <, ≤, > and ≥ Addition + and – Multiplication * and / From these properties, the four types of attributes are defined: nominal, ordinal, interval and ratio. Table 2.1 give a summary of the different types. Description Examples Nominal The values of a nominal attribute are just different names (nominal values provide only enough information to distinguish one object from another). (=, ≠) Binary values, eye color, gender Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >) {good, better, best} grades, street numbers Interval For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, -) Calendar dates, temperature in Celsius Ratio For ratio variables, both differences and ratios are meaningful. (*, /) monetary quantities, counts, age, mass, length, electrical current (Qualitative) (Qualitative) (Quantitative) (Quantitative) Numeric Numeric Categorical Categorical Attribute Type Table 2.2 Different attribute types – 24 – Nominal and ordinal attributes are so called categorical or qualitative attributes, and most operations performed on numbers, have no meaning for this data. A discrete attribute can only have a finite set of values. These are often represented by integer variables, and a special case is binary attributes that only take 2 different values, representing true/false, yes/no, male/female etc. In this work the data sets is represented as Boolean values that can be only 1 or 0 [24]. Interval and ratio attributes on the other hand are quantitative or numeric attributes where the data represents actual values, and they hold the properties of numbers. These attributes can be both integer-valued, and continuous. Continuous attributes have real numbers as their values, and are often represented as floating point variables in data sets. One way to distinguish between attributes is by the number the values can take. Any measurement scale type (nominal, ordinal, interval or ratio) can be combined with any of the number of attribute values (binary, discrete or continuous) but it is often not practical with some combinations. Typically the nominal and ordinal attributes are discrete or binary, while interval and ratio attributes are continuous, since they represent realistic data. But this doesn’t hold always: count attributes that are discrete, are also ratio attributes for instance [9]. – 25 – 2.3 Cluster analysis Clustering is a popular data mining technique. Cluster analysis divides data into groups (clusters) that are meaningful, useful or both. Classification of data is a fundamental tool in pattern recognition and vector quantization, which are applied in image processing and computer vision [11]. Cluster analysis groups data objects based only on information found in the data that describes the object and their relationships. The goal is that the objects within a group be similar to one another and different from the objects in other groups. The greater the similarity (or homogeneity) within a group and the greater the difference between groups, the better the clustering. The ability to classify things is undoubtedly one of the key features of human intelligence. It is also well known that the clustering problem is a difficult one, and we have to resort on approximate solutions [12]. Figure 2.2 shows a set of data point in 3D. Assuming we know that have two clusters, we can easily determine visually which points belong to which class. A clustering algorithm takes the complete set of points and classifies them using some distance measure. Figure 2.3. Two clusters in R3 When dealing with unsupervised learning the cluster number is not always clear. Moreover measure of similarity depends on the application. Three most popular clustering methods are described in the following sections, Hierarchical clustering, K-means clustering and Gaussian Mixture Model. The DSM Dependency Structure Matrix clustering which can be used for future work when one considers a large set of functions simultaneously is described in Appendix B. – 26 – 2.3.1 Hierarchical Clustering Hierarchical Clustering groups data over a variety of scales by creation a cluster tree or dendrogram [13]. Figure 2.4 shows an example of a hierarchical Clustering using a dendrogram. In this case, there is an animal that is similar in all respects except that one has a white stomach. The other two cases are less similar (because of the colour and the other is an angry boy!) Similar Dissimilar Figure 2.4. Hierarchical Clustering using dendrogram The tree is not a single set of clusters, but rather a multilevel hierarchy, where clusters at one level are joined as clusters at the next level. This method allows deciding the level of clustering that is most appropriate for the application at hand. A characteristic of this method is that it produces a sequence of partitions in one run. The main method in this work is based on a modified version of hierarchical clustering which is called Two Step Clustering. The TSC is described in chapter 3. 2.3.2 K-means Clustering K-mean Clustering is a partitioning method. Unlike the hierarchal clustering, this method operates on actual observations rather than the larger set of dissimilarity measures. It creates only one level of clusters and treats observation in data as an object having a location in space [14]. In this work only a simple example on K-means is giving. The algorithm will be demonstrated in MatLab by using the Hamming distance (see chapter 2.3.4). The disadvantages with this method are the number of clusters, which is unknown. This algorithm would need to run multiple times (one for each number of cluster) to generate a sequence of partitions. K-means finds a partition in which objects within each cluster are as close to each other as possible, and as far from objects in other clusters as possible. This clustering method uses an iterative algorithm that minimizes the sum of distance from each object to its cluster centroid, over all clusters. A cluster centroid (or just centre) is defined as the vector of cluster means of each variable. – 27 – 2.3.3 Gaussian Mixture Model Gaussian Mixture Model form clusters by representing the probability density function of observed variables as a mixture of multivariate normal densities. Mixture models are based on expectation maximization (EM), which assigns posterior probabilities to each component density with respect to each observation [15]. Clusters are assigned by selecting the component that maximizes the posterior probability. Like K-means clustering, GMM uses an iterative algorithm. 2.3.4 Distance measure A very important step in any clustering is to select right distance measure (metric or distance function), which will determine how the similarity between two elements is calculated [8]. The shape of the clusters will be affected as some elements may be close to one another according to one distance and farther away according to another. In this work the metric will be the probability based log-likelihood measure. The distance between two different subclasses (clusters) is related to the decrease in likelihood as they are combined into one cluster [10]. In calculating log-likelihood, multinomial distribution is assumed since we are dealing with categorical variables (see table 2.2). It is assumed that the trucks and their binary attributes are independent of each other. The metric is defined by d (i, k )   i   k   i , k  K  j   N j  Eˆ jn n 1 Lk Eˆ jn    l 1 N jnl Nj log N jnl Nj where d (i, k ) is the distance between clusters i and k,  i, k  is the index that represents the cluster formed by combining cluster i and k, N j is the number of trucks in cluster j and N jnl is the number of trucks in cluster j whose n-th variable takes the l-th category. K is the total number of variables and Lk is the number of categories for the k-th variable. Another distance function is the Hamming distance which measures the minimum number of substitutions required to change one into the other. The metric is defined by d (i, k ) = # (of places where i and k disagree) Figure 2.4 shows the Hamming distance between to binary vectors which equals 2. vector x = 1011101 and vector y = 1001001 Figure 2.5. Hamming distance between x and y is 2 There is other metrics for binary data known as Jaccard, Russel & Ro, Sokal & Sneath and Dice to name a few, but as stated above, the focus will be on the log-likelihood metric in this work. – 28 – 3. Binary Clustering The main focus on this work is on classifying (clustering) data consisting of binary vectors. Here the clustering means division of the set of binary vectors to a set of disjoint subset (i.e. clusters or subclasses) in a way that the cost of the classification is minimal. To measure the cost of classification one can use error measures such as MSE (Mean Square Error) or more complex one such as stochastic complexity. We shall discuss both methods and describe the Two Step Clustering algorithm in SPSS. 3.1 Mathematical criteria – A general clustering model for binary data Suppose a set of binary vectors B t of form X l  ( x1l , x 2l ,...., x dl ) , where xil  [0,1] . Then the set is described as follow: B t   X l | l  1,2,...., t  Suppose now that we want to classify B t into k disjoint classes C  (C1 , C 2 ,..., C k ) , where C j   X l | l  1,2,...., t j  and j  [1,2,...k ] . Then for each class C j one has to compute the number of the ones in each column i by: t t ij  l j 1 xil (1) Assume now that the distance function (metric) of each vector to its class is given by d (xl , C j ) This distance can be Euclidean distance, Hamming distance, log-likelihood distance and other distance function. Now the total error can be expressed as follow: k tj t Error ( B , C )   d ( x l , C j ) (2) j 1 l 1 We first present a general model for binary clustering problem based on mean square error. The model is specified as follow: W  AXB T  E (3) where E is the error component. The first term AXB T characterizes the information of the binary data set W  (wij ) nm that can be described by the cluster structures. A and B explicitly designate the cluster membership for data points and features. X specifies cluster representation. – 29 – Let now Ŵ denote the approximation AXB T and the goal is to minimize the approximation error. Before the minimizing process let us define the Frobenius norm [37] of a matrix M  (M ij ) : M F 2 ij M  i, j The sum of squared error is now: Error ( A, X , B)  W  Wˆ n m K 2 F n m  Trace[(W  Wˆ )(W  Wˆ ) T ]  i 1  j 1 (wij  wˆ ij ) 2 (4) C  i 1  j 1 ( wij   aik b jc x kc ) 2 k 1 c 1 where K is number of clusters for data points and C is number of clusters for features. Suppose now that A  (a ik ), a ik  0,1 B  (b jc ), b jc  0,1 and   K k 1 C c 1 a ik  1 b jc  1 A and B denote the data and feature membership respectively. Based on equation (4) above we obtain Error ( A, X , B)  W  Wˆ 2 F n K m C K C  i 1  j 1 ( wij   aik b jc x kc ) 2     (wij  x kc ) 2 k 1 c 1 k 1 c 1 iPk jQc where i  Pk is i-th data point in cluster Pk and j  Qc is j-th feature in cluster Qc . For fixed Pk and Qc , the optimum X is obtained by x kc  1 pk qc  w ij iPk jQc Hence X can be thought as the matrix of centroids for the simultaneously clustering problem. X represents the associations between the data clusters and the feature clusters. – 30 – Error ( A , X , B ) can then be minimized via an iterative procedure of the following steps 1. Given X and B, then the feature partition Q is fixed. Error ( A, X , B) is then minimized by aˆ ik  1 if C C   c 1 (w  x kj ) 2  c 1  jQ ( wij  x lj ) 2 for l  1,..., K , l  k jQc ij c and 0 otherwise. 2. Given X and A, then the data partition P is fixed, Error ( A, X , B) is then minimized by bˆ jc  1 if K   k 1 iPk K ( w  xic ) 2  k 1 iP ( wij  xll ) 2 for l  1,..., C , l  c ij k and 0 otherwise. 3. Given A and B, X can be compute by: x kc  1 pk qc  w ij iPk jQc 3.1.1 K-means Clustering Consider equation (3) above: W  AXB T  E If we choose B  I mm (identity matrix) then the general model reduces to the K-means clustering (grouping data points into clusters). Hence W  AX  E Suppose now A  (a ik ), aik  0,1 and  K k 1 a ik  1 . The optimization model reduces to Error ( A, X , B)  W  Wˆ n m m 2 F n m k 1 n K m K m  i 1  j 1 aik  (wij  x kj ) 2   a ik  ( wij  y kj ) 2   p k  ( y kj  x kj ) 2 j 1 K  Trace[(W  AX )(W  AX ) T ]  i 1  j 1 (wij   aik x kj ) 2 i 1 k 1 j 1 k 1 – 31 – j 1 n where p k  k 1 aik and y kj  1 pk  n k 1 a ik wij Hence, given A the error component Error ( A, X , B) is minimized by setting x kj  y kj  1 pk  n k 1 a ik wij 3.1.2 The principle of Minimum Description Length MDL In order to compress several data vectors together in an optimal manner, one need to capture all the common regularities found in the data. The more the data vectors in a cluster are similar the better one can compress the cluster [38]. Sum of all compressed clusters (the total code length) is a criterion forming dependence between the clusters. The overall idea is to choose a representation of the data which lets one express them with a shortest message via a postulated set of models. Hence the code length offers a universal scale, making it possible to compare clusterings of different complexity. The “message” or “description” length is measured in bits (traditionally) [39]. 3.1.3 Stochastic Complexity SC Stochastic complexity in the minimum description length framework is a central concept for statistical modelling. Old formalization of SC is marginal likelihood and BIC (Bayesian Information Criterion) and modern formalization is Normalized Maximum Likelihood. SC is the shortest description length of a given data set relative to a model class  [38]. The model class  can be defined as a set of paramedic distributions indexed by elements of   Bt :   P( x |  ),    The maximum likelihood model in the model class  with respect to the data set x is ˆ( x)  arg maxP ( x |  , )   Define stochastic complexity as the result of the following minmax optimization problem with a density Q [39]: SC  BIC  PBIC ( x' )  arg min max (log P ( x' | ˆ( x ' ), )  log Q( x' )) Q x' The solution to this minmax problem is BIC  PBIC ( x )  P( x | ˆ( x), ) P( x ' | ˆ( x ' ), )  x' – 32 – 3.2 Two Step Cluster in SPSS The SPSS Two Step cluster TSC method is a modified version of hierarchical clustering analysis designed to handle very large data sets. The main idea is to pre-cluster the cases (the trucks in this work) into many small sub-clusters and then cluster the sub-clusters resulting from pre-cluster step into desired number of clusters. The pre-cluster step uses a sequential clustering approach. It scans the cases one by one and decides if the current case should be merged with the previously formed clusters or starts a new cluster based on the distance criterion. The procedure is constructing a modified cluster feature (CF) tree [40] The CF tree consist of levels of nodes, and each node contains a number of entries. An entry in a leaf node represents a final sub-cluster [10]. The internal nodes and their entries are used to guide a new case into a correct leaf node. Each entry is characterized by its CF that consist counts for category of the categorical variable (binary here). Procedure of TSC: Step 1: Pre-cluster data to sub-clusters 1. Knowledge of CF tree 2. Cluster feature 3. CF tree Step 2: Group data into sub-clusters 1. Calculate BIC 2. Refines 3.2.1 CF-tree The information that is maintained about a cluster is summarizing in clustering feature [10]. A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering. A internal node in a tree has “children” and these nodes store sums of the CFs of their children. Hence an internal node represents a cluster made up of all sub clusters by its entries. A leaf node also represents a cluster made up of all sub clusters represented by its entries. A CF tree has two parameters, a branching factor B which specifies the maximum number of “children” and a threshold T. The size of any entry has to be less than the threshold [40]. Also there is a limit for the numbers of entries in a leaf node. Figure 3.1 shows a CF tree with branching factor B and a leaf node with maximum entries L. For each case, starting from the root node, it is recursively guided by the closest entry (at each level, choose the sub-tree whose centroid is closest) in the node to find the closest child node, and descends along the CF tree. Upon reaching a leaf node, it finds the closest leaf entry in the leaf node. If the case is within the threshold T of the closest leaf entry, it is absorbed into the leaf entry and the CF of that leaf entry is updated. Otherwise it starts its own leaf entry in the leaf node. If there is no space in the leaf node to create a new leaf entry, the leaf node is split into two. The entries in the original leaf node are divided into two groups using the farthest pair as seeds, and redistributing the remaining entries based on the closeness criterion (distance measure). If the CF tree grows beyond allowed maximum size, the CF tree is rebuilt based on the existing CF tree by increasing the threshold. The rebuilt CF tree is smaller and has space for new cases. This process continues until a complete data pass is finished [10]. – 33 – CF1 CF11 CF12 CF111 ……………… CF112 ……… CF2 ……………… CF1B CFB1 CFB CFB2 ……………… CF11L Figure 3.1. CF tree with branching factor B. A leaf node contains at most L entries. 3.2.2 Cluster step The cluster step takes sub-clusters resulting from the pre-cluster step as input and then groups them into desired number of clusters. Since the number of clusters is much less than the number of original cases, the traditional clustering methods can be used effectively. The TSC uses the hierarchical clustering method. 3.2.3 Log-Likelihood distance A distance measure for closeness is needed in both pre-cluster and cluster steps. The distance between two different clusters is related to the decrease in likelihood as they are combined into one cluster. In calculating log-likelihood, multinomial distribution is assumed. It is also assumed that the cases and their attributes are independent of each other. The metric is defined by d (i, k )   i   k   i , k  (5) K  j   N j  Eˆ jn n 1 Lk Eˆ jn    l 1 N jnl Nj log N jnl Nj where d (i, k ) is the distance between clusters i and k,  i, k  is the index that represents the cluster formed by combining cluster i and k, N j is the number of cases in cluster j and N jnl is the number of cases in cluster j whose n-th variable takes the l-th category. K is the total number of variables and Lk is the number of categories for the k-th variable. – 34 – CFBB 3.2.4 Auto Clustering using BIC The number of clusters depends on the data at hand. A characteristic of hierarchical clustering is that it produces a sequence of partitions in one run: 1, 2, 3 … clusters. A K-means algorithm would need to run several times in order to generate the sequence. To determine the number of clusters automatically, a two-step process that works well with hierarchical clustering is considered. In the first step, the BIC (see chapter 3.1.3) for each number of clusters within a specified range is calculated and used to find the initial estimate for the number of clusters. The initial estimate is refined in the second step by finding the largest increase in distance between two clusters in each hierarchical clustering stage. Using equation (5) above the BIC is calculated as: V BIC (V )  2v 1  v  mV log( N ) where N is the number of cases in total and mV  V  K k 1  ( Lk  1) . 3.3 Data Mapping RPM in VisuMap For many decades visualizing high dimensional data has been a key subject matter. Many of the methods target high dimensional data with stylish rendering procedure like 3D, landscape, special glyphs, colours and graphics etc. Some other methods target the problem by reducing the dimensionality in a generic way with little theory about data type. In the later method we can include the RPM (Relational perspective map) algorithm. RPM is a universal purpose method to visualize distance information of data points in high dimensional spaces [25]. The goal of the RPM algorithm is to map the data points into a two or three dimensional map so that distances between the image points visually approaches as much as possible. RPM map attempts to maintain as much as possible distance information of the original dataset from geometric point of view. The RPM algorithm creation of 2D and 3D maps is shown in figure 3.2. Figure 3.2. The principle of the RPM algorithm – 35 – 4. Analysis This section is the main section of this work. Three different tools were used to demonstrate how the clustering algorithm works. The first one is the K-means in MatLab with Hamming distance on a simple example. The second one is Data Mapping in VisuMap on one of these five functions and the last one is Two Step Clustering in SPSS. Since the TSC handle large data set and decide the number of clusters automatically, this tool was to prefer. Hence, the main result is based on TSC in SPSS. However, since the visualization in SPSS is poor, the results were presented by graphs in Excel. 4.1 K-means in MatLab – A simple example Figure 4.1 shows the principle of the K-mean algorithm. The K-mean is a partitioning method where the trucks (based on their attributes) are partitioned into subsets (clusters). The idea is to minimize the mean square error MSE (see chapter 3.1.1). The inputs are the data set and number of clusters. The output is clusters with data among them where data within the clusters are similar to each other and dissimilar from data in other clusters. We will demonstrate how this method works by using K-means in MatLab. The disadvantage with this method is the requirement of number of clusters. Moreover the visualization using silhouette values (figure 4.2) in MatLab is not comprehensible. However, the silhouette plot is useful for deciding the number of clusters, but this can cost time since one must run the algorithm several times. Each times for different number of clusters and then compares the resulting silhouette plot. Figure 4.1. K-means algorithm – 36 – The matrix bellow shows 140 trucks with 5 attributes (here Electric Control Units.) Truck 1 Truck 2 Truck 3 … … … Truck 140 ECU 1 ECU 2 ECU 3 ECU 5 0 0 0 …. 1 1 0 …. 0 1 0 ….. 0 0 1 … 0 1 0 0 By using the K-means in Matlab with Hamming distance (see 2.3.5) we can find a pattern between the trucks. If we choose 2 clusters the result will be as in figure 4.2. The figure shows that ECU2 and ECU4 differ from the rest of the ECUs since this control units belongs to cluster 1 while the other ones belongs to cluster 2. See appendix D for the very simple code in MatLab for this example using the built-in functions kmeans and silhouette. ECU2 ECU4 ECU1 ECU3 ECU5 Figure 4.2. Silhouette plot in MatLab using K-means with Hamming distance for two clusters – 37 – 4.2 Data Mapping in VisuMap In this section we will describe the use of data mapping in VisuMap [25] (see 3.3). We will analyze function A. Figure 4.3 shows the data once it has been imported as a CSV file (Comma Separate Value) into Visumap. Figure 4.3. Imported multidimensional data into VisuMap By using the RPM described in 3.3 the function was analyzed. Figure 4.4 shows 5 clusters. The surprisingly result is that a big proportion (31.5%) of all trucks with this functions is sold to Thailand. Figure 4.4. Result in VisuMap for function A – 38 – 4.3 Two Step Clustering in SPSS The best method is the TSC in SPSS which can handle the size of the data and uses autocluster (see 3.2). Since the visualization is SPSS is poor the results of the TSC algorithm was exported to Excel. 4.3.1 Clustering strategy Five different functions were considered. For each function pattern between Electric Control Units was found by using the TSC in SPSS. Once the clusters based on the ECUs were found the segment and countries within those clusters was identified. 4.4 Result in SPSS The main result if this work is shown in this section. The clusters found for the five functions are presented here. Table 4.1 describes the volumes of respective function in the data set. For a description of the different ECUs, please refer to Appendix C. Function Function A Function B Function C Function D Function E Volume Low Low High Very High Very High Table 4.1. Volumes of the functions in the data set – 39 – 4.4.1 Function A Five clusters were found for function A. Figure 4.5 shows the cluster distribution. Figure 4.6 shows the ECU-cluster distribution and figure 4.7 respective segment and countries. Cluster 4 shows that a big proportion of all trucks with this function have the ECUs 23 and 25, and is sold to Thailand. The details of each of the found clusters are listed below. Cluster distribution 5 12% 1 26% 4 31% 2 11% 3 20% Figure 4.5. Cluster distribution – Function A Cluster 1 Trucks in this cluster have the ECUs 5, 17, 22, 24 and 25. Moreover they belong to the segment Long-Haulage and are sold in Sweden. Cluster 2 These trucks do not have ECU 15; they belong to the segment Long-Haulage and are sold in Span. Cluster 3 These trucks have the ECUs 16, 22 and 24; they belong to the segment Long-Haulage and are sold in Europe. Cluster 4 These trucks have the ECUs 23 and 25; they belong to the segments Long-Haulage and Distribution and are sold in Thailand. Cluster 5 These trucks are low-end version; they belong to the segment Long-Haulage and are sold in Saudi Arabia. – 40 – ECUs w r t function A 100% Percentage 80% 1 60% 2 3 4 40% 5 20% 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ECU Figure 4.6. Cluster distribution - ECUs with respect to function A Segment & Countries 100% 80% 1 2 4 5 40% 20% Th ai la nd M al ay si Sa a ud i_ A ra bi a Po rtu ga l Au st ria R Es us to si ni an a _F ed er at io n an y en m ar k D Ita ly er m G Sp ai n N or w ay ni te d_ Ki ng do m U Fr an ce et he rl a nd s N Po la nd ep ub li c ed en Sw ze ch _R C au la ge Di st rib ut io n Co ns tru ct io n 0% Lo ng -H percentage 3 60% Figure 4.7. Cluster distribution – segment and countries for function A – 41 – 4.4.2 Function B Two clusters were found for function B. Figure 4.8 shows the cluster distribution. Figure 4.9 sows the ECU-cluster distribution and figure 4.10 respective segment and countries. The details of each of the found clusters are listed below. Cluster distribution 2 35% 1 65% Figure 4.8. Cluster distribution – Function B Cluster 1 Trucks in this cluster have the ECUs 6 and 16. Moreover they belong to the segment LongHaulage and are sold in Germany. Cluster 2 These trucks have the ECUs 17, 22, 24 and 25; they belong to the segment Long-Haulage and are sold in Sweden. – 42 – ECUs w r t function B 120% Percentage 100% 80% 1 60% 2 40% 20% 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ECU Figure 4.9. Cluster distribution - ECUs with respect to function B Segment & Countries 120,0% 80,0% 1 60,0% 2 40,0% 20,0% D -H au ist lag r C ibu e on t i st on ru c Sw tion ed en P N ol et an he d rla U nd ni te No s d_ r w Ki a y ng do m Sp ai n G Ital er y m D an en y R us m si an E ark s _F to ed nia er at i Fi on nl an Au d s H tria un R gar om y a Sl nia o v Sw e i tz nia er la C nd yp P o r us rtu ga l 0,0% Lo ng Percentage 100,0% Figure 4.10. Cluster distribution – segments and countries for function B – 43 – 4.4.3 Function C Four clusters were found for function C. Figure 4.11 shows the cluster distribution. Figure 4.12 shows the ECU-cluster distribution and figure 4.13 respective segment and countries. Cluster 2 shows that a big proportion of all trucks with this function do not have the ECUs 3 and 4 (pneumatic brake system), the segment is Construction and has been sold to Dubai (Middle East in general). The details of each of the found clusters are listed below. Cluster distribution 4 11% 3 18% 1 60% 2 11% Figure 4.11. Cluster distribution – Function C Cluster 1 Trucks in this cluster have the ECU 3. Moreover they belong to the segment Construction and are sold in France, Turkey and Spain. Cluster 2 These trucks do not have the ECUs 3 and 4. They belong to the segment Construction and are sold in Dubai. Cluster 3 These trucks have the ECUs 3, 14 and 18; they belong to the segment long-Haulages and are sold in Europe. Cluster 4 These trucks have the ECUs 4 and 21; they belong to the segment Distribution and are sold in Israel. – 44 – -H au Di st la ge r Co ibu ns ti on tru ct io C Sw n ze ch ed _R en ep ub li c La tv ia Po la nd Tu rk ey Fr Ne a n ce th er la nd s Un N i te or w d_ a Ki y ng do m Sp ai n It G al y er m an D en y m a Uk rk ra R in us e si an Est on _F ed i a er at io Au n st r Be ia lg iu m R o Sw ma i tz n ia er la nd Ire lan Po d r tu ga R _u l ni on Is ra Bu e l So lg a ut h _ r ia Af r ic a Br az il D ub ai Q at ar Sa O u d ma n i_ Ar Ab ab ia u_ Dh a Ar ge bi nt in a Ku wa it Lo ng Percentage Precentage ECUs w r t function C 120% 100% 80% 60% 40% 1 2 3 4 20% 0% 1 2 3 4 5 6 7 8 9 10 11 12 ECU 13 14 – 45 – 15 16 17 18 19 20 21 22 Figure 4.13. Cluster distribution – segment and countries for function C 23 24 25 Figure 4.12. Cluster distribution - ECUs with respect to function C Segement & Countries 120% 100% 80% 60% 40% 1 2 3 4 20% 0% 4.4.4 Function D Two clusters were found for function D. Figure 4.14 shows the cluster distribution. Figure 4.15 shows the ECU-cluster distribution and figure 4.16 respective segment and countries. The details of each of the found clusters are listed below. Cluster distribution 1 38% 2 62% Figure 4.14. Cluster distribution – Function D Cluster 1 Trucks in this cluster have ECU 3 and are low ended in general. Moreover they belong to the segment Construction. Besides, the countries vary. Cluster 2 These trucks are high ended and have the ECUs 1, 4 and 21. They belong to the segment Long-Haulages and the countries vary. – 46 – ECUs w r t function D 120% Precentage 100% 80% 1 2 60% 40% 20% 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ECU Figure 4.15. Cluster distribution - ECUs with respect to function D 100% 80% 60% 40% 20% 0% Figure 4.16. Cluster distribution – segment and countries for function D – 47 – Bolivia Chile Bahrain O m an Egypt India Indonesia Lebanon Bulgaria Korea__Republic Cyprus Switzerland Slovenia Belgium Estonia Italy Norway Turkey Czech_Republic 1 2 Long-Haulage Precentage Segment & Countries 4.4.5 Function E Two clusters were found for function E. Figure 4.17 shows the cluster distribution. Figure 4.18 shows the ECU-cluster distribution and figure 4.19 respective segment and countries. The details of each of the found clusters are listed below. Cluster distribution 2 30% 1 70% Figure 4.17. Cluster distribution – Function E Cluster 1 Trucks in this cluster have ECU 3 and are low ended in general. Moreover they belong to the segment Construction. Besides, the countries vary. Cluster 2 These trucks are high ended and have the ECUs 1, 4 and 21. They belong to the segment Long-Haulages and are sold in Italy and Denmark. – 48 – ECUs w r t function E 120% Precentage 100% 80% 1 2 60% 40% 20% 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ECU Figure 4.18. Cluster distribution - ECUs with respect to function E 100% 80% 60% 40% 20% 0% Figure 4.19. Cluster distribution – segment and countries for function E – 49 – India Taiwan__Provinc Bulgaria R_union Korea__Republic Bosnia_and_Herz Switzerland Jugoslavien Slovenia Romania Belgium Finland Estonia Denmark Italy Lithuania Norway France Turkey Slovakia Czech_Republic Construction 1 2 Long-Haulage Precentage Segment & countries 4.5 Change over time In this section the change over the time for function D and function E is considered. 4.5.1 Function D Figure 4.16 shows the change over the time for the two clusters for function D. In the first cluster the segment was construction and the trucks were low-end. In the second cluster the segment was long-haulage and the trucks were high-end. In figure 4.15 it is clear that more and more construction trucks (the upper graph) have been sold over time. Figure 4.15. Change over time for function D – the upper graph shows that the construction trucks trends upward over time – 50 – 4.5.2 Function E Figure 4.17 shows the change over the time for the two clusters for function E. In the first cluster, the segment was construction and the trucks were low-ended. In the second cluster, the segment was long-haulage and the trucks were high-ended. In the figure it is clear that more construction trucks have been sold during the last years. Figure 4.17. Change over time for function E – the upper graph shows that more construction trucks have been sold during the last years compared to lower graph which shows the long-haulage trucks. – 51 – 5. 5.1. Discussion Tools The research company Gartner group [16, 34] states that SPSS remain the leading vendors in the customer data-mining application markets behind the legendary statistical software SAS Statistical Analysis System. According to Gartner SAS is expensive: price-sensitive companies, or those requiring significant justification of the cost-effectiveness of one solution over another, should evaluate alternatives. Some good alternatives are SPSS, VisuMap, BayMiner and MatLab statistic toolbox. Figure 5.1 and table 5.1 shows five different tools. The figure shows performance versus price. The pluses in table 5.1 means “good” and minuses means “maybe I should use another tool.” VisuMap is the leading one when it’s come to visualization and SPSS is the only one having Auto Clustering (automatically determine the number of clusters). The usability in both SPSS and VisuMap is fairly good. The prices include some extra tool needed for data mining purposes, for example the Clementine tool in SPSS. Moreover, the price of BayMiner and VisuMap depends on the purpose at hand since the price includes monthly support etc. Figure 5.1. Tools for Data Mining purposes – 52 – Handing Large data set Visualization SPSS yes - ++ VisuMap yes ++ - MatLab statistic toolbox BayMiner yes - - (max 100 000 records and 500 attributes) + yes ++ SAS Auto Clustering Cost single computer license ~90 000 SEK [29] Depends on size of data. For no size limit ~100 000 SEK [30] ~ 7000 SEK [31] Usability - ~ 40 000 SEK [32] ++ - ~ 120 000 SEK [33] ++ ++ + + Table 5.1. Tools for Data Mining purposes 5.2. Binarization Assume that we know the dependence between the functions (an architecture or expert at Scania gives us this information) and we want to use binary clustering technique. If we assume that 3 is very strong dependent and 0 is independent we can construct a binary matrix. Following example shows a simple example of binarization: Categorical Value Independent Almost independent Strong dependent Very strong dependent Integer value D1 D2 D3 D4 0 1 2 3 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 Table 5.1. Example of binarization One can then merge this with the historical sales data. Such a transformation can cause complications, such as creating unintended relationships among the transformed attributes. Please see future work and appendix B for more information when dealing with a large set of functions simultaneously. – 53 – 6. Related wok – Stock market The S&P 500 (Standard & poor’s 500) is a market value-weighted index whose components are weighted according to the total market value of their outstanding share [26]. The stocks included in the S&P 500 are those of large publicly held companies that trade on the two largest American stock markets, the New York Stock Exchange and NASDAQ. Almost all of the stocks included in the index are among the 500 American stocks with the largest market capitalizations [28] (which means the total value of all outstanding shares multiplied by the stock). To many people a stock market is nothing but a site where it is possible to acquire capital or influence – or both. However finance theory and the idea of a stock market can be applied outside of the finance domain. These concepts may not be used to evaluate resource optimization, but rather to be used to put price tags on financial instruments. By using Data Mining techniques one can analyze the performance of the stocks. Figure 6.1 shows S&P 500 index stocks based on weekly performance in the year 2002. This is done in VisuMap by James X Li [27]. Stocks having similar performance properties are located closely. Figure 6.1. S&P 500 index stocks – Closely located stocks have similar properties – 54 – 7. Future work – Function to function This thesis has focused on creating a framework for evaluating the resource efficiency in embedded systems. This chapter discuss the future work where the most interesting issue is to consider a larger amount of functions simultaneously. Given a function we have considered the pattern between Electric Control Units, segment and countries. The most interesting question for future work is what if we consider all functions? The answer is not easy and it’s beyond this work. However one way to solve this problem is to find the dependence between functions by using the Dependency Structure Matrix DSM and combine this with historical sale data. By letting design architectures putting weight on the functions as a measure of dependence (for example 5 for very strong dependent and 0 for independent) and then use DSM-clustering, one can find more about the customer choice. Another way to tackle this problem is to modify the model in this work by transforming the measure of dependence to binary attributes by binarization and merge it to the modified binary matrix on a suitable way. Figure 7.1 demonstrates the use of a DSM. Suppose we are considering four functions: A, B, C and D. We list A, B, C and D across the columns and down the rows. An “X” is placed in each entry to indicate an interaction between two functions. Reading across a row we can see from which other functions information must be passed to the function in that row. For instance, the third row in the figure shows that function C depends on both functions B, and D. Next, reading down the columns we understand which functions depend on the function in that column. From the fourth column we can see that both functions A and C depend on component D. Hence the “X” marks indicate a dependency in a general sense. Please see Appendix B for more on DSM-Clustering. A B C D A A B C B X X C X D X X D Figure 7.1. A sample DSM – 55 – 8. Conclusion Since electronics is growing increasingly important in the automotive sector, more and more functionality is implemented through the embedded system. This means that manufactures knowledge about functionality is getting more and more important. In the mean time, vast amount of data are collected and stored in computers. The relevant information about customer’s choice, in the collected data, can optimize the resource utilization in embedded system. A classification of data can be made by using Data Mining techniques and the results can be used as support when making architectural decisions. The work of the thesis includes a case study of five functions. The analysis showed the importance of extracting relevant information from large data set. For instance, trucks with function A were sold in Thailand. Low-ended long-haulages with function B were sold in Saudi Arabia. Moreover, low-ended construction trucks with pneumatic brakes and function C were sold in Dubai. Moreover, more construction-trucks with function E are being sold over the past years compared to high-ended long-haulages with the same function. In conclusion, the analysis can be said to give a new perspective when making design decisions. As pointed out in the previously chapter, some future work remains: Given a function we have considered the pattern between electric control units, segment and countries. The most interesting question is what if we consider a larger amount of functions simultaneously? Hence, the resource optimization outlined in this work may prove very helpful when evaluating customer’s choice with respect to historical sale data. This will give the architects at Scania a basic data for decision-making in the current design process. – 56 – 9. References [1] Ming-Shan Liu. (2007) Application of Embedded System in Construction Machinery [2] Mercer Management Consulting and Hypovereinsbank. (2001) Studie, Automobiltechnologie 2010. [3] Erik Persson. (2008) Resource utilization in embedded systems – an economical perspective. M.Sc. thesis at Royal Institute of Technology, Stockholm. [4] Hardung, B et.al. (2004) Reuse of software in distributed embedded automotive systems. [5] Zientz, W. (2007) Electronic systems for commercial vehicles. AutoTechnology 5, pp 4043. [6] Zielke, A et.al. (2006) The race to master automotive embedded systems development. McKinsey Company, Automotive and assembly sector business technology office, Germany. [7] Jiwei Han and Micheline Kamber. (2008) Data Mining Concepts and Techniques. [8] Lars Eldén. (2007) Matrix Methods in Data Mining and Pattern Recognition. [9] Tan Ping-Tang, Steinbach Michael and Kumar Vipin. (2006) Introduction to Data Mining [10] Zhang, T. (1996). Birch: An efficient data clustering method for very large databases. ACM SIGMOD Conference, Montreal, Canada, pp. 103–114. [11] Gray R.M. (1991) Vector Quantization and Signal Compression, Kluwer Academic Publishers, 1991. [12] Fischbacher U. (1996) Finding the maximum a posteriori probability (MAP) in a Bayesian taxonomic key is NP-hard J. Math. Biol. 34. [13] Kaufman L. (1990) Finding Groups in Data: An introduction to Cluster Analysis, Wiley. [14] Chirs Ding and Xiaofeng He. (2004) K-means via Principal Component Analysis, Canada. [15] Hartigan John A. (1975) Clustering Algorithms, John Wiley & Sons, New York. [16] Gartner Group. (1995) High Performance Computing Research Note. [17] http://etn.se/48017 [18] Eppinger, Steven D., Daniel R. (1994) A Model-based Method for Organizing Tasks in Product Development, Research in Engineering Design. 1-13. [19] www.dsmweb.org – 57 – [20] Rogers, James L and McCulley M. Collin. (1996) Integration a Genetic Algorithm into a Knowledge-Based System for Ordering Complex Design Processes NASA Technical Memorandum. [21] James Xinzhi Li. (2004) Visualization of High Dimensional Data with Relational Perspective Map. Information Visualization, Vol 3, No. 1. 49-59. [22] Scania Inline. (2003) Electrical System. [23] http://www.spss.com/statistics/ [24] Ph. Dwinger. (1961) Introduction to Boolean algebras, Wurzburg. [25] http://www.visumap.net/ [26] http://www.investopedia.com/terms/s/sp500.asp [27] http://jamesxli.blogspot.com/ [28] http://www.investopedia.com/terms/m/marketcapitalization.asp [29] http://www.ogs.state.ny.us/purchase/snt/awardnotes/7600600239prices.pdf [30] http://www.visumap.net/registered/ProductList.aspx [31] http://www.mathworks.se/store/productIndexLink.do [32] http://www.bayminer.com/en/pages/positioning.htm [33] http://www.sas.com/technologies/analytics/datamining/ [34] http://mediaproducts.gartner.com/reprints/sas/vol5/article3/article3.html [35] http://www.sas.com/technologies/analytics/statistics/stat/ [36] Rissanen J (1989) Stochastic Complexity in Statistical Inquiry Singapore. World Scientific. [37] Higham N.J. (1996) Matrix Norms. Philadelphia: Soc. Industrial and Appl. Math. [38] P.Kontkanen, P.Myllymäki, W.Buntine, H. Tirri, J.Rissanen. (2005) In Advances in Minimum Description Length: Theory and Applications. The MIT Press. [39] A.D Lanterman. (2001) Intertwining Themes in Theories of Model Selection. International Statistical Review 69, pp 189-212. [40] Rong Liu. (2002) The SPSS Two Step Cluster University of North Texas – 58 – 10. Appendices This chapter includes a total of three appendices: Appendix A Importance of "knowing you data" Appendix B DSM Clustering Appendix C K-means in Matlab – 59 – 10.1. Appendix A – Importance of “knowing your data” This example is taken from Tan, Steinbach and Kumar [9]. Although this scenario represents an extreme situation, it highlights the importance of the data preparation or pre-processing discussed in chapter 2. Assume that you are a Data Miner. From a medical researcher you receive an email. Hi, I’ve attached the data file that I mentioned in my previous email. Each line contains the information for a single patient and consists of five fields. We want to predict the last field using the other fields. Thanks and see you in a couple of days with my friend, a statistician. Best regards Medical Bob You proceed to analyze the data. The first few rows of the file are as follows: 012 232 33.5 0 020 121 14.4 2 027 134 12.2 0 … … …. …. 10.7 210.1 344.3 …. You put your doubts aside and start the analysis. There are only 100 lines, a smaller data file than you had hope for, but tow days later; you feel that you have made some progress. You arrive the meeting and strike up a conversation with the statistician who is also working with this project (Bob’s friend). She asks if you would mind giving her a brief overview of you results. Statistician: So, you got the data for all the patients? Data Miner: Yes. I haven’t had much time for analysis, but I do have a few interesting results. Statistician: Amazing. There were so many data issues with this set of patients that I couldn’t do much. Data Miner: Oh? I didn’t hear about any possible problems. Statistician: Well, first there is field 5, the variable we want to predict. It’s common knowledge among people who analyze this type of data that results are better if you work with the log of the values, but I didn’t discover this until later. Was it mentioned to you? Data Miner: No. Statistician: But surely you heard about what happened to field 4? It’s supposed to be measured on a scale from 1 to 10, with 0 indicating a missing value, but because of the data entry error, all 10’s were changed to 0’s. Unfortunately, since some of the patients have missing values for this field, it’s impossible to say whether a 0 in this field is a real 0 or a 10. Quite a few of the records have that problem. Data Miner: Interesting. Were there any other problems? Statistician: Yes, field 2 and 3 are basically the same, but I assume that you probably noticed that. Data Miner: Yes, but these fields were only weak predictors of field 5. – 60 – Statistician: Anyway, given all those problems, I’m surprised you were able to accomplish anything. Data Miner: True, but my results are really quite good. Field 1 is a very strong predictor of field 5. I’m surprised that this wasn’t noticed before. Statistician: What?? Field 1 is just an identification number. Data Miner: Nonetheless, my results speak for themselves. Statistician: Oh no! I just remembered. We assigned ID numbers after we sorted the records based on field 5. There is a strong connection, but it’s meaningless. Sorry! – 61 – 10.2. Appendix B – Dependency Structure Matrix The Design Structure Matrix DSM (Dependency Structure Matrix) is a useful tool for optimizing the composition product development elements in terms of minimizing interfaces and extra-element interactions. DSM is used in system architecting, engineering and design [18]. The DSM is a square matrix where rows and columns list the same elements. The entries in the matrix record interactions between elements. The goal of DSM is to find clusters that are minimally interacting subsets [15]. In other words, a cluster absorbs most of the interactions internally and the interaction of links between separate clusters is minimized. The goal of DSM is to identify clusters of highly interactive functions through the reordering of the matrix [20]. The goal of DSM is to find clusters (or subsets) that are minimally interacting subsets [15]. This process is referred to as clustering. In other words, a cluster absorbs most of the interactions internally and the interaction of links between separate clusters is minimized. The rules of how this clustering is performed vary from application to application, and so do the type of solution obtained. The goal of DSM is to identify clusters of highly interactive functions through the reordering of the matrix [20]. What we need is to modify the matrix and cluster functions together into highly interactive groups known as system components. A simple example of DSM-clustering is shown in figure 2.6. A B C D E F G H A A B C D X B E F X X C X X X F X X X E X H X X D X G X G X Figure 2.6. A sample DSM – 62 – X H According to Figure 2.6, function A has interaction with components D, F and H. By reordering the above matrix we get the following optimized solution (figure 2.7): : A A X X A H F D E C G B H X H X X F D F X E C G B D E X X C X X G X X X B Figure 2.7. DSM clustering – reordered DSM with two system components The optimized matrix was obtained by exchanging the position of groups B and H, and groups C and F. In this example, two system components were distinguished as the best configuration: system component 1 (green) with A, H, F and D; and system component 2 (red) with teams E, C, G and B. Let us complete this section with a concrete example of DSM for a vehicle. The following DSM describes a Climate Control system [19]. A B Radiator A A X Engine fan B X B Heater Core C Heater Hoses D Condenser E Compressor F Evaporator Case G Evaporator Core H Accumulator I Refrigeration Controls J Air Controls K Sensors L Command Distribution M Actuators N Blower Controls O Blower Motor P C D E F G H I J K L N M O C P X D X E X X F X X X G X X X H X X X I X J K L M N X X – 63 – X O X X P By reordering the above DSM we get the following optimized solution: D J K L M N A B E F I H A A X B X B X X E X X F X X X I X X X H C P Radiator D Engine fan J Heater Core K Heater Hoses L Condenser M Compressor N Evaporator Case Evaporator Core Accumulator E Refrigeration Controls F Air Controls I Sensors H Command Distribution C Actuators P Blower Controls O Blower Motor G X O G X D J K L M N X X X X C X X P X X O Clustering the "X" marks along the diagonal of the DSM resulted in the creation of three "chunks" for the Climate Control System. The "chunks" are: 1. Front End Air Chunk. 2. Refrigerant Chunk. 3. Interior Air Chunk In our case the “chunks” could for example be low-end and high-end trucks, segment (longhaulage, distribution or construction) and so on. – 64 – G 10.3. Appendix C - K-means in MatLab MatLab code for the simple example in chapter 4.1 clc; clear; clf; % Historical sale data, 140 trucks and 5 Cotrol Units data = load('sale.txt'); X = data'; % K-means using Hamming distance, number of cluster = 2 idx = kmeans(X,2,'distance','hamming'); [s, h] = silhouette(X, idx, 'hamming'); % Label the attrubutes label = {'ECU1' 'ECU2' 'ECU3' 'ECU4' 'ECU5'}'; vektor = []; for i=1:length(idx) vektor = [vektor {idx(i) [label{i,:}]}]; end answer = vektor' – 65 –

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Resource optimization in embedded systems based on data mining