Download Resource optimization in embedded systems based on data mining

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Resource optimization in embedded systems
based on data mining
Author: AKOO HEMATBOLLAND
Supervisor (KTH): Professor Timo Koski
Supervisor (Scania CV AB): Håkan Gustavsson
Master thesis
KTH. Royal Institute of Technology
M.Sc. in Engineering Physics
SCI. School for Engineering Science
Stockholm, Sweden 2008
Resursoptimering av inbyggda system
Baserad på Data Mining
AKOO HEMATBOLLAND
Examensarbete i matematisk statistik om 30 högskolepoäng
Vid programmet för Teknisk Fysik
Kungliga Tekniska Högskolan år 2008
Examinator: professor Timo Koski
Handledare på Scania: Håkan Gustavsson
Kungliga tekniska högskolan
Skolan för Teknikvetenskap
KTH SCI
100 44 Stockholm
URL: www.csi.kth.se
–2–
Sammanfattning
Detta examensarbete behandlar resursoptimering i inbyggda system för Scanias lastbilar.
Arbetet gick ut på att analysera historisk försäljningsdata för att bättre förstå vilka val kunden
gör. Stor del av arbetet har varit att studera lämpliga metoder för analysen samt att utvärdera
verktyg. Eftersom det handlar om stora data mängder har metoder inom Data Mining
tillämpats för att utvinna relevant information om kundens val. Metoder och verktyg har
sedan testats genom att analysera fem olika funktioner (funktion A-E). Funktionsidentiteterna
är av sekretesskäl inte tillgängliga i detta offentliga dokument. Med hjälp av Data Mining,
främst här med Two Step Clustering och BIC (= ett bayesianskt informationskriterium), kan
ökad kunskap om kundens val ge företaget reducerade kostnader och förbättra relationen till
kunden.
Exempel på förvånande resultat var att en stor andel av alla sålda lastbilar med funktion A och
påbyggnadsnod var sålda till Thailand. Ett annat exempel var att en stor andel
konstruktionsbilar med funktion B såldes till Dubai med pneumatiskt bromssystem.
Framtida arbete av särskilt intresse är att göra en mer omfattande fallstudie där man tar
hänsyn till en stor mängd funktioner.
–3–
Abstract
This master thesis discusses the resource optimization in embedded systems for Scania's
trucks. It is about analyzing historical sale data to find out more information about customer’s
choice. Big part of the work has focus upon studying appropriate methods and tools to the
analysis. Since we are dealing with a large amount of data, data mining techniques have been
used to find relevant information about customer choice. Methods and tools (the main tool is
Two Step Clustering with BIC = Bayesian Information Criterion and log likelihood metric)
have been tested on five different functions (function A-E). The identity of these functions has
been suppressed in this public version of the final report. With Data Mining a company’s
knowledge about customer’s choice can reduce costs and improve the value of customer
relationships.
Example of results was that a big proportion of all trucks with function A and BWS (Body
Work System) were sold to Thailand. Another example was that a big proportion of
construction-trucks with function B was sold to Dubai and had pneumatic brake system.
Future work of particular interest would be to perform a more extensive case study which one
considers a larger amount of functions.
–4–
Acknowledgements
This master thesis constitutes the final part of the education I am pursuing at the Royal
Institute of Technology (KTH); a M.Sc. in Engineering Physics, with a specialization on
financial mathematics and statistics. The work of this thesis has been carried out at the Scania
department of pre-development (REP), part of the Systems Development division.
I would like to thank Håkan Gustavsson, my supervisor at Scania and Professor Timo Koski,
my supervisor at KTH. Thank you!
In particular I would like to thank Saddaf Shabbir at Vectuz Webwork AB for her in-depth
knowledge of programming and Ann Lindqvist at the Scania department of Diagnostic
Communications (RESD) for her knowledge of statistics. I would also like to thank my good
friends Anders Ingårda and Assad Alam. Thank you!
Finally I would like to thank my mother, my father and my sister – always supporting me. I
love you!
Södertälje, Sweden 2008
Akoo Hematbolland
–5–
Table of contents
1.
Introduction ....................................................................................................................9
1.1
Background.............................................................................................................9
1.2
The evolution of the automotive industry ..............................................................11
1.3
Problem statement.................................................................................................12
1.4
Research issues .....................................................................................................14
1.5
Scania – a case study.............................................................................................15
1.6
ECU systems.........................................................................................................15
1.7
Problem statement revisited – Data Mining ...........................................................18
1.7.1
Data Mining process......................................................................................18
1.7.2
What is meant with function? ........................................................................18
1.8
Large data set........................................................................................................19
2. Data mining ..................................................................................................................20
2.1
Introduction ..........................................................................................................20
2.1.1
Data Mining Tasks ........................................................................................22
2.2
Data preparation....................................................................................................23
2.2.1
Attributes and Measurement ..........................................................................23
2.2.2
The Different Types of Attributes..................................................................24
2.3
Cluster analysis .....................................................................................................26
2.3.1
Hierarchical Clustering..................................................................................27
2.3.2
K-means Clustering .......................................................................................27
2.3.3
Gaussian Mixture Model ...............................................................................28
2.3.4
Distance measure...........................................................................................28
3. Binary Clustering..........................................................................................................29
3.1
Mathematical criteria – A general clustering model for binary data .......................29
3.1.1
K-means Clustering .......................................................................................31
3.1.2
The principle of Minimum Description Length MDL ....................................32
3.1.3
Stochastic Complexity SC .............................................................................32
3.2
Two Step Cluster in SPSS .....................................................................................33
3.2.1
CF-tree ..........................................................................................................33
3.2.2
Cluster step....................................................................................................34
3.2.3
Log-Likelihood distance................................................................................34
3.2.4
Auto Clustering using BIC ............................................................................35
3.3
Data Mapping RPM in VisuMap ...........................................................................35
4. Analysis........................................................................................................................36
4.1
K-means in MatLab – A simple example...............................................................36
4.2
Data Mapping in VisuMap ....................................................................................38
4.3
Two Step Clustering in SPSS ................................................................................39
4.3.1
Clustering strategy.........................................................................................39
4.4
Result in SPSS ......................................................................................................39
4.4.1
Function A ....................................................................................................40
4.4.2
Function B.....................................................................................................42
4.4.3
Function C.....................................................................................................44
4.4.4
Function D ....................................................................................................46
4.4.5
Function E.....................................................................................................48
4.5
Change over time ..................................................................................................50
4.5.1
Function D ....................................................................................................50
4.5.2
Function E.....................................................................................................51
5. Discussion ....................................................................................................................52
5.1. Tools.....................................................................................................................52
–6–
5.2. Binarization ..........................................................................................................53
6. Related wok – Stock market..........................................................................................54
7. Future work – Function to function ...............................................................................55
8. Conclusion....................................................................................................................56
9. References ....................................................................................................................57
10.
Appendices ...............................................................................................................59
10.1. Appendix A – Importance of “knowing your data”................................................60
10.2. Appendix B – Dependency Structure Matrix .........................................................62
10.3. Appendix C - K-means in MatLab.........................................................................65
–7–
Reading guide
Chapter 1 introduces the thesis. It provides a motivation, as well as a description of the
research issues and the assignment.
Chapter 2 introduces Data Mining techniques in general.
Chapter 3 discusses Binary Clustering.
Chapter 4 discusses the analyses performed in this thesis.
Chapter 5 outlines a discussion about the methods and tools.
Chapter 6 describes related work.
Chapter 7 presents future work.
Chapter 8 provides the conclusions of this thesis.
For the one who can only spend a quarter of an hour reading the thesis the conclusions may be
of greatest interest as well as the section outlining the problem statement.
The following parts of the thesis are the most relevant if you are…
Math student
The introductionary sections (perhaps 1.7-1.8 and 2.3).
Binary Clustering (chapter 3).
Results, analysis and conclusion as well as future work.
Importance of “Knowing your data” (Appendix A)
Scania employee
The introductionary sections.
Data Mining introduction (Chapter 2.1).
Results, analysis and conclusion as well as future work.
DSM-clustering (Appendix B).
–8–
1.
Introduction
This chapter serves as an introduction to the thesis. It introduces background, the evolution of
automotive industry, the problem statement and the research issues, as well as providing a
motivation for this research.
1.1 Background
To facilitate people’s life in modern information society the technology of computer, auto
control and communication has been developed. Microcontroller is used widely in electric
appliance, automobile, robots, science instrument and medical device. The embedded system
indicates that the computer and auto control technology has permeated into many kinds of
products in our life [1].
Until recently, in the automotive industry, reuse of software has entirely been a typical
activity of suppliers. They try to reduce the increasing software development costs that stem
from rising complexity and size of software in the modern automobile [4].
In the current time, not only the suppliers but also the manufacturers have to deal with the
problem of reuse. The manufacturers have to deal with additional problems such as to
integrate the networked hardware components to one automotive system.
The automotive industry is facing a new challenge at the beginning of the third millennium.
This means that electronics will make 90% of the innovations and out of that 80% in the
software part [17]. The development of electronics will be affected by this major change. It
will be a need for more functionality that is highly connected for this matter before running
out of time. The Mercer Management Consulting and Hypovereinsbank [2] have done a study
that values the software in the automotive industry remarkably. This study claims that in
2010, 13% of the production cost of a vehicle will be software (Figure 1.1).
Figure 1.1. – (Mercer Management Consulting and Hypovereinsbank, 2001)
To reflect on this matter a changed development process has to be recognized and this change
must also include the methods of developing software for the automotive domain. Intensive
work has been done in parts of this field that covers requirements engineering, quality of
software or model based software development among others [4]. The main targets are to
–9–
decrease the software development time and to increase the quality of software. Reuse of
software is also another target to include in this challenge.
Created mainly for the specific requirements and standards automotive industries for vehicles
to work in the field, modern controller consist of numerous different Electric Control Units
(ECUs) based on the embedded system [1]. The Control Area Network (CAN) configuration
in construction machinery is made by the different ECUs. The main tasks include measuring,
driving or operating control device for sensor-actuator management and carry out a number of
tasks in real time. The separation of the hardware of an ECU from the embedded software is
the main requirement for reusing software in the automotive domain. A few years ago
automotive manufacturers saw the ECU’s of a car as a single unit. They defined them as black
boxes when they ordered them from the supplier. When receiving the samples, they tested
them as black boxes.The drawback with this method for the manufacturers is that the software
has to be developed from scratch for each new project, if the supplier is changed. This will
cause expenses, and an increase of development time. The responsibility for the whole
electronic system is another subject that will need more consideration by the manufacturers.
The suppliers view will only cover their part of the system. It is essential for the
manufacturers to develop process and methods that will make reuse software on the system
level available. The methods for the reuse of software will modify the manufacturers to
develop relevant software on their own in the future. [4] Since the truck industry follows the
same development track as the car industry (with some latency) the truck manufacturers must
deal with the same problem.
To many people, cars and trucks are the same product – the only difference being the size.
This is however far from the truth. There are major differences – differences that will be
explored in this section.
In an article by Zientz [5], he discusses the differences between passenger cars and
commercial vehicles:
“The main purpose of commercial vehicles is the transportation of goods. This means that the
manufacturers of commercial vehicles, unlike the passenger car sector, must deal with a wide
variety of trucks and special purpose vehicles. Trucks for example are produced in a wide
variety of combinations regarding the maximum load to be carried, the number of axles, the
size of the engine and the size of the truck cabin. The customer base for truck manufacturers
varies a lot, from private business owners to large haulage companies with a fleet of several
hundred trucks. These hauling companies have a strong purchasing power that may influence
cost and feature structures of the vehicle manufacturers. Most European truck manufacturers
are developing vehicles for the global market, in order to ensure necessary production
quantities. This globalisation brings additional challenges to the manufacturer with regard to
different customer demands, regional regulations and competition in regional markets. Hence,
the ability to address strong variation is a key success factor in this business.”
– 10 –
1.2 The evolution of the automotive industry
The increase of the importance of embedded systems within the automotive industry is a fact.
High end cars can contain well over 50 ECUs within the car sector. Yet the truck division is
fairly diverse and the number of ECUs in trucks is more in the magnitude of a dozen [5]. The
cost of a characteristic ECU in a truck is approximately 1000 SEK [3]. See figure 1.2 to
observe the evolution of the number of ECUs in passenger cars.
Figure 1.2. Number of ECU in latest car releases (Zelke, 2006)
A study made by McKinsey & Company in 2006 [6] shows that they expect the value of
electronics in automobiles to increase from the current 25 percent to 40 percent in 2015.
According to this study the software and electronics drive about 70 to 90 percent of all
innovations in cars- a figure that will be more increased by 2015. Electronics is also seen as a
major lever allowing manufacturers to differentiate product offering and expand into new
markets. The statistics from McKinsey & Company is for the passenger car sector, but it is
also valid for the truck industry, which most of the time follow the same evolutionary track
but with some latency.
Consider the following example taken from Erik Persson’s thesis [3]:
A particular module supports the function cruise control as well as the function adaptive
cruise control. Then suppose that the majority of the customers only request the function
cruise control. The module would then have the code needed to implement this function,
though this is code that in this case is not being used. This latent code is something that the
customer does not have to pay for, so in a sense it is given away for free. (However, the
function that this piece of code implements cannot be used by the customer.)
The function adaptive cruise control requires a distance sensor in order to function, which is
not required by the regular cruise control. For a vehicle configuration in which this function
has not been chosen the sensor is not mounted, and hence incurs no cost. Yet there are other
associated hardware costs: the size of the memory has been dimensioned to harbour both
functions and when only cruise control is used these results in dead space in the memory.
Only a fraction of all vehicles has both functions, rendering it more likely that a smaller
memory would be sufficient. The conclusion would be that the resource utilization of this
module is low, and hence its cost-efficiency as well. However, the adaptive cruise control
– 11 –
logic is very complex and distributed, which would make it far from straight forward in
practice to evaluate this function. An architecture could – as in the case of the cruise control –
lead to a situation in which a customer choice incurs unnecessary cost as functionality is not
used to its full extent. The architecture that uses the resources in the best way are hence one in
which consideration has been taken to the choice of the customer.
The conclusion of this section informs us that electronics is very important from a financial or
business perspective. For instant relatively small savings on a component level may result in
saving of nearly 10 million EURO [3] over the production period. Hence manufacturers’
knowledge about functionality is getting more and more important.
1.3 Problem statement
In previously chapter we saw that the cost of electronics has risen rapidly over the last years,
and a reduction of the product cost of the electronics system would thus have a significant
impact on the total cost of the vehicle. A big part of current and future functionality is realized
by the electronics system. This system consists of modular components with the same
requirement as the traditional mechanical components. The electronics system in vehicles
implement distributed functions that are employing different hard- and software components
in order to realize their functionality. The way in which components are allocated and
connected is described by the architecture of the system.
Scania CV AB produces automotive products with a common product platform of modular
components to keep the product cost low, a high level of quality and to offer the customer a
maximized range of choice. The customer can virtually tailor the vehicle and may choose
between many different functions such as cruise control, anti-spin, ESP, retarder etc.
The system architecture is the same within a product family, but every produced vehicle can
still be unique as its configuration is chosen by the customer. Figure 1.3 show that every
vehicle has its own “DNA.” Hence, there is an almost infinite set of variants, implying that it
is virtually impossible to achieve a perfect architecture with respect to resource utilization.
Figure 1.3. Every truck has its own “DNA”
– 12 –
The “DNA” is described by special codes. These codes describe the physical configuration of
every single vehicle. Its value can affect the parameters in one or several control units and our
focus will be on the electrical control units.
The purpose of this work is to investigate the resource utilization by using historical sale data
which are described by the DNA codes. The goal is to get knowledge about the customer’s
choice by looking at historical sale data for the electronic system of the vehicles.
Figure 1.4 shows how the electric system is allocated in a truck. The set of possible customer
choices is enormous and the number of optional functions can be very large. If we consider 20
functions simultaneously which have 5 attributes each, we will get 95.367.431.640.625
different combinations.
Figure 1.4. 20 Control Units with 5 attribute each lead to astronomical numbers
We will pick out five functions (function A-E) which depend on DNA codes in the vehicle.
These codes can be components, control units, countries and type of vehicle. From a database
containing historical sales data we will then pick out the subset of vehicle where each function
is included. The problem is now to find a pattern between the trucks based on control units,
segment and countries.
sales
data
analysis
function
design
No. 1
Cost efficient systemdesign
5 functions
– 13 –
Figure 1.5 below shows the abstract model of this work. The inputs are the given functions
and historical data while the output are statistics.
Function A-E
DNA code
Analysis
Historical Data
DNA code
Statistics
-Control Units
-Segment
-Country
Figure 1.5. Abstract model of this work
An architecture may in some cases lead to a situation where a customer choice incurs
unnecessary costs as the function requires a particular hardware that otherwise would not be
necessary. This implies that the architecture that utilizes the resources best is an architecture
in which one has taken into account how the product family has been configured with respect
to customer choice [3]. In many cases, numerous modular components employed in the
electronics system have functions that are not being used, as the customer may have chosen a
more low-end, less advanced configuration of the vehicle. The result of this work can be used
as support when making architectural decisions
1.4 Research issues
The purpose of this master thesis is to investigate how historical sales can be used to find out
more information about customers choice of functions. By applying Data Mining techniques
we will try to find patterns in our data. If we don’t find any interesting pattern for any of our
five functions we will be forced to pick out other functions.
In order to fulfil the purpose of this master thesis, four research questions were initially
formulated:
Research issues
1 – Can we find a pattern where several vehicles uses similar configuration based on the
electrical control units?
Patterns of interest:
2 – What kind of truck is it (Segment - distribution, long-haulage or construction)?
3 – Which country has bought these trucks?
4 – Change over time?
– 14 –
1.5 Scania – a case study
Scania is one of the world’s leading manufacturers of trucks and buses for heavy transport
applications. A growing proportion of the company’s operations consist of services. Scania
operates in about 100 countries and employs almost 33 000 people. Research and
development are concentrated to Södertälje, Sweden, and production units are located in
Europe and Latin America. This master thesis has been carried out at REP, which is a
department of the division Systems Development. REP mainly works with pre-development
of systems and functions realized by electronics and software, and has no responsibility for
parts in production. The department develops vehicle functionality, as well as works with the
long term improvement of methods for systems development, e.g. methods for system
modelling.
Moreover, REP has the responsibility to co-ordinate the pre-development within Systems
Development and to keep contacts with universities, institutes and research programs in the
area. The supervisor from the Scania side has been Håkan Gustavsson, currently pursuing a
PhD within the project Decision methods for E/E-system Architectural Design (DAD).
1.6 ECU systems
An overview of how the electrical system has been designed in Scania’s vehicles is giving in
this section. The network that links the control units plus some of the systems included in this
network is described.
The Electric Control Unit ECU systems write and read “packets” of digital information in a
network called Control Area Network. There are approximately 30 control units (ECUs)
which are linked together in the CAN network. This means that in a vehicle with advance
specifications (high-end), most of the systems interchange information over the CAN network.
The advantage is that the driver and mechanic are able to gain more information about the
condition of the vehicle and regarding any faults [22].
This makes the troubleshooting both simpler and faster. Furthermore, it enables the mechanics
to change functions in the ECU systems. The CAN network shown in the figure 1.6 contains
18 ECU systems. However, there are only five ECUs in the simplest vehicle (low-end).
– 15 –
Figure 1.6. Location of ECUs that can be part of CAN network in an advance vehicle
Reducing the risk of interferences to message between the most important ECU systems
(coordinator, brakes, engine and gearbox) from less important messages (radio, ACC, ATA,
etc), the important system are linked together in a special CAN bus (red bus), The other
systems are divided into two CAN buses, which is called the yellow and green bus [22] .
– 16 –
The two figures 1.7 and 1.8 below show the network structure for two different truck
configurations. The first version as a high-end version, where the customer has chosen almost
all of the available functionality, and as a consequence there are more than 20 ECUs required
to implement this functionality (each box represents an ECU).
D i a g n o s t ic b u s
CO O
C o o r d i n a to r
s y s te m
AUS
A u d io S y s t e m
CSS
C ra s h S a fe ty
S y s te m
R ed Bus
LAS
L o c k in g a n d
A la rm S y s te m
ACC
A u t o m a ti c
C li m a t e C o n t ro l
GMS
G e a rb o x
M a na ge m en t
S y s te m s
EMS
E n g in e
M a na gem ent
S y s te m
EEC
E x h a u s t E m is s io n
C o n tr o l
W TA
a u x ilia r y h e a te r
s ys te m
W a te r - T o - A i r
IC L
Green Bus
I n s t r u m e n t C l u s te r
S y s te m
TCO
T a ch o g ra p h
S y s te m
Yellow Bus
C TS
C lo c k a n d T im e r
S y s te m
R TI
R o a d T ra n s p o rt
In fo r m a t i c s
s y s te m
V IS
V i s ib i li ty S y s t e m
APS
A i r P r o c e s s in g
S y s te m
BWS
B o d y W o rk
S y s te m
B o d y B u ild e r T r u c k
Figure 1.7. The embedded system of a high-end version of a Scania vehicle
Figure 1.8. The embedded system of a low-end version of a Scania vehicle
– 17 –
BMS
B ra k e
M a na ge m en t
S y s te m
SM S
S u s p e n s io n
M an ag em e nt
S y s te m
1.7 Problem statement revisited – Data Mining
This chapter revisits the problem statement by highlight the sense of data mining.
1.7.1 Data Mining process
From a database the data will be represented by a large matrix. This work will be delimited by
looking at customer’s choice of functions as Boolean variables (0/1). This means that the
customer’s choice is representing by binary attributes. Hence the matrix will be built on
binary values. The structure in the data set will be investigated by Data Mining techniques.
The final part is to visualize the structure in the modified binary matrix. Figure 1.9 shows the
process.
Figure 1.9. Data mining process for this work
1.7.2 What is meant with function?
Consider following example to highlight the concept of a function:
Call a fictitious (not realistic) function for temperature display. This function uses a sensor
and two ECUs to display the temperature. All of these components are described by DNA
codes. Here, the ICL and ACC are control units while the sensor and temperature display are
components in the vehicle. This function can be described by a mathematician as:
f ( x1 , x 2 , x3 , x 4 )  f ( DNA23, DNA4, DNA1, DNA3)  f ( sensor , ICL, COO , display )
In this work five functions will be picked out to analyze. Architectures at Scania will help us
to find appropriate functions to the analysis. These functions depend on DNA codes in the
vehicle which can be both components and control units:
f 1 ( x1 , x2 ,.....)  A
f 2 ( y1 , y 2 ,....)  B
f 3 ( z1 , z 2 ,.....)  C
f 4 (q1 , q 2 ,....)  D
f 5 ( w1 , w2 ,...)  E
Once the function is defined we will look at historical data sales (approximately a quarter of a
million trucks from our database) to find those vehicle using this function.
– 18 –
We will now investigate if we can find a pattern based on the control units, countries and
segment for these vehicles. Notice that we are just looking at control units despite the
function are correlated to other functions. See future work for more information.
The following binary matrix shows n trucks with a common function. The rows are the cases
(trucks in this case) and the columns describe attributes: Electric Control Units, countries and
segment (distribution, long-haulage or construction).
Truck 1
Truck 2
Truck 3
…
…
…
Truck n
Attribute 1
Attribute 2
Attribute 3
......
Attribute k
0
0
0
….
1
1
0
….
0
1
0
…..
0
0
1
….
0
0
0
…
0
1
0
1
0
The ones in the above matrix indicate that there is a certain control unit, country and type for
a vehicle (segment) and the zeros indicate that there is not.
1.8 Large data set
Since the historical data set from the database is large this problem must be tackled by data
mining techniques. When dealing with binary data (binary vectors) one has to use appropriate
methods and algorithms to classify the data. Moreover, reliable patterns and visualization of
the patterns depends of the nature of the data and the chosen distance measure. Finally, right
tool (software) must be found. “Right tool” means software that can handle such large data
set, data type and contains the algorithms needed for this problem.
The purpose of this work is to investigate the resource utilization in automotive embedded
systems. The initial phase of the work constituted of formulating a problem statement and
research issues. The next step consisted of studying methods and algorithms based on data
mining techniques. Once the data preparation was finished the method and algorithms was
tested. The main tool was SPSSs TSC (Two Step Clustering). SPSS is powerful software
made for statistical analysis [23]. The final part of the work was to visualize the results in
SPSS. The literature survey formalized the background of the problem. It laid the foundation
for the theoretical framework used to evaluate the resource efficiency and to compare
different methods. The literature survey included doctoral theses, Scania internal documents,
published books, articles in various journals, other publications etc.
Problem
statement
Data
Mining
Binary
Clustering
Analysis
Tools
– 19 –
SPSS
Visualization
Conclusions
2.
Data mining
This chapter describes data mining techniques. It also describes data preparation, some
clustering methods and distance measurement.
2.1 Introduction
In the current time, vast amount of data are collected and stored in computers. The reason is to
extract useful information later. The relevant information of data is not known at initial time
of collection and therefore the database is not designed to distil any particular information [8].
The nature of the data in the database is unstructured. The science of extracting useful
information from large data sets is usually referred to as “Data Mining” or “Knowledge
Discovery from Data.” Hence data mining is the process of sorting through large amounts of
data and picking out relevant information [7]. Here data can be any facts, numbers, or text that
can be processed by a computer. Patterns, association, relationship among all this data can
provide information. Figure 2.1 shows the concept of data mining – finding relevant
information in large data set.
Figure 2.1. Data mining – find relevant information in large amounts of data
There are many different application areas for data mining, ranging from scientific
applications such as the classification of volcanoes on Venus to internet search engine. Data
mining include many techniques from computer science, statistics and data analysis and
optimization to name a few. This makes it to an interdisciplinary science [8].
– 20 –
Data mining is an integral part of Knowledge Discovery in Databases (KDD) [9], which is the
process of converting raw data into useful information, as shown in Figure 2.2. This process
consists of a series of transformation steps, from data pre-processing to post-processing of
data mining results.
Filtering Patterns
Visualization
Pattern Interpretation
Input Data
Data Preprocessing
Data
Mining
Feature Selection
Dimensionality Reduction
Normalization
Data Subsetting
Post-processing
Information
Figure 2.2. The process of knowledge discovery in databases (KDD)
The input data can be stored in a variety of formats (flat files, spreadsheets, or relational
tables). In this work the input was a relational table which was imported to the software SAS
Statistical Analysis System [35] from internal software which is connected to the database.
Once the appropriate functions were imported to SAS, a binarization of the data was made.
From SAS, the data was exported to different software’s for the analysis. The purpose of preprocessing is to transform the raw input data into an appropriate format for subsequent
analysis. The steps involved in data pre-processing include fusing data from multiple sources,
cleaning data to remove noise and duplicate observations, and selection records and features
that are relevant to the data mining task at hand. The pre-processing part of this work was to
select records, binarization of the data and handle missing values. Because of the many ways
data can be collected and stored, data pre-processing is perhaps the most laborious and timeconsuming step in the overall knowledge discovery process [10].
An example of post-processing is visualization. Data visualizations are the display of
information in a graphic or tabular format. Successful visualization requires data to be
converted into a visual format so that the properties of the data and the relationships among
data items can be analyzed [9]. The visualization part of this work was the graphic
presentation of the results. Since binary data is hard to visualize, this was made in two
different ways. One in visualization software called VisuMap which offers methods to
visualize high dimensional data [25] and the other one was to export the results to Excel and
the make the graphic presentation (see chapter 4).
– 21 –
2.1.1 Data Mining Tasks
Data mining tasks are generally divided into two major categories:
Supervised learning: predictive tasks
The objective of these tasks is to predict the value of a particular attribute based on the values
of other attributes. The attribute to be predicted is commonly known as the target or
dependent variable, while the attributes used for making the prediction are known as the
explanatory or independent variables.
Unsupervised learning: descriptive tasks
Here, the objective is to derive patterns (correlation, trends, clusters) that summarize the
underlying relationship in data [9]. Descriptive data mining tasks are often exploratory in
nature and frequently require post-processing techniques to validate and explain the results.
Cluster analysis, which is an unsupervised learning, will be used in this work. Cluster analysis
seeks to find groups of closely related observations so that observations that belong to the
same cluster are similar to each other than observations that belong to other clusters.
Clustering has been used to group sets of related customers to find areas of the ocean that
have a significant impact on the Earth’s climate [10]. An example on importance of the data
preparation is giving Appendix A.
– 22 –
2.2 Data preparation
New research in data mining is often driven by the need to accommodate new application
areas and their new types of data [10]. Data that is to be analyzed can differ in several ways.
The attributes used to describe data objects can be quantitative or qualitative, and different
data types require different tools and methods to analyze the data. Hence it is vital to
represent the data in a way that suits the methods used.
The Quality of the Data is often far from perfect through presence of noise, missing values
and inconsistent or duplicate data. Most data mining techniques can handle some
imperfections, but the result is often improved if the quality of the data is increased. In this
work the missing values was to be tackled by searching through the data set and replace it
with some suitable values. Moreover, many if-statements were made to pick out the chosen
functions and merge the data to create correct binary data matrix.
Once again, the pre-processing part in Figure 2.2 above is one of the most important steps in
the data mining process [9]. Pre-processing data is all about making data more suitable for
data mining and analysis.
2.2.1 Attributes and Measurement
A data set usually contains a collection of data objects, also called records, points or
observations. Data objects have different attributes that describe the property of an object,
such as the mass or colour of the object. The definition of an attribute is a property or
characteristic of an object that may vary, either from one object to another or from one time to
another. In this work the records was trucks and the attributes was Electric Control Units,
segment and countries.
In practical, attributes don’t need to be numbers or symbols, but to analyze the characteristics
we can assign them these and for that, a measurement scale is needed.
The definition for a measurement scale is a rule or function that associates a numerical or
symbolical value with an attribute of an object [9]. This is needed to handle the data
effectively and correct. Since it is possible to assign different measurement scales to an
attribute, it is obvious that the properties of an attribute need not be the same as the properties
of the values used to measure it. In this work the measurement scale was categorical since the
attributes was binary (see table 2.1).
The type of an attribute says what properties of the attribute are represented by the values
used to measure it. It is vital to understand and know the type of an attribute, in order to reach
correct conclusions from the resulting analysis.
– 23 –
2.2.2 The Different Types of Attributes
The different types of attributes are derived from the following operations that can be
performed on numbers:
1.
2.
3.
4.
Distinctness = and ≠
Order <, ≤, > and ≥
Addition + and –
Multiplication * and /
From these properties, the four types of attributes are defined: nominal, ordinal, interval and
ratio. Table 2.1 give a summary of the different types.
Description
Examples
Nominal
The values of a nominal attribute are
just different names (nominal values
provide only enough information to
distinguish one object from another).
(=, ≠)
Binary values, eye
color, gender
Ordinal
The values of an ordinal attribute
provide enough information to order
objects.
(<, >)
{good, better, best}
grades, street
numbers
Interval
For interval attributes, the differences
between values are meaningful, i.e., a
unit of measurement exists.
(+, -)
Calendar dates,
temperature in
Celsius
Ratio
For ratio variables, both differences
and ratios are meaningful.
(*, /)
monetary quantities,
counts, age, mass,
length, electrical
current
(Qualitative)
(Qualitative)
(Quantitative)
(Quantitative)
Numeric
Numeric
Categorical
Categorical
Attribute
Type
Table 2.2 Different attribute types
– 24 –
Nominal and ordinal attributes are so called categorical or qualitative attributes, and most
operations performed on numbers, have no meaning for this data.
A discrete attribute can only have a finite set of values. These are often represented by integer
variables, and a special case is binary attributes that only take 2 different values, representing
true/false, yes/no, male/female etc. In this work the data sets is represented as Boolean values
that can be only 1 or 0 [24].
Interval and ratio attributes on the other hand are quantitative or numeric attributes where the
data represents actual values, and they hold the properties of numbers. These attributes can be
both integer-valued, and continuous. Continuous attributes have real numbers as their values,
and are often represented as floating point variables in data sets.
One way to distinguish between attributes is by the number the values can take.
Any measurement scale type (nominal, ordinal, interval or ratio) can be combined with any of
the number of attribute values (binary, discrete or continuous) but it is often not practical with
some combinations. Typically the nominal and ordinal attributes are discrete or binary, while
interval and ratio attributes are continuous, since they represent realistic data. But this doesn’t
hold always: count attributes that are discrete, are also ratio attributes for instance [9].
– 25 –
2.3 Cluster analysis
Clustering is a popular data mining technique. Cluster analysis divides data into groups (clusters)
that are meaningful, useful or both. Classification of data is a fundamental tool in pattern
recognition and vector quantization, which are applied in image processing and computer
vision [11]. Cluster analysis groups data objects based only on information found in the data
that describes the object and their relationships. The goal is that the objects within a group be
similar to one another and different from the objects in other groups. The greater the
similarity (or homogeneity) within a group and the greater the difference between groups, the
better the clustering. The ability to classify things is undoubtedly one of the key features of
human intelligence. It is also well known that the clustering problem is a difficult one, and we
have to resort on approximate solutions [12].
Figure 2.2 shows a set of data point in 3D. Assuming we know that have two clusters, we can
easily determine visually which points belong to which class. A clustering algorithm takes the
complete set of points and classifies them using some distance measure.
Figure 2.3. Two clusters in
R3
When dealing with unsupervised learning the cluster number is not always clear. Moreover
measure of similarity depends on the application. Three most popular clustering methods are
described in the following sections, Hierarchical clustering, K-means clustering and Gaussian
Mixture Model. The DSM Dependency Structure Matrix clustering which can be used for
future work when one considers a large set of functions simultaneously is described in
Appendix B.
– 26 –
2.3.1 Hierarchical Clustering
Hierarchical Clustering groups data over a variety of scales by creation a cluster tree or
dendrogram [13]. Figure 2.4 shows an example of a hierarchical Clustering using a
dendrogram. In this case, there is an animal that is similar in all respects except that one has a
white stomach. The other two cases are less similar (because of the colour and the other is an
angry boy!)
Similar
Dissimilar
Figure 2.4. Hierarchical Clustering using dendrogram
The tree is not a single set of clusters, but rather a multilevel hierarchy, where clusters at one
level are joined as clusters at the next level. This method allows deciding the level of
clustering that is most appropriate for the application at hand. A characteristic of this method
is that it produces a sequence of partitions in one run. The main method in this work is based
on a modified version of hierarchical clustering which is called Two Step Clustering. The
TSC is described in chapter 3.
2.3.2 K-means Clustering
K-mean Clustering is a partitioning method. Unlike the hierarchal clustering, this method
operates on actual observations rather than the larger set of dissimilarity measures. It creates
only one level of clusters and treats observation in data as an object having a location in space
[14]. In this work only a simple example on K-means is giving. The algorithm will be
demonstrated in MatLab by using the Hamming distance (see chapter 2.3.4).
The disadvantages with this method are the number of clusters, which is unknown. This
algorithm would need to run multiple times (one for each number of cluster) to generate a
sequence of partitions.
K-means finds a partition in which objects within each cluster are as close to each other as
possible, and as far from objects in other clusters as possible. This clustering method uses an
iterative algorithm that minimizes the sum of distance from each object to its cluster centroid,
over all clusters. A cluster centroid (or just centre) is defined as the vector of cluster means of
each variable.
– 27 –
2.3.3 Gaussian Mixture Model
Gaussian Mixture Model form clusters by representing the probability density function of
observed variables as a mixture of multivariate normal densities. Mixture models are based on
expectation maximization (EM), which assigns posterior probabilities to each component
density with respect to each observation [15]. Clusters are assigned by selecting the
component that maximizes the posterior probability. Like K-means clustering, GMM uses an
iterative algorithm.
2.3.4 Distance measure
A very important step in any clustering is to select right distance measure (metric or distance
function), which will determine how the similarity between two elements is calculated [8].
The shape of the clusters will be affected as some elements may be close to one another
according to one distance and farther away according to another. In this work the metric will
be the probability based log-likelihood measure. The distance between two different
subclasses (clusters) is related to the decrease in likelihood as they are combined into one
cluster [10]. In calculating log-likelihood, multinomial distribution is assumed since we are
dealing with categorical variables (see table 2.2). It is assumed that the trucks and their
binary attributes are independent of each other. The metric is defined by
d (i, k )   i   k   i , k 
K
 j   N j  Eˆ jn
n 1
Lk
Eˆ jn   
l 1
N jnl
Nj
log
N jnl
Nj
where d (i, k ) is the distance between clusters i and k,  i, k  is the index that represents the
cluster formed by combining cluster i and k, N j is the number of trucks in cluster j and N jnl is
the number of trucks in cluster j whose n-th variable takes the l-th category. K is the total
number of variables and Lk is the number of categories for the k-th variable.
Another distance function is the Hamming distance which measures the minimum number of
substitutions required to change one into the other. The metric is defined by
d (i, k ) = # (of places where i and k disagree)
Figure 2.4 shows the Hamming distance between to binary vectors which equals 2.
vector x = 1011101 and vector y = 1001001
Figure 2.5. Hamming distance between x and y is 2
There is other metrics for binary data known as Jaccard, Russel & Ro, Sokal & Sneath and
Dice to name a few, but as stated above, the focus will be on the log-likelihood metric in this
work.
– 28 –
3.
Binary Clustering
The main focus on this work is on classifying (clustering) data consisting of binary vectors.
Here the clustering means division of the set of binary vectors to a set of disjoint subset (i.e.
clusters or subclasses) in a way that the cost of the classification is minimal.
To measure the cost of classification one can use error measures such as MSE (Mean Square
Error) or more complex one such as stochastic complexity. We shall discuss both methods
and describe the Two Step Clustering algorithm in SPSS.
3.1 Mathematical criteria – A general clustering model for
binary data
Suppose a set of binary vectors B t of form X l  ( x1l , x 2l ,...., x dl ) , where xil  [0,1] . Then the set
is described as follow:
B t   X l | l  1,2,...., t

Suppose now that we want to classify B t into k disjoint classes C  (C1 , C 2 ,..., C k ) , where
C j   X l | l  1,2,...., t j  and j  [1,2,...k ] . Then for each class C j one has to compute the
number of the ones in each column i by:
t
t ij  l j 1 xil
(1)
Assume now that the distance function (metric) of each vector to its class is given by
d (xl , C j )
This distance can be Euclidean distance, Hamming distance, log-likelihood distance and other
distance function. Now the total error can be expressed as follow:
k
tj
t
Error ( B , C )   d ( x l , C j )
(2)
j 1 l 1
We first present a general model for binary clustering problem based on mean square error.
The model is specified as follow:
W  AXB T  E
(3)
where E is the error component. The first term AXB T characterizes the information of the
binary data set W  (wij ) nm that can be described by the cluster structures. A and B explicitly
designate the cluster membership for data points and features. X specifies cluster
representation.
– 29 –
Let now Ŵ denote the approximation AXB T and the goal is to minimize the approximation
error. Before the minimizing process let us define the Frobenius norm [37] of a
matrix M  (M ij ) :
M
F
2
ij
M

i, j
The sum of squared error is now:
Error ( A, X , B)  W  Wˆ
n
m
K
2
F
n
m
 Trace[(W  Wˆ )(W  Wˆ ) T ]  i 1  j 1 (wij  wˆ ij ) 2
(4)
C
 i 1  j 1 ( wij   aik b jc x kc ) 2
k 1 c 1
where K is number of clusters for data points and C is number of clusters for features.
Suppose now that
A  (a ik ), a ik  0,1
B  (b jc ), b jc  0,1
and


K
k 1
C
c 1
a ik  1
b jc  1
A and B denote the data and feature membership respectively. Based on equation (4) above
we obtain
Error ( A, X , B)  W  Wˆ
2
F
n
K
m
C
K
C
 i 1  j 1 ( wij   aik b jc x kc ) 2     (wij  x kc ) 2
k 1 c 1
k 1 c 1 iPk jQc
where i  Pk is i-th data point in cluster Pk and j  Qc is j-th feature in cluster Qc .
For fixed Pk and Qc , the optimum X is obtained by
x kc 
1
pk qc
 w
ij
iPk jQc
Hence X can be thought as the matrix of centroids for the simultaneously clustering problem.
X represents the associations between the data clusters and the feature clusters.
– 30 –
Error ( A , X , B ) can then be minimized via an iterative procedure of the following steps
1. Given X and B, then the feature partition Q is fixed. Error ( A, X , B) is then
minimized by
aˆ ik  1 if
C
C
 
c 1
(w  x kj ) 2  c 1  jQ ( wij  x lj ) 2 for l  1,..., K , l  k
jQc
ij
c
and 0 otherwise.
2. Given X and A, then the data partition P is fixed, Error ( A, X , B) is then minimized
by
bˆ jc  1 if
K
 
k 1
iPk
K
( w  xic ) 2  k 1 iP ( wij  xll ) 2 for l  1,..., C , l  c
ij
k
and 0 otherwise.
3. Given A and B, X can be compute by:
x kc 
1
pk qc
 w
ij
iPk jQc
3.1.1 K-means Clustering
Consider equation (3) above:
W  AXB T  E
If we choose B  I mm (identity matrix) then the general model reduces to the K-means
clustering (grouping data points into clusters). Hence
W  AX  E
Suppose now
A  (a ik ), aik  0,1 and

K
k 1
a ik  1 .
The optimization model reduces to
Error ( A, X , B)  W  Wˆ
n
m
m
2
F
n
m
k 1
n
K
m
K
m
 i 1  j 1 aik  (wij  x kj ) 2   a ik  ( wij  y kj ) 2   p k  ( y kj  x kj ) 2
j 1
K
 Trace[(W  AX )(W  AX ) T ]  i 1  j 1 (wij   aik x kj ) 2
i 1 k 1
j 1
k 1
– 31 –
j 1
n
where p k  k 1 aik and y kj 
1
pk

n
k 1
a ik wij
Hence, given A the error component Error ( A, X , B) is minimized by setting
x kj  y kj 
1
pk

n
k 1
a ik wij
3.1.2 The principle of Minimum Description Length MDL
In order to compress several data vectors together in an optimal manner, one need to capture
all the common regularities found in the data. The more the data vectors in a cluster are
similar the better one can compress the cluster [38]. Sum of all compressed clusters (the total
code length) is a criterion forming dependence between the clusters. The overall idea is to
choose a representation of the data which lets one express them with a shortest message via a
postulated set of models. Hence the code length offers a universal scale, making it possible to
compare clusterings of different complexity. The “message” or “description” length is
measured in bits (traditionally) [39].
3.1.3 Stochastic Complexity SC
Stochastic complexity in the minimum description length framework is a central concept for
statistical modelling. Old formalization of SC is marginal likelihood and BIC (Bayesian
Information Criterion) and modern formalization is Normalized Maximum Likelihood. SC is
the shortest description length of a given data set relative to a model class  [38].
The model class  can be defined as a set of paramedic distributions indexed by elements of
  Bt :
  P( x |  ),   
The maximum likelihood model in the model class  with respect to the data set x is
ˆ( x)  arg maxP ( x |  , )
 
Define stochastic complexity as the result of the following minmax optimization problem
with a density Q [39]:
SC  BIC  PBIC ( x' )  arg min max (log P ( x' | ˆ( x ' ), )  log Q( x' ))
Q
x'
The solution to this minmax problem is
BIC  PBIC ( x ) 
P( x | ˆ( x), )
P( x ' | ˆ( x ' ), )

x'
– 32 –
3.2 Two Step Cluster in SPSS
The SPSS Two Step cluster TSC method is a modified version of hierarchical clustering
analysis designed to handle very large data sets. The main idea is to pre-cluster the cases (the
trucks in this work) into many small sub-clusters and then cluster the sub-clusters resulting
from pre-cluster step into desired number of clusters.
The pre-cluster step uses a sequential clustering approach. It scans the cases one by one and
decides if the current case should be merged with the previously formed clusters or starts a
new cluster based on the distance criterion. The procedure is constructing a modified cluster
feature (CF) tree [40] The CF tree consist of levels of nodes, and each node contains a number
of entries. An entry in a leaf node represents a final sub-cluster [10]. The internal nodes and
their entries are used to guide a new case into a correct leaf node. Each entry is characterized
by its CF that consist counts for category of the categorical variable (binary here).
Procedure of TSC:
Step 1: Pre-cluster data to sub-clusters
1. Knowledge of CF tree
2. Cluster feature
3. CF tree
Step 2: Group data into sub-clusters
1. Calculate BIC
2. Refines
3.2.1 CF-tree
The information that is maintained about a cluster is summarizing in clustering feature [10]. A
CF tree is a height-balanced tree that stores the clustering features for a hierarchical
clustering. A internal node in a tree has “children” and these nodes store sums of the CFs of
their children. Hence an internal node represents a cluster made up of all sub clusters by its
entries. A leaf node also represents a cluster made up of all sub clusters represented by its
entries. A CF tree has two parameters, a branching factor B which specifies the maximum
number of “children” and a threshold T. The size of any entry has to be less than the threshold
[40]. Also there is a limit for the numbers of entries in a leaf node. Figure 3.1 shows a CF tree
with branching factor B and a leaf node with maximum entries L.
For each case, starting from the root node, it is recursively guided by the closest entry (at each
level, choose the sub-tree whose centroid is closest) in the node to find the closest child node,
and descends along the CF tree. Upon reaching a leaf node, it finds the closest leaf entry in
the leaf node. If the case is within the threshold T of the closest leaf entry, it is absorbed into
the leaf entry and the CF of that leaf entry is updated. Otherwise it starts its own leaf entry in
the leaf node. If there is no space in the leaf node to create a new leaf entry, the leaf node is
split into two. The entries in the original leaf node are divided into two groups using the
farthest pair as seeds, and redistributing the remaining entries based on the closeness criterion
(distance measure). If the CF tree grows beyond allowed maximum size, the CF tree is rebuilt
based on the existing CF tree by increasing the threshold. The rebuilt CF tree is smaller and
has space for new cases. This process continues until a complete data pass is finished [10].
– 33 –
CF1
CF11
CF12
CF111
………………
CF112
………
CF2
………………
CF1B
CFB1
CFB
CFB2
………………
CF11L
Figure 3.1. CF tree with branching factor B. A leaf node contains at most L entries.
3.2.2 Cluster step
The cluster step takes sub-clusters resulting from the pre-cluster step as input and then groups
them into desired number of clusters. Since the number of clusters is much less than the
number of original cases, the traditional clustering methods can be used effectively. The TSC
uses the hierarchical clustering method.
3.2.3 Log-Likelihood distance
A distance measure for closeness is needed in both pre-cluster and cluster steps. The distance
between two different clusters is related to the decrease in likelihood as they are combined
into one cluster. In calculating log-likelihood, multinomial distribution is assumed. It is also
assumed that the cases and their attributes are independent of each other. The metric is
defined by
d (i, k )   i   k   i , k 
(5)
K
 j   N j  Eˆ jn
n 1
Lk
Eˆ jn   
l 1
N jnl
Nj
log
N jnl
Nj
where d (i, k ) is the distance between clusters i and k,  i, k  is the index that represents the
cluster formed by combining cluster i and k, N j is the number of cases in cluster j and N jnl is
the number of cases in cluster j whose n-th variable takes the l-th category. K is the total
number of variables and Lk is the number of categories for the k-th variable.
– 34 –
CFBB
3.2.4 Auto Clustering using BIC
The number of clusters depends on the data at hand. A characteristic of hierarchical clustering
is that it produces a sequence of partitions in one run: 1, 2, 3 … clusters. A K-means
algorithm would need to run several times in order to generate the sequence. To determine the
number of clusters automatically, a two-step process that works well with hierarchical
clustering is considered. In the first step, the BIC (see chapter 3.1.3) for each number of
clusters within a specified range is calculated and used to find the initial estimate for the
number of clusters. The initial estimate is refined in the second step by finding the largest
increase in distance between two clusters in each hierarchical clustering stage. Using equation
(5) above the BIC is calculated as:
V
BIC (V )  2v 1  v  mV log( N )
where N is the number of cases in total and mV  V

K
k 1

( Lk  1) .
3.3 Data Mapping RPM in VisuMap
For many decades visualizing high dimensional data has been a key subject matter. Many of
the methods target high dimensional data with stylish rendering procedure like 3D, landscape,
special glyphs, colours and graphics etc. Some other methods target the problem by reducing
the dimensionality in a generic way with little theory about data type. In the later method we
can include the RPM (Relational perspective map) algorithm. RPM is a universal purpose
method to visualize distance information of data points in high dimensional spaces [25].
The goal of the RPM algorithm is to map the data points into a two or three dimensional map
so that distances between the image points visually approaches as much as possible. RPM
map attempts to maintain as much as possible distance information of the original dataset
from geometric point of view. The RPM algorithm creation of 2D and 3D maps is shown in
figure 3.2.
Figure 3.2. The principle of the RPM algorithm
– 35 –
4.
Analysis
This section is the main section of this work. Three different tools were used to demonstrate
how the clustering algorithm works. The first one is the K-means in MatLab with Hamming
distance on a simple example. The second one is Data Mapping in VisuMap on one of these
five functions and the last one is Two Step Clustering in SPSS. Since the TSC handle large
data set and decide the number of clusters automatically, this tool was to prefer. Hence, the
main result is based on TSC in SPSS. However, since the visualization in SPSS is poor, the
results were presented by graphs in Excel.
4.1 K-means in MatLab – A simple example
Figure 4.1 shows the principle of the K-mean algorithm. The K-mean is a partitioning method
where the trucks (based on their attributes) are partitioned into subsets (clusters). The idea is
to minimize the mean square error MSE (see chapter 3.1.1). The inputs are the data set and
number of clusters. The output is clusters with data among them where data within the
clusters are similar to each other and dissimilar from data in other clusters. We will
demonstrate how this method works by using K-means in MatLab. The disadvantage with this
method is the requirement of number of clusters. Moreover the visualization using silhouette
values (figure 4.2) in MatLab is not comprehensible. However, the silhouette plot is useful for
deciding the number of clusters, but this can cost time since one must run the algorithm
several times. Each times for different number of clusters and then compares the resulting
silhouette plot.
Figure 4.1. K-means algorithm
– 36 –
The matrix bellow shows 140 trucks with 5 attributes (here Electric Control Units.)
Truck 1
Truck 2
Truck 3
…
…
…
Truck 140
ECU 1
ECU 2
ECU 3
ECU 5
0
0
0
….
1
1
0
….
0
1
0
…..
0
0
1
…
0
1
0
0
By using the K-means in Matlab with Hamming distance (see 2.3.5) we can find a pattern
between the trucks. If we choose 2 clusters the result will be as in figure 4.2. The figure shows
that ECU2 and ECU4 differ from the rest of the ECUs since this control units belongs to
cluster 1 while the other ones belongs to cluster 2. See appendix D for the very simple code in
MatLab for this example using the built-in functions kmeans and silhouette.
ECU2
ECU4
ECU1
ECU3
ECU5
Figure 4.2. Silhouette plot in MatLab using K-means with Hamming distance for two clusters
– 37 –
4.2 Data Mapping in VisuMap
In this section we will describe the use of data mapping in VisuMap [25] (see 3.3). We will
analyze function A. Figure 4.3 shows the data once it has been imported as a CSV file
(Comma Separate Value) into Visumap.
Figure 4.3. Imported multidimensional data into VisuMap
By using the RPM described in 3.3 the function was analyzed. Figure 4.4 shows 5 clusters.
The surprisingly result is that a big proportion (31.5%) of all trucks with this functions is sold
to Thailand.
Figure 4.4. Result in VisuMap for function A
– 38 –
4.3 Two Step Clustering in SPSS
The best method is the TSC in SPSS which can handle the size of the data and uses autocluster (see 3.2). Since the visualization is SPSS is poor the results of the TSC algorithm was
exported to Excel.
4.3.1 Clustering strategy
Five different functions were considered. For each function pattern between Electric Control
Units was found by using the TSC in SPSS. Once the clusters based on the ECUs were found
the segment and countries within those clusters was identified.
4.4 Result in SPSS
The main result if this work is shown in this section. The clusters found for the five functions
are presented here. Table 4.1 describes the volumes of respective function in the data set. For
a description of the different ECUs, please refer to Appendix C.
Function
Function A
Function B
Function C
Function D
Function E
Volume
Low
Low
High
Very High
Very High
Table 4.1. Volumes of the functions in the data set
– 39 –
4.4.1 Function A
Five clusters were found for function A. Figure 4.5 shows the cluster distribution. Figure 4.6
shows the ECU-cluster distribution and figure 4.7 respective segment and countries. Cluster 4
shows that a big proportion of all trucks with this function have the ECUs 23 and 25, and is
sold to Thailand. The details of each of the found clusters are listed below.
Cluster distribution
5
12%
1
26%
4
31%
2
11%
3
20%
Figure 4.5. Cluster distribution – Function A
Cluster 1
Trucks in this cluster have the ECUs 5, 17, 22, 24 and 25. Moreover they belong to the
segment Long-Haulage and are sold in Sweden.
Cluster 2
These trucks do not have ECU 15; they belong to the segment Long-Haulage and are sold in
Span.
Cluster 3
These trucks have the ECUs 16, 22 and 24; they belong to the segment Long-Haulage and are
sold in Europe.
Cluster 4
These trucks have the ECUs 23 and 25; they belong to the segments Long-Haulage and
Distribution and are sold in Thailand.
Cluster 5
These trucks are low-end version; they belong to the segment Long-Haulage and are sold in
Saudi Arabia.
– 40 –
ECUs w r t function A
100%
Percentage
80%
1
60%
2
3
4
40%
5
20%
0%
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
ECU
Figure 4.6. Cluster distribution - ECUs with respect to function A
Segment & Countries
100%
80%
1
2
4
5
40%
20%
Th
ai
la
nd
M
al
ay
si
Sa
a
ud
i_
A
ra
bi
a
Po
rtu
ga
l
Au
st
ria
R
Es
us
to
si
ni
an
a
_F
ed
er
at
io
n
an
y
en
m
ar
k
D
Ita
ly
er
m
G
Sp
ai
n
N
or
w
ay
ni
te
d_
Ki
ng
do
m
U
Fr
an
ce
et
he
rl a
nd
s
N
Po
la
nd
ep
ub
li c
ed
en
Sw
ze
ch
_R
C
au
la
ge
Di
st
rib
ut
io
n
Co
ns
tru
ct
io
n
0%
Lo
ng
-H
percentage
3
60%
Figure 4.7. Cluster distribution – segment and countries for function A
– 41 –
4.4.2 Function B
Two clusters were found for function B. Figure 4.8 shows the cluster distribution. Figure 4.9
sows the ECU-cluster distribution and figure 4.10 respective segment and countries. The
details of each of the found clusters are listed below.
Cluster distribution
2
35%
1
65%
Figure 4.8. Cluster distribution – Function B
Cluster 1
Trucks in this cluster have the ECUs 6 and 16. Moreover they belong to the segment LongHaulage and are sold in Germany.
Cluster 2
These trucks have the ECUs 17, 22, 24 and 25; they belong to the segment Long-Haulage and
are sold in Sweden.
– 42 –
ECUs w r t function B
120%
Percentage
100%
80%
1
60%
2
40%
20%
0%
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
ECU
Figure 4.9. Cluster distribution - ECUs with respect to function B
Segment & Countries
120,0%
80,0%
1
60,0%
2
40,0%
20,0%
D
-H
au
ist lag
r
C ibu e
on t i
st on
ru
c
Sw tion
ed
en
P
N ol
et an
he d
rla
U
nd
ni
te No s
d_ r w
Ki a y
ng
do
m
Sp
ai
n
G Ital
er y
m
D an
en y
R
us
m
si
an E ark
s
_F to
ed nia
er
at
i
Fi on
nl
an
Au d
s
H tria
un
R gar
om y
a
Sl nia
o
v
Sw e
i tz nia
er
la
C nd
yp
P o r us
rtu
ga
l
0,0%
Lo
ng
Percentage
100,0%
Figure 4.10. Cluster distribution – segments and countries for function B
– 43 –
4.4.3 Function C
Four clusters were found for function C. Figure 4.11 shows the cluster distribution. Figure
4.12 shows the ECU-cluster distribution and figure 4.13 respective segment and countries.
Cluster 2 shows that a big proportion of all trucks with this function do not have the ECUs 3
and 4 (pneumatic brake system), the segment is Construction and has been sold to Dubai
(Middle East in general). The details of each of the found clusters are listed below.
Cluster distribution
4
11%
3
18%
1
60%
2
11%
Figure 4.11. Cluster distribution – Function C
Cluster 1
Trucks in this cluster have the ECU 3. Moreover they belong to the segment Construction and
are sold in France, Turkey and Spain.
Cluster 2
These trucks do not have the ECUs 3 and 4. They belong to the segment Construction and are
sold in Dubai.
Cluster 3
These trucks have the ECUs 3, 14 and 18; they belong to the segment long-Haulages and are
sold in Europe.
Cluster 4
These trucks have the ECUs 4 and 21; they belong to the segment Distribution and are sold in
Israel.
– 44 –
-H
au
Di
st
la
ge
r
Co ibu
ns ti on
tru
ct
io
C
Sw n
ze
ch ed
_R en
ep
ub
li c
La
tv
ia
Po
la
nd
Tu
rk
ey
Fr
Ne a n
ce
th
er
la
nd
s
Un
N
i te or w
d_
a
Ki y
ng
do
m
Sp
ai
n
It
G al y
er
m
an
D
en y
m
a
Uk rk
ra
R
in
us
e
si
an Est
on
_F
ed i a
er
at
io
Au n
st
r
Be ia
lg
iu
m
R
o
Sw ma
i tz n ia
er
la
nd
Ire
lan
Po d
r tu
ga
R
_u l
ni
on
Is
ra
Bu e l
So
lg
a
ut
h _ r ia
Af
r ic
a
Br
az
il
D
ub
ai
Q
at
ar
Sa
O
u d ma
n
i_
Ar
Ab ab ia
u_
Dh
a
Ar
ge bi
nt
in
a
Ku
wa
it
Lo
ng
Percentage
Precentage
ECUs w r t function C
120%
100%
80%
60%
40%
1
2
3
4
20%
0%
1
2
3
4
5
6
7
8
9
10
11
12
ECU
13 14
– 45 –
15
16 17
18
19
20 21
22
Figure 4.13. Cluster distribution – segment and countries for function C
23 24
25
Figure 4.12. Cluster distribution - ECUs with respect to function C
Segement & Countries
120%
100%
80%
60%
40%
1
2
3
4
20%
0%
4.4.4 Function D
Two clusters were found for function D. Figure 4.14 shows the cluster distribution. Figure
4.15 shows the ECU-cluster distribution and figure 4.16 respective segment and countries.
The details of each of the found clusters are listed below.
Cluster distribution
1
38%
2
62%
Figure 4.14. Cluster distribution – Function D
Cluster 1
Trucks in this cluster have ECU 3 and are low ended in general. Moreover they belong to the
segment Construction. Besides, the countries vary.
Cluster 2
These trucks are high ended and have the ECUs 1, 4 and 21. They belong to the segment
Long-Haulages and the countries vary.
– 46 –
ECUs w r t function D
120%
Precentage
100%
80%
1
2
60%
40%
20%
0%
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
ECU
Figure 4.15. Cluster distribution - ECUs with respect to function D
100%
80%
60%
40%
20%
0%
Figure 4.16. Cluster distribution – segment and countries for function D
– 47 –
Bolivia
Chile
Bahrain
O m an
Egypt
India
Indonesia
Lebanon
Bulgaria
Korea__Republic
Cyprus
Switzerland
Slovenia
Belgium
Estonia
Italy
Norway
Turkey
Czech_Republic
1
2
Long-Haulage
Precentage
Segment & Countries
4.4.5 Function E
Two clusters were found for function E. Figure 4.17 shows the cluster distribution. Figure
4.18 shows the ECU-cluster distribution and figure 4.19 respective segment and countries.
The details of each of the found clusters are listed below.
Cluster distribution
2
30%
1
70%
Figure 4.17. Cluster distribution – Function E
Cluster 1
Trucks in this cluster have ECU 3 and are low ended in general. Moreover they belong to the
segment Construction. Besides, the countries vary.
Cluster 2
These trucks are high ended and have the ECUs 1, 4 and 21. They belong to the segment
Long-Haulages and are sold in Italy and Denmark.
– 48 –
ECUs w r t function E
120%
Precentage
100%
80%
1
2
60%
40%
20%
0%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
ECU
Figure 4.18. Cluster distribution - ECUs with respect to function E
100%
80%
60%
40%
20%
0%
Figure 4.19. Cluster distribution – segment and countries for function E
– 49 –
India
Taiwan__Provinc
Bulgaria
R_union
Korea__Republic
Bosnia_and_Herz
Switzerland
Jugoslavien
Slovenia
Romania
Belgium
Finland
Estonia
Denmark
Italy
Lithuania
Norway
France
Turkey
Slovakia
Czech_Republic
Construction
1
2
Long-Haulage
Precentage
Segment & countries
4.5 Change over time
In this section the change over the time for function D and function E is considered.
4.5.1 Function D
Figure 4.16 shows the change over the time for the two clusters for function D. In the first
cluster the segment was construction and the trucks were low-end. In the second cluster the
segment was long-haulage and the trucks were high-end. In figure 4.15 it is clear that more
and more construction trucks (the upper graph) have been sold over time.
Figure 4.15. Change over time for function D – the upper graph shows that the construction trucks trends
upward over time
– 50 –
4.5.2 Function E
Figure 4.17 shows the change over the time for the two clusters for function E. In the first
cluster, the segment was construction and the trucks were low-ended. In the second cluster,
the segment was long-haulage and the trucks were high-ended. In the figure it is clear that
more construction trucks have been sold during the last years.
Figure 4.17. Change over time for function E – the upper graph shows that more construction trucks have been
sold during the last years compared to lower graph which shows the long-haulage trucks.
– 51 –
5.
5.1.
Discussion
Tools
The research company Gartner group [16, 34] states that SPSS remain the leading vendors in
the customer data-mining application markets behind the legendary statistical software SAS
Statistical Analysis System. According to Gartner SAS is expensive: price-sensitive
companies, or those requiring significant justification of the cost-effectiveness of one solution
over another, should evaluate alternatives.
Some good alternatives are SPSS, VisuMap, BayMiner and MatLab statistic toolbox. Figure
5.1 and table 5.1 shows five different tools. The figure shows performance versus price. The
pluses in table 5.1 means “good” and minuses means “maybe I should use another tool.”
VisuMap is the leading one when it’s come to visualization and SPSS is the only one having
Auto Clustering (automatically determine the number of clusters). The usability in both SPSS
and VisuMap is fairly good. The prices include some extra tool needed for data mining
purposes, for example the Clementine tool in SPSS. Moreover, the price of BayMiner and
VisuMap depends on the purpose at hand since the price includes monthly support etc.
Figure 5.1. Tools for Data Mining purposes
– 52 –
Handing
Large data set
Visualization
SPSS
yes
-
++
VisuMap
yes
++
-
MatLab
statistic
toolbox
BayMiner
yes
-
-
(max 100 000
records and
500 attributes)
+
yes
++
SAS
Auto
Clustering
Cost single
computer
license
~90 000 SEK
[29]
Depends on
size of data.
For no size
limit
~100 000
SEK [30]
~ 7000 SEK
[31]
Usability
-
~ 40 000 SEK
[32]
++
-
~ 120 000
SEK [33]
++
++
+
+
Table 5.1. Tools for Data Mining purposes
5.2.
Binarization
Assume that we know the dependence between the functions (an architecture or expert at
Scania gives us this information) and we want to use binary clustering technique. If we
assume that 3 is very strong dependent and 0 is independent we can construct a binary matrix.
Following example shows a simple example of binarization:
Categorical Value
Independent
Almost independent
Strong dependent
Very strong
dependent
Integer value
D1
D2
D3
D4
0
1
2
3
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
Table 5.1. Example of binarization
One can then merge this with the historical sales data. Such a transformation can cause
complications, such as creating unintended relationships among the transformed attributes.
Please see future work and appendix B for more information when dealing with a large set of
functions simultaneously.
– 53 –
6.
Related wok – Stock market
The S&P 500 (Standard & poor’s 500) is a market value-weighted index whose components
are weighted according to the total market value of their outstanding share [26]. The stocks
included in the S&P 500 are those of large publicly held companies that trade on the two
largest American stock markets, the New York Stock Exchange and NASDAQ. Almost all of
the stocks included in the index are among the 500 American stocks with the largest market
capitalizations [28] (which means the total value of all outstanding shares multiplied by the
stock). To many people a stock market is nothing but a site where it is possible to acquire
capital or influence – or both. However finance theory and the idea of a stock market can be
applied outside of the finance domain. These concepts may not be used to evaluate resource
optimization, but rather to be used to put price tags on financial instruments.
By using Data Mining techniques one can analyze the performance of the stocks. Figure 6.1
shows S&P 500 index stocks based on weekly performance in the year 2002. This is done in
VisuMap by James X Li [27]. Stocks having similar performance properties are located
closely.
Figure 6.1. S&P 500 index stocks – Closely located stocks have similar properties
– 54 –
7.
Future work – Function to function
This thesis has focused on creating a framework for evaluating the resource efficiency in
embedded systems. This chapter discuss the future work where the most interesting issue is to
consider a larger amount of functions simultaneously.
Given a function we have considered the pattern between Electric Control Units, segment and
countries. The most interesting question for future work is what if we consider all functions?
The answer is not easy and it’s beyond this work. However one way to solve this problem is
to find the dependence between functions by using the Dependency Structure Matrix DSM
and combine this with historical sale data. By letting design architectures putting weight on
the functions as a measure of dependence (for example 5 for very strong dependent and 0 for
independent) and then use DSM-clustering, one can find more about the customer choice.
Another way to tackle this problem is to modify the model in this work by transforming the
measure of dependence to binary attributes by binarization and merge it to the modified
binary matrix on a suitable way.
Figure 7.1 demonstrates the use of a DSM. Suppose we are considering four functions: A, B,
C and D. We list A, B, C and D across the columns and down the rows. An “X” is placed in
each entry to indicate an interaction between two functions. Reading across a row we can see
from which other functions information must be passed to the function in that row. For
instance, the third row in the figure shows that function C depends on both functions B, and
D. Next, reading down the columns we understand which functions depend on the function in
that column. From the fourth column we can see that both functions A and C depend on
component D. Hence the “X” marks indicate a dependency in a general sense. Please see
Appendix B for more on DSM-Clustering.
A
B
C
D
A
A
B
C
B
X
X
C
X
D
X
X
D
Figure 7.1. A sample DSM
– 55 –
8.
Conclusion
Since electronics is growing increasingly important in the automotive sector, more and more
functionality is implemented through the embedded system. This means that manufactures
knowledge about functionality is getting more and more important. In the mean time, vast
amount of data are collected and stored in computers. The relevant information about
customer’s choice, in the collected data, can optimize the resource utilization in embedded
system. A classification of data can be made by using Data Mining techniques and the results
can be used as support when making architectural decisions.
The work of the thesis includes a case study of five functions. The analysis showed the
importance of extracting relevant information from large data set. For instance, trucks with
function A were sold in Thailand. Low-ended long-haulages with function B were sold in
Saudi Arabia. Moreover, low-ended construction trucks with pneumatic brakes and function C
were sold in Dubai. Moreover, more construction-trucks with function E are being sold over
the past years compared to high-ended long-haulages with the same function. In conclusion,
the analysis can be said to give a new perspective when making design decisions.
As pointed out in the previously chapter, some future work remains: Given a function we
have considered the pattern between electric control units, segment and countries. The most
interesting question is what if we consider a larger amount of functions simultaneously?
Hence, the resource optimization outlined in this work may prove very helpful when
evaluating customer’s choice with respect to historical sale data. This will give the architects
at Scania a basic data for decision-making in the current design process.
– 56 –
9.
References
[1] Ming-Shan Liu. (2007) Application of Embedded System in Construction Machinery
[2] Mercer Management Consulting and Hypovereinsbank. (2001) Studie,
Automobiltechnologie 2010.
[3] Erik Persson. (2008) Resource utilization in embedded systems – an economical
perspective. M.Sc. thesis at Royal Institute of Technology, Stockholm.
[4] Hardung, B et.al. (2004) Reuse of software in distributed embedded automotive systems.
[5] Zientz, W. (2007) Electronic systems for commercial vehicles. AutoTechnology 5, pp 4043.
[6] Zielke, A et.al. (2006) The race to master automotive embedded systems development.
McKinsey Company, Automotive and assembly sector business technology office, Germany.
[7] Jiwei Han and Micheline Kamber. (2008) Data Mining Concepts and Techniques.
[8] Lars Eldén. (2007) Matrix Methods in Data Mining and Pattern Recognition.
[9] Tan Ping-Tang, Steinbach Michael and Kumar Vipin. (2006) Introduction to Data Mining
[10] Zhang, T. (1996). Birch: An efficient data clustering method for very large databases.
ACM SIGMOD Conference, Montreal, Canada, pp. 103–114.
[11] Gray R.M. (1991) Vector Quantization and Signal Compression, Kluwer Academic
Publishers, 1991.
[12] Fischbacher U. (1996) Finding the maximum a posteriori probability (MAP) in a
Bayesian taxonomic key is NP-hard J. Math. Biol. 34.
[13] Kaufman L. (1990) Finding Groups in Data: An introduction to Cluster Analysis, Wiley.
[14] Chirs Ding and Xiaofeng He. (2004) K-means via Principal Component Analysis,
Canada.
[15] Hartigan John A. (1975) Clustering Algorithms, John Wiley & Sons, New York.
[16] Gartner Group. (1995) High Performance Computing Research Note.
[17] http://etn.se/48017
[18] Eppinger, Steven D., Daniel R. (1994) A Model-based Method for Organizing Tasks in
Product Development, Research in Engineering Design. 1-13.
[19] www.dsmweb.org
– 57 –
[20] Rogers, James L and McCulley M. Collin. (1996) Integration a Genetic Algorithm into a
Knowledge-Based System for Ordering Complex Design Processes NASA Technical
Memorandum.
[21] James Xinzhi Li. (2004) Visualization of High Dimensional Data with Relational
Perspective Map. Information Visualization, Vol 3, No. 1. 49-59.
[22] Scania Inline. (2003) Electrical System.
[23] http://www.spss.com/statistics/
[24] Ph. Dwinger. (1961) Introduction to Boolean algebras, Wurzburg.
[25] http://www.visumap.net/
[26] http://www.investopedia.com/terms/s/sp500.asp
[27] http://jamesxli.blogspot.com/
[28] http://www.investopedia.com/terms/m/marketcapitalization.asp
[29] http://www.ogs.state.ny.us/purchase/snt/awardnotes/7600600239prices.pdf
[30] http://www.visumap.net/registered/ProductList.aspx
[31] http://www.mathworks.se/store/productIndexLink.do
[32] http://www.bayminer.com/en/pages/positioning.htm
[33] http://www.sas.com/technologies/analytics/datamining/
[34] http://mediaproducts.gartner.com/reprints/sas/vol5/article3/article3.html
[35] http://www.sas.com/technologies/analytics/statistics/stat/
[36] Rissanen J (1989) Stochastic Complexity in Statistical Inquiry Singapore. World
Scientific.
[37] Higham N.J. (1996) Matrix Norms. Philadelphia: Soc. Industrial and Appl. Math.
[38] P.Kontkanen, P.Myllymäki, W.Buntine, H. Tirri, J.Rissanen. (2005) In Advances in
Minimum Description Length: Theory and Applications. The MIT Press.
[39] A.D Lanterman. (2001) Intertwining Themes in Theories of Model Selection.
International Statistical Review 69, pp 189-212.
[40] Rong Liu. (2002) The SPSS Two Step Cluster University of North Texas
– 58 –
10.
Appendices
This chapter includes a total of three appendices:
Appendix A Importance of "knowing you data"
Appendix B DSM Clustering
Appendix C K-means in Matlab
– 59 –
10.1. Appendix A – Importance of “knowing your data”
This example is taken from Tan, Steinbach and Kumar [9]. Although this scenario represents
an extreme situation, it highlights the importance of the data preparation or pre-processing
discussed in chapter 2.
Assume that you are a Data Miner. From a medical researcher you receive an email.
Hi,
I’ve attached the data file that I mentioned in my previous email. Each line contains the
information for a single patient and consists of five fields. We want to predict the last field
using the other fields.
Thanks and see you in a couple of days with my friend, a statistician.
Best regards
Medical Bob
You proceed to analyze the data. The first few rows of the file are as follows:
012 232 33.5 0
020 121 14.4 2
027 134 12.2 0
… … …. ….
10.7
210.1
344.3
….
You put your doubts aside and start the analysis. There are only 100 lines, a smaller data file
than you had hope for, but tow days later; you feel that you have made some progress. You
arrive the meeting and strike up a conversation with the statistician who is also working with
this project (Bob’s friend). She asks if you would mind giving her a brief overview of you
results.
Statistician: So, you got the data for all the patients?
Data Miner: Yes. I haven’t had much time for analysis, but I do have a few interesting
results.
Statistician: Amazing. There were so many data issues with this set of patients that I couldn’t
do much.
Data Miner: Oh? I didn’t hear about any possible problems.
Statistician: Well, first there is field 5, the variable we want to predict. It’s common
knowledge among people who analyze this type of data that results are better if you work with
the log of the values, but I didn’t discover this until later. Was it mentioned to you?
Data Miner: No.
Statistician: But surely you heard about what happened to field 4? It’s supposed to be
measured on a scale from 1 to 10, with 0 indicating a missing value, but because of the data
entry error, all 10’s were changed to 0’s. Unfortunately, since some of the patients have
missing values for this field, it’s impossible to say whether a 0 in this field is a real 0 or a 10.
Quite a few of the records have that problem.
Data Miner: Interesting. Were there any other problems?
Statistician: Yes, field 2 and 3 are basically the same, but I assume that you probably noticed
that.
Data Miner: Yes, but these fields were only weak predictors of field 5.
– 60 –
Statistician: Anyway, given all those problems, I’m surprised you were able to accomplish
anything.
Data Miner: True, but my results are really quite good. Field 1 is a very strong predictor of
field 5. I’m surprised that this wasn’t noticed before.
Statistician: What?? Field 1 is just an identification number.
Data Miner: Nonetheless, my results speak for themselves.
Statistician: Oh no! I just remembered. We assigned ID numbers after we sorted the records
based on field 5. There is a strong connection, but it’s meaningless. Sorry!
– 61 –
10.2. Appendix B – Dependency Structure Matrix
The Design Structure Matrix DSM (Dependency Structure Matrix) is a useful tool for
optimizing the composition product development elements in terms of minimizing interfaces
and extra-element interactions. DSM is used in system architecting, engineering and design
[18]. The DSM is a square matrix where rows and columns list the same elements. The entries
in the matrix record interactions between elements. The goal of DSM is to find clusters that
are minimally interacting subsets [15]. In other words, a cluster absorbs most of the
interactions internally and the interaction of links between separate clusters is minimized. The
goal of DSM is to identify clusters of highly interactive functions through the reordering of
the matrix [20].
The goal of DSM is to find clusters (or subsets) that are minimally interacting subsets [15].
This process is referred to as clustering. In other words, a cluster absorbs most of the
interactions internally and the interaction of links between separate clusters is minimized. The
rules of how this clustering is performed vary from application to application, and so do the
type of solution obtained. The goal of DSM is to identify clusters of highly interactive
functions through the reordering of the matrix [20].
What we need is to modify the matrix and cluster functions together into highly interactive
groups known as system components. A simple example of DSM-clustering is shown in
figure 2.6.
A
B
C
D
E
F
G
H
A
A
B
C
D
X
B
E
F
X
X
C
X
X
X
F
X
X
X
E
X
H
X
X
D
X
G
X
G
X
Figure 2.6. A sample DSM
– 62 –
X
H
According to Figure 2.6, function A has interaction with components D, F and H. By
reordering the above matrix we get the following optimized solution (figure 2.7):
:
A
A
X
X
A
H
F
D
E
C
G
B
H
X
H
X
X
F
D
F
X
E
C
G
B
D
E
X
X
C
X
X
G
X
X
X
B
Figure 2.7. DSM clustering – reordered DSM with two system components
The optimized matrix was obtained by exchanging the position of groups B and H, and groups
C and F. In this example, two system components were distinguished as the best
configuration: system component 1 (green) with A, H, F and D; and system component 2
(red) with teams E, C, G and B.
Let us complete this section with a concrete example of DSM for a vehicle. The following
DSM describes a Climate Control system [19].
A
B
Radiator
A
A
X
Engine fan
B
X
B
Heater Core
C
Heater Hoses
D
Condenser
E
Compressor
F
Evaporator Case
G
Evaporator Core
H
Accumulator
I
Refrigeration Controls
J
Air Controls
K
Sensors
L
Command Distribution
M
Actuators
N
Blower Controls
O
Blower Motor
P
C
D
E
F
G
H
I
J
K
L
N
M
O
C
P
X
D
X
E
X
X
F
X
X
X
G
X
X
X
H
X
X
X
I
X
J
K
L
M
N
X
X
– 63 –
X
O
X
X
P
By reordering the above DSM we get the following optimized solution:
D
J
K
L
M
N
A
B
E
F
I
H
A
A
X
B
X
B
X
X
E
X
X
F
X
X
X
I
X
X
X
H
C
P
Radiator
D
Engine fan
J
Heater Core
K
Heater Hoses
L
Condenser
M
Compressor
N
Evaporator Case
Evaporator Core
Accumulator
E
Refrigeration Controls
F
Air Controls
I
Sensors
H
Command Distribution
C
Actuators
P
Blower Controls
O
Blower Motor
G
X
O
G
X
D
J
K
L
M
N
X
X
X
X
C
X
X
P
X
X
O
Clustering the "X" marks along the diagonal of the DSM resulted in the creation of three
"chunks" for the Climate Control System. The "chunks" are:
1. Front End Air Chunk.
2. Refrigerant Chunk.
3. Interior Air Chunk
In our case the “chunks” could for example be low-end and high-end trucks, segment (longhaulage, distribution or construction) and so on.
– 64 –
G
10.3. Appendix C - K-means in MatLab
MatLab code for the simple example in chapter 4.1
clc; clear; clf;
% Historical sale data, 140 trucks and 5 Cotrol Units
data = load('sale.txt');
X = data';
% K-means using Hamming distance, number of cluster = 2
idx = kmeans(X,2,'distance','hamming');
[s, h] = silhouette(X, idx, 'hamming');
% Label the attrubutes
label = {'ECU1' 'ECU2' 'ECU3' 'ECU4' 'ECU5'}';
vektor = [];
for i=1:length(idx)
vektor = [vektor {idx(i) [label{i,:}]}];
end
answer = vektor'
– 65 –