Download Operations research and knowledge discovery: a data mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Intl. Trans. in Op. Res. 7 (2000) 159±170
www.elsevier.com/locate/orms
Operations research and knowledge discovery: a data
mining method applied to health care management
L. Delesie*, L. Croes
Centre for Health & Nursing Research, Catholic University of Leuven, Kapucijnenvoer 35, 3000 Leuven, Belgium
Received 2 June 1999; received in revised form 15 January 2000; accepted 26 January 2000
Abstract
The exponential growth of databanks creates opportunities to expand Operational Research. An
example is the development of scienti®c approaches to ``mine'' intelligently the huge databanks that
complex systems rely on for their management. The contribution presents an approach to exploit a
health insurance databank to evaluate the performance of cardiovascular surgery nation wide. 7 2000
IFORS. Published by Elsevier Science Ltd. All rights reserved.
Keywords: Knowledge discovery; Datamining; Health care management; Cardiovascular surgery
1. Problem formulation
The management of complex systems increasingly depends on the discovery of management
knowledge in the large databases that these systems are implementing. Banks, retail chains,
insurance companies, health care organisations, . . . install information and communication
technology (ICT) to monitor the millions of transactions that characterise their daily
operations. But as ``management does the things right, leadership does the right things'', it
becomes mandatory to develop methods to discover strategic business knowledge. Success
depends on the close co-operation among the decision-makers, the experts in the ®eld of
application, and the skilled Operational research (OR) analysts.
Most databanks can be looked upon as a set of tables, cross-tabulations or matrixes. Such
* Corresponding author. Tel.: +32-16-336972; fax: +32-16-336970.
E-mail address: [email protected] (L. Delesie).
0969-6016/00/$20.00 7 2000 IFORS. Published by Elsevier Science Ltd. All rights reserved.
PII: S 0 9 6 9 - 6 0 1 6 ( 0 0 ) 0 0 0 1 6 - 2
160
L. Delesie, L. Croes / Intl. Trans. in Op. Res. 7 (2000) 159±170
matrix can have thousands of rows that represent the objects of investigation: bank outlets,
retail shops, hospitals, cardiovascular surgery departments, and patients with health insurance
claims. The matrix also has hundreds of columns that represent the variables that measure the
objects of investigation. One easily arrives at millions of matrix cells that contain the actual
measurement data. The type of data or the measurement level of the variables can be
categorical, including binary, ordinal or numerical, including interval and ratio (Stevens, 1951).
The problem is not to aggregate the matrix. Aggregation over all objects is common, e.g.
total count by category, average value by variable. These operations of aggregation are easily
understood but hardly generate knowledge, e.g., counts and averages assume homogeneous
objects, while management wants to focus on troublesome and promising objects. The problem
is to perform a sequence of operations on the data in such way that, in the end, some display
shows patterns and/or relations that discover knowledge for the decision maker: e.g. leverage,
in¯uential, outlier and extreme objects, abnormal or fraudulent practices, opportunities for
further activity. The problem is to discover something that is covered up in the database. Tufte
(1997, p. 45) points out ``there are displays that reveal the truth and displays that do not''.
Knowledge discovery in data bases is often de®ned as a six-stage iterative process: (1)
develop an understanding of the proposed application; (2) create a target data set; (3) remove
or correct corrupted data; (4) apply data-reduction algorithms; (5) apply a data-mining
algorithm; and (6) interpret the mined patterns (Brodley et al., 1999). This paper concentrates
on stages 4±6 and is limited to categorical variables and the one matrix case.
2. Approach
The approach that we suggest after many years of experience consists of four phases: (1)
determine a criterion to measure the degree of similarity between any pair of representative,
in¯uential, core, peer or benchmark objects on the basis of the available knowledge about the
objects, the type of data variables and the Ð cleaned Ð data themselves; (2) display the
benchmark objects in accordance with their mutual similarity; (3) highlight those variables that
characterise the overall display to the detriment of those variables that only marginally
in¯uence the display; (4) locate new, additional objects within the framework established by the
benchmark objects.
2.1. Criterion for degree of similarity
The search for an index to measure the resemblance between two or more objects has a long
tradition. It is an early application of data reduction: one indicator to summarise a range of
features for one single object. Indeed such an index would allow to speak, in a generalised
way, of the distance or dissimilarity between any pair of objects, to test its statistical
signi®cance and to classify and cluster any population of objects (Tildesley, 1921). Ten
similarity criterions have been investigated (Wishart, 1996).
Inferential statistical methods concentrate on the variables. Test statistics to measure
similarity for categorical variables were introduced by Karl Pearson's Chi-squared statistic …w2 †
in 1900 and Wilks' Loglikelihood ratio test statistic …G 2 † in 1935 (Wilks, 1935). Much depends
L. Delesie, L. Croes / Intl. Trans. in Op. Res. 7 (2000) 159±170
161
on the variables and the assumptions. Most publications assume the (multi-) normal
distribution. Real-life databases hardly con®rm this textbook situation. Highly skewed
distributions, missing data, continuous data shifts due to market, technology or cultural trends,
are common. Cressie and Read (1984) develop the power-divergence family of statistics that
includes the w2 and the G 2 statistics and review (Cressie and Read, 1989) the assumptions,
corrections and recommendations to measure and test discrepancy between observed and
hypothesised frequencies on the basis of this family of statistics. Fisher (1990) developed exact
methods for 2 2 tables about 1935. Unfortunately, computing capability limited its usefulness
for a long time to small samples only. In 1983, Mehta and Patel expanded its use to a few
hundred cells of objects and categories, by developing speci®c computer algorithms (Mehta and
Patel, 1983). An evaluation of the existing statistics and measures of association became
possible and revealed major lack of accuracy (Mehta, 1994, 1995). Recently, this software that
allow to do away with many corrections and approximations of the past, has become available
(SPSS, 1998). To date, this software is limited to applications of less than 500 cells.
Cluster analysis concentrates on the objects: a manageable number of homogeneous groups
of objects on the basis of their salient characteristics as measured in the variables. By now, a
lot of companies o€er datamining, datadrilling, business mining software algorithms that
cluster in many di€erent ways: Business Objects, SAP, SAS, Silicon Graphics, SPSS. The
classic parameters are the mean and the variance or standard deviation. Unfortunately, these
algorithms hide the outlier and extreme cases that are of particular interest. ICT routinely
collects and stores data in great volume and detail and at great cost but analysis and
evaluation on the aggregate level is con®ned to some global, simplistic conclusions that forego
the real-life diversity and variability. Recently, more sophisticated algorithms become available,
e.g. non-parametric such as kernel density curve based, qualitative attribute such as
information theory based or ``learning'' neural networks. They do away with the normal
distribution assumptions and manage now huge databases with fuzzy and irregular shaped
clusters. Averages still prevail to identify the clusters.
But our experience indicates that the analysis of diversity and variability often reveals a
continuum within which only a few clusters are present. In most cases a lot of objects do not
belong to a clear-cut cluster one way or another. On the other hand, if clear-cut clusters exist,
a good similarity criterion should reveal them. Also, cluster analysis does not hold on to an
overall view that induces to understand the global context as well as to grasp exceptions, e.g.
``special patients'', ``famous practitioners''. The popularity of cluster analysis stems from its
analytic approach. It generates a lengthy list of clusters that can readily be applied to the
detriment of knowledge discovery. The hidden basis of cluster analysis, though statistically
sound, aggravates the uncertainty intrinsic to the real world. Managers demand to understand
the scope of their decisions. They prefer holistic decision rules that apply to all objects, can be
argued on the basis of individual object characteristics and do not depend on abstract,
mathematically de®ned, clusters.
At a more essential level of investigation, one discovers that inferential statistical methods
and cluster analysis hurt the same problem: to ®nd an index to measure the resemblance
between two or more objects on the basis of the variables in the database. Hence, we focused
on the development of another new index of similarity. Formula (1) gives the index of
similarity SPij between any pair of objects i and j when the data represent the proportion of the
162
L. Delesie, L. Croes / Intl. Trans. in Op. Res. 7 (2000) 159±170
quantity under investigation by category k.
where pik and pjk are the proportion of row quantities i and j in column k
pi‡j, k ˆ
pik ‡ pjk
2
…1†
1X
pik
pjk
pik log2
‡ pjk log2
SPij ˆ
2 k
pi‡j, k
pi‡j, k
This index of similarity is closely related to the amount of information as de®ned by Wiener
(1948) and Krippendorf (1986). The amount of information and our index of similarity are 0
when the two objects are completely similar with respect to all variables. When the two objects
are completely di€erent, the amount of information is some value in the interval (0, 1] while
our index of similarity is 1.
The index stays sensitive and speci®c with hundreds of categories, with highly skewed
distributions, when many categories have small numbers, when zeros are common. This
characteristic is important for management where leverage, outlier or extreme data are time
and again considered most interesting and management does not want to cover them up into
some rest category. Managers understand the index of similarity SPij quickly. It is directly
interpretable within a large group of objects and does not switch contexts between pairs of
objects. Also, management learns often as much from the variables or categories that are not
scored at all in some objects as from the variables and categories that are scored (Hearst, 1991).
The index is also geared toward a micro±macro approach. Classi®cations for variables and
objects are often hierarchically structured, e.g. individual subcategories of activities are
grouped into main categories, and individual medical specialists are grouped into hospitals.
One can use the same or di€erent similarity indexes on each level of zoom, e.g. more detailed
categories when zooming in. The calculation of the probability density distribution of the index
of similarity though is no easy matter: From what value of the index of similarity on, is one
object statistically signi®cantly di€erent from another? We developed a Monte Carlo based
approach that gives a con®dence interval as small as required for any level of signi®cance
requested. The algorithms though are cumbersome: hours of computer time. Further
explanation on our index of similarity, its characteristics and its distribution is given elsewhere
(Delesie et al., 1998).
2.2. Visualisation of the benchmark objects
The second phase will ``display'' the square symmetric matrix that we obtain by
calculating the indexes of similarity between all pairs of objects. The display is a
L. Delesie, L. Croes / Intl. Trans. in Op. Res. 7 (2000) 159±170
163
scatterplot on one single page or window that visualises overall patterns in a
comprehensive and accessible way. We ®ne-tuned a multidimensional scaling (MDS)
algorithm to generate this display. The approach is equivalent to the virtual reality that
everybody has to use when representing the real spherical globe by way of a map of the
world, e.g. a Mercator map.
The class of MDS methods belongs to the domain of descriptive statistics. They model the
relationships between objects and their multiple variables by way of an optimal scatterplot
(Gi®, 1991; Molinero, 1998). The conditions of optimality can be speci®ed and pertain to the
(non-linear) variable transformations, the dimensionality of the projection space and the
orthogonality of the space of which every scatterplot displays two dimensions. The scatterplot
visually relates each object to all other objects. Objects that are completely similar are on the
same spot in the scatterplot. Objects that are dissimilar are located far from each other. The
approach provides insight in the global solution as well as in the contribution of each
individual object to this global solution. Any direction in the scatterplot shows a (non-linear)
relationship between the columns in the matrix. E.g. if the user prefers to optimise on the basis
of the least square criterion, then the x-axis represents the direction of maximum ``explained
variance''. Our approach uses a square symmetric matrix of indexes of similarity. Hence, the xaxis would be the direction along which the dissimilarity between the objects in the database is
greatest. The same reasoning goes on for the y-axis and so on. Several optimality criteria have
been proposed. Some MDS computer algorithms are already commercially available (SPSS,
1994; SAS, 1995). Our MDS approach is based on the work of Young (Young and Hamer,
1987).
MDS still has its computer limits. No algorithm exists that will manage a square symmetric
matrix of thousands of rows or columns. It takes four days for a high-speed 1997 Unix
machine to calculate a display of some 3500 objects. Fortunately, it does make little sense to
display thousands of objects and variable lines (see Phase 3) in one single overall scatterplot. A
graphic output becomes unreadable when more than some few hundreds of objects are
involved. A `huge' display overwhelms the user and no longer communicates the imbedded
knowledge. It becomes useless.
Consequently, we suggest starting with the careful selection of benchmark objects and
variables. Our experience indicates that management has little problem to identify these
benchmark objects and variables. Upper and lower limits for volume of activity or total costs
are readily accepted. Our approach also allows weighting particular objects in accordance with
the demands of management with respect to a ``good'' benchmark: in¯uential objects, e.g.
university hospitals, get a greater weight in the overall display. It is also common to rely on
some variable that is exogenous to the criterion of similarity: e.g. teaching hospitals only want
to be compared with other teaching hospitals; big city outlets only want to be compared with
their ``peers'' in similar surroundings. When no reasonable limit, no acceptable weights or
exogenous variable seem available, we suggest to structure the problem hierarchically. In a ®rst
step, groups of objects and main categories of variables are investigated by way of the index of
similarity and the MDS method. Subsequently, one can proceed from this global approach
down. The application will show that the approach allows to zoom-in as well as to zoom-out
easily: the micro±macro approach. When objects are aggregated into types of objects, the type
is always situated in the centroid of all the objects of the group.
164
L. Delesie, L. Croes / Intl. Trans. in Op. Res. 7 (2000) 159±170
2.3. Display of important variables
The scatterplot displays the objects. In order to discover knowledge it is also important to
understand the relationships among the variables and between the variables and the objects.
For the same management reasons that not all objects are equally important, so also, are not
all variables equally important to arrive at an optimal scatterplot. Essentially, we adapted a
non-linear, monotone regression technique to order the variables with respect to their
importance for the visual display (Kruskal, 1964; Carroll, 1972). We choose to represent the
variable as a radiating line in the scatterplot. Experience indicates that this representation suits
the user best. As the number of variables can be quite large, a selection of variables is again
mandatory in order not to overwhelm the display. Also, some variables may be important to
some groups or subgroups of objects while being less important for the overall display. Finally,
as rotation of the axes does not change distances, the surrogate indicator for similarity, the
user can rotate the display in any direction to pilot his analysis to highlight particular variables
of interest.
An understanding of the application and an active participation of management are
prerequisite. In the same way that no map of the world shows all information in an exact way,
e.g. a Mercator map strongly misrepresents distances in the Polar Regions, all displays discover
some knowledge in the database and cover some other knowledge. Careful monitoring of the
objects, their variables, the index of similarity, the MDS method and its optimality criterion
and rotation is required at all times: no magic formula does exist.
2.4. Adding new objects to the display of benchmark objects
The last phase locates additional objects within the framework set by the benchmark objects:
``[A] micro/macro design [that] enforces both local and global comparison and at the same
time, avoids the disruption of context switching. All told, exactly what is needed for reasoning
about information'' (Tufte, 1990). This phase allows investigating the relationships between the
socalled leverage, outlier and extreme objects to the benchmark objects.
We use a plain hill-climbing optimisation technique to determine the optimal position of
each new object within the framework of the benchmark objects. We ®rst calculate the vector
of similarity indexes between each new object and all the benchmark objects. We then move to
distances using the characteristics of the original display, e.g. the (non-) linear variable
transformations and the optimality criterion used in the original display. Finally, we locate the
new object by minimising its distance from the benchmark objects. As no display gives perfect
knowledge, again careful monitoring is required.
3. Application
The application establishes the ``degree of specialisation'' of all 29 cardiovascular surgery
hospital departments in Belgium. The application starts from a ``warehoused'' database that
contains a 50% sample of all patients that received cardiology and cardiovascular surgery
services in one of the 290 hospitals in Belgium for the years 1994 up until 1998. The data are
L. Delesie, L. Croes / Intl. Trans. in Op. Res. 7 (2000) 159±170
165
cleaned already. The target data matrix is a table of medical procedures Ð variables Ð by
hospital Ð objects. The data in the matrix cells represent the proportion of insurance
reimbursements for each procedure. Fig. 1 shows the micro±macro approach. This document
covers the two top levels only. Each level uses its own management priorities. For example, a
federal ®nancing system is only interested in the overall unit activity pro®le; the unit may want
to delineate specialised care programmes for organisational purposes; the physicians (not
shown here) may want to focus on care paths and the allocation of resources to individual
patients. Each level of decision-making demands its proper level of detail and units of
measurement. This is true to such extent that what may look similar on one aggregate level
may look dissimilar on a lower, more detailed level of management. On the federal level, 42
relevant medical activities were selected. A rather large group of primarily diagnostic medical
activities are grouped into one ``REST'' category. They are considered to be irrelevant for the
identi®cation of the severity of the patients or the intensity of cardiology and cardiovascular
medical care.
The ®rst phase computes the ``index of similarity Ð SPij '' between each pair of hospitals on
the basis of their vectors of care: the percentage of reimbursement for each of the 42 activities.
Fig. 2 shows the index of similarity between the group of non-university cardiovascular surgery
units and the group of university cardiovascular surgery units. The index is 0.0345, which is
rather small. This basically implies that the medical activity in both types of units is pretty
much the same. The top two medical activities in the ®gure contribute overwhelmingly 24 and
16%, respectively, to their dissimilarity: see right column. The proportion of insurance claims
for ``myocard revascularisation by way of the arteria mamalia'' (bypass surgery) is twice as
large in the non-university units, 14%, than in the university units, 7%: see left columns. On
the other hand, the proportion of insurance claims for open-heart surgery during the ®rst 2
years of life is 1.7% in the university units while it is minimal or 0.1% in the non-university
units.
The next phase makes the display of the benchmark objects. Fig. 3 gives the display of all
290 Belgian hospitals on the basis of their total cardiology and cardiovascular surgery activity
Fig. 1. Overall approach.
166
L. Delesie, L. Croes / Intl. Trans. in Op. Res. 7 (2000) 159±170
Fig. 2. Similarity index: cardiovascular surgery university/non-university centres.
Fig. 3. Cardiology and cardiovascular surgery activity in 290 hospitals in Belgium (1994±1996).
L. Delesie, L. Croes / Intl. Trans. in Op. Res. 7 (2000) 159±170
167
for the years 1994±1996. This three-year period was selected as the base line. The data were
complete, thoroughly controlled and beyond any doubt. This display shows some clear-cut
clusters. The cluster of 33 hospitals/89 years clearly di€erentiates from the large cluster of 257
hospitals/741 years. More detailed analysis (not shown) reveals that this ``dissimilarity'' is
completely due to the cardiovascular surgery activity pro®le of the hospitals. The 257 hospitals/
741 years cluster shows ``no relevant'' cardiovascular surgery activity at all. Some dispersed
hospitals only accentuate this major di€erence: the 8 hospitals/15 years ``start'' a
cardiovascular surgery activity; the 9 hospitals/13 years ``outsource'' their limited
cardiovascular surgery activity to one of the main units. Finally, the 1 hospital/3 years unit is a
highly specialised university children``s'' hospital with a prominent activity pro®le. Based upon
Kruskal stress formula I optimality criterion, the ®t of this display to the data is 91.6%
(Kruskal and Wish, 1978).
Fig. 4 zooms in on the group of 33 hospitals/89 years that are clear-cut cardiovascular
surgery units. The inclusion of the 1997 and 1998 data (not shown) reveals that 29 units
can be considered as cardiovascular surgery units for the period 1994±1998: two units
drastically reduced their activity while, 2 other two merged. A selection of units is shown
by use of a code, e.g. P20. The radiating lines in Fig. 4 (from left to right in the bottom
half of the ®gure) indicate the eight prominent medical activities that characterise the
cardiovascular surgery units (Table 1).The lines signi®cantly facilitate knowledge discovery
Fig. 4. Cardiovascular surgery in 29 centers in Belgium (1994±1998) Ð salient activities.
168
L. Delesie, L. Croes / Intl. Trans. in Op. Res. 7 (2000) 159±170
Table 1
Salient medical activities to characterise the cardiovascular surgery units (from left to right in the bottom half of
Fig. 4)
Percutaneous endovascular repair of the aorta
Percutaneous endovascular dilation of coronary artery
Percutaneous endovascular dilation of coronary arteries, with complications
Electrofysiologic study for tachycard
Open heart surgery or arteriectomy with thoracic graft replacementa
Ablation, heart, by cardiac cathetera
Valvoplasty, open heart technique, more than one valve
REST
a
Lines on same spot.
by allowing on the spot interpretation. The lines induced the ``expert'' managers to
characterise di€erent schools of cardiovascular surgery in Belgium: ``complicated open
heart surgery'' units, ``percutaneous'' units, ``bypass'' units . . .. In 1999, Belgium is
delineating cardiology and cardiovascular surgery care programmes and a national peer
review programme is being developed to start from 2000 onwards. Based upon Kruskal
stress formula I optimality criterion, the ®t of this display to the data is 91.8% (Kruskal
and Wish, 1978).
Once the benchmark and the benchmark period established, one can add new objects to the
scatterplot. Fig. 5 traces one cardiovascular surgery unit, e.g. unit ``UZL'', over the period
Fig. 5. Cardiovascular surgery in 29 Benchmark units in Belgium Ð trend of UZL (1994±1998).
L. Delesie, L. Croes / Intl. Trans. in Op. Res. 7 (2000) 159±170
169
1994±1998. The purpose of this type of application is often to look at evolutions or trends.
The benchmark context does not switch and is exactly the same as in Fig. 4. The benchmark
units are indicated by the symbols P1±P29. As expected the ®ve points UZL94±UZL98 move
only marginally with respect to the benchmark. This indicates that this unit did not drastically
change its activity pro®le from year to year. This common fact leads to one enticing result for
management purposes. It shows that one needs to calibrate the benchmark only periodically,
e.g. every 5 years. This is exactly what one expects from a benchmark. Our experience so far
indicates that many displays are pretty stable.
The example also shows the need for zooming in and zooming out for knowledge discovery
purposes. Indeed, unit ``UZL'' shows a trend as it moves upward over the period 1994±1998
but a clear break exists between the years 1994 and 1996 and the years 1997 and 1998. As a
result of a meeting with the unit in 1997, the cardiovascular surgeons involved started to code
their activity more thoroughly: a more detailed activity pro®le starts to show as of 1997 even
within the limited context of the 42 medical activities selected.
4. Discussion
A lot of ICT resources and energy still goes to the registration, classi®cation, warehousing,
communication, archiving, sorting, retrieving, cleaning, auditing, listing and tabulating of data.
The need to start discovering knowledge within these very large and still growing databases is
becoming a major concern. Intelligent analysis of databases must not only result in
operational, eciency and productivity, advantages for the organisation but also in new
strategic opportunities (Hand, 1997). Too often these databanks are still exploited by way of
the old toolbox of tables, averages, listings, inferential statistics and general linear models. New
insights, new approaches are needed. We anticipate these to be based on the visualisation of
data and graphic software that allows to zoom in and to zoom out at ease. The huge volume
of data, the speed and the level of detail and ®ne shadings required for management purposes
pushes this evolution.
Examples start to be published in banking, marketing, and retailing. This paper presents an
approach, which has so far found some applications in health care management in Belgium.
Three of the six stages of data discovery are presented: (1) the application of data-reduction
algorithms by way of the similarity index; (2) the application of data-mining algorithms by way
of the MDS scatterplot; (3) the interpretation of the mined patterns by way of the display of
salient variables and the comparison of new objects within the framework of the benchmark
established. The example of the Mercator map of the world that dates from the 16th century
shows that no data-reduction; data mining and pattern interpretation however is perfect. Each
map has its strengths and weaknesses and its speci®c management context of application.
Nevertheless, experienced pilots stay required at all times. OR can improve their skills.
170
L. Delesie, L. Croes / Intl. Trans. in Op. Res. 7 (2000) 159±170
References
Brodley, C.E., Lane, T., Stough, T.M., 1999. Knowledge discovery and data mining. American Scientist 87 (1), 54±
61.
Carroll, J.D., 1972. Individual di€erences and multidimensional scaling. In: Shepard, R.N., Romney, A.K., Nerlove,
S.B. (Eds.), Multidimensional Scaling: Theory and Applications in the Behavioral Sciences, vol. 1. Seminar Press,
New York.
Cressie, N., Read, T.R., 1984. Multinomial goodness-of-®t tests. J. R. Statistical Society, Series B 46, 440±464.
Cressie, N., Read, T.R., 1989. Pearson's w2 and the loglikelihood ratio statistic G 2 : a comparative review. Int.
Statistical Review 57 (1), 19±43.
Delesie, L., Croes, L., Vanlanduyt, J., De Bisschop, I., Haspeslagh, M., 1998. Measuring Diversity: A Survey, an
Approach and a Case Study for Cardio-Surgery in Belgium. Catholic University of Leuven, Leuven, Belgium.
Fisher, R.A, 1990. Statistical Methods, 14th ed. Oxford University Press, Oxford, p. 96.
Gi®, A., 1991. Non-Linear Multivariate Analysis. Wiley, Chichester, UK (2nd corr. reprint).
Hand, D.J. 1997. Intelligent data analysis: issues and opportunities. In: Xiaohui, L., Cohen, P., Berthold, M. (Eds.),
Advances in Intelligent Data Analysis: Reasoning about Data. Springer, Berlin.
Hearst, E., 1991. Psychology and nothing. American Scientist 79 (9/10), 432±443.
Krippendorf, K., 1986. Information theory, structural models for qualitative data. Sage University Papers,
Quantitative Applications in the Social Sciences, Beverly Hills, USA, pp. 87±88.
Kruskal, J.B., 1964. Multidimensional scaling by optimizing goodness of ®t to a nonmetric hypothesis.
Psychometrika 29, 1±27.
Kruskal, J.B., Wish M., 1978. Multidimensional scaling. Sage University Papers, Sage Publications, London, p. 93.
Mehta, C.R., Patel, N.R., 1983. A network algorithm for performing Fisher's exact test in r c contingency tables.
J. Am. Stat. Assoc 78 (382), 427±434.
Mehta, C.R., 1994. The exact analysis of contingency tables in medical research. Statistical Methods in Medical
Research 3 (2), 177±202.
Mehta, C.R., 1995. Exact Methods for Contingency Tables and Logistic Regression. Catholic University Leuven,
University Centre of Statistics, Lecture notes, October 26±27, Leuven, Belgium.
Molinero, C.M., 1998. Geometrical approaches to data analysis. Department of Management, University of
Southampton.
SAS Institute, 1995. The SAS1 System. The Corporation, Cary, NC, USA.
SPSS, 1994. Categories. The Corporation, Chicago.
SPSS, 1998. StatXact 3 for Windows, SPSS Inc., Chicago and Cytel LogXact 2.1. Cytel Software Corporation,
Cambridge, MA.
Stevens, S.S. 1951. Mathematics, measurement, and psychophysics. In: Stevens, S. (Ed.), Handbook of Experimental
Psychology. Wiley, New York, pp. 1±49.
Tildesley, M.L., 1921. A ®rst study of the Burmese skull. Biometrics 13, 176 (In this paper by Miss M.L. Tildesley,
Karl Pearson proposed a measure of racial likeness-C.R.L. for purposes of classifying skeletal remains.).
Tufte, E.R., 1990. Envisioning Information. Graphics Press, Cheshire, CT, USA, p. 50.
Tufte, E.R., 1997. Visual Explanations: Images and Quantities, Evidence and Narrative. Graphics Press, Cheshire,
CT, USA.
Wiener, N., 1948. Cybernetics. MIT Press, Cambridge, MA, p. 62.
Wilks, S.S., 1935. The likelihood test of independence in contingency tables. Ann. Math. Statistics 6, 190±196.
Wishart, D., 1996. Cluster Analysis Software. Computing Laboratory, University of St. Andrews/Clustan Ltd,
Scotland.
Young, F.W., Hamer, R.M. (Eds.), 1987. Multidimensional Scaling, History, Theory, and Applications. Lawrence
Erlbaum Associates, Hillsdale, NJ, USA.