Download Multi-Dimensional Data Presentation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Gross domestic product wikipedia , lookup

Post–World War II economic expansion wikipedia , lookup

Transcript
Multi-Dimensional Data Presentation
Karl Glaser, Roche Colorado Corporation, Boulder, Colorado
ABSTRACT
Human beings cannot see associations in data beyond two or three dimensions. In this paper, simple SAS tools will
be used to display relationships in data in three, four, and five dimensions, with some interesting and perhaps
surprising results.
INTRODUCTION
It’s a fact! Humans cannot discern relationships in data beyond two or three dimensions. One of the jobs, then, of
the statistician is to present multi-dimensional data in a form amenable to human interpretation. In this paper, data
from different sources are obtained and merged to form a single data set; variables from this set are then compared
and displayed in dimensions increasing from one to five, that associations may easily be seen.
METHODS
Data were obtained from three different sources via the Internet, and the downloaded data were of three different
types. One set was as an EXCEL spreadsheet, one was as a space delimited .TXT file, and a third was in an
unknown word processing output format. These were converted to EXCEL spreadsheets and the data examined to
standardize country names, used as the MERGE BY variable. The spreadsheets were then uploaded into SAS
using the ACCESS procedure and match-merged by country, giving a single data set containing a number of
variables of interest.
ONE DIMENSION – PER CAPITA GROSS DOMESTIC PRODUCT
The per capita Gross Domestic Product (GDP) data were obtained from the CIA website (1), converted to an EXCEL
spreadsheet, then uploaded into SAS using PROC ACCESS. A histogram plus some summary data were
developed using the UNIVARIATE procedure, then sent to Microsoft WORD as a Rich Text Format (.RTF) File using
the SAS Output Delivery System (ODS). The relevant code appears below:
ODS LISTING CLOSE;
ODS RTF FILE="U:\SAS_RSLT\WUSS1.RTF" NEWFILE=PROC;
GOPTIONS DEVICE = SASEMF;
TITLE2 '1D - Per Capita Gross Domestic Product';
PROC UNIVARIATE DATA=UNIFIED NOPRINT;
VAR GDP;
HISTOGRAM / EXPONENTIAL MIDPOINTS= 1000 TO 70000 BY 2000;
/* Plot
histogram, overlay EXP distribution */
INSET MEAN MEDIAN MODE / CFILL=BLANK CSHADOW=GRAY HEADER = 'Summary
Statistics' POS=NE;
INSET EXPONENTIAL / CFILL=BLANK CSHADOW=GRAY HEADER = 'Reference Distribution'
POS=NW;
RUN;
The histogram, plus the overlay of the exponential distribution, plus the summary statistics, appears below:
1
1 D - P e r C a p i t a G r o s s D o m e s t ic P r o d uc t
25
20
P
e
r
c
e
n
t
15
10
5
0
1000
7 0 0 0 13 0 0 0 1 9 00 0 2 5 0 00 3 1 0 0 0 3 7 0 0 0 4 30 0 0 4 9 0 00 5 5 0 0 0 6 1 0 0 0 67 0 0 0
GDP ($)
As expected, the histogram shows a few rich countries, many poor ones, and a reasonable fit to an exponential falloff in the number of countries with increasing GDP. In a single graph which is easily understandable, a vast amount
of data (from 232 countries) is absorbed.
TWO DIMENSIONS – ECONOMIC AND POLITICAL FREEDOM
Political Freedom data were obtained from Freedom House (2). Economic Freedom data were obtained from The
Heritage Foundation (3). In order to see if an association exists between political and economic freedom, a
scatterplot was made using the GPLOT procedure. Although strictly speaking linear regression cannot be
legitimately performed on data of these types (since the data are not continuous), a “best fit” straight line is
superimposed on the scatterplot, to give a sense of the association. The brief, relevant code is seen below:
AXIS5 LABEL= (J=C '<--- More
Economic Freedom
Less --->');
AXIS6 LABEL= (J=C A=90 '<--- More
Political Freedom
Less --->') ORDER=(0.5
TO 7.0 BY 0.5) MINOR=NONE;
SYMBOL1 I=NONE C=BLACK VALUE=PLUS;
/* Black plus sign */
SYMBOL2 I=RL C=BLUE;
/* Best fit straight line
*/
PROC GPLOT DATA=UNIFIED;
TITLE2 '2D - Political Freedom vs Economic Freedom';
PLOT PSCORE*ESCORE=1 PSCORE*ESCORE=2 / FRAME OVERLAY HAXIS=AXIS5 VAXIS=AXIS6;
RUN;
The scatterplot was then output as an .RTF file using ODS as above, with the following result:
2
2 D - P o l i t ic a l F r ee d o m
vs
E c o no m i c F r ee d o m
L e ss - - ->
4 .5
< - -- M o re
P o l it i ca l Fr e e do m
7
6 .5
6
5 .5
5
4
3 .5
3
2 .5
2
1 .5
1
0 .5
1
2
< -- - M o r e
3
E c o n om i c F r ee d o m
4
5
L e s s - -- >
From a quick glance at the graph (containing data from 192 countries), it is clear that there is an association
between economic and political freedom, but it is also evident that there are significant outliers. Neither of these
observations would be immediately apparent from looking at the lists of numerical data.
THREE DIMENSIONS – PER CAPITA GDP VS ECONOMIC AND POLITICAL FREEDOM
It might be interesting to see if an association exists between the per capita GDP and political and economic
freedom. To this end, a bubbleplot was prepared using the BUBBLE statement in PROC GPLOT. In this plot, the
area of the bubble is proportional to the per capita GDP. The relevant code is seen below:
PROC GPLOT DATA=UNIFIED;
TITLE2 '3D - GDP vs Economic and Political Freedom';
BUBBLE PSCORE*ESCORE=GDP / FRAME HAXIS=AXIS5 VAXIS=AXIS6;
RUN;
The resulting bubbleplot , sent to Microsoft WORD as a .RTF file using ODS, as above, is seen below:
3
3D - GDP
vs
E c o no m i c a nd P o l i ti c a l F r ee d o m
L e ss - - ->
4 .5
< - -- M o re
P o l it i ca l Fr e e do m
7
6 .5
6
5 .5
5
4
3 .5
3
2 .5
2
1 .5
1
0 .5
1
2
< -- - M o r e
3
E c o n om i c F r ee d o m
4
5
L e s s - -- >
It is apparent that there is an association between economic and political freedom and per capita GDP, as seen by
the cluster of the large bubbles to the left and the bottom. However, clearly there are some outliers. It would be
virtually impossible to discern these facts by visually comparing the lists of numerical data. With just four lines of
SAS code, a graph was produced which makes obvious relationships which would be almost impossible to see
otherwise.
FOUR DIMENSIONS – PER CAPITA GDP, ECONOMIC AND POLITICAL FREEDOM, WITH LABELS
In the above graph, it is seen that countries with the highest per capita GDP tend to be also those with the most
political freedom and the most economic freedom (the lower left). There are some dramatic outliers, and these
outliers could prove to be very interesting.
To label the outliers, an ANNOTE data set was constructed, which contains the names of the countries which are the
outliers, plus some others for reference. The relevant code is seen below:
DATA ANNO1;
/* Create an ANNOTATE data set */
SET UNIFIED;
IF (((PSCORE > 4) AND (ESCORE < 2.5) AND (ESCORE NE .))
/* Politic
unfree, econom free */
OR ((PSCORE > 5) AND (ESCORE > 2.5) AND (GDP > 20000))
/* Polit
unfree, econ unfree, big GDP */
OR ((PSCORE > 6.6 ) AND (GDP > 8000))
/*
Dictatorship, big GDP */
OR (ID IN('JAPAN' 'SOUTH KOREA' 'CROATIA' 'TRINIDAD AND TOBAGO'
'VENEZUELA' 'IRAN' 'CHINA'
'UNITED STATES' 'IRELAND')));
The resulting graph containing the labels is seen below:
4
4D - GDP
v s E c o no m i c a nd P o l i ti c a l F re e d o m , pl u s L a be l s
L e ss - - ->
4 .5
< - -- M o re
P o l it i ca l Fr e e do m
7
6 .5
6
5 .5
5
4
3 .5
3
2 .5
2
1 .5
1
0 .5
1
2
< -- - M o r e
3
E c o n om i c F r ee d o m
4
5
L e s s - -- >
In this graph, labeling the outliers allows a better picture to emerge. The most extreme of these (Singapore, furthest
away from neighboring points) is economically one of the most free places on earth, although the same cannot be
said about its political freedom. Singapore makes “its” money from mercantilism and manufacture for export.
A common theme among a large percentage of the outliers (i.e., large GDP, small freedom) may be oil.
FIVE DIMENSIONS – PER CAPITA GDP, ECONOMIC AND POLITICAL FREEDOM, OIL IMPORT/EXPORT STATUS, LABELS
To investigate the influence of oil on this relationship, the ANNOTATE data set was enhanced to add color to the
labels, indicating net importers or exporters of oil. The oil import/export data is from the CIA website (1). The
relevant code is listed below:
IF (IMP_EXP = 'EXPORT') THEN COLOR='BLACK';
ELSE IF (IMP_EXP = 'IMPORT') THEN COLOR='RED';
ELSE COLOR='BLUE';
The label will be black if the country is a net exporter of oil, red if a net importer, and blue if it is neither a net importer
or exporter.
The resulting graph is seen below:
5
5D - GDP
v s E co n o m i c a n d P o li t i c a l F r e e d o m , p l u s La b e l s an d C o l or s
L e ss - - ->
4 .5
< - -- M o re
P o l it i ca l Fr e e do m
7
6 .5
6
5 .5
5
4
3 .5
3
2 .5
2
1 .5
1
0 .5
1
2
< -- - M o r e
3
E c o n om i c F r ee d o m
4
5
L e s s - -- >
The interrelationships become immediately obvious; those countries with little political and economic freedom (upper
right), and also having large per capita GDPs, are net oil exporters.
FIVE DIMENSIONS – OIL IMPORT/EXPORT, ECONOMIC AND POLITICAL FREEDOM, LABELS
In looking at the above graph, another question suggests itself; “What is the relationship between actual amounts of
oil imported or exported, and political and economic freedom?” In order to investigate this relationship, another
ANNOTATE data set was constructed, and another plot produced. In this plot, oil imports are represented by
dashed circles and oil exports by solid ones. Only the largest importers and exporters are included, since they
account for most of the worlds oil transport. Unfortunately, all the labels will not fit on the graph without overlapping,
so a few of them were excluded. The relevant code is included below:
PROC GPLOT DATA=UNIFIED;
TITLE2 '5D - Oil vs Economic and Political Freedom, plus Labels and Colors';
BUBBLE PSCORE*ESCORE=NET / FRAME HAXIS=AXIS5 VAXIS=AXIS6 ANNOTATE=ANNO3;
RUN;
The resulting graph is seen below:
6
5D - Oil
vs
Economic and Political Freedom, plus Labels and Colors
L e ss - - ->
P o l it i c al Fr e e do m
7
6.5
4.5
6
5.5
5
4
3.5
3
2.5
2
1.5
< - -- M o re
1
0.5
1
2
<--- More
3
4
Economic Freedom
Less --->
5
The graph should include labels for the big net oil importers Germany, Spain, France, and Italy in the lower left.
These labels were excluded in order to prevent overlapping. It can easily be seen that the largest oil exporters tend
to be in the upper right (i.e., mostly unfree) and the largest oil importers tend to be in the lower right (i.e., mostly
free).
CONCLUSION
A number of conclusions may be drawn:






There exists a vast amount of data, easily accessed through the Internet, which is amenable to analysis
using SAS
Associations between variables in many dimensions may be seen using simple SAS tools, with a minimum
of time and effort
There is an association between political and economic freedom. The association contains quite a bit of
scatter, and is thus not perfect
There is an association between political and economic freedom and per capita GDP, a measure of the
wealth of a country (although not necessarily of its people). More freedom is generally associated with
greater wealth
Oil is a factor in most of the outliers; unfree countries with large GDPs are generally oil exporters.
Big oil exporters tend (but not always) to be unfree. Big oil importers tend (but not always) to be free
REFERENCES
(1) CIA Website http://www.cia.gov
(2) Freedom House website http://www.freedomhouse.org
(3) Heritage Foundation website http://www.heritage.org
.
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Karl Glaser
7
Roche Colorado Corporation
th
2075 N. 55 Street
Boulder Colorado 80301
Work Phone: 303 938 6348
Fax:303 938 6590
Email:[email protected]
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
8