Download Extended abstract - Conference

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Relational model wikipedia , lookup

Big data wikipedia , lookup

Clusterpoint wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Disseminating statistical data by short quantified
sentences of natural language
Miroslav Hudec ([email protected])1
Keywords: data dissemination, linguistic summary, linguistic quantifier, quality of
summary
1.
INTRODUCTION
Data summarization by statistical methods is a convenient way, but understandable for
rather small group of specialists [1]. Another option is summarization which is not as
terse as summarization by numbers. For example, we can say: mean value is 2358.42
with standard deviation of 428.3265, or linguistically: most of entities are near mean
value, few entities are near mean value and the like. The latter structure, known as
linguistic summary (LS), provides valuable summarization for variety of statistical data
users such as businesses and journalists.
Further, LSs can be used as query conditions in data retrieving tasks [2]. An example of
such condition is SELECT regions WHERE most of municipalities has high
unemployment rate and high ratio of arable land. Obviously, region may fully or
partially meet the query condition, which allows us to rank regions downwards from the
best one and moreover visualize result on a thematic map by marking regions with
different hues according to the respective query matching degrees.
Generally, entities are expressed by attributes or dimensions. NSIs data users may be
interested to see whether particular summary such as most of visits from remote countries
have short stay holds [3]. Another option is mining all relevant summaries (summaries
with high validity) regarding particular data set. This case can be solved as an operational
research task [4]. In order to illustrate this approach Section 2 briefly explain LSs,
Section 3 is dedicated to short examples and discussion, and Section 4 concludes this
paper.
2.
METHODOLOGY OF LINGUISTIC SUMMARIZATION FORM THE DATA
Linguistic summaries have been developed to express relational, concise and easily
understandable knowledge about the data. The concept of LSs has been introduced by
Yager [5] and further developed in e.g. [6] and [7]. Since the best way for
communication and mining information for people is the natural language, LSs are in the
line with the concept computing with words introduced by Zadeh [8].
LSs for summarizing the whole data set is of the following structure: Q entities in
database are (have) S, where Q is relative quantifier and S is summarizer, both expressed
by linguistic terms. The validity is computed in the following way [5]:
v(Qx( Px))  μ Q (
1 n
 μ S ( xi ))
n i 1
(1)
n
where n is the number of tuples (records) in a data set (cardinality), 1   ( x ) is the
S
i
n
i 1
proportion of tuples in a data set that satisfy summarizer S and µQ is the membership
function of chosen relative quantifier (few, about half, most of, …).
1
Faculty of Economic Informatics, University of Economic in Bratislava, Slovakia
1
LS focused on a restricted part of a data set has the form Q R entities in database are
(have) S, where R is a restriction (expressed by linguistic term). The validity is computed
in the following way [6]:
n
 t (μ S ( xi ), μ R ( xi ))
v(Qx( Px))  μ Q ( i 1
n
 μ R ( xi )
(2)
)
i 1
n
where  t ( S ( xi ),  R ( xi )) is the proportion of tuples in a data set that satisfy S and belong to
i 1
n
  R ( xi )
i 1
R, t is a t-norm (often minimum function is used) and µQ is the membership function of
chosen relative quantifier.
For instance, summarizer or restriction high pollution (HP) can be expressed as R type
fuzzy set (Fig 1a.):
Figure 1. Concept high pollution expressed as fuzzy set (a) and crisp set (b)
In Fig 1.a values 50 and 60 delimit uncertain area, i.e. area where belonging to set is
matter of degree. If we apply classical set (Fig 1b.), then two similar values are
differently treated: the value 55 mg of measured pollutant is not considered as high
pollution, whereas value of 55.000003 is.
Quantifier most of is relaxation of the universal quantifier all. (Fig 2.):
0
 y - 0.6
 Q ( y)  
 0.25
1
y  0.6
0.6  x  0.85
x  0.85
where y stands for the proportion in (1) and (2). The domain of relative quantifier is unit
interval.
Figure 2. Linguistic quantifier most of
2
3.
ILLUSTRATIVE EXAMPLES AND DISCUSSION
This section illustrates approach suggested in Section 2 and provides further discussion.
3.1. Illustrative examples
Illustrative example 1 User wishes to get regions where most of municipalities has small
attitude above sea level. Parameters expressing fuzzy set small altitude are mined form
the data. The result is shown in Table 1. Table 1 shows that regions Bratislava, Trnava
and Nitra fully meet the query condition, whereas region Bánska Bystrica is more hilly
than flat. Two regions are not selected, because they do not meet query condition. The
result corresponds with the map of Slovak Republic.
Table 1: Retrieved regions
Region
Bratislava
Trnava
Nitra
Trenčín
Košice
Bánska Bystrica
Validity of the summary
1
1
1
0.7719
0.6314
0.2116
In addition this way of dissemination is able to keep sensitive data undisclosed, because
LSs are calculated regarding data on lower level (mikrodata) but result is aggregated to
the respective higher levels [9].
Illustrative example 2 An agency wishes to know which summaries explain length of
visits of tourists from the remote countries. The attribute length is divided into three
overlapping granules: short, medium and long. The term set for relative quantifier
consists of terms few, about half and most of. Hence, we should evaluate nine sentences.
Construction of the sets short, medium and long for the attribute duration depends on
particular categorization or user’s preferences, which are not further examined due to the
limited space. Possible answer is shown in Table 2. We see that short and long stay
dominate with few medium long stays.
3.2. Discussion
LSs are able to capture vagueness or semantic uncertainty of analysed phenomena by
fuzzy sets and visualize results in an understandable way. A linguistically summarized
sentence can be read out by a text-to-speech synthesis system, which is a valuable option
when the visual attention should not be disturbed as well as for disabled people.
Although LSs are applied at the final stages of statistical data production, they could
improve data collection by tailored motivation of respondents [10]. We could offer
sophisticated LSs to businesses, which highly cooperate in surveys, for example.
Businesses are often interested in summarized information rather than long sheets of
data. It especially holds for smaller businesses, which cannot afford data mining
specialists. By this approach we can mitigate paradox explained by Ross [11]: “We find
that a paradox is steadily developing in a rapidly changing world, in that statistical users
are becoming ever more demanding for timely data, but are less willing to provide their
own data to statistical institutes”. This paradox presumably appeared from the fact that
respondents cooperate in many official surveys, but on the other hand they often are not
able to easily find relevant information extracted from databases on NSI data portals.
3
Table 2 Summaries and their respective validities
LS
few visits from remoted countries are of short stay
few visits from remoted countries are of medium stay
few visits from remoted countries are of long stay
about half visits from remoted countries are of short stay
about half visits from remoted countries are of medium stay
about half visits from remoted countries are of long stay
most of visits from remoted countries are of short stay
most of visits from remoted countries are of medium stay
most of visits from remoted countries are of long stay
4.
validity
0.1472
0.8575
0
0.8528
0.1425
1
0
0
0
CONCLUSION
Linguistic summaries play a pivotal role in summarizing information from the data when
uncertainty related to the semantic meaning of the phenomena cannot be neglected. In the
paper we have speculated possibilities for applying LSs in statistical data dissemination,
because linguistically summarized information is understandable for large scale of
statistical data users. Furthermore, a linguistically summarized sentence can be read out
by a text-to-speech synthesis system, which brings benefit for disabled people or when
visual attention of data user is focused on something else. In addition, when
summarization is focused on territorial units, validities of summaries can be visualised on
thematic maps by different hues of the selected colour. Finally, this novel way of data
dissemination could motivate respondents to cooperate in surveys.
Future tasks should be focused on adjusting quality measures of LSs to particularities of
statistical data, analysing dissemination needs, summarizing from SDMX data cubes and
developing tool. These tasks can be solved in cooperation between NSIs data
dissemination units and scientists working in this field.
REFERENCES
[1] Yager, R.R., Ford, M., Canas, A.J.: An approach to the linguistic summarization of data. In:
3rd International Conference of Information Processing and Management of Uncertainty in
Knowledge-based Systems (IPMU 1990), pp. 456-468, Paris (1990).
[2] Hudec, M.: Fuzziness in Information Systems. Springer Int.Publishing, Switzerland, 2016.
[3] Mišút, M., Hudec, M.: Linguistically summarizing mobile positioning data managed in the
STAR scheme. In review.
[4] Liu, B.: Uncertain logic for modeling human language. J.Uncertain Syst. 5, 3–20 (2011).
[5] Yager, R.R., 1982. A new approach to the summarization of data. Inf. Sciences 28, 69-86.
[6] Rasmussen, D., Yager, R.R., 1997. Summary SQL - A Fuzzy Tool for Data Mining Intell.
Data Analysis 1, 49-58.
[7] Kacprzyk, J., Zadrożny, S.: Protoforms of linguistic database summaries as a human
consistent tool for using natural language in data mining. International Journal of Software
Science and Computational Intelligence 1, 100–111 (2009).
[8] Zadeh, L.A., 2001. From computing with numbers to computing with words - from
manipulation of measurements to manipulation of perceptions. In: Wang, P. (Ed), Computing
with Words. New York: John Wiley & Sons, pp. 35 – 68.
[9] Hudec M. (2013) Fuzzy database queries in official statistics: Perspective of using linguistic
terms in query conditions. Statistical Journal of the IAOS, 29(4): 315-323.
[10] Hudec M, Torres Van Grinsven V. (2013) Business’ participants motivation in official
surveys by fuzzy logic. In: 1st Eurasian Multidisciplinary Forum, (EMF 2013), Tbilisi, 24 – 26
October, Vol. 3, pp. 42-52.
[11] Ross, M. P.: Official Statistics in Malta – implications of Membership of the European
Statistical System for a small country/NSI. 95th DGINS Conference, 2009.
4