Download Extended NTTS 2017 abstract - Conference

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Profiling Users by Estimating Composite
and Multi-valued Attributes from Big Data Sources for
Social Statistics Purposes
Jacek Maślankowski ([email protected])1
Keywords: Big Data, Social Statistics, Text Mining, Web Mining, Social Mining
1.
INTRODUCTION
In recent years the value of big data in statistics has increased according to the
development of useful tools and methods to ensure the reliability of the results of
analysis. Especially the comparison of the results of social media sentiments analysis and
other indicators, such as consumer confidence, looks promising [1]. However, there are
still some lacks of the methodology, especially in terms of representativeness of the data
scraped from the web. Therefore it is necessary to provide information on population
with detailed attributes that can be extracted from the data or at least estimated. Those
attributes can be divided into composite (e.g., address, name, etc.) as well as multivalued (e.g., phone numbers). The problem of the low quality of such attributes was
known in past and several methods for increasing the quality have been proposed [2].
The goal is to incorporate such methods to big data methodology to ensure the high
quality of the data. This is usually referred to the reliability of the data, in terms of
completeness, accuracy, consistency and integrity [3].
The aim of the paper is to show the methodology of extracting big data sources for social
statistics purposes. Due to the fact that in most of the web pages it is quite difficult to
profile users, a methodology for estimating attributes has been proposed. The
methodology used in this paper assumes that in some circumstances, the set of attributes
used to describe a statistical unit can be estimated based on the content of the text being
analysed. The paper presents both suggested methodology as well as results of the case
study applied using this methodology. The case study of using web page data presented
in the paper confirms the necessity of attributes estimating when making big data
analysis.
The hypotheses used in this paper are as follows: H1: The reliability of the data can be
increased by estimating values for specific entities; H2: The representativeness of the
web data does not allow applying it directly for social statistics purposes. Therefore text
mining and machine learning tools cannot be applied without knowing what type of
entities will be used for further analysis. This leads to the conclusion that selecting
proper entities and attributes from big data sources allows enhancing social statistics
surveys.
1
Department of Business Informatics, Faculty of Management, University of Gdańsk, Poland;
Central Statistical Office, Statistical Office in Gdańsk, Poland
1
2.
METHODS
The simple form of big data analysis may be related to the MapReduce paradigm, which
is easy to implement programming model by research and industrial communities [4].
However this model does not provide most of the tools that will enable to extract the
entity with all the attributes from the unstructured data source. Typical MapReduce
algorithms, including WordCount and Regular Expressions, are making analysis of the
text without separating the results into different attributes of the entities. Although it is
possible to apply Text Mining and Machine Learning tools to increase the value of the
results, there is a need to develop new methods according to the requirements of the
particular survey.
Before starting analysis, the first step is to analyse the readiness of the data source [5].
Although several different methods have been proposed, there is still no framework that
can be applied to any data source.
Therefore we propose the method for profiling users from social media and other data
sources, in the case study web pages, that are based on text mining techniques.
Sometimes the methods of providing analysis from the social media are known as social
big data [6] or social mining. The paper also includes using the subclass of the text
mining that is well known as a web mining. This especially allows making identification
of demographic attributes related to the human generated information [7]. The users
profiling in this paper relates to using this information for social statistics purposes.
The data was extracted using algorithms implemented in Python language on Apache
Spark as a tool. The population of the survey was a selected group of a social media users
and people that make comments on selected web portals. Due to the accessibility of the
API, it was decided to use a Twitter for a social media part of the case study. Web
scraping was used as a second method for accessing the data from public news portals.
For this purposes a machine learning algorithms has been prepared and tested. It can also
be concluded that current machine learning algorithms do not fulfil all the expectations
for big data analysis [8].
The paper concentrates on unstructured data analysis. Such type of data is approximately
up to 95% of the sources used for big data processing [9]. This is the effect of the
data-driven characteristics of methods applying big data tools [10]. Therefore the
decision was to develop a framework for gathering composite and multi-valued attributes
from the unstructured datasets. However to verify the results of analysis, structured and
semi-structured data were used to confirm the first hypothesis of the paper.
3.
RESULTS
The main findings in the paper is a proposal of a set of combined methods used to extract
users profiles from both social media as well as webpages. The results of the case study,
that in fact was a survey conducted on social media and webpages, show that several
useful and reliable attributes can be extracted to enhance social statistics surveys. It
especially includes social confidence and intention to vote. On the other hand, a new
phenomenon, such as media education can also be analysed. The results are presented
using geographic and demographic attributes of the entities.
In the survey, the entity is a person who is active on social media as well as persons that
are making comments to various events in the country. Although there is a noise in the
data, and opinions in some circumstances, can be a bit confused for algorithms, valuable
2
and reliable information can be extracted. However the general conclusion from the
analysis is that machine learning algorithms have to be modified according to changing
data patterns in the data source. Therefore the testing phase is repeated several times in
regular time periods.
In the presented paper three different cases were used to make analysis and enhance the
social statistics: intentions to vote (mostly covered in statistics from OECD [11]), media
education – how people trust in media and social confidence. The framework that is
presented in the paper can also be applied for other social statistics purposes, taking into
account the specification of the data source.
4.
CONCLUSIONS
The goal of the paper has been achieved and the results of the analysis using the
framework were presented. The survey conducted using big data tools allows formulating
conclusions on the appliances of big data sources for social statistics, especially in terms
of the quality of the data as well as accessibility of the methods to increase the data
quality. Firstly, the data sources are very noisy which is obvious and well known. There
are still no methods that will cleanse them and provide a reliable information regarding
the attributes of the entities being analysed.
The hypothesis H1 has been confirmed by comparing the results of analysis with the data
from official statistics. Although there are differences between results from traditional
surveys and from big data sources, the changes over time are correlated in both sources.
It has to be noted that big data source has a larger population comparing to the population
in traditional surveys. On the other hand, population presented in big data sources is
limited to active social media users and people that leave comments on webpages. This
means that we have to expect some differences in the results. It confirms the second
hypothesis, which refers to representativeness of alternative data sources in social
statistics, such as big data.
Therefore there is a need to build the framework to extract high quality multi-valued as
well as composite attributes from the unstructured dataset. However this will not resolve
all the issues related to provide a reliable information. In fact it is very risky to substitute
traditional data sources with big data analysis for identifying the scale of intention to
vote, social confidence and media education. Apart from that, the results presented in the
paper are very promising and may have a big impact on future way of conducting social
surveys.
REFERENCES
[1] P.J.H. Daas, M.J.H. Puts, Social media sentiment and consumer confidence,
Statistics Paper Series, No. 5, September, ECB, (2014).
[2] I. Kononenko, On Biases in Estimating Multi-Valued Attributes, IJCAI'95
Proceedings of the 14th international joint conference on Artificial intelligence Volume 2, Morgan Kaufmann Publishers, (1995), 1034-1040.
[3] L. Cai, Y. Zhu, The Challenges of Data Quality and Data Quality Assessment in the
Big Data Era, Data Science Journal 14, May, (2015).
[4] S. Sakr, A. Liu, A. G. Fayoumi, The family of MapReduce and large-scale data
processing systems, ACM Computing Surveys (CSUR): Volume 46 Issue 1,
October, (2013).
3
[5] G. Bello-Orgaza, J.J. Jung, D. Camachoa, Social big data: Recent achievements and
new challenges, Information Fusion, Volume 28, March (2016), 45–59.
[6] Y. Lu, X. Fang, J. Zhan, Data Readiness Level for Unstructured Data,
BigDataScience '14 Proceedings of the 2014 International Conference on Big Data
Science and Computing, No. 36, ACM New York (2014).
[7] B. Fortuna, D. Mladenic, M. Grobelnik, Application of semantic annotations to
predicting users' demographics, ESAIR '10 Proceedings of the third workshop on
Exploiting semantic annotations in information retrieval, ACM New York (2010).
[8] T. Condie, P. Mineiro, N. Polyzotis, M. Weimer, Machine learning for big data,
SIGMOD '13 Proceedings of the 2013 ACM SIGMOD International Conference on
Management of Data, (2013).
[9] A. Gandomi, M. Haider, Beyond the hype: Big data concepts, methods, and
analytics, International Journal of Information Management, Volume 35, Issue 2,
April (2015), 137–144.
[10] X. Wu, X. Zhu, G.-Q. Wu, Data mining with big data, IEEE Transactions on
Knowledge and Data Engineering, Volume: 26, Issue: 1, January (2014), 97-107.
[11] Education at a Glance 2016, OECD, Paris (2016).
4