Download PDF Full-text

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Life-cycle assessment wikipedia , lookup

Environmentalism wikipedia , lookup

Transcript
sustainability
Article
Monitoring Environmental Quality by Sniffing
Social Media
Zhibo Wang 1,2 , Lei Ke 1 , Xiaohui Cui 1, *, Qi Yin 1 , Longfei Liao 1 , Lu Gao 1 and Zhenyu Wang 1
1
2
*
International School of Software, Wuhan University, Wuhan 430079, China; [email protected] (Z.W.);
[email protected] (L.K.); [email protected] (Q.Y.); [email protected] (L.L.); [email protected] (L.G.);
[email protected] (Z.W.)
School of Software, East China University of Technology, Nanchang 330013, China
Correspondence: [email protected]; Tel.: +86-27-6877-8720
Academic Editors: Yichun Xie, Xinyue Ye and Clio Andris
Received: 9 November 2016; Accepted: 5 January 2017; Published: 10 February 2017
Abstract: Nowadays, the environmental pollution and degradation in China has become a serious
problem with the rapid development of Chinese heavy industry and increased energy generation.
With sustainable development being the key to solving these problems, it is necessary to develop
proper techniques for monitoring environmental quality. Compared to traditional environment
monitoring methods utilizing expensive and complex instruments, we recognized that social media
analysis is an efficient and feasible alternative to achieve this goal with the phenomenon that a
growing number of people post their comments and feelings about their living environment on
social media, such as blogs and personal websites. In this paper, we self-defined a term called the
Environmental Quality Index (EQI) to measure and represent people’s overall attitude and sentiment
towards an area’s environmental quality at a specific time; it includes not only metrics for water
and food quality but also people’s feelings about air pollution. In the experiment, a high sentiment
analysis and classification precision of 85.67% was obtained utilizing the support vector machine
algorithm, and we calculated and analyzed the EQI for 27 provinces in China using the text data
related to the environment from the Chinese Sina micro-blog and Baidu Tieba collected from January
2015 to June 2016. By comparing our results to with the data from the Chinese Academy of Sciences
(CAS), we showed that the environment evaluation model we constructed and the method we
proposed are feasible and effective.
Keywords: social media; environmental quality; environment monitoring; Support Vector
Machine (SVM)
1. Introduction
With the popularity and development of the internet, social media, such as blogs and personal
websites, etc., have greatly changed peoples’ lives. These social media propel the spread of social
news, public opinion, and personal daily information in society and play an important role in current
information dissemination and life sharing [1,2]. For example, in China, according to the survey of the
China Internet Network Information Center (CNNIC) (as shown in Figure 1), people are spending
more and more time on the internet, and that internet time will keep increasing in the future [3].
Since people spend so much time on social media, it is worthwhile to utilize them to uncover and
collect various kinds of environmental information. Considering the environment field, we checked
the previous related works that attempted to gather environment pollution information based on
different methods. In 2008, Honicky et al. [4] suggested collecting pollution information by sensors
attached to mobile phones. In 2009, Aoki et al. [5] used vehicles to monitor air quality by deploying
mobile air quality sensing platforms on street sweeping trucks in San Francisco. In 2012, the work
Sustainability 2017, 9, 85; doi:10.3390/su9020085
www.mdpi.com/journal/sustainability
Sustainability 2017, 9, 85
2 of 14
Sustainability 2017, 9, 85
2 of 13
by Xu et al. [6] monitored spatial-temporal signals from social media. In 2014, Zheng et al. [7] and
Chen and
et al.
[8] estimated
the air quality
in big cities
fusing
monitoring
station
data
with
social
media
Chen
et al. [8] estimated
the air quality
in big by
cities
by fusing
monitoring
station
data
with
social
data,
Mei
et al. [9]on
focused
on predicting
air pollution
108 in
cities
in China
the text
data, media
and Mei
et and
al. [9]
focused
predicting
air pollution
in 108 in
cities
China
fromfrom
the text
content
in the
Chinese
Sina micro-blog.
More recently,
2015,
et al.researched
[10] researched
how
Chinese
in thecontent
Chinese
Sina
micro-blog.
More recently,
in 2015,inKay
etKay
al. [10]
how
Chinese
social
media influenced
the airInquality.
In 2016, Sammarco
et used
al. [11]a used
a geographical
mediasocial
influenced
the air quality.
2016, Sammarco
M et al. M
[11]
geographical
searchsearch
about air
about air pollution related posts on social networks as an effective air-impurities measurement and
pollution
related posts on social networks as an effective air-impurities measurement and utilized it to
utilized it to predict pollution values, and Tse et al. [12] verified the feasibility of sensing air pollution
predict pollution values, and Tse et al. [12] verified the feasibility of sensing air pollution from social
from social networks and of integrating such information with real sensor feeds. To the best of our
networks
and of integrating such information with real sensor feeds. To the best of our knowledge,
knowledge, there has been very little research done on comprehensive environmental quality
there through
has been
very Chinese
little research
done on comprehensive environmental quality through mining
mining
social media.
Chinese social media.
Figure
1. Time
spent
onononline
byChinese
Chineseinternet
internet
users.
Figure
1. Time
spent
online(per
(per week)
week) by
users.
This paper aims to monitor environmental quality in regions of different provinces in China by
This paper aims to monitor environmental quality in regions of different provinces in China
collecting large amount of text data from two of China’s major social medias: Sina micro-blog
by collecting
large amount of text data from two of China’s major social medias: Sina micro-blog
(http://weibo.com) and Baidu TieBa (http://tieba.baidu.com). Sina micro-blog is a popular micro(http://weibo.com)
Baidu
micro-blog
a popular
blogging service and
platform
like TieBa
Twitter,(http://tieba.baidu.com).
and Baidu Tieba is known asSina
the world’s
largestisChinese
micro-blogging
service
platform
like
Twitter,
and
Baidu
Tieba
is
known
as
the
world’s
largest
Chinese
community, which has 1.5 billion registered users and 64.6 billion total messages. We further
community,
which
has 1.5 billion
registered
users
and 64.6
total messages.
Wesupport
furthervector
analyzed
analyzed
and classified
people’s
attitude and
sentiment
to billion
the environment
using the
machine algorithm
85.67%
precision
on average.
In addition, weusing
constructed
a new environment
and classified
people’s with
attitude
and
sentiment
to the environment
the support
vector machine
evaluation
model and
achieved
goal ofIn
environment
monitoring
in specific
effectively
by
algorithm
with 85.67%
precision
onthe
average.
addition, we
constructed
a newregions
environment
evaluation
calculating
and
analyzing
the
Environmental
Quality
Index
(EQI),
which
represents
people’s
overall
model and achieved the goal of environment monitoring in specific regions effectively by calculating
attitude and sentiment towards an area’s environmental quality at a specific time. Finally, by utilizing
and analyzing
the Environmental Quality Index (EQI), which represents people’s overall attitude
the environment evaluation model, we monitored the environmental quality for parts of major cities
and sentiment towards an area’s environmental quality at a specific time. Finally, by utilizing the
or provinces in China and compared the difference in environmental quality for 27 provinces in
environment
evaluation model, we monitored the environmental quality for parts of major cities or
China in 2015. The experimental results we obtained are very closed to the “China livable city report”
provinces
in
China
and
compared
the difference
environmental
quality
for 27
provinces
in China in
published by the
Chinese
Academy
of Sciences in
(CAS)
in 2015, which
illustrates
that
the environment
2015. evaluation
The experimental
results
we obtained
are very closed to the “China livable city report” published
model we
constructed
is feasible.
by the Chinese
Academy
of Sciences
(CAS)
in 2015,
which
illustrates
that the environment
evaluation
This work
is an important
and
innovative
step
in monitoring
environmental
quality based
on
social media.
It explores the process of classification according to short texts and constructs
modelChinese
we constructed
is feasible.
an environment
model
calculate EQI.
support
vector machine
algorithms
This
work is anevaluation
important
and to
innovative
stepByincombining
monitoring
environmental
quality
based on
and
experiment
results
we
show
that
the
model
is
a
feasible
and
efficient
way
to
analyze
and
compare an
Chinese social media. It explores the process of classification according to short texts and constructs
the environmental quality in different regions in China. Besides, it can provide a foundation for
environment
evaluation model to calculate EQI. By combining support vector machine algorithms
monitoring environmental quality by analyzing social media information for some researchers.
and experiment results we show that the model is a feasible and efficient way to analyze and compare
the environmental
quality in different regions in China. Besides, it can provide a foundation for
2. Methods
monitoring environmental quality by analyzing social media information for some researchers.
2.1. Formal Definition
2. Methods
In this paper, we wanted to establish a model that can be used to evaluate the environmental
quality
of different places in China at different times. Given enough text data related to the
2.1. Formal Definition
environment for a specific region in the required time period, we should be able to determine the
In this paper, we wanted to establish a model that can be used to evaluate the environmental
quality of different places in China at different times. Given enough text data related to the environment
Sustainability 2017, 9, 85
3 of 14
for a specific region in the required time period, we should be able to determine the status of the
environment by analyzing the writers’ emotional attitudes. In order to make this problem clearer,
the first step we took was to define the vocabulary of region and emotion (Definitions 1–3).
Definition 1 (Text words w). The words in the text may be related to emotional attitudes toward the
environment. Applying statistics to every word in the text is the key point in processing raw data. Using the
tuple to represent: w = {t, a}, where t stands for the text value of the word, and a stands for the frequency of w in
this text.
Definition 2 (Emotional dictionary D). For each emotional tendency, we can build a more significant
thesaurus to represent it, namely, an emotional thesaurus. An emotional thesaurus depends on the analysis of
the attitude toward the environment. We use tuples to represent the emotional word: D = {d, v}, where d is
the lexicon of each word, and v is the vector measuring emotional tendency. In this way, if the v is close to the
central vector, the text content is more likely to have emotional tendency. The words in the lexicon can also be
represented as tuples: d = {t, c}, where t is the text representation of the word d, and c is the correlation for the
word d with the emotional tendency.
Definition 3 (Regional dictionary R). For each region or city, we need large amount of its text data, T, as the
basis for the analysis of the region’s environmental quality.
After we made the three definitions above, the problem was formalized as follows: Use the
data on social media to build dictionary D, Regional dictionary R, and large amounts of text data T,
then process the data and forecast the environmental quality.
2.2. Establishing the Emotional and Regional Dictionary
Currently, we have both a geographical thesaurus and an emotional thesaurus. The “Sogou
thesaurus” (http://pinyin.sogou.com/dict) provided the geographical thesaurus and the Dalian
University of Technology Laboratory of Information Retrieval provided the emotional thesaurus [13].
However, the accuracy of each thesaurus has a crucial impact on the experiment, so verifying the
emotional and regional thesaurus was necessary. This paper utilized a CHI (Chi-square) prescribing
test and TFIDF (Term Frequency Inverse Document Frequency) algorithms and a certain degree of
manual inspection to check the thesauruses. By comparing the keyword set related to environment
and sentiment obtained by the two methods with the above thesauruses, we found many words exist
in both thesauruses. At the same time, some words that are irrelevant with the emotional tendency
and environment were also found, such as “expert” and “people”. Since the total number of words is
just less than 120, we can obtain a more accurate result using artificial selection. As shown in Table 1,
the average appearance rate for the “Sogou thesaurus” is 92.41% and the average appearance rate
for another thesaurus is 94.38%. From this result, we can determine that the two thesauruses are
accurate and reliable. Also, we manually added some keywords from the keywords set we obtained as
a supplement, if they were relevant but not contained in the thesauruses.
Table 1. The appearance rate for the two thesauruses under CHI prescribing and TFIDF.
Algorithm
geographical thesaurus
emotional thesaurus
Appearance Rate (%)
CHI Prescribing
TFIDF
90.01%
92.63%
94.81%
96.13%
CHI: Chi-square; TFIDF: Term Frequency Inverse Document Frequency.
Sustainability 2017, 9, 85
4 of 14
2.3. Filtering the Irrelevant Text Content
It was very important to find a method to determine whether a text is related to environmental
quality so that we could establish the environment evaluation model. The model was resolved by the
“TextRank” method mentioned in Section 3. The formula for calculating the relation is:


R( T ) = 
∑
( f (w) + f 0(w) × F (w))/( F (w) + 1) +
∀w∈( T (w)∩ D (w))
∑
f 0(w)/L( T )
(1)
∀w/
∈( T (w)∩ D (w))
In the formula, R( T ) represents the environmental relation value for a piece of text data,
T represents a whole text content. f (w) is the value of the word w defined in the word frequency table
we set up, f 0(w) is the “TextRank” method’s value of the word, and F (w) is the word occurrences
time in the frequency table. L( T ) is number of words and w is each word that makes up the whole
text. T (w) is the word set that constitutes the sentence, and D (w) is the word set of the whole word
frequency table. There are more specific definitions for the “TextRank” method and the word frequency
table in Section 3.2 of this paper.
2.4. Environmental Quality Analysis Method Combining Support Vector Machine
For the classification, machine learning was widely used. In this paper, we utilized the Support
Vector Machine (SVM) for classifying [14]. The Support Vector Machine is currently one of the best
algorithms in data mining [15]. It is just a hyper plane used to divide data into different groups [16–18].
When we enter the processed data into it, the data are distributed in both sides of the hyper plane.
Ultimately, we get the label data that have no label and do prediction [19]. The detailed process for
constructing the support vector machine model is in Section 3.3.
Algorithm: Environmental quality analysis using the support vector machine and utilizing
Sequential Minimal Optimization (SMO):
•
Input content:
1
2
3
4
5
6
7
•
Output content:
1
2
3
•
dataMatInTrain: processed data which has the characteristic about environment in
Training set
classLabelsTrain: the label for the Corresponding data in Training set
C: threshold
toler: fault tolerant rate
maxIter: iterator number
kTup: kernel function
dataMatInTest: processed data which has the characteristic about environment in test set
w: the normal vector for hyper plane
b: the constant for hyper plane
label: predicted label for data
Pseudo code:
1.
2.
Data = process data and get the characteristic toward environment
Get data matrix:
dataMatInTrain, classLabelsTrain = loadData (Data)
3.
Get key parameter:
w, b = SMO (dataMatIn, classLabels, C, toler ,maxIter, kTup)
Sustainability 2017, 9, 85
4.
5 of 14
Test the correct rate of this hyper plane:
Rate = judgeData (w, data, classLabel, b)
5.
If Rate > n:
Use this hyper plane to do the prediction:
Label = Predict (w, data, b)
Else:
Change some input parameter or training data to increase the correct rate
Do step 5
6.
Draw conclusions
In the pseudo code above, the most important step is function SMO (Sequential Minimal
Optimization), which breaks the large quadratic programming (QP) problem into a series of the
smallest possible QP problems in training support vector machines, and we wrote our own SMO
function code based on the pseudo-code mentioned in the research paper of John C. Platt [20].
2.5. Establishing the Environmental Quality Index
By analyzing thousands of sentiment analyses of text data on social networks associated with the
environment in a region within a certain time period, utilizing the Support Vector Machine model,
we can artificially create a representative of environmental quality in that area which we call the EQI
(Environmental Quality Index) F (c, t):
F (c, t) = −
∑
E( T )
(2)
∀ T ∈S(c,t)
where t stands for the time period, and c stands for the region of a text content. T represents a text,
E( T ) is the emotional evaluation value towards the environment status which is obtained from the
SVM model we constructed in Section 3.3, and S(c, t) is the whole text data set we have at time period t
in region c. The EQI can be used to measure the environmental quality in a specific region at a specific
period. The practical operation of it will be discussed in detail in Section 3.4.
3. Experiment
In this part, we followed the steps in Figure 2 to do the predictions. Firstly, we used the crawler to
get data about environment from the Sina micro-blog and Baidu TieBa. When we got enough data from
users, we tried to eliminate the irrelevant data and use text model to express the content. Secondly,
we input the training data into the SVM to get the most suitable hyper plane. Then, we predicated
the emotional tendency toward the environment based on this hyper plane. Finally, we constructed
the environment evaluation model and obtained the environment monitoring result for some major
cities in China every month from January 2015 to June 2016 and environmental quality ranking for
27 provinces in China in 2015.
predicated the emotional tendency toward the environment based on this hyper plane. Finally, we
constructed the environment evaluation model and obtained the environment monitoring result for
some major cities in China every month from January 2015 to June 2016 and environmental quality
ranking for 27 provinces in China in 2015.
Sustainability 2017, 9, 85
6 of 14
Figure
2. Experiment
flowchart.
Figure
2. Experiment
flowchart.
3.1. The Collection and Preprocessing of the Data
We collected the data from Baidu TieBa and the Sina micro-blog by using the Web Spider.
The Baidu TieBa Spider works as follow: Firstly, the Baidu Tieba Spider got access to the
home of the related Baidu TieBa and analyzed the page elements in order to get the link of all
the posts. Then, we collected content and the post time of each post. Next, we used Jieba
(https://pypi.python.org/pypi/jieba), which is an open source and popular python package and
provides functions for Chinese word segmentation, to split the text information and match each word
with the keywords about environment we set in advance for the text filtering. After that, Jieba found the
most probable combination based on the word frequency using dynamic programming. If there were
unknown words, we used the Viterbi algorithm [21] which is a HMM (Hidden Markov Model)-based
model to build a directed acyclic graph (DAG) for all possible word combinations. Then, it was written
into the database if the matching process was successful. The Sina micro-blog Spider was slightly
modified from a GitHub user Liu’s “Sina_spider1” (https://github.com/LiuXingMing/SinaSpider).
This spider has the advantages of simple authentication and fast accessing speed of the micro-blogging
mobile version so that it can collect the data with the pre-set account and save the data into the Mongo
database. However, the account cannot login if it is verified, so that we modified the login source code
and used the cookies for login through the browser directly. In addition, the text filtering process was
the same as Baidu TieBa information filtering.
We collected a total of 3,500,210 pieces of text by using the Sina micro-blog Spider and Baidu
TieBa Spider from January 2015 to June 2016. These included about 2,500,000 pieces of text from
Sina micro-blog and the other 1,000,000 pieces of text from Baidu TieBa. Then, we did the data
preprocessing after the collection of data. First, we noticed that Micro-blog users produce a lot of
interactive information in the interaction process, which have no emotion tendencies. Therefore,
we utilized regular expressions to make the filtering rules, and filter the information. However, some
of the text data that we collected still had little relationship to the environment although we used
the keywords about environment to filter the texts. So we filtered the text content again by using the
Sustainability 2017, 9, 85
7 of 14
TextRank algorithm, which is a kind of graph based on a ranking algorithm and a way of deciding
on the importance of a vertex within a graph by taking into account global information recursively
computed from the entire graph, rather than relying only on local vertex-specific information [22].
3.2. Further Filtering and Classification of Data
In this section, we manually selected about 8329 texts from the high environment-related crawling
data. We utilized the python package Jieba, which has a module function for keyword extraction based
on TextRank algorithm, to help us determine if a text was related to environmental quality and the
relevant calculation formula for this method is introduced in Section 3.3 of the paper. The module
function in Jieba:
jieba.analyse.textrank (raw_text)
where the parameter raw_text is the text content needed to be extracted, and the output of the program
is composed of the words which make up these text and their corresponding value which indicate the
importance of each word to the environment. Each keyword and its corresponding value is stored
in a word frequency table. For example, words such as “pollution”, “air”, and some other words
which have a higher frequency, which means they repeatedly appeared, correspond to a larger value,
while the words that are irrelevant to the environment correspond to a small value.
Further, we used the method to build a word frequency table which contains the words most
associated with the environment and record their corresponding value. The following procedure was
used for each new Chinese text for input: Firstly, Jieba completed Chinese word segmentation by
splitting these texts into a series of words, and then from the word frequency table we set up before,
we obtained the corresponding value for each word. Finally, we calculated the sum of all the values
corresponding to each word, and we got the correlation data for the input text. As for the result,
the high correlation value meant this text hd a greater degree of association with the environment,
Sustainability
2017,
77 of
Sustainability
2017, 9,
9,it85
85meant this text was irrelevant to the environment, as shown in Figures 3 and 4.
of 13
13
and conversely,
Figure
corresponding to
to aa high
high relation
relation
value.
Figure 3.
3. Text
Text data
data related
related to
to the
the environment
environment corresponding
Figure
3.
Text
relation value.
value.
Figure 4. Text
Text data related to the environment corresponding to a low relation value.
Figure
Figure 4.
4. Text data related to the environment corresponding to a low relation value.
value.
After
collected data,
1,500,000 remaining
After filtering
filtering the
the collected
data, aa total
total of
of 1,500,000
remaining texts
texts with
with aa high
high degree
degree of
of
correlation
with
the
environment
remained.
In
this
paper,
we
classified
these
texts
geographically
In this
this paper, we classified these texts geographically
correlation with the environment remained. In
according
Chinese provinces
provinces and
and months
months by
by the
the utilization
utilization of
of the
the geographical
geographical information
information and
and
according to
to Chinese
time
information
in
the
data.
However,
some
provinces,
such
as
“Tibet”
and
“Xinjiang”,
did
not
have
However, some provinces, such as “Tibet” and “Xinjiang”, did not have
time information in the data. However,
enough
texts,
provinces
into
Upon
the
work
do do
not not
take take
thosethose
provinces
into account.
Upon completing
the work of
geographic
enough texts,
texts,soso
sowewe
we
do
not
take
those
provinces
into account.
account.
Upon completing
completing
the
work of
of
geographic
and
temporal
classification
of
the
data,
we
obtained
the
geographic
and
period
and
temporal
classification
of
the
data,
we
obtained
the
geographic
and
period
distribution
of
the
data
geographic and temporal classification of the data, we obtained the geographic and period
distribution
of
data
which
is
in
source, which
shown
in Figures
5 and
6.
distribution
ofisthe
the
data source,
source,
which
is shown
shown
in Figures
Figures 55 and
and 6.
6.
correlation with the environment remained. In this paper, we classified these texts geographically
according to Chinese provinces and months by the utilization of the geographical information and
time information in the data. However, some provinces, such as “Tibet” and “Xinjiang”, did not have
enough texts, so we do not take those provinces into account. Upon completing the work of
geographic and temporal classification of the data, we obtained the geographic and period
Sustainability
2017, 9, of
85 the data source, which is shown in Figures 5 and 6.
8 of 14
distribution
Sustainability 2017, 9, 85
Figure 5. The geographical distribution of data sources.
8 of 13
Figure 5. The geographical distribution of data sources.
Figure
Figure6.6.The
Themonth
monthdistribution
distributionofofdata
datasources.
sources.
3.3.
3.3.Using
UsingSupport
SupportVector
VectorMachine
MachinetotoDetermine
DetermineEmotional
EmotionalAttitudes
Attitudestoward
towardEnvironment
Environment
Before
Beforewe
weused
usedthe
theSupport
SupportVector
VectorMachine
Machinetotodetermine
determineemotional
emotionalattitudes
attitudestoward
towardthe
the
environment,
what
we
needed
to
do
was
feature
classification,
referring
to
the
feature
classification
environment, what we needed to do was feature classification, referring to the feature classification
methods
aboutthe
theconstruction
construction
ontological
emotional
vocabulary
To determine
methodsin
inaa paper
paper about
ofof
ontological
emotional
vocabulary
[23]. [23].
To determine
feature
feature
vectors
of
the
data,
we
divided
the
attitude
characteristics
into
four
aspects,
which
included
vectors of the data, we divided the attitude characteristics into four aspects, which included view,
view,
emotion,
evaluation,
and degree.
eachthesaurus,
feature thesaurus,
it was subdivided
into
emotion,
evaluation,
and degree.
For eachFor
feature
it was subdivided
into positive
andpositive
negative
and
negative
parts.
For
piecewe
of made
text data,
we made
the segmentation
Chinese wordby
segmentation
invoking
parts.
For each
piece
of each
text data,
the Chinese
word
invoking theby
Jieba
method
the
Jieba
method
in
a
python
program
to
split
the
text
and
compare
it
with
the
thesaurus
forif
in a python program to split the text and compare it with the thesaurus for four characteristics to check
four
characteristics
to check
it was
a match.
If there
wasthen
a match
between positive
words then
it was
a match. If there
was aifmatch
between
positive
words
the corresponding
characteristic
value
was plus 1, if a match existed in negative words, then the corresponding characteristic value wasvminus
1. Through the accumulated score of each feature, we obtained the final result of it, which was used as a
measurement value for the character feature value in the passage below. In brief, after we got enough data
from the Sina micro-blog and Baidu TieBa, we processed the raw data before we could use it as input for
the SVM. The four aspects above are four features for these data and we accumulated the score of each
Sustainability 2017, 9, 85
9 of 14
the corresponding characteristic value was plus 1, if a match existed in negative words, then the
corresponding characteristic value wasvminus 1. Through the accumulated score of each feature,
we obtained the final result of it, which was used as a measurement value for the character feature
value in the passage below. In brief, after we got enough data from the Sina micro-blog and Baidu
TieBa, we processed the raw data before we could use it as input for the SVM. The four aspects above
are four features for these data and we accumulated the score of each feature and marked the label for
each text to create the training set. Since the purpose of this paper was to determine emotional attitudes
toward the local environment for a Chinese text, we simply divided emotional attitude tendency into
positive and negative, represented by 1 and −1.
We selected 7500 pieces of text data randomly, 5000 from Sina micro-blog and 2500 from the Baidu
Tieba, to construct the training set. For each piece of data, we marked its emotional attitude tendency
artificially according to the previously mentioned category. We utilized the python procedures to obtain
the text feature value mentioned above and to construct a corresponding CSV (Comma-Separated
Values) format file for the whole training set. Data format is shown in Table 2, the value for column
classification represents the emotional attitude tendency and characteristic values for view, emotion,
evaluation, and degree are the feature extraction results for each text content.
Table 2. Text sentiment feature value in CSV format file.
Classification
View
Emotion
Evaluation
Degree
−1
−1
1
0
−1
48
36
54
78
90
−4
−4
0
2
−6
1
2
1
6
4
0
0
0
0
4
After the data pretreatment, which was one of the most important steps, we started our key
step: training the support vector machine model. We first wrote a class in python to realize the SMO
algorithm [18,24]. We used the LoadTestData method in the python package to obtain the data matrix
and run the SMOP method to get the hyper plane:
SMOP (dataMatIn, classLabels, C, toler, maxIter, kTup)
In this function, we needed six parameters: dataMatIn is the input data matrix, classLabels is
the classifications which the corresponding input data belong to, C is a constant which you can set
yourself to decrease error rate, toler is the fault tolerant rate which you can set to what you want,
maxIter is the iteration number for the whole loop, and kTup is the kind of kernel function you use to
process the input data.
The SMOP method includes two main steps: the constructor of OptStruct and the innerL method:
OptStruct (mat (dataMatIn), mat (classLabels).transpose (), C, toler, kTup)
innerL (i, oS)
The function above is used to construct a special struct to store the processed data matrix,
classLabel matrix, the parameter C, toler, and the kind of kernel function kTup. The kTup can be set as
“rbf” which stands for Gaussian kernel function and “lk” which stands for Laplacian kernel function.
The innerL is used to change the value of key parameter we need. After we fixed all the parameters,
we called the function calcWs to calculate the hyper plane:
calcWs (alphas, dataArr, classLabels)
Sustainability 2017, 9, 85
10 of 14
In this function, the input parameters are output in last step. Then, we checked the accuracy by
using the judgeData function:
judgeData (w, data, classLabel, b)
The w is from calcWs, and other parameters are prepared.
After we constructed the support vector machine model, we verified the accuracy of the model
by selecting 5000 pieces of text data randomly as the test set. Also, we gave each piece of data an
emotional tendency value manually in advance to compare with the classification result produced by
the support vector machine algorithm. After the comparison and calculation, the accuracy for SVM
was 85.67%.
We input all the data set as a whole to make emotional classification. The classification process
was predicted according to the characteristics of the input data set, using the following method:
Predict (w, data, b)
In this function, the input parameter was the same as the judgeData function, but the data was
from the prediction set which had no class labels. The prediction set can be divide into two parts.
One part represents a good emotion toward the environment and the other part represents a bad
emotion to the environment.
Classifications generated by the resulting data were stored in the array, merging the input data
and output results to get a clear classification on social media text data and emotional tendency results
towards environment, which is shown Table 3.
Table 3. Emotional tendency classification for some text content.
Text Serial Number
Corresponding Feature Set
(View, Emotion, Evaluation, Degree)
Output Results
Emotional Tendency
1
2
3
4
(15, 4, 1, 2)
(33, −4, −1, 0)
(45, −3, −3, 0)
(21, 2, 3, 1)
1
−1
−1
1
Positive
Negative
Negative
Positive
Sustainability 2017, 9, 85
10 of 13
Further, we made an emotional tendency classification for the whole data source, the statistical
Further, we made an emotional tendency classification for the whole data source, the statistical result
result is shown in Figure 7, from which we found that 43.9% of the texts expressed negative attitudes
is shown in Figure 7, from which we found that 43.9% of the texts expressed negative attitudes towards
towards the local environment and 56.1% expressed a positive emotional tendency. Also, we made a
the local environment and 56.1% expressed a positive emotional tendency. Also, we made a more detailed
more detailed statistical analysis for Hubei Province from January 2015 to June 2016 shown in Figure 8;
statistical analysis for Hubei Province from January 2015 to June 2016 shown in Figure 8; there are 80,100
there are 80,100 pieces of text and the majority of texts also express a negative sentiment towards
pieces of text and the majority of texts also express a negative sentiment towards the environment.
the environment.
Figure 7. The emotional tendency distribution.
Figure 7. The emotional tendency distribution.
1600
1400
1200
Figure 7. The emotional tendency distribution.
Sustainability 2017, 9, 85
11 of 14
1600
1400
1200
1000
Positive Attitude
800
Negative Attitude
600
400
200
2016/6
2016/5
2016/4
2016/3
2016/2
2016/1
2015/12
2015/11
2015/9
2015/10
2015/8
2015/7
2015/6
2015/5
2015/4
2015/3
2015/2
2015/1
0
Figure 8. Sentiment distribution for text data in Hubei province.
Figure 8. Sentiment distribution for text data in Hubei province.
3.4.
Calculating
3.4.
Calculatingthe
theEQI
EQIScore
Score
ByBy
using
we manually
manuallycreated
createdaarepresentative
representative
environmental
usingthe
thesupport
supportvector
vectormachine
machine (SVM),
(SVM), we
environmental
indicator
of
environmental
quality
in
that
area,
which
we
call
the
Environmental
Quality
Index.
indicator of environmental quality in that area, which we call the Environmental Quality Index. This index
This
has the
ability and
to monitor
forecast the environmental
quality
in at
a specific
hasindex
the ability
to monitor
forecast and
the environmental
quality in a specific
region
a specificregion
period.at a
Also, itperiod.
can be used
environmental
quality
for differentquality
Chinese
WeChinese
calculated
the
specific
Also,toit compare
can be used
to compare
environmental
forregions.
different
regions.
EQI
for
three
large
regions
in
China
from
January
2015
to
June
2016
according
to
the
formula
defined
in to
We calculated the EQI for three large regions in China from January 2015 to June 2016 according
2.5, defined
and as the
9 proved,
quality was
in summer and
deteriorated
theSection
formula
inFigure
Section
2.5, andthe
asenvironmental
the Figure 9 proved,
thebest
environmental
quality
was best
the approach
of winter. with
In Table
and Figureof10,
we compared
for 28
in with
summer
and deteriorated
the4approach
winter.
In Tablethe
4 environmental
and Figure 10,quality
we compared
s
provinces
in
China
in
2015
using
the
EQI
score.
The
EQI
score
is
defined
as:
the environmental quality for 28 provinces in China in 2015 using the EQI score. The EQI score s is
defined as:
s  1  F ( t ) / max{ F ( t ),  t  C }
(3)
s = 1 − F (t)/max{ F (t), ∀t ∈ C }
(3)
F (t ) is the EQI for a specific province defined in Section 2.5 and C is the whole province
where
where
F (t) is the
EQI for a specific province defined in Section 2.5 and C is the whole province set
set for 27 Chinese provinces. The lager EQI score represents a better performance in comprehensive
for 27 Chinese provinces. The lager EQI score represents a better performance in comprehensive
environmental
quality,
ranges from 0 to 1. The result we got for the rank of environmental11quality
Sustainability 2017,
9, 85 which
of 13
environmental
quality,
which ranges from 0 to 1. The result we got for the rank of environmental
quality evaluation is veryclose to the “China livable city report” published by the Chinese Academy of
evaluation is veryclose to the “China livable city report” published by the Chinese Academy of Sciences
Sciences
(CAS)
inHowever,
2015. However,
thatcontained
this report
contained
only of
a small
(CAS)
in 2015.
we regretwe
thatregret
this report
only
a small number
cities innumber
China. of cities
in China.
6000
“environmental status”index
5450
5080
5000
4510
4000
3371
3000
2000
3540
3789
3421
4239
3782
3713
3347
3020 2972
2893
2743
2600 2476 2491
2419
2348
2312
2176 2245
2187
2150
2134
2040
1937
1934
1934
1873
1794
1624 1501 1658 1672 1613
1423
3254
2670 2674 2789
2334
2194
1965
3800
3593
3972
2987
3140
2877
3152
3292
3146
1000
20
15
/1
20
15
/2
20
15
/3
20
15
/4
20
15
/5
20
15
/6
20
15
/7
20
15
/8
20
15
/9
20
15
/1
20 0
15
/1
20 1
15
/1
2
20
16
/1
20
16
/2
20
16
/3
20
16
/4
20
16
/5
20
16
/6
0
Beijing
Shanghai
Hubei
Figure 9. The Environmental Quality Index (EQI) change for three large provinces in China.
Figure 9. The Environmental Quality Index (EQI) change for three large provinces in China.
Table 4. The rank of environmental quality evaluation results for 27 provinces in China in 2015.
Rank
1
2
3
Province
Hainan
Yunnan
Guizhou
Score
0.8501
0.8209
0.6753
Rank
10
11
12
Province
Jiangxi
Shanghai
Jilin
Score
0.5723
0.5650
0.5192
Rank
21
22
21
Province
Tianjin
Shandong
Shanxi
Score
0.3894
0.3671
0.3721
2
2
2
2
2
2
20
20
20
2
2
2
2
2
2
2
2
2
Beijing
Shanghai
Sustainability 2017, 9, 85
Hubei
Figure 9. The Environmental Quality Index (EQI) change for three large provinces in China.
12 of 14
Table4.4.The
Therank
rankofofenvironmental
environmentalquality
qualityevaluation
evaluationresults
resultsfor
for2727provinces
provincesininChina
Chinainin2015.
2015.
Table
Rank
Province
Score
Rank
Province
Score
Rank
Rank
1
Province
Hainan
Score
0.8501
Rank
10
Province
Jiangxi
Score
0.5723 Rank21
21
32
43
54
65
76
87
98
Yunnan
Hainan
Guizhou
Yunnan
Anhui
Guizhou
Guangxi
Anhui
Chongqing
Guangxi
Fujian
Chongqing
Shanxi
Fujian
Guangdong
Shanxi
9
Guangdong
0.8209
0.8501
0.6753
0.8209
0.6355
0.6753
0.6201
0.6355
0.6191
0.6201
0.6121
0.6191
0.5864
0.6121
0.5852
0.5864
0.5852
11
10
12
11
13
12
14
13
15
14
16
15
17
16
18
17
18
Shanghai
Jiangxi
Jilin
Shanghai
Zhejiang
Jilin
Ningxia
Zhejiang
Jiangsu
Ningxia
Henan
Jiangsu
Hebei
Henan
Hunan
Hebei
Hunan
0.5650
0.5723
0.5192
0.5650
0.5118
0.5192
0.5014
0.5118
0.4625
0.5014
0.4241
0.4625
0.4138
0.4241
0.4031
0.4138
0.4031
21
22
21
22
23
24
25
26
27
Province
Province
Tianjin
22
Shandong
Tianjin
21
Shanxi
Shandong
22
Heilongjiang
Shanxi
23 Heilongjiang
Hubei
24
Sichuan
Hubei
25
Beijing
Sichuan
26
Liaoning
Beijing
27
Gansu
Liaoning
Gansu
Score
Score
0.3894
0.3671
0.3894
0.3721
0.3671
0.3142
0.3721
0.2561
0.3142
0.2054
0.2561
0.1732
0.2054
0.1121
0.1732
0
0.1121
0
Figure 10. The EQI score for 27 provinces in China in 2015.
Figure 10. The EQI score for 27 provinces in China in 2015.
4. Conclusions
This paper proposed a new method to evaluate a region’s environmental quality by using Chinese
social media. We made a sentiment analysis and classification by utilizing the support vector machine
algorithm, which had an accuracy rate of 85.67%. We self-defined a term called Environmental Quality
Index (EQI) and constructed an environment evaluation model to monitor and forecast a local area’s
environmental quality successfully. The result of environmental quality comparison between different
provinces in China is also very close to the air pollution ranking information which is published by
the Ministry of Environmental Protection (MEP) of the People’s Republic of China. We think that our
method can also be used in other related fields to process Chinese social media data for Chinese big
data analysis.
The alternative we used is innovative and feasible compared to the traditional environment
monitoring methods that use expensive and complex instruments. The useful and available text data
related to the environment and stored in the large repository of social networks will grow as more and
more people post their comments and feelings about water, food quality, and air pollution in a local
environment on social media in the future; this will make our method and evaluation model more
accurate and reliable.
For the sustainable development of the environment in the future, we need to take into
consideration social media data from more sources, not only from Sina micro-blog and Baidu TieBa,
and expand the amount of text data for more time spans. Also, we can further study and analyze some
of the change phenomenon and laws for EQI we obtained, such as why the EQI for different provinces
Sustainability 2017, 9, 85
13 of 14
in China reached its highest point in March, and try to understand the pattern of human activities in
social media which may help in predicting the EQI.
Acknowledgments: This research is supported in part by the National Nature Science Foundation of China
No. 61440054, Nature Science Foundation of Hubei, China No. 2014CFA048, and Fundamental Research Funds
for the Central Universities of China No. 216274213. Outstanding Academic Talents Startup Funds of Wuhan
University, No. 216-410100003. Youth Funds of Science and Technology in Jiangxi Province Department of
Education (No. GJJ150572). National Natural Science Foundation of China (No. 61462004). Natural Science
Foundation of Jiangxi Province (No. 20151BAB207042).
Author Contributions: Zhibo Wang and Xiaohui Cui conceived and designed the experiments; Lei Ke performed
the experiments; Qi Yin and Longfei Liao analyzed the data; Lu Gao and Zhenyu Wang contributed materials and
analysis tools; Lei Ke and Zhibo Wang wrote the paper together.
Conflicts of Interest: The authors declare no conflict of interest.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
Cui, X.; Yang, N.; Wang, Z.; Hu, C.; Zhu, W.; Li, H.; Ji, Y.; Liu, C. Chinese social media analysis for disease
surveillance. Pers. Ubiquitous Comput. 2015, 19, 1125–1132. [CrossRef]
Wang, Z.; Cui, X.; Gao, L.; Yin, Q.; Ke, L.; Zhang, S. A hybrid model of sentimental entity recognition on
mobile social media. EURASIP J. Wirel. Commun. Netw. 2016. [CrossRef]
The Growing Impact of Social Media. Available online: http://www.sociallyawareblog.com/2012/11/21/
time-americans-spendper-month-on-social-media-sites/ (accessed on 22 December 2016).
Honicky, R.J.; Brewer, E.A.; Paulos, E.; White, R.M. N-Smarts: Networked Suite of Mobile Atmospheric
Real-Time Sensors. In Proceedings of the Second ACM SIGCOMM Workshop on Networked Systems for
Developing Regions, Seattle, WA, USA, 17–22 August 2008.
Aoki, P.M.; Honicky, R.J.; Mainwaring, A.; Myers, C.; Paulos, E.; Subramanian, S.; Woodruff, A. A Vehicle
for Research: Using Street Sweepers to Explore the Landscape of Environmental Community Action.
In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Boston, MA, USA,
4–9 April 2009.
Xu, J.M.; Bhargava, A.; Nowak, R.; Zhu, X. Socioscope: Spatio-Temporal Signal Recovery from Social
Media. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in
Databases, Bristol, UK, 24–28 September 2012.
Zheng, Y.; Liu, F.; Hsieh, H.P. U-Air: When Urban Air Quality Inference Meets Big Data. In Proceedings of
the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL,
USA, 11–14 August 2013.
Chen, J.; Chen, H.; Zheng, G.; Pan, J.Z.; Wu, H.; Zhang, N. Big Smog Meets Web Science: Smog Disaster
Analysis Based on Social Media and Device Data on the Web. In Proceedings of the 23rd International
Conference on World Wide Web, Seoul, Korea, 7–11 April 2014.
Mei, S.; Li, H.; Fan, J.; Zhu, X.; Dyer, C.R. Inferring Air Pollution by Sniffing Social Media. In Proceedings
of the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
(ASONAM), Beijing, China, 17–20 August 2014.
Kay, S.; Zhao, B.; Sui, D. Can social media clear the air? A case study of the air pollution problem in Chinese
cities. Prof. Geogr. 2015, 67, 351–363. [CrossRef]
Sammarco, M.; Tse, R.; Pau, G.; Marfia, G. Using geosocial search for urban air pollution monitoring.
Pervasive Mob. Comput. 2016. [CrossRef]
Tse, R.; Xiao, Y.; Pau, G.; Fdida, S.; Roccetti, M.; Marfia, G. Sensing pollution on online social networks:
A transportation perspective. Mob. Netw. Appl. 2016, 21, 688–707. [CrossRef]
Xu, L.; Lin, H.; Pan, Y.; Ren, H.; Chen, J. Constructing the affective lexicon ontology. J. China Soc. Sci. Tech. Inf.
2008, 2, 6.
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [CrossRef]
Burges, C.J.C. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 1998,
2, 121–167. [CrossRef]
Platt, J.C. Fast training of support vector machines using sequential minimal optimization. In Advances in
Kernel Methods; MIT Press: Cambridge, MA, USA, 1999; pp. 185–208.
Sustainability 2017, 9, 85
17.
18.
19.
20.
21.
22.
23.
24.
14 of 14
Schuldt, C.; Laptev, I.; Caputo, B. Recognizing Human Actions: A Local SVM Approach. In Proceedings of
the 17th International Conference on Pattern Recognition, ICPR 2004, Cambridge, UK, 23–26 August 2004.
Rakotomamonjy, A. Variable selection using SVM-based criteria. J. Mach. Learn. Res. 2003, 3, 1357–1370.
Duan, K.; Keerthi, S.S.; Poo, A.N. Evaluation of simple performance measures for tuning SVM
hyperparameters. Neurocomputing 2003, 51, 41–59. [CrossRef]
Platt, J. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Available
online: https://www.microsoft.com/en-us/research/publication/sequential-minimal-optimization-a-fastalgorithm-for-training-support-vector-machines/ (accessed on 22 December 2016).
Forney, G.D. The Viterbi Algorithm. Available online: http://www.systems.caltech.edu/EE/Courses/
EE127/EE127A/handout/ForneyViterbi.pdf (accessed on 22 December 2016).
Mihalcea, R.; Tarau, P. TextRank: Bringing Order into Texts. In Proceedings of the 42nd Annual Meeting of
the Association for Computational Linguistics, Bacelona, Spain, 21–26 July 2004.
Xu, L.; Lin, H.; Pan, Y. Construction of ontology emotional vocabulary. J. China Soc. Sci. Tech. Inf. 2008, 27,
180–185.
Joachims, T. Making large scale SVM learning practical. In Advances in Kernel Methods; MIT Press: Cambridge,
MA, USA, 1999.
© 2017 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC-BY) license (http://creativecommons.org/licenses/by/4.0/).