Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
sustainability Article Monitoring Environmental Quality by Sniffing Social Media Zhibo Wang 1,2 , Lei Ke 1 , Xiaohui Cui 1, *, Qi Yin 1 , Longfei Liao 1 , Lu Gao 1 and Zhenyu Wang 1 1 2 * International School of Software, Wuhan University, Wuhan 430079, China; [email protected] (Z.W.); [email protected] (L.K.); [email protected] (Q.Y.); [email protected] (L.L.); [email protected] (L.G.); [email protected] (Z.W.) School of Software, East China University of Technology, Nanchang 330013, China Correspondence: [email protected]; Tel.: +86-27-6877-8720 Academic Editors: Yichun Xie, Xinyue Ye and Clio Andris Received: 9 November 2016; Accepted: 5 January 2017; Published: 10 February 2017 Abstract: Nowadays, the environmental pollution and degradation in China has become a serious problem with the rapid development of Chinese heavy industry and increased energy generation. With sustainable development being the key to solving these problems, it is necessary to develop proper techniques for monitoring environmental quality. Compared to traditional environment monitoring methods utilizing expensive and complex instruments, we recognized that social media analysis is an efficient and feasible alternative to achieve this goal with the phenomenon that a growing number of people post their comments and feelings about their living environment on social media, such as blogs and personal websites. In this paper, we self-defined a term called the Environmental Quality Index (EQI) to measure and represent people’s overall attitude and sentiment towards an area’s environmental quality at a specific time; it includes not only metrics for water and food quality but also people’s feelings about air pollution. In the experiment, a high sentiment analysis and classification precision of 85.67% was obtained utilizing the support vector machine algorithm, and we calculated and analyzed the EQI for 27 provinces in China using the text data related to the environment from the Chinese Sina micro-blog and Baidu Tieba collected from January 2015 to June 2016. By comparing our results to with the data from the Chinese Academy of Sciences (CAS), we showed that the environment evaluation model we constructed and the method we proposed are feasible and effective. Keywords: social media; environmental quality; environment monitoring; Support Vector Machine (SVM) 1. Introduction With the popularity and development of the internet, social media, such as blogs and personal websites, etc., have greatly changed peoples’ lives. These social media propel the spread of social news, public opinion, and personal daily information in society and play an important role in current information dissemination and life sharing [1,2]. For example, in China, according to the survey of the China Internet Network Information Center (CNNIC) (as shown in Figure 1), people are spending more and more time on the internet, and that internet time will keep increasing in the future [3]. Since people spend so much time on social media, it is worthwhile to utilize them to uncover and collect various kinds of environmental information. Considering the environment field, we checked the previous related works that attempted to gather environment pollution information based on different methods. In 2008, Honicky et al. [4] suggested collecting pollution information by sensors attached to mobile phones. In 2009, Aoki et al. [5] used vehicles to monitor air quality by deploying mobile air quality sensing platforms on street sweeping trucks in San Francisco. In 2012, the work Sustainability 2017, 9, 85; doi:10.3390/su9020085 www.mdpi.com/journal/sustainability Sustainability 2017, 9, 85 2 of 14 Sustainability 2017, 9, 85 2 of 13 by Xu et al. [6] monitored spatial-temporal signals from social media. In 2014, Zheng et al. [7] and Chen and et al. [8] estimated the air quality in big cities fusing monitoring station data with social media Chen et al. [8] estimated the air quality in big by cities by fusing monitoring station data with social data, Mei et al. [9]on focused on predicting air pollution 108 in cities in China the text data, media and Mei et and al. [9] focused predicting air pollution in 108 in cities China fromfrom the text content in the Chinese Sina micro-blog. More recently, 2015, et al.researched [10] researched how Chinese in thecontent Chinese Sina micro-blog. More recently, in 2015,inKay etKay al. [10] how Chinese social media influenced the airInquality. In 2016, Sammarco et used al. [11]a used a geographical mediasocial influenced the air quality. 2016, Sammarco M et al. M [11] geographical searchsearch about air about air pollution related posts on social networks as an effective air-impurities measurement and pollution related posts on social networks as an effective air-impurities measurement and utilized it to utilized it to predict pollution values, and Tse et al. [12] verified the feasibility of sensing air pollution predict pollution values, and Tse et al. [12] verified the feasibility of sensing air pollution from social from social networks and of integrating such information with real sensor feeds. To the best of our networks and of integrating such information with real sensor feeds. To the best of our knowledge, knowledge, there has been very little research done on comprehensive environmental quality there through has been very Chinese little research done on comprehensive environmental quality through mining mining social media. Chinese social media. Figure 1. Time spent onononline byChinese Chineseinternet internet users. Figure 1. Time spent online(per (per week) week) by users. This paper aims to monitor environmental quality in regions of different provinces in China by This paper aims to monitor environmental quality in regions of different provinces in China collecting large amount of text data from two of China’s major social medias: Sina micro-blog by collecting large amount of text data from two of China’s major social medias: Sina micro-blog (http://weibo.com) and Baidu TieBa (http://tieba.baidu.com). Sina micro-blog is a popular micro(http://weibo.com) Baidu micro-blog a popular blogging service and platform like TieBa Twitter,(http://tieba.baidu.com). and Baidu Tieba is known asSina the world’s largestisChinese micro-blogging service platform like Twitter, and Baidu Tieba is known as the world’s largest Chinese community, which has 1.5 billion registered users and 64.6 billion total messages. We further community, which has 1.5 billion registered users and 64.6 total messages. Wesupport furthervector analyzed analyzed and classified people’s attitude and sentiment to billion the environment using the machine algorithm 85.67% precision on average. In addition, weusing constructed a new environment and classified people’s with attitude and sentiment to the environment the support vector machine evaluation model and achieved goal ofIn environment monitoring in specific effectively by algorithm with 85.67% precision onthe average. addition, we constructed a newregions environment evaluation calculating and analyzing the Environmental Quality Index (EQI), which represents people’s overall model and achieved the goal of environment monitoring in specific regions effectively by calculating attitude and sentiment towards an area’s environmental quality at a specific time. Finally, by utilizing and analyzing the Environmental Quality Index (EQI), which represents people’s overall attitude the environment evaluation model, we monitored the environmental quality for parts of major cities and sentiment towards an area’s environmental quality at a specific time. Finally, by utilizing the or provinces in China and compared the difference in environmental quality for 27 provinces in environment evaluation model, we monitored the environmental quality for parts of major cities or China in 2015. The experimental results we obtained are very closed to the “China livable city report” provinces in China and compared the difference environmental quality for 27 provinces in China in published by the Chinese Academy of Sciences in (CAS) in 2015, which illustrates that the environment 2015. evaluation The experimental results we obtained are very closed to the “China livable city report” published model we constructed is feasible. by the Chinese Academy of Sciences (CAS) in 2015, which illustrates that the environment evaluation This work is an important and innovative step in monitoring environmental quality based on social media. It explores the process of classification according to short texts and constructs modelChinese we constructed is feasible. an environment model calculate EQI. support vector machine algorithms This work is anevaluation important and to innovative stepByincombining monitoring environmental quality based on and experiment results we show that the model is a feasible and efficient way to analyze and compare an Chinese social media. It explores the process of classification according to short texts and constructs the environmental quality in different regions in China. Besides, it can provide a foundation for environment evaluation model to calculate EQI. By combining support vector machine algorithms monitoring environmental quality by analyzing social media information for some researchers. and experiment results we show that the model is a feasible and efficient way to analyze and compare the environmental quality in different regions in China. Besides, it can provide a foundation for 2. Methods monitoring environmental quality by analyzing social media information for some researchers. 2.1. Formal Definition 2. Methods In this paper, we wanted to establish a model that can be used to evaluate the environmental quality of different places in China at different times. Given enough text data related to the 2.1. Formal Definition environment for a specific region in the required time period, we should be able to determine the In this paper, we wanted to establish a model that can be used to evaluate the environmental quality of different places in China at different times. Given enough text data related to the environment Sustainability 2017, 9, 85 3 of 14 for a specific region in the required time period, we should be able to determine the status of the environment by analyzing the writers’ emotional attitudes. In order to make this problem clearer, the first step we took was to define the vocabulary of region and emotion (Definitions 1–3). Definition 1 (Text words w). The words in the text may be related to emotional attitudes toward the environment. Applying statistics to every word in the text is the key point in processing raw data. Using the tuple to represent: w = {t, a}, where t stands for the text value of the word, and a stands for the frequency of w in this text. Definition 2 (Emotional dictionary D). For each emotional tendency, we can build a more significant thesaurus to represent it, namely, an emotional thesaurus. An emotional thesaurus depends on the analysis of the attitude toward the environment. We use tuples to represent the emotional word: D = {d, v}, where d is the lexicon of each word, and v is the vector measuring emotional tendency. In this way, if the v is close to the central vector, the text content is more likely to have emotional tendency. The words in the lexicon can also be represented as tuples: d = {t, c}, where t is the text representation of the word d, and c is the correlation for the word d with the emotional tendency. Definition 3 (Regional dictionary R). For each region or city, we need large amount of its text data, T, as the basis for the analysis of the region’s environmental quality. After we made the three definitions above, the problem was formalized as follows: Use the data on social media to build dictionary D, Regional dictionary R, and large amounts of text data T, then process the data and forecast the environmental quality. 2.2. Establishing the Emotional and Regional Dictionary Currently, we have both a geographical thesaurus and an emotional thesaurus. The “Sogou thesaurus” (http://pinyin.sogou.com/dict) provided the geographical thesaurus and the Dalian University of Technology Laboratory of Information Retrieval provided the emotional thesaurus [13]. However, the accuracy of each thesaurus has a crucial impact on the experiment, so verifying the emotional and regional thesaurus was necessary. This paper utilized a CHI (Chi-square) prescribing test and TFIDF (Term Frequency Inverse Document Frequency) algorithms and a certain degree of manual inspection to check the thesauruses. By comparing the keyword set related to environment and sentiment obtained by the two methods with the above thesauruses, we found many words exist in both thesauruses. At the same time, some words that are irrelevant with the emotional tendency and environment were also found, such as “expert” and “people”. Since the total number of words is just less than 120, we can obtain a more accurate result using artificial selection. As shown in Table 1, the average appearance rate for the “Sogou thesaurus” is 92.41% and the average appearance rate for another thesaurus is 94.38%. From this result, we can determine that the two thesauruses are accurate and reliable. Also, we manually added some keywords from the keywords set we obtained as a supplement, if they were relevant but not contained in the thesauruses. Table 1. The appearance rate for the two thesauruses under CHI prescribing and TFIDF. Algorithm geographical thesaurus emotional thesaurus Appearance Rate (%) CHI Prescribing TFIDF 90.01% 92.63% 94.81% 96.13% CHI: Chi-square; TFIDF: Term Frequency Inverse Document Frequency. Sustainability 2017, 9, 85 4 of 14 2.3. Filtering the Irrelevant Text Content It was very important to find a method to determine whether a text is related to environmental quality so that we could establish the environment evaluation model. The model was resolved by the “TextRank” method mentioned in Section 3. The formula for calculating the relation is: R( T ) = ∑ ( f (w) + f 0(w) × F (w))/( F (w) + 1) + ∀w∈( T (w)∩ D (w)) ∑ f 0(w)/L( T ) (1) ∀w/ ∈( T (w)∩ D (w)) In the formula, R( T ) represents the environmental relation value for a piece of text data, T represents a whole text content. f (w) is the value of the word w defined in the word frequency table we set up, f 0(w) is the “TextRank” method’s value of the word, and F (w) is the word occurrences time in the frequency table. L( T ) is number of words and w is each word that makes up the whole text. T (w) is the word set that constitutes the sentence, and D (w) is the word set of the whole word frequency table. There are more specific definitions for the “TextRank” method and the word frequency table in Section 3.2 of this paper. 2.4. Environmental Quality Analysis Method Combining Support Vector Machine For the classification, machine learning was widely used. In this paper, we utilized the Support Vector Machine (SVM) for classifying [14]. The Support Vector Machine is currently one of the best algorithms in data mining [15]. It is just a hyper plane used to divide data into different groups [16–18]. When we enter the processed data into it, the data are distributed in both sides of the hyper plane. Ultimately, we get the label data that have no label and do prediction [19]. The detailed process for constructing the support vector machine model is in Section 3.3. Algorithm: Environmental quality analysis using the support vector machine and utilizing Sequential Minimal Optimization (SMO): • Input content: 1 2 3 4 5 6 7 • Output content: 1 2 3 • dataMatInTrain: processed data which has the characteristic about environment in Training set classLabelsTrain: the label for the Corresponding data in Training set C: threshold toler: fault tolerant rate maxIter: iterator number kTup: kernel function dataMatInTest: processed data which has the characteristic about environment in test set w: the normal vector for hyper plane b: the constant for hyper plane label: predicted label for data Pseudo code: 1. 2. Data = process data and get the characteristic toward environment Get data matrix: dataMatInTrain, classLabelsTrain = loadData (Data) 3. Get key parameter: w, b = SMO (dataMatIn, classLabels, C, toler ,maxIter, kTup) Sustainability 2017, 9, 85 4. 5 of 14 Test the correct rate of this hyper plane: Rate = judgeData (w, data, classLabel, b) 5. If Rate > n: Use this hyper plane to do the prediction: Label = Predict (w, data, b) Else: Change some input parameter or training data to increase the correct rate Do step 5 6. Draw conclusions In the pseudo code above, the most important step is function SMO (Sequential Minimal Optimization), which breaks the large quadratic programming (QP) problem into a series of the smallest possible QP problems in training support vector machines, and we wrote our own SMO function code based on the pseudo-code mentioned in the research paper of John C. Platt [20]. 2.5. Establishing the Environmental Quality Index By analyzing thousands of sentiment analyses of text data on social networks associated with the environment in a region within a certain time period, utilizing the Support Vector Machine model, we can artificially create a representative of environmental quality in that area which we call the EQI (Environmental Quality Index) F (c, t): F (c, t) = − ∑ E( T ) (2) ∀ T ∈S(c,t) where t stands for the time period, and c stands for the region of a text content. T represents a text, E( T ) is the emotional evaluation value towards the environment status which is obtained from the SVM model we constructed in Section 3.3, and S(c, t) is the whole text data set we have at time period t in region c. The EQI can be used to measure the environmental quality in a specific region at a specific period. The practical operation of it will be discussed in detail in Section 3.4. 3. Experiment In this part, we followed the steps in Figure 2 to do the predictions. Firstly, we used the crawler to get data about environment from the Sina micro-blog and Baidu TieBa. When we got enough data from users, we tried to eliminate the irrelevant data and use text model to express the content. Secondly, we input the training data into the SVM to get the most suitable hyper plane. Then, we predicated the emotional tendency toward the environment based on this hyper plane. Finally, we constructed the environment evaluation model and obtained the environment monitoring result for some major cities in China every month from January 2015 to June 2016 and environmental quality ranking for 27 provinces in China in 2015. predicated the emotional tendency toward the environment based on this hyper plane. Finally, we constructed the environment evaluation model and obtained the environment monitoring result for some major cities in China every month from January 2015 to June 2016 and environmental quality ranking for 27 provinces in China in 2015. Sustainability 2017, 9, 85 6 of 14 Figure 2. Experiment flowchart. Figure 2. Experiment flowchart. 3.1. The Collection and Preprocessing of the Data We collected the data from Baidu TieBa and the Sina micro-blog by using the Web Spider. The Baidu TieBa Spider works as follow: Firstly, the Baidu Tieba Spider got access to the home of the related Baidu TieBa and analyzed the page elements in order to get the link of all the posts. Then, we collected content and the post time of each post. Next, we used Jieba (https://pypi.python.org/pypi/jieba), which is an open source and popular python package and provides functions for Chinese word segmentation, to split the text information and match each word with the keywords about environment we set in advance for the text filtering. After that, Jieba found the most probable combination based on the word frequency using dynamic programming. If there were unknown words, we used the Viterbi algorithm [21] which is a HMM (Hidden Markov Model)-based model to build a directed acyclic graph (DAG) for all possible word combinations. Then, it was written into the database if the matching process was successful. The Sina micro-blog Spider was slightly modified from a GitHub user Liu’s “Sina_spider1” (https://github.com/LiuXingMing/SinaSpider). This spider has the advantages of simple authentication and fast accessing speed of the micro-blogging mobile version so that it can collect the data with the pre-set account and save the data into the Mongo database. However, the account cannot login if it is verified, so that we modified the login source code and used the cookies for login through the browser directly. In addition, the text filtering process was the same as Baidu TieBa information filtering. We collected a total of 3,500,210 pieces of text by using the Sina micro-blog Spider and Baidu TieBa Spider from January 2015 to June 2016. These included about 2,500,000 pieces of text from Sina micro-blog and the other 1,000,000 pieces of text from Baidu TieBa. Then, we did the data preprocessing after the collection of data. First, we noticed that Micro-blog users produce a lot of interactive information in the interaction process, which have no emotion tendencies. Therefore, we utilized regular expressions to make the filtering rules, and filter the information. However, some of the text data that we collected still had little relationship to the environment although we used the keywords about environment to filter the texts. So we filtered the text content again by using the Sustainability 2017, 9, 85 7 of 14 TextRank algorithm, which is a kind of graph based on a ranking algorithm and a way of deciding on the importance of a vertex within a graph by taking into account global information recursively computed from the entire graph, rather than relying only on local vertex-specific information [22]. 3.2. Further Filtering and Classification of Data In this section, we manually selected about 8329 texts from the high environment-related crawling data. We utilized the python package Jieba, which has a module function for keyword extraction based on TextRank algorithm, to help us determine if a text was related to environmental quality and the relevant calculation formula for this method is introduced in Section 3.3 of the paper. The module function in Jieba: jieba.analyse.textrank (raw_text) where the parameter raw_text is the text content needed to be extracted, and the output of the program is composed of the words which make up these text and their corresponding value which indicate the importance of each word to the environment. Each keyword and its corresponding value is stored in a word frequency table. For example, words such as “pollution”, “air”, and some other words which have a higher frequency, which means they repeatedly appeared, correspond to a larger value, while the words that are irrelevant to the environment correspond to a small value. Further, we used the method to build a word frequency table which contains the words most associated with the environment and record their corresponding value. The following procedure was used for each new Chinese text for input: Firstly, Jieba completed Chinese word segmentation by splitting these texts into a series of words, and then from the word frequency table we set up before, we obtained the corresponding value for each word. Finally, we calculated the sum of all the values corresponding to each word, and we got the correlation data for the input text. As for the result, the high correlation value meant this text hd a greater degree of association with the environment, Sustainability 2017, 77 of Sustainability 2017, 9, 9,it85 85meant this text was irrelevant to the environment, as shown in Figures 3 and 4. of 13 13 and conversely, Figure corresponding to to aa high high relation relation value. Figure 3. 3. Text Text data data related related to to the the environment environment corresponding Figure 3. Text relation value. value. Figure 4. Text Text data related to the environment corresponding to a low relation value. Figure Figure 4. 4. Text data related to the environment corresponding to a low relation value. value. After collected data, 1,500,000 remaining After filtering filtering the the collected data, aa total total of of 1,500,000 remaining texts texts with with aa high high degree degree of of correlation with the environment remained. In this paper, we classified these texts geographically In this this paper, we classified these texts geographically correlation with the environment remained. In according Chinese provinces provinces and and months months by by the the utilization utilization of of the the geographical geographical information information and and according to to Chinese time information in the data. However, some provinces, such as “Tibet” and “Xinjiang”, did not have However, some provinces, such as “Tibet” and “Xinjiang”, did not have time information in the data. However, enough texts, provinces into Upon the work do do not not take take thosethose provinces into account. Upon completing the work of geographic enough texts, texts,soso sowewe we do not take those provinces into account. account. Upon completing completing the work of of geographic and temporal classification of the data, we obtained the geographic and period and temporal classification of the data, we obtained the geographic and period distribution of the data geographic and temporal classification of the data, we obtained the geographic and period distribution of data which is in source, which shown in Figures 5 and 6. distribution ofisthe the data source, source, which is shown shown in Figures Figures 55 and and 6. 6. correlation with the environment remained. In this paper, we classified these texts geographically according to Chinese provinces and months by the utilization of the geographical information and time information in the data. However, some provinces, such as “Tibet” and “Xinjiang”, did not have enough texts, so we do not take those provinces into account. Upon completing the work of geographic and temporal classification of the data, we obtained the geographic and period Sustainability 2017, 9, of 85 the data source, which is shown in Figures 5 and 6. 8 of 14 distribution Sustainability 2017, 9, 85 Figure 5. The geographical distribution of data sources. 8 of 13 Figure 5. The geographical distribution of data sources. Figure Figure6.6.The Themonth monthdistribution distributionofofdata datasources. sources. 3.3. 3.3.Using UsingSupport SupportVector VectorMachine MachinetotoDetermine DetermineEmotional EmotionalAttitudes Attitudestoward towardEnvironment Environment Before Beforewe weused usedthe theSupport SupportVector VectorMachine Machinetotodetermine determineemotional emotionalattitudes attitudestoward towardthe the environment, what we needed to do was feature classification, referring to the feature classification environment, what we needed to do was feature classification, referring to the feature classification methods aboutthe theconstruction construction ontological emotional vocabulary To determine methodsin inaa paper paper about ofof ontological emotional vocabulary [23]. [23]. To determine feature feature vectors of the data, we divided the attitude characteristics into four aspects, which included vectors of the data, we divided the attitude characteristics into four aspects, which included view, view, emotion, evaluation, and degree. eachthesaurus, feature thesaurus, it was subdivided into emotion, evaluation, and degree. For eachFor feature it was subdivided into positive andpositive negative and negative parts. For piecewe of made text data, we made the segmentation Chinese wordby segmentation invoking parts. For each piece of each text data, the Chinese word invoking theby Jieba method the Jieba method in a python program to split the text and compare it with the thesaurus forif in a python program to split the text and compare it with the thesaurus for four characteristics to check four characteristics to check it was a match. If there wasthen a match between positive words then it was a match. If there was aifmatch between positive words the corresponding characteristic value was plus 1, if a match existed in negative words, then the corresponding characteristic value wasvminus 1. Through the accumulated score of each feature, we obtained the final result of it, which was used as a measurement value for the character feature value in the passage below. In brief, after we got enough data from the Sina micro-blog and Baidu TieBa, we processed the raw data before we could use it as input for the SVM. The four aspects above are four features for these data and we accumulated the score of each Sustainability 2017, 9, 85 9 of 14 the corresponding characteristic value was plus 1, if a match existed in negative words, then the corresponding characteristic value wasvminus 1. Through the accumulated score of each feature, we obtained the final result of it, which was used as a measurement value for the character feature value in the passage below. In brief, after we got enough data from the Sina micro-blog and Baidu TieBa, we processed the raw data before we could use it as input for the SVM. The four aspects above are four features for these data and we accumulated the score of each feature and marked the label for each text to create the training set. Since the purpose of this paper was to determine emotional attitudes toward the local environment for a Chinese text, we simply divided emotional attitude tendency into positive and negative, represented by 1 and −1. We selected 7500 pieces of text data randomly, 5000 from Sina micro-blog and 2500 from the Baidu Tieba, to construct the training set. For each piece of data, we marked its emotional attitude tendency artificially according to the previously mentioned category. We utilized the python procedures to obtain the text feature value mentioned above and to construct a corresponding CSV (Comma-Separated Values) format file for the whole training set. Data format is shown in Table 2, the value for column classification represents the emotional attitude tendency and characteristic values for view, emotion, evaluation, and degree are the feature extraction results for each text content. Table 2. Text sentiment feature value in CSV format file. Classification View Emotion Evaluation Degree −1 −1 1 0 −1 48 36 54 78 90 −4 −4 0 2 −6 1 2 1 6 4 0 0 0 0 4 After the data pretreatment, which was one of the most important steps, we started our key step: training the support vector machine model. We first wrote a class in python to realize the SMO algorithm [18,24]. We used the LoadTestData method in the python package to obtain the data matrix and run the SMOP method to get the hyper plane: SMOP (dataMatIn, classLabels, C, toler, maxIter, kTup) In this function, we needed six parameters: dataMatIn is the input data matrix, classLabels is the classifications which the corresponding input data belong to, C is a constant which you can set yourself to decrease error rate, toler is the fault tolerant rate which you can set to what you want, maxIter is the iteration number for the whole loop, and kTup is the kind of kernel function you use to process the input data. The SMOP method includes two main steps: the constructor of OptStruct and the innerL method: OptStruct (mat (dataMatIn), mat (classLabels).transpose (), C, toler, kTup) innerL (i, oS) The function above is used to construct a special struct to store the processed data matrix, classLabel matrix, the parameter C, toler, and the kind of kernel function kTup. The kTup can be set as “rbf” which stands for Gaussian kernel function and “lk” which stands for Laplacian kernel function. The innerL is used to change the value of key parameter we need. After we fixed all the parameters, we called the function calcWs to calculate the hyper plane: calcWs (alphas, dataArr, classLabels) Sustainability 2017, 9, 85 10 of 14 In this function, the input parameters are output in last step. Then, we checked the accuracy by using the judgeData function: judgeData (w, data, classLabel, b) The w is from calcWs, and other parameters are prepared. After we constructed the support vector machine model, we verified the accuracy of the model by selecting 5000 pieces of text data randomly as the test set. Also, we gave each piece of data an emotional tendency value manually in advance to compare with the classification result produced by the support vector machine algorithm. After the comparison and calculation, the accuracy for SVM was 85.67%. We input all the data set as a whole to make emotional classification. The classification process was predicted according to the characteristics of the input data set, using the following method: Predict (w, data, b) In this function, the input parameter was the same as the judgeData function, but the data was from the prediction set which had no class labels. The prediction set can be divide into two parts. One part represents a good emotion toward the environment and the other part represents a bad emotion to the environment. Classifications generated by the resulting data were stored in the array, merging the input data and output results to get a clear classification on social media text data and emotional tendency results towards environment, which is shown Table 3. Table 3. Emotional tendency classification for some text content. Text Serial Number Corresponding Feature Set (View, Emotion, Evaluation, Degree) Output Results Emotional Tendency 1 2 3 4 (15, 4, 1, 2) (33, −4, −1, 0) (45, −3, −3, 0) (21, 2, 3, 1) 1 −1 −1 1 Positive Negative Negative Positive Sustainability 2017, 9, 85 10 of 13 Further, we made an emotional tendency classification for the whole data source, the statistical Further, we made an emotional tendency classification for the whole data source, the statistical result result is shown in Figure 7, from which we found that 43.9% of the texts expressed negative attitudes is shown in Figure 7, from which we found that 43.9% of the texts expressed negative attitudes towards towards the local environment and 56.1% expressed a positive emotional tendency. Also, we made a the local environment and 56.1% expressed a positive emotional tendency. Also, we made a more detailed more detailed statistical analysis for Hubei Province from January 2015 to June 2016 shown in Figure 8; statistical analysis for Hubei Province from January 2015 to June 2016 shown in Figure 8; there are 80,100 there are 80,100 pieces of text and the majority of texts also express a negative sentiment towards pieces of text and the majority of texts also express a negative sentiment towards the environment. the environment. Figure 7. The emotional tendency distribution. Figure 7. The emotional tendency distribution. 1600 1400 1200 Figure 7. The emotional tendency distribution. Sustainability 2017, 9, 85 11 of 14 1600 1400 1200 1000 Positive Attitude 800 Negative Attitude 600 400 200 2016/6 2016/5 2016/4 2016/3 2016/2 2016/1 2015/12 2015/11 2015/9 2015/10 2015/8 2015/7 2015/6 2015/5 2015/4 2015/3 2015/2 2015/1 0 Figure 8. Sentiment distribution for text data in Hubei province. Figure 8. Sentiment distribution for text data in Hubei province. 3.4. Calculating 3.4. Calculatingthe theEQI EQIScore Score ByBy using we manually manuallycreated createdaarepresentative representative environmental usingthe thesupport supportvector vectormachine machine (SVM), (SVM), we environmental indicator of environmental quality in that area, which we call the Environmental Quality Index. indicator of environmental quality in that area, which we call the Environmental Quality Index. This index This has the ability and to monitor forecast the environmental quality in at a specific hasindex the ability to monitor forecast and the environmental quality in a specific region a specificregion period.at a Also, itperiod. can be used environmental quality for differentquality Chinese WeChinese calculated the specific Also,toit compare can be used to compare environmental forregions. different regions. EQI for three large regions in China from January 2015 to June 2016 according to the formula defined in to We calculated the EQI for three large regions in China from January 2015 to June 2016 according 2.5, defined and as the 9 proved, quality was in summer and deteriorated theSection formula inFigure Section 2.5, andthe asenvironmental the Figure 9 proved, thebest environmental quality was best the approach of winter. with In Table and Figureof10, we compared for 28 in with summer and deteriorated the4approach winter. In Tablethe 4 environmental and Figure 10,quality we compared s provinces in China in 2015 using the EQI score. The EQI score is defined as: the environmental quality for 28 provinces in China in 2015 using the EQI score. The EQI score s is defined as: s 1 F ( t ) / max{ F ( t ), t C } (3) s = 1 − F (t)/max{ F (t), ∀t ∈ C } (3) F (t ) is the EQI for a specific province defined in Section 2.5 and C is the whole province where where F (t) is the EQI for a specific province defined in Section 2.5 and C is the whole province set set for 27 Chinese provinces. The lager EQI score represents a better performance in comprehensive for 27 Chinese provinces. The lager EQI score represents a better performance in comprehensive environmental quality, ranges from 0 to 1. The result we got for the rank of environmental11quality Sustainability 2017, 9, 85 which of 13 environmental quality, which ranges from 0 to 1. The result we got for the rank of environmental quality evaluation is veryclose to the “China livable city report” published by the Chinese Academy of evaluation is veryclose to the “China livable city report” published by the Chinese Academy of Sciences Sciences (CAS) inHowever, 2015. However, thatcontained this report contained only of a small (CAS) in 2015. we regretwe thatregret this report only a small number cities innumber China. of cities in China. 6000 “environmental status”index 5450 5080 5000 4510 4000 3371 3000 2000 3540 3789 3421 4239 3782 3713 3347 3020 2972 2893 2743 2600 2476 2491 2419 2348 2312 2176 2245 2187 2150 2134 2040 1937 1934 1934 1873 1794 1624 1501 1658 1672 1613 1423 3254 2670 2674 2789 2334 2194 1965 3800 3593 3972 2987 3140 2877 3152 3292 3146 1000 20 15 /1 20 15 /2 20 15 /3 20 15 /4 20 15 /5 20 15 /6 20 15 /7 20 15 /8 20 15 /9 20 15 /1 20 0 15 /1 20 1 15 /1 2 20 16 /1 20 16 /2 20 16 /3 20 16 /4 20 16 /5 20 16 /6 0 Beijing Shanghai Hubei Figure 9. The Environmental Quality Index (EQI) change for three large provinces in China. Figure 9. The Environmental Quality Index (EQI) change for three large provinces in China. Table 4. The rank of environmental quality evaluation results for 27 provinces in China in 2015. Rank 1 2 3 Province Hainan Yunnan Guizhou Score 0.8501 0.8209 0.6753 Rank 10 11 12 Province Jiangxi Shanghai Jilin Score 0.5723 0.5650 0.5192 Rank 21 22 21 Province Tianjin Shandong Shanxi Score 0.3894 0.3671 0.3721 2 2 2 2 2 2 20 20 20 2 2 2 2 2 2 2 2 2 Beijing Shanghai Sustainability 2017, 9, 85 Hubei Figure 9. The Environmental Quality Index (EQI) change for three large provinces in China. 12 of 14 Table4.4.The Therank rankofofenvironmental environmentalquality qualityevaluation evaluationresults resultsfor for2727provinces provincesininChina Chinainin2015. 2015. Table Rank Province Score Rank Province Score Rank Rank 1 Province Hainan Score 0.8501 Rank 10 Province Jiangxi Score 0.5723 Rank21 21 32 43 54 65 76 87 98 Yunnan Hainan Guizhou Yunnan Anhui Guizhou Guangxi Anhui Chongqing Guangxi Fujian Chongqing Shanxi Fujian Guangdong Shanxi 9 Guangdong 0.8209 0.8501 0.6753 0.8209 0.6355 0.6753 0.6201 0.6355 0.6191 0.6201 0.6121 0.6191 0.5864 0.6121 0.5852 0.5864 0.5852 11 10 12 11 13 12 14 13 15 14 16 15 17 16 18 17 18 Shanghai Jiangxi Jilin Shanghai Zhejiang Jilin Ningxia Zhejiang Jiangsu Ningxia Henan Jiangsu Hebei Henan Hunan Hebei Hunan 0.5650 0.5723 0.5192 0.5650 0.5118 0.5192 0.5014 0.5118 0.4625 0.5014 0.4241 0.4625 0.4138 0.4241 0.4031 0.4138 0.4031 21 22 21 22 23 24 25 26 27 Province Province Tianjin 22 Shandong Tianjin 21 Shanxi Shandong 22 Heilongjiang Shanxi 23 Heilongjiang Hubei 24 Sichuan Hubei 25 Beijing Sichuan 26 Liaoning Beijing 27 Gansu Liaoning Gansu Score Score 0.3894 0.3671 0.3894 0.3721 0.3671 0.3142 0.3721 0.2561 0.3142 0.2054 0.2561 0.1732 0.2054 0.1121 0.1732 0 0.1121 0 Figure 10. The EQI score for 27 provinces in China in 2015. Figure 10. The EQI score for 27 provinces in China in 2015. 4. Conclusions This paper proposed a new method to evaluate a region’s environmental quality by using Chinese social media. We made a sentiment analysis and classification by utilizing the support vector machine algorithm, which had an accuracy rate of 85.67%. We self-defined a term called Environmental Quality Index (EQI) and constructed an environment evaluation model to monitor and forecast a local area’s environmental quality successfully. The result of environmental quality comparison between different provinces in China is also very close to the air pollution ranking information which is published by the Ministry of Environmental Protection (MEP) of the People’s Republic of China. We think that our method can also be used in other related fields to process Chinese social media data for Chinese big data analysis. The alternative we used is innovative and feasible compared to the traditional environment monitoring methods that use expensive and complex instruments. The useful and available text data related to the environment and stored in the large repository of social networks will grow as more and more people post their comments and feelings about water, food quality, and air pollution in a local environment on social media in the future; this will make our method and evaluation model more accurate and reliable. For the sustainable development of the environment in the future, we need to take into consideration social media data from more sources, not only from Sina micro-blog and Baidu TieBa, and expand the amount of text data for more time spans. Also, we can further study and analyze some of the change phenomenon and laws for EQI we obtained, such as why the EQI for different provinces Sustainability 2017, 9, 85 13 of 14 in China reached its highest point in March, and try to understand the pattern of human activities in social media which may help in predicting the EQI. Acknowledgments: This research is supported in part by the National Nature Science Foundation of China No. 61440054, Nature Science Foundation of Hubei, China No. 2014CFA048, and Fundamental Research Funds for the Central Universities of China No. 216274213. Outstanding Academic Talents Startup Funds of Wuhan University, No. 216-410100003. Youth Funds of Science and Technology in Jiangxi Province Department of Education (No. GJJ150572). National Natural Science Foundation of China (No. 61462004). Natural Science Foundation of Jiangxi Province (No. 20151BAB207042). Author Contributions: Zhibo Wang and Xiaohui Cui conceived and designed the experiments; Lei Ke performed the experiments; Qi Yin and Longfei Liao analyzed the data; Lu Gao and Zhenyu Wang contributed materials and analysis tools; Lei Ke and Zhibo Wang wrote the paper together. Conflicts of Interest: The authors declare no conflict of interest. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. Cui, X.; Yang, N.; Wang, Z.; Hu, C.; Zhu, W.; Li, H.; Ji, Y.; Liu, C. Chinese social media analysis for disease surveillance. Pers. Ubiquitous Comput. 2015, 19, 1125–1132. [CrossRef] Wang, Z.; Cui, X.; Gao, L.; Yin, Q.; Ke, L.; Zhang, S. A hybrid model of sentimental entity recognition on mobile social media. EURASIP J. Wirel. Commun. Netw. 2016. [CrossRef] The Growing Impact of Social Media. Available online: http://www.sociallyawareblog.com/2012/11/21/ time-americans-spendper-month-on-social-media-sites/ (accessed on 22 December 2016). Honicky, R.J.; Brewer, E.A.; Paulos, E.; White, R.M. N-Smarts: Networked Suite of Mobile Atmospheric Real-Time Sensors. In Proceedings of the Second ACM SIGCOMM Workshop on Networked Systems for Developing Regions, Seattle, WA, USA, 17–22 August 2008. Aoki, P.M.; Honicky, R.J.; Mainwaring, A.; Myers, C.; Paulos, E.; Subramanian, S.; Woodruff, A. A Vehicle for Research: Using Street Sweepers to Explore the Landscape of Environmental Community Action. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Boston, MA, USA, 4–9 April 2009. Xu, J.M.; Bhargava, A.; Nowak, R.; Zhu, X. Socioscope: Spatio-Temporal Signal Recovery from Social Media. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Bristol, UK, 24–28 September 2012. Zheng, Y.; Liu, F.; Hsieh, H.P. U-Air: When Urban Air Quality Inference Meets Big Data. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, 11–14 August 2013. Chen, J.; Chen, H.; Zheng, G.; Pan, J.Z.; Wu, H.; Zhang, N. Big Smog Meets Web Science: Smog Disaster Analysis Based on Social Media and Device Data on the Web. In Proceedings of the 23rd International Conference on World Wide Web, Seoul, Korea, 7–11 April 2014. Mei, S.; Li, H.; Fan, J.; Zhu, X.; Dyer, C.R. Inferring Air Pollution by Sniffing Social Media. In Proceedings of the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Beijing, China, 17–20 August 2014. Kay, S.; Zhao, B.; Sui, D. Can social media clear the air? A case study of the air pollution problem in Chinese cities. Prof. Geogr. 2015, 67, 351–363. [CrossRef] Sammarco, M.; Tse, R.; Pau, G.; Marfia, G. Using geosocial search for urban air pollution monitoring. Pervasive Mob. Comput. 2016. [CrossRef] Tse, R.; Xiao, Y.; Pau, G.; Fdida, S.; Roccetti, M.; Marfia, G. Sensing pollution on online social networks: A transportation perspective. Mob. Netw. Appl. 2016, 21, 688–707. [CrossRef] Xu, L.; Lin, H.; Pan, Y.; Ren, H.; Chen, J. Constructing the affective lexicon ontology. J. China Soc. Sci. Tech. Inf. 2008, 2, 6. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [CrossRef] Burges, C.J.C. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 1998, 2, 121–167. [CrossRef] Platt, J.C. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods; MIT Press: Cambridge, MA, USA, 1999; pp. 185–208. Sustainability 2017, 9, 85 17. 18. 19. 20. 21. 22. 23. 24. 14 of 14 Schuldt, C.; Laptev, I.; Caputo, B. Recognizing Human Actions: A Local SVM Approach. In Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, Cambridge, UK, 23–26 August 2004. Rakotomamonjy, A. Variable selection using SVM-based criteria. J. Mach. Learn. Res. 2003, 3, 1357–1370. Duan, K.; Keerthi, S.S.; Poo, A.N. Evaluation of simple performance measures for tuning SVM hyperparameters. Neurocomputing 2003, 51, 41–59. [CrossRef] Platt, J. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Available online: https://www.microsoft.com/en-us/research/publication/sequential-minimal-optimization-a-fastalgorithm-for-training-support-vector-machines/ (accessed on 22 December 2016). Forney, G.D. The Viterbi Algorithm. Available online: http://www.systems.caltech.edu/EE/Courses/ EE127/EE127A/handout/ForneyViterbi.pdf (accessed on 22 December 2016). Mihalcea, R.; Tarau, P. TextRank: Bringing Order into Texts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, Bacelona, Spain, 21–26 July 2004. Xu, L.; Lin, H.; Pan, Y. Construction of ontology emotional vocabulary. J. China Soc. Sci. Tech. Inf. 2008, 27, 180–185. Joachims, T. Making large scale SVM learning practical. In Advances in Kernel Methods; MIT Press: Cambridge, MA, USA, 1999. © 2017 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).