Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
HYBRID SVM DATAMINING TECHNIQUES FOR WEATHER DATA ANALYSIS OF KRISHNA DISTRICT OF ANDHRA REGION N.Rajasekhar1, Dr. T. V. Rajini Kanth2 1 (Assistant Professor, Department of Computer Science & Engineering, VNRVJIET, Hyderabad) 2 (Professor, Department of Computer Science &Engineering, SNIST, Hyderabad) Abstract- Weather Prediction is the application of science and technology to estimate the state of atmosphere at a particular spatial location. Due to the availability of huge data researchers got interest to analyze and forecast the weather. By predicting accurately it helps in safeguarding human life and their wealth. Forecasting techniques are useful in effective weather prediction; crops yield growth, traffic congestions, marine navigation, forests growth & defense purposes. The Data Mining techniques / algorithms are proved to be better than the existing techniques / methodologies / traditional statistical methods. In this paper, we have proposed a new Hybrid SVM (Support vector machines) model for effective weather prediction by analyzing the given huge weather data sets and to find the suitable patterns existing in it. SVM is one of the supervised learning methods under classifications & regression. Good results were arrived in weather prediction than the other Data Mining Techniques. For this, as a case study we have considered Krishna district weather data sets were considered for analysis using this proposed hybrid SVM technique. Keywords - Weather Prediction, Spatial Location, Hybrid Support Vector Machine, Machine Learning Algorithms, Data Mining Techniques. I. INTRODUCTION Analysis of Weather data is used for finding the atmospheric pattern and apart from knowing the status of weather at a specific location for desired time duration. Krishna district [Ref.1] is a district in the Coastal Andhra region of Andhra Pradesh, India. It is named after the Krishna River, the third longest river that flows within India, flows through the district and joins Bay of Bengal near Hamsaladevi village of the district. Machilipatnam is the administrative headquarters for the district. Vijayawada is the biggest city in the district and also the commercial center of the state. According to 2011 Census, it has a population of 4,529,009, of which 41.00% is urban and a literacy rate of 74.37%. The district is bounded to the north-west and west by Telangana, to the north-east by West Godavari District, to the south-east by the Bay of Bengal, to the southwest by Guntur District. All the numerical data about the previous and current status of the rainfall is collected for the estimation of weather. It is useful to study and analyze weather data sets for effective prediction. The present weather prediction models are useful in estimation of weather based on the atmospheric conditions. In expert systems there is no automation methods exists in order to pick up the best possible predictive models for effective weather forecast. For this, new mathematical models are to be designed in order to address very huge time and computational complexities that occurs due to uncertainty of weather conditions. By this, the new models developed will benefit us in enhancing weather prediction accuracies. Effective forecasting of weather has lots of uses like by providing early information about natural disasters to the public to save their lives and minimize the property damages. These predictions make the agriculturalists and business people to plan crops and yields. Weather prediction is classified depending on time factor in to four types’ namely Short range- forecast up to 48 hours, Extendedforecast from 3 – 5 days, Medium range–3 to 7 days and Long-range forecasts– more than 7 days [1]. We consider the rainfall data of three districts of Andhra Region. Here we considered Krishna district weather data for study and analysis for effective weather prediction. II. LITERATURE SURVEY A. K- Means clustering algorithm: K- Means clustering algorithm [6, 7, 8] is one of the most widely used clustering algorithm in the real world applications. It has been successfully applied in domains such as relational databases, gene expression data and decision support systems. One of the drawbacks of this algorithm is the choice of centroids location which is at random at the beginning of the algorithm; variables treatment as numbers and the number of clusters K is unknown. The drawback of the first one is overcome by multiple runs or particular initialization methods. But these initialization methods are not better than random centroids. The second drawback is overcome by handling the categorical parameters by using the measure of matching dissimilarity. For the third drawback, the input parameter that is number of clusters is fixed before in the regular Kmeans algorithm. This problem can be addressed by the use HYBRID SVM DATAMINING TECHNIQUES FOR WEATHER DATA ANALYSIS OF KRISHNA DISTRICT OF ANDHRA REGION of cluster validity indices. Like other data mining algorithms, reliability through K-means reduced when treating highdimensional data. In this type data sets are nearly always too sparse. The application of Euclidean distance becomes meaningless in high dimensional sparse spaces. By combining K-means with feature extraction methods such as Principal Component Analysis (PCA) and Self Organizing Maps (SOM) there is a possibility of solution. B. Support vector machine: Support vector machine (SVMs) [2, 5] is a supervised learning method that generates a decision-making system to predict new values. When a set of training samples were given the SVM algorithm develops a model to predict to which category this samples falls. SVM model segregates as distant as possible the given samples by a clear gap by representing them as points in space. The given examples to be predicted are then mapped into the either side of the gap based on the category that they belongs. SVM constructs a hyper plane or set of hyper-planes in the high dimensional space that can be used for machine learning algorithms like classification, or regression. The largest distanced hyper plane has a good margin even to the closest training data points to whatever class they belongs to. If there is higher separation there will be smaller generalization of classifier error. This technique can be explained as a combination of two steps: Learning: This step consists of examples along with SVM training. Prediction: It is necessary when ever results are not known by inclusion of new samples. Every example is represented as a pair of (input, output) in which data set is the input and how to categorize is denoted by output. In this every example is represented as a pair (x, y) in mathematical format where x is the real numbers vector and y is the Boolean value (i.e. y/n, 1/-1, T/F), or a real number. Out of these two cases first one, is treated as a problem of classification, and the second one treated as a problem of regression. Learning results are compared to the (x, y) points representing a real function in the space and SVM generates the interpolated function interpolates these series of points to the best. The “Training Set” consists of group of pairs that make up the SVM, where as in “Test Set” it contains the group of pairs for prediction. The data set is derived after clustering the data set. III. PROPOSED APPROACH The realistic data sets that are collected are in raw format. In first step they have to be converted in to experimental suitable data [3] sets format by preprocessing techniques like noise removal etc. Initially the K-means clustering technique will be applied on the data sets and after that SVM classification technique will be applied over that clustered data set. This SVM hybrid technique [4] is suitable for effective analysis of weather data sets. 3.1 Implementation of the Proposed Approach: K-means clustering was applied on the data set for about 102 years namely Monthly Mean for each year Average Temperature (1901-2002). K-means clustering algorithm was applied on the data set [3] Monthly mean for each year Average Temperature and the Clustering model is on full training set. There are 13 attributes namely year, Jan, Feb, Mar, April, May, June, July, August, Sep, Oct, Nov and Dec. Euclidean distance was applied for making clusters. The number of iterations is 6 and within cluster sum of squared errors is 33.373323263149274. Time taken to build model (full training data) is 0.03 seconds. In this, missing values are globally replaced with mean/mode. Table 1 is representing the cluster centroids for about 5 clusters. Total number of Instances is 102. Cluster 0 has 33(32%), Cluster 1 has 15(15%), Cluster 2 has 24(24%), Cluster 3 has 13(13%) and Cluster 4 has 17(17%) number of instances. Fig.1 shows clusters Histogram. Fig.2 shows cluster graph of Monthly Mean for each year Average temperature with Instance number along x-axis and year along y-axis. SVM classification was done with 14 attributes namely year, names of 12 months and Clusters. The total number of Instances is 102. It was evaluated on total training set. LibSVM wrapper, actual code by Yasser EL-Manzalawy (= WLSVM). The time taken to build the model is 0.14 seconds and 0.02 seconds to test the model on training data set. Agreement between the classifications and the true classes is represented by Kappa Statistic. It is a Chance-corrected measure. It is measured by considering expected agreement by chance away from observed agreement and divides it by highest possible agreement. If the value is more than 0 it means that the classifier is doing better. The Kappa statistic is near to 1 so it is near to perfect agreement. The kappa statistic measures the agreement of prediction with the true class 0.8828 signifies almost there is complete agreement. The correctly classified instances are 92 and incorrectly classified instances are 10. That means it has classified 2 | Page HYBRID SVM DATAMINING TECHNIQUES FOR WEATHER DATA ANALYSIS OF KRISHNA DISTRICT OF ANDHRA REGION perfectly. There is no considerable error most of the errors are near to zero except the root relative squared error. Fig.3 shows SVM classifier graph of monthly mean for each year Average temperature. In this graph Clusters were taken along x-axis and predicted cluster along y-axis. The Annual Mean values of Minimum, Average and Maximum temperatures for all the years 1901-2002 are shown in Fig.4 along with their trend line Polynomial equations. In this month’s are taken along x-axis and temperature along y-axis. The equation for Mean Minimum Temperature is given by Wet day Frequency + * Potential Evapotranspiration +13.3485 --- (4) IV. FIGURES AND TABLES OF CLUSTER CENTROIDS Cluster# Attribute Full Data y = -0.0019x5 + 0.0631x4 - 0.7665x3 + 3.8138x2 - 3.9871 3 0 1 (33) (15) 2 4 5.4917x + 20.378 ---- (1) (102) (13) (24) (17) Equation for Mean Average Temperature is given by ========================================== ============================= y = -0.0024x5 + 0.0787x4 - 0.9231x3 + 4.3635x2 Year 5.9399x + 25.725 --- (2) 1951.5 1977.3333 1924.697 1986.6923 1930.2 1958.9412 Equation for Mean Maximum Temperature is Jan given by 23.3345 23.5955 y = -0.0029x5 + 0.0943x4 - 1.0808x3 + 4.9155x2 6.3828x + 31.093 --- (3) Feb 24.1856 25.1828 25.6661 Mar The regression [9] equation for Annual Mean values is given by 30.6396 24.8517 26.159 27.6269 28.0848 Apr 23.1897 27.0164 28.5775 30.3108 30.0328 30.975 22.7651 23.0984 24.5215 24.9804 27.1878 27.8259 29.5188 30.5771 Averagetemperature = 0.0442 * cloudcover + (May 1.2564) * DiurnalTemperatureRange + 32.8083 (-9.431)* GroundFrostFrequency + 32.3848 32.1833 32.6614 31.8497 32.4384 0.0208 * precipitation + 0.0824 * Vapour Pressure + -0.7168 * Jun 31.8466 31.0933 31.0168 30.7131 30.7815 30.7443 3 | Page HYBRID SVM DATAMINING TECHNIQUES FOR WEATHER DATA ANALYSIS OF KRISHNA DISTRICT OF ANDHRA REGION Jul 28.838 29.185 Aug 29.5832 28.4052 28.7352 Sep 28.2424 27.1811 23.9412 26.91 27.7857 24.7899 25.1427 Dec 28.0239 28.9003 27.3202 Nov 28.4061 28.6475 28.6095 Oct 28.6428 24.2983 25.6115 23.1654 22.8922 28.7983 28.192 28.0938 28.0271 28.2679 27.6229 27.5127 26.7561 25.2432 Fig.2: Clustered graph generated using k-means clustering algorithm 24.2178 23.3564 23.493 DETAILED ACCURACY BY CLASS 22.4715 TP Rate F-Measure MCC TABLE1: CLUSTERED CENTROIDS OF MONTHLY MEAN FOR EACH YEAR AVERAGE TEMPERATURE ROC Area 0.970 0.914 0.873 0.749 0.895 0.917 0.910 0.935 Weighted Avg. Fig.1: Clusters Histogram 0.897 0.875 0.038 1.000 0.988 0.902 0.885 0.933 0.600 0.958 cluster2 1.000 0.846 cluster3 0.895 0.895 0.036 0.970 cluster1 0.866 0.024 Class cluster0 0.858 0.000 Recall 0.865 0.659 0.923 1.000 0.944 0.000 0.960 0.846 PRC Area 0.848 0.800 0.958 0.920 0.072 0.949 0.600 0.750 FP Rate Precision 1.000 cluster4 0.912 0.902 0.833 4 | Page HYBRID SVM DATAMINING TECHNIQUES FOR WEATHER DATA ANALYSIS OF KRISHNA DISTRICT OF ANDHRA REGION 1. TABLE 2: DETAILED ACCURACY BY CLUSTERED CLASS CONFUSION MATRIX a b c d e 32 0 0 0 1 | a = cluster0 5 1 0 0 | b = cluster1 0 0 23 0 1 | c = cluster2 9 0 0 <-- classified as 2 11 0 | d = cluster3 0 0 0 2. 0 17 | e = cluster4 Fig.4: Graph of Annual Mean Temperatures for all the years 1901-2002 along with Trend line equations TABLE 3: CONFUSION MATRIX Fig.3: SVM classifier graph of monthly mean for each year Average temperature Fig.5: Graph of Annual mean values of weather parameters for all the years (1901-2002) V. CONCLUSION: In the K-means cluster analysis given by Table.1 cluster 0 is 33 with highest number of instances has 1925 year as the centroid. Cluster 4 has lowest temperature 22.47150c in the month of December. Cluster3 has highest temperature values in all months except May and June with 13 instances. Cluster2 with number of instances 24 has highest temperature values in these two months May and June. Detailed accuracies are given by Table.2 indicates that 5 | Page HYBRID SVM DATAMINING TECHNIQUES FOR WEATHER DATA ANALYSIS OF KRISHNA DISTRICT OF ANDHRA REGION cluster 4 has TP rate as 1 and F-measure as 0.944. Cluster0 and cluster2 has TP rate values as 0.970 and 0.958 and Fmeasure values are 0.914 and 0.920 respectively. It tells that cluster 4 was classified accurately followed by cluster0 and cluster2. In the confusion matrix represented by the Table.3 each column represents instances in the predicted class and each row indicates instances in actual class. The classifier has classified accurately for the cluster4, cluster0 and cluster2 but less accurate in case of cluster1 and cluster 3 indicated by confusion matrix. Only cluster 4 was classified accurately. Fig.4. gives the following conclusions or inferences in which months are taken along x-axis and different weather parameter values are taken along y-axis. There is a correlation between mean precipitation and mean cloud cover its value is 0.86114364. Likewise there is a correlation between average temperature and mean vapour pressure and is 0.851278951. There is a correlation between mean diurnal temperature and mean potential evapotranspiration and is 0.665903472. There is negative correlation between Mean ground frost frequency and mean wet day frequency and its value is -0.729960042. Mean average temperature has correlation with all the parameters namely mean cloud cover, diurnal temperature range, ground frost frequency, precipitation, vapour pressure, wet day frequency, potential evapotranspiration and reference crop evapotranspiration. It has highest correlation with reference crop evapotranspiration and negative correlation with ground frost frequency. The future scope is this Hybrid SVM technique can be extended to any weather data belongs to any spatial location for further analysis based on various weather parameters. REFERENCES [1]http://www.timeanddate.com/weather/forecast-accuracytime.html [2] http://en.wikipedia.org/wiki/Weather_forecasting [3] Folorunsho Olaiya, Application of Data Mining Techniques in Weather Prediction and Climate Change Studies, I.J. Information Engineering and Electronic Business, 2012, 1, 51-59, Published Online February 2012 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijieeb.2012.01.07 [4]Dr. M.H.Dunham, Companion slides for the text, “Data Mining, Introductory and Advanced Topics”, Prentice Hall, 2002. [5] Y.Radhika and M.Shashi, “Atmospheric Temperature Prediction using Support Vector Machines” International Journal of Computer Theory and Engineering, Vol. 1, No. 1, April 2009 1793-8201 [6] Dr.T. V. Rajini Kanth, Ananthoju Vijay Kumar, Estimation of the Influence of Fertilizer Nutrients Consumption on the Wheat Crop yield in India- a Data mining Approach, 30 Dec 2013, Volume 3, Issue 2, Pg.No: 316-320, ISSN: 2249-8958 (Online). [7] Dr.T. V. Rajini Kanth, Ananthoju Vijay Kumar, A Data Mining Approach for the Estimation of Climate Change on the Jowar Crop Yield in India, 25Dec2013,Volume 2 Issue 2, Pg.No:16-20, ISSN: 2319-6378 (Online). [8] A. Vijay Kumar, Dr. T. V. Rajini Kanth “Estimation of the Influential Factors of rice yield in India” 2nd International Conference on Advanced Computing methodologies ICACM2013, 02-03 Aug 2013, Elsevier Publications, Pg. No: 459-465, ISBN No: 978-93-35107-14-95. [9] Dr.David, B.Stephenson, Data analysis methods in weather and climate research Department of Meteorology University of Reading, July,20, 2005 , http://www.met.rdg.ac.uk /cag/courses N.Rajasekhar did B.Tech in Mechanical Engineering from NBKRIST, Vidyanagar, Nellore and M.Tech in Computer Science & Engineering from Bharath University, Chennai, INDIA, in 2006. Currently pursuing Ph.D from Acharya Nagarjuna University, Guntur. He is presently working as Assistant Professor in the Department of Computer Science & Engineering VNR Vignana Jyothi Institute of Engineering & Technology, Hyderabad, India. He published papers in reputed journals and conferences. He is a professional member of ISTE Dr.T.V.K.Rajinikanth received M.Tech from Osmania University, Hyderabad in 2001, PhD from Osmania University,Hyderabad in 2007. He is currently working as a Professor, Department of Computer Science & Engineering, Sree Nidhi Institute Of Science & Technology, Hyderabad, India. He has written and attended several National & International journals and conferences. His research interests include Data warehouse & mining, Semantic Web, Spatial Data mining, ANN. He is a professional member of CSI and ISTE. 6 | Page