Download hybrid svm datamining techniques for weather data analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
HYBRID SVM DATAMINING TECHNIQUES FOR WEATHER DATA
ANALYSIS OF KRISHNA DISTRICT OF ANDHRA REGION
N.Rajasekhar1, Dr. T. V. Rajini Kanth2
1
(Assistant Professor, Department of Computer Science & Engineering, VNRVJIET, Hyderabad)
2
(Professor, Department of Computer Science &Engineering, SNIST, Hyderabad)
Abstract- Weather Prediction is the application of science and
technology to estimate the state of atmosphere at a particular
spatial location. Due to the availability of huge data
researchers got interest to analyze and forecast the weather. By
predicting accurately it helps in safeguarding human life and
their wealth. Forecasting techniques are useful in effective
weather prediction; crops yield growth, traffic congestions,
marine navigation, forests growth & defense purposes. The
Data Mining techniques / algorithms are proved to be better
than the existing techniques / methodologies / traditional
statistical methods. In this paper, we have proposed a new
Hybrid SVM (Support vector machines) model for effective
weather prediction by analyzing the given huge weather data
sets and to find the suitable patterns existing in it. SVM is one
of the supervised learning methods under classifications &
regression. Good results were arrived in weather prediction
than the other Data Mining Techniques. For this, as a case
study we have considered Krishna district weather data sets
were considered for analysis using this proposed hybrid SVM
technique.
Keywords - Weather Prediction, Spatial Location, Hybrid
Support Vector Machine, Machine Learning Algorithms, Data
Mining Techniques.
I. INTRODUCTION
Analysis of Weather data is used for finding the
atmospheric pattern and apart from knowing the status of
weather at a specific location for desired time duration.
Krishna district [Ref.1] is a district in the Coastal Andhra
region of Andhra Pradesh, India. It is named after the
Krishna River, the third longest river that flows within India,
flows through the district and joins Bay of Bengal near
Hamsaladevi village of the district. Machilipatnam is the
administrative headquarters for the district. Vijayawada is
the biggest city in the district and also the commercial center
of the state. According to 2011 Census, it has a population of
4,529,009, of which 41.00% is urban and a literacy rate of
74.37%. The district is bounded to the north-west and west
by Telangana, to the north-east by West Godavari District, to
the south-east by the Bay of Bengal, to the southwest by
Guntur District. All the numerical data about the previous
and current status of the rainfall is collected for the
estimation of weather. It is useful to study and analyze
weather data sets for effective prediction. The present
weather prediction models are useful in estimation of
weather based on the atmospheric conditions. In expert
systems there is no automation methods exists in order to
pick up the best possible predictive models for effective
weather forecast. For this, new mathematical models are to
be designed in order to address very huge time and
computational complexities that occurs due to uncertainty of
weather conditions. By this, the new models developed will
benefit us in enhancing weather prediction accuracies.
Effective forecasting of weather has lots of uses like by
providing early information about natural disasters to the
public to save their lives and minimize the property
damages. These predictions make the agriculturalists and
business people to plan crops and yields. Weather prediction
is classified depending on time factor in to four types’
namely Short range- forecast up to 48 hours, Extendedforecast from 3 – 5 days, Medium range–3 to 7 days and
Long-range forecasts– more than 7 days [1]. We consider the
rainfall data of three districts of Andhra Region. Here we
considered Krishna district weather data for study and
analysis for effective weather prediction.
II. LITERATURE SURVEY
A. K- Means clustering algorithm:
K- Means clustering algorithm [6, 7, 8] is one of the
most widely used clustering algorithm in the real world
applications. It has been successfully applied in domains
such as relational databases, gene expression data and
decision support systems. One of the drawbacks of this
algorithm is the choice of centroids location which is at
random at the beginning of the algorithm; variables
treatment as numbers and the number of clusters K is
unknown. The drawback of the first one is overcome by
multiple runs or particular initialization methods. But these
initialization methods are not better than random centroids.
The second drawback is overcome by handling the
categorical parameters by using the measure of matching
dissimilarity. For the third drawback, the input parameter
that is number of clusters is fixed before in the regular Kmeans algorithm. This problem can be addressed by the use
HYBRID SVM DATAMINING TECHNIQUES FOR WEATHER DATA ANALYSIS OF KRISHNA DISTRICT
OF ANDHRA REGION
of cluster validity indices. Like other data mining algorithms,
reliability through K-means reduced when treating highdimensional data. In this type data sets are nearly always too
sparse. The application of Euclidean distance becomes
meaningless in high dimensional sparse spaces. By
combining K-means with feature extraction methods such as
Principal Component Analysis (PCA) and Self Organizing
Maps (SOM) there is a possibility of solution.
B. Support vector machine:
Support vector machine (SVMs) [2, 5] is a supervised
learning method that generates a decision-making system to
predict new values. When a set of training samples were
given the SVM algorithm develops a model to predict to
which category this samples falls. SVM model segregates as
distant as possible the given samples by a clear gap by
representing them as points in space. The given examples to
be predicted are then mapped into the either side of the gap
based on the category that they belongs.
SVM constructs a hyper plane or set of hyper-planes in
the high dimensional space that can be used for machine
learning algorithms like classification, or regression. The
largest distanced hyper plane has a good margin even to the
closest training data points to whatever class they belongs to.
If there is higher separation there will be smaller
generalization of classifier error. This technique can be
explained as a combination of two steps:
Learning: This step consists of examples along with
SVM training.
Prediction: It is necessary when ever results are not
known by inclusion of new samples.
Every example is represented as a pair of (input, output)
in which data set is the input and how to categorize is
denoted by output. In this every example is represented as a
pair (x, y) in mathematical format where x is the real
numbers vector and y is the Boolean value (i.e. y/n, 1/-1,
T/F), or a real number. Out of these two cases first one, is
treated as a problem of classification, and the second one
treated as a problem of regression. Learning results are
compared to the (x, y) points representing a real function in
the space and SVM generates the interpolated function
interpolates these series of points to the best.
The “Training Set” consists of group of pairs that make
up the SVM, where as in “Test Set” it contains the group of
pairs for prediction. The data set is derived after clustering
the data set.
III. PROPOSED APPROACH
The realistic data sets that are collected are in raw format.
In first step they have to be converted in to experimental
suitable data [3] sets format by preprocessing techniques like
noise removal etc. Initially the K-means clustering technique
will be applied on the data sets and after that SVM
classification technique will be applied over that clustered
data set. This SVM hybrid technique [4] is suitable for
effective analysis of weather data sets.
3.1 Implementation of the Proposed Approach: K-means
clustering was applied on the data set for about 102 years
namely Monthly Mean for each year Average Temperature
(1901-2002). K-means clustering algorithm was applied on
the data set [3] Monthly mean for each year Average
Temperature and the Clustering model is on full training set.
There are 13 attributes namely year, Jan, Feb, Mar, April,
May, June, July, August, Sep, Oct, Nov and Dec. Euclidean
distance was applied for making clusters. The number of
iterations is 6 and within cluster sum of squared errors is
33.373323263149274. Time taken to build model (full
training data) is 0.03 seconds. In this, missing values are
globally replaced with mean/mode. Table 1 is representing
the cluster centroids for about 5 clusters. Total number of
Instances is 102. Cluster 0 has 33(32%), Cluster 1 has
15(15%), Cluster 2 has 24(24%), Cluster 3 has 13(13%) and
Cluster 4 has 17(17%) number of instances. Fig.1 shows
clusters Histogram. Fig.2 shows cluster graph of Monthly
Mean for each year Average temperature with Instance
number along x-axis and year along y-axis. SVM
classification was done with 14 attributes namely year,
names of 12 months and Clusters. The total number of
Instances is 102. It was evaluated on total training set.
LibSVM wrapper, actual code by Yasser EL-Manzalawy (=
WLSVM). The time taken to build the model is 0.14 seconds
and 0.02 seconds to test the model on training data set.
Agreement between the classifications and the true classes is
represented by Kappa Statistic. It is a Chance-corrected
measure. It is measured by considering expected agreement
by chance away from observed agreement and divides it by
highest possible agreement. If the value is more than 0 it
means that the classifier is doing better. The Kappa statistic
is near to 1 so it is near to perfect agreement. The kappa
statistic measures the agreement of prediction with the true
class 0.8828 signifies almost there is complete agreement.
The correctly classified instances are 92 and incorrectly
classified instances are 10. That means it has classified
2 | Page
HYBRID SVM DATAMINING TECHNIQUES FOR WEATHER DATA ANALYSIS OF KRISHNA DISTRICT
OF ANDHRA REGION
perfectly. There is no considerable error most of the errors
are near to zero except the root relative squared error. Fig.3
shows SVM classifier graph of monthly mean for each year
Average temperature. In this graph Clusters were taken along
x-axis and predicted cluster along y-axis. The Annual Mean
values of Minimum, Average and Maximum temperatures
for all the years 1901-2002 are shown in Fig.4 along with
their trend line Polynomial equations. In this month’s are
taken along x-axis and temperature along y-axis. The
equation for Mean Minimum Temperature is given by
Wet
day
Frequency
+
*
Potential
Evapotranspiration +13.3485 --- (4)
IV. FIGURES AND TABLES OF CLUSTER CENTROIDS
Cluster#
Attribute Full Data
y = -0.0019x5 + 0.0631x4 - 0.7665x3 + 3.8138x2 -
3.9871
3
0
1
(33)
(15)
2
4
5.4917x + 20.378 ---- (1)
(102)
(13)
(24)
(17)
Equation for Mean Average Temperature is given
by
==========================================
=============================
y = -0.0024x5 + 0.0787x4 - 0.9231x3 + 4.3635x2 Year
5.9399x + 25.725 --- (2)
1951.5
1977.3333
1924.697
1986.6923
1930.2
1958.9412
Equation for Mean Maximum Temperature is
Jan
given by
23.3345
23.5955
y = -0.0029x5 + 0.0943x4 - 1.0808x3 + 4.9155x2 6.3828x + 31.093 --- (3)
Feb
24.1856
25.1828
25.6661
Mar
The regression [9] equation for Annual Mean
values is given by
30.6396
24.8517
26.159
27.6269
28.0848
Apr
23.1897
27.0164
28.5775
30.3108
30.0328
30.975
22.7651
23.0984
24.5215
24.9804
27.1878
27.8259
29.5188
30.5771
Averagetemperature = 0.0442 * cloudcover + (May
1.2564) * DiurnalTemperatureRange +
32.8083
(-9.431)* GroundFrostFrequency +
32.3848
32.1833
32.6614
31.8497
32.4384
0.0208 *
precipitation + 0.0824 * Vapour Pressure + -0.7168 *
Jun
31.8466
31.0933
31.0168
30.7131
30.7815
30.7443
3 | Page
HYBRID SVM DATAMINING TECHNIQUES FOR WEATHER DATA ANALYSIS OF KRISHNA DISTRICT
OF ANDHRA REGION
Jul
28.838
29.185
Aug
29.5832
28.4052
28.7352
Sep
28.2424
27.1811
23.9412
26.91
27.7857
24.7899
25.1427
Dec
28.0239
28.9003
27.3202
Nov
28.4061
28.6475
28.6095
Oct
28.6428
24.2983
25.6115
23.1654
22.8922
28.7983
28.192
28.0938
28.0271
28.2679
27.6229
27.5127
26.7561
25.2432
Fig.2: Clustered graph generated using k-means clustering algorithm
24.2178
23.3564
23.493
DETAILED ACCURACY BY CLASS
22.4715
TP Rate
F-Measure
MCC
TABLE1: CLUSTERED CENTROIDS OF MONTHLY MEAN FOR EACH YEAR
AVERAGE TEMPERATURE
ROC Area
0.970
0.914
0.873
0.749
0.895
0.917
0.910
0.935
Weighted Avg.
Fig.1: Clusters Histogram
0.897
0.875
0.038
1.000
0.988
0.902
0.885
0.933
0.600
0.958
cluster2
1.000
0.846
cluster3
0.895
0.895
0.036
0.970
cluster1
0.866
0.024
Class
cluster0
0.858
0.000
Recall
0.865
0.659
0.923
1.000
0.944
0.000
0.960
0.846
PRC Area
0.848
0.800
0.958
0.920
0.072
0.949
0.600
0.750
FP Rate Precision
1.000
cluster4
0.912
0.902
0.833
4 | Page
HYBRID SVM DATAMINING TECHNIQUES FOR WEATHER DATA ANALYSIS OF KRISHNA DISTRICT
OF ANDHRA REGION
1.
TABLE 2: DETAILED ACCURACY BY CLUSTERED CLASS
CONFUSION MATRIX
a
b
c d
e
32 0
0
0
1 | a = cluster0
5
1
0
0 | b = cluster1
0 0 23
0
1 | c = cluster2
9
0 0
<-- classified as
2 11 0 | d = cluster3
0 0
0
2.
0 17 | e = cluster4
Fig.4: Graph of Annual Mean Temperatures for all the years 1901-2002
along with Trend line equations
TABLE 3: CONFUSION MATRIX
Fig.3: SVM classifier graph of monthly mean for each year Average
temperature
Fig.5: Graph of Annual mean values of weather parameters for all the
years (1901-2002)
V. CONCLUSION:
In the K-means cluster analysis given by Table.1 cluster
0 is 33 with highest number of instances has 1925 year as the
centroid. Cluster 4 has lowest temperature 22.47150c in the
month of December. Cluster3 has highest temperature values
in all months except May and June with 13 instances.
Cluster2 with number of instances 24 has highest
temperature values in these two months May and June.
Detailed accuracies are given by Table.2 indicates that
5 | Page
HYBRID SVM DATAMINING TECHNIQUES FOR WEATHER DATA ANALYSIS OF KRISHNA DISTRICT
OF ANDHRA REGION
cluster 4 has TP rate as 1 and F-measure as 0.944. Cluster0
and cluster2 has TP rate values as 0.970 and 0.958 and Fmeasure values are 0.914 and 0.920 respectively. It tells that
cluster 4 was classified accurately followed by cluster0 and
cluster2. In the confusion matrix represented by the Table.3
each column represents instances in the predicted class and
each row indicates instances in actual class. The classifier
has classified accurately for the cluster4, cluster0 and
cluster2 but less accurate in case of cluster1 and cluster 3
indicated by confusion matrix. Only cluster 4 was classified
accurately. Fig.4. gives the following conclusions or
inferences in which months are taken along x-axis and
different weather parameter values are taken along y-axis.
There is a correlation between mean precipitation and mean
cloud cover its value is 0.86114364. Likewise there is a
correlation between average temperature and mean vapour
pressure and is 0.851278951. There is a correlation between
mean
diurnal
temperature
and
mean
potential
evapotranspiration and is 0.665903472. There is negative
correlation between Mean ground frost frequency and mean
wet day frequency and its value is -0.729960042. Mean
average temperature has correlation with all the parameters
namely mean cloud cover, diurnal temperature range, ground
frost frequency, precipitation, vapour pressure, wet day
frequency, potential evapotranspiration and reference crop
evapotranspiration. It has highest correlation with reference
crop evapotranspiration and negative correlation with ground
frost frequency. The future scope is this Hybrid SVM
technique can be extended to any weather data belongs to
any spatial location for further analysis based on various
weather parameters.
REFERENCES
[1]http://www.timeanddate.com/weather/forecast-accuracytime.html
[2] http://en.wikipedia.org/wiki/Weather_forecasting
[3] Folorunsho Olaiya, Application of Data Mining Techniques in
Weather Prediction and Climate Change Studies, I.J. Information
Engineering and Electronic Business, 2012, 1, 51-59, Published
Online February 2012 in MECS (http://www.mecs-press.org/)
DOI: 10.5815/ijieeb.2012.01.07
[4]Dr. M.H.Dunham, Companion slides for the text, “Data
Mining, Introductory and Advanced Topics”, Prentice Hall, 2002.
[5] Y.Radhika and M.Shashi, “Atmospheric Temperature
Prediction using Support Vector Machines” International Journal
of Computer Theory and Engineering, Vol. 1, No. 1, April 2009
1793-8201
[6] Dr.T. V. Rajini Kanth, Ananthoju Vijay Kumar, Estimation of
the Influence of Fertilizer Nutrients Consumption on the Wheat
Crop yield in India- a Data mining Approach, 30 Dec 2013,
Volume 3, Issue 2, Pg.No: 316-320, ISSN: 2249-8958 (Online).
[7] Dr.T. V. Rajini Kanth, Ananthoju Vijay Kumar, A Data
Mining Approach for the Estimation of Climate Change on the
Jowar Crop Yield in India, 25Dec2013,Volume 2 Issue 2,
Pg.No:16-20, ISSN: 2319-6378 (Online).
[8] A. Vijay Kumar, Dr. T. V. Rajini Kanth “Estimation of the
Influential Factors of rice yield in India” 2nd International
Conference on Advanced Computing methodologies ICACM2013, 02-03 Aug 2013, Elsevier Publications, Pg. No: 459-465,
ISBN No: 978-93-35107-14-95.
[9] Dr.David, B.Stephenson, Data analysis methods in weather and
climate research Department of Meteorology University of
Reading, July,20, 2005 , http://www.met.rdg.ac.uk /cag/courses
N.Rajasekhar did B.Tech in
Mechanical Engineering from NBKRIST,
Vidyanagar, Nellore and M.Tech in
Computer Science & Engineering from
Bharath University, Chennai, INDIA, in
2006. Currently pursuing Ph.D from Acharya Nagarjuna
University, Guntur. He is presently working as Assistant
Professor in the Department of Computer Science &
Engineering VNR Vignana Jyothi Institute of Engineering
& Technology, Hyderabad, India. He published papers in
reputed journals and conferences. He is a professional
member of ISTE
Dr.T.V.K.Rajinikanth
received
M.Tech
from
Osmania
University, Hyderabad in 2001, PhD
from Osmania University,Hyderabad in
2007. He is currently working as a
Professor, Department of Computer
Science & Engineering, Sree Nidhi Institute Of Science &
Technology, Hyderabad, India. He has written and attended
several National & International journals and conferences.
His research interests include Data warehouse & mining,
Semantic Web, Spatial Data mining, ANN. He is a
professional member of CSI and ISTE.
6 | Page