Download this PDF file

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
International Journal of Research in Computer and
Communication Technology, Vol 3, Issue 10, October - 2014
ISSN (Online) 2278- 5841
ISSN (Print) 2320- 5156
Analyzing Feedback Patterns Using Data Mining Techinque
Shaunak Chheda
Prof. Lynette D’mello
Department of Computer Engineering
D.J. Sanghvi College of Engineering
Mumbai, India
Email: [email protected]
Department of Computer Engineering
D.J. Sanghvi College of Engineering
Mumbai, India
Email: [email protected]
Abstract— This paper presents a data mining technique that
can be used to study which courses a student will more likely be
interested in, during his graduation. The raw data was collected
from feedback forms of an institution offering various courses.
We processed the raw data available and performed t-weight
calculations to present useful results.
Keywords—Data mining, KDD, Data preprocessing, t-
weight.
I. INTRODUCTION
The data mining approach, a relatively new technique, is
deployed in large databases to find novel and useful
patterns that might otherwise remain unknown. This paper
presents a data mining approach to study course likeliness
patterns amongst students of various departments in a
university.
The data mining process consists of a series of transformation
steps, as shown in Figure 1. It is the overall process of
converting raw data into useful information.
It deals with processing the raw data which is been
collected from disparate sources and convert it into a uniform
format [2]. The main aim of data preprocessing is to select
relevant data with respect to the data mining task in hand. It
consists of the following tasks:
1) Data cleaning
Data cleaning processes attempts to “clean” the data by filling
in missing values, smoothing noisy data, identifying or
removing outliers and resolving inconsistencies. Ambiguous
data can cause confusion for the mining procedure, resulting
in an undesirable output.
2) Data integration
Data is drawn from various sources for analysis. This would
involve data integrating i.e. integrating multiple databases,
files, etc. It should be ensured that all the data is represented in
a consistent manner without any redundancies.
3) Data reduction
It deals with eliminating all the irrelevant data and thereby
reducing the size of data set. It obtains a reduced
representation of the data set that is much smaller in volume,
yet produces the same analytical result. The original data can
be compressed or can be replaced by alternative, smaller
representation.
4) Data transformation
It converts the data into appropriate forms for mining so that
the resulting mining process may be more efficient, and the
patterns found easier to understand.
II. Data mining
Data mining is the process of discovering interesting
patterns and knowledge from large amounts of data. It is the
most essential process where intelligent methods are applied
to extract useful data patterns from the preprocessed data.
Figure 1. KDD (Knowledge Discovery in Databases) Process
I. Data Preprocessing
www.ijrcct.org
III. Data Post-processing
It deals with the presentation of the information from
the data mining phase in a manner which is easier for a user to
understand. It consists of:
Page 1201
International Journal of Research in Computer and
Communication Technology, Vol 3, Issue 10, October - 2014
1) Pattern evaluation
It deals with identifying truly interesting patterns from the
processed data.
2) Knowledge presentation
It utilizes various data visualization and knowledge
representation techniques like graphs, charts, etc to present
mined knowledge to the users [2].
II. DESCRIPTION OF DATA
The University comprises of three departments namely P, Q
and R. Each department consists of 50 students. The
University offers various courses for its students.
In this paper, we analyzed and counted feedbacks from all the
students who attended the respective seminar. A positive
feedback was taken as a “Yes” and a negative feedback was
taken as a “No”. This study looks at the likelihood that a
student belonging to a particular department would prefer to
take a particular course of his interest in future.
III. DATA PRE-PROCESSING
The data for this study is collected from feedback forms from
students of different departments and read into a spreadsheet.
This data is then cleaned up by removing extra or unnecessary
information and then integrated into a database.
The resulting database contains 150 rows (one for each
student) and four attributes (one for the department id of the
student and one for each course covered).
In Table 1 we present a sample of the raw data in
the database:
Table 1: Sample raw data
DEPT ID
A1
A2
…
B1
B2
…
C1
C2
…
P
Y
N
Q
N
Y
R
N
N
N
N
N
Y
Y
Y
Y
Y
Y
Y
N
Y
In the University, there are three major departments namely P,
Q and R. The strength of each department is 50 students. In
this dataset there was only one student whose department was
“S.” Since there was only one student belonging to the
Department S, we eliminated this row from our data set.
ISSN (Online) 2278- 5841
ISSN (Print) 2320- 5156
Some students did not provide any feedback on the course.
This meant that the student did not submit his feedback for a
particular course. For any course that did not have a feedback
from a particular student, we replaced it with the majority
feedback for the department on that particular course. So, if a
student from Department Q did not give a feedback for the
course C1, but a majority of the other students from
Department Q voted “Y” for that course, then we replaced the
"No feedback" with a “Y”.
The same process was carried out for students who could not
attend the course.
This made our dataset (or database, as shown in Table 1)
ready to be processed or mined. We start our quantitative
analysis with an exploratory quantitative analysis tool:
t-weight calculations
IV. QUANTITATIVE CHARACTERISTIC RULE
A logic rule that deals with quantitative information is called
quantitative rule.
t-weights are an exploratory quantitative data analysis tool that
present visualizations of within-class comparisons.
Let qa be generalized tuple describing the target class.
A measure t-weight for qa is the percentage of tuples of the
target class from the initial working relation that are covered
by qn
t _ weight  count (qa ) / i 1 count (qi )
n
Where n is the total number of tuples for the target class in the
generalized relation.
q1,q2,... qn are tuples for the target class and qa is in q1,q2,...
qn.
The range for t-weight is [0.0, 1.0] or [0%, 100%].
The t-weight rule is expressed in the form:
X , t arg et _ class ( X )  condition1 ( X )[t : w1 ]  ...
 conditionm ( X )[t : wm ]
The above rule can be understood as:
If X is in target class, then there is a probability of wi that the
tuple X satisfies conditioni .
For example, in this set of data, t-weights will measure for
each course, what is the probability that each class will cast a
yes feedback or a no feedback. So, for each issue, we count
the number of yes feedbacks and no feedbacks for each target
class.
Using the data from Table 1, we generated t-weights.
Table 2. t-weights – Course C1
www.ijrcct.org
Page 1202
International Journal of Research in Computer and
Communication Technology, Vol 3, Issue 10, October - 2014
DEPT
P
P
Q
Q
R
R
C1
Y
N
Y
N
Y
N
COUNT
10
40
45
5
15
35
t-weight
20%
80%
90%
10%
30%
70%
The t-weights of Table 2 can be converted into logic rules in
the form:
Let the target class be Dept(D). Then the corresponding
characteristic rule in logic form is:
Rule 1:
∀x, Dept(X) =’P’ ⇒ (C1=”Y”)[t:20%] V (C1=”N”)[t:80%]
This rule says that if X is in the target class, that is, if a student
of the university belongs to department P, there is a 20%
probability that this student gave a “No” feedback for course
C1, and a 80% probability that this student gave a “Yes”
feedback for course C1.
Similarly, the next rules that can be generated from Table 2
are:
Rule 2:
∀x, Dept(X) =’Q’ ⇒ (C1=”Y”)[t:90%] V (C1=”N”)[t:10%]
Rule
3:
∀x, Dept(X) =’R’ ⇒ (C1=”Y”)[t:30%] V (C1=”N”)[t:70%]
Table 3. t-weights – Course C2
DEPT
P
P
Q
Q
R
R
C2
Y
N
Y
N
Y
N
COUNT
42
8
25
25
12
38
t-weight
84%
16%
50%
50%
24%
76%
ISSN (Online) 2278- 5841
ISSN (Print) 2320- 5156
Table 4. t-weights – Course C3
DEPT
P
P
Q
Q
R
R
C3
Y
N
Y
N
Y
N
COUNT
6
44
21
29
47
3
t-weight
12%
88%
42%
58%
94%
6%
The t-weights of Table 4 can be converted to logic rules:
Rule 7:
∀x, Dept(X) = ‘P’ ⇒ (C3=”Y”)[t:12%] V (C3=”N”)[t:88%]
Rule 8:
∀x, Dept(X) = ‘Q’ ⇒ (C3=”Y”)[t:42%] V (C3=”N”)[t:58%]
Rule 9:
∀x, Dept(X) = ‘R’ ⇒ (C3=”Y”)[t:94%] V (C3=”N”)[t:6%]
V. CONCLUSION FOR T-WEIGHTS
From the t-weight rules, we can come up with the following
conclusions:
 There is a higher probability of students of
Department P as well as Department R giving a no
feedback, and of Department Q giving a yes feedback
for course C1.
 There is a higher probability of students of
Department P giving a yes feedback, and of Department
R giving a no feedback for course C2. The probabilities
are equally likely in case of Department Q.
 There is a higher probability of students of
Department P as well as Department Q giving a no
feedback, and of Department R giving a yes feedback
for course C3.
VI. CONCLUSION
The t-weights of Table 3 can be converted to logic rules:
Rule 4:
∀x, Dept(X) = ‘P’ ⇒ (C2=”Y”)[t:84%] V (C2=”N”)[t:16%]
Rule 5:
∀x, Dept(X) = ‘Q’ ⇒ (C2=”Y”)[t:50%] V (C2=”N”)[t:50%]
Rule 6:
∀x, Dept(X) = ‘R’ ⇒ (C2=”Y”)[t:24%] V (C2=”N”)[t:76%]
www.ijrcct.org
In this paper, data mining techniques are presented that can be
used to study or mine which courses a student belonging to a
department would prefer to take during his graduation tenure
in the University.
We have shown the whole data mining processing – from
processing input data to preprocessing to presenting
information (in the form of rules) and conclusions.
Page 1203
International Journal of Research in Computer and
Communication Technology, Vol 3, Issue 10, October - 2014
ISSN (Online) 2278- 5841
ISSN (Print) 2320- 5156
The exploratory data mining technique, t-weights, gave us a
picture of what percentage of students from each department
gave a positive or a negative feedback for a particular course.
This study presents interesting patterns which can be utilized
in determining which courses a student from a particular
department would be more interested to take up during his
term. Such results will be beneficial for the students as well as
for the University in deciding which courses it should offer.
Our future work aims at improving the efficiency of the results
obtained by using advanced data mining techniques such as
decision tree analysis and association rule mining.
We can also use WEKA, which is a data mining tool available
which supports several standard data mining tasks. The
WEKA workbench contains a collection of visualization tools
and algorithms for data analysis which can be used to provide
a better and efficient representation of the results.
VII. REFERENCES
[1] Sikha Bagui, Dustin Mink, and Patrick Cash, “Data
mining techniques to study voting patterns in the US”,
Data Science Journal.
[2] Jiawei Han, Micheline Kamber, Jian Pei, “Data mining
Concepts and Techniques”, Morgan Kaufmann Publishers.
[3] Oded Maimon, Lior Rokach, “Introduction to knowledge
discovery in databases”.
www.ijrcct.org
Page 1204