Download Data Mining Technology in e

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Data Mining Technology in e-Learning
Abdel-Badeeh M. Salem
Computer Science Department
Faculty of Computer & Information Sciences
Ain Shams University, Abbassia, Cairo, Egypt
E-mail: [email protected]
Abstract
Data mining technology deals with the discovery
of hidden knowledge, unexpected patterns and
new rules from large database. It is currently
regarded as the key element of a much more
elaborate process called "knowledge discovery
in databases", KDD. From the artificial
intelligence point of view, the term KDD refers
to the whole process of extraction of knowledge
from data. Recently, researchers have begun to
investigate various data mining methods to help
teachers improve the capabilities of e-Learning
systems. These methods allow them to discover
new knowledge based on students' usage data.
So, one of the most promising areas is the
application of knowledge extraction. This talk
presents the application of data mining
techniques and concepts in e-Learning systems.
1. Introduction
Knowledge discovery process (KDD) and data
mining aim to extract useful information and
discover some hidden patterns form huge amount
of databases, which statistical approaches cannot
discover. it is a multidisciplinary field of
research includes: machine learning, databases,
statistics, expert systems, visualization, high
performance computing, rough sets, neural
networks, and knowledge representation, etc.
And some of the most useful data mining tasks
and methods are statistics, visualization,
clustering, classification and association rule
mining [1]. Recently, researchers have begun to
investigate various data mining methods to help
instructors and administrators to improve elearning systems [2]. These methods discover
new, interesting and useful knowledge based on
students’ usage data. Some of the mains elearning problems or subjects to which data
mining techniques have been applied are dealing
with the assessment of student’s learning
performance, provide course adaptation and
learning recommendations based on the students’
learning behavior, dealing with the evaluation of
learning material and educational web-based
courses, provide feedback to both teachers and
students of e-learning courses, and detection of
atypical student’s learning behavior.
2. KDD and Data Mining
Methodology
Knowledge Discovery in Databases process
involves the following processes; (a) using the
database along with any required selection,
Preprocessing, subsampling, and transformations
of it. (b) applying data mining methods
(algorithms) to enumerate patterns from it. and
(c) evaluating the products of data mining to
identify the subset of the enumerated patterns
deemed
knowledge.
The
data
mining
components of the KDD process is concerned
with the algorithmic means by which patterns are
extracted and enumerated from data. The overall
KDD process includes the evaluation and
possible interpretation of the mined patterns to
determine which patterns can be considered new
Knowledge. KDD process is interactive and
iterative, involving numerous steps with many
decisions made by the user.
In what follows a brief description about each
process [1]
 Define the Goal of KDD Process:
Developing an understanding of the
application domain and the relevant
prior knowledge and identifying the
goal of the KDD Process from the
customer's viewpoint.
 Selection of Target Dataset: Creating a
target data set, OR selecting a data set,
OR focusing on a subset of variables or
data samples, on which discovery is to
be performed.
 Data cleaning and preprocessing : in
which
the following tasks are
performed (a) removing noise if
appropriate , (b) collecting the
necessary information to model or
account for noise, (c) deciding on






strategies for handling missing data
fields , (d) accounting for timesequence information and known
changes.
Data reduction and transformation:
Finding useful features to represent the
data depending on the goal of the task.
With dimensionality reduction or
transformation methods, the effective
number of variables under consideration
can
be
reduced,
or
invariant
representations for the data can be
found.
Matching the goals of the KDD
process to a particular data-mining
method,
e.g.
Summarization,
Regression,
Classification
and
Clustering.
Exploratory analysis and model &
hypothesis selection: Choosing the
mining algorithm(s) and selecting
method(s) to be used for searching for
data patterns. This process includes (a)
Deciding which models and parameters
might be appropriate (models for
categorical data are different than
models of vectors) (b) Matching a
particular data-mining method with the
overall criteria of the KDD process (the
end user might be more interested in
understanding the model than its
predictive capabilities).
Data mining: Searching for patterns of
interest in a particular representational
form or a set of such representations,
including classification rules or trees,
regression, and clustering.
Interpreting mined patterns: This
process involves visualization of the
extracted patterns and models or
visualization of the data given the
extracted models possibly returning to
any of process 1 through 7 for further
iteration.
Acting on the discovered knowledge:
Using
the
knowledge
directly,
incorporating the knowledge into
another system for further actions, or
simply documenting it and reporting it
to interested parties. This process
includes checking for and resolving
potential conflicts with previously
believed (or extracted) knowledge.
3. Data Mining Tasks
Data mining is supported by a host that captures
the character of data in several different ways.
 Clustering: The key objective is to find
natural groupings (clusters) in highly
dimensional data. Clustering is an
example of unsupervised learning, and
it is a part of pattern recognition.
 Regression Models: These originate
from standard regression analysis and
its applied part known as system
identification. The underlying idea is to
construct a linear or nonlinear function
 Classification: This concerns learning
that
classifies
data
into
the
predetermined categories. The term
originates form pattern recognition, in
which a vast number of classifiers have
been developed.
 Summarization: This is an approach
towards characterizing data via small
number of features/attributes. In the
simplest scenario one can think of a
mean and standard deviations as two
extremely compact descriptors of the
data. This technique is often applied in
an interactive exploratory data analysis
and automated report generation.
 Link analysis: It is concerned with
determination
of
relationships
(dependencies) between fields in a
database. In a particular case we may be
interested in the determination of the
correlation between the variables.
 Sequence Analysis: This type of
analysis is geared toward problems of
modeling sequential data. Pertinent
models embrace time series analysis,
time series models, and temporal neural
networks.
4. Data Mining Techniques
This section presents a brief account about the
well known data mining techniques [3].
4.1. Neural Networks
Neural networks (NN) are inspired in biological
models of brain functioning. They are capable of
learning by examples and generalizing the
acquired knowledge. Due to these abilities the
neural networks are widely used to find out
nonlinear relations which otherwise could not be
unveiled due to analytical constraints. The
learned knowledge is hidden in their structure
thus it is not possibly to be easily extracted
and interpreted. The structure of the multilayered
perceptron (MLP), i.e. the number of hidden
layers and the number of neurons, determines its
capacity, while the knowledge about the relations
between input and output data is stored in the
weights of connections between neurons. The
values of weights are updated in the supervised
training process with a set of known and
representative values of input – output data
samples.
4.2. Support Vector Machines(SVM)
SVM are new learning-by example paradigm for
classification and regression problems [4]. SVM
have demonstrated significant efficiency when
compared with neural networks. Their main
advantage lies in the structure of the learning
algorithm which consists of a constrained
quadratic optimization problem (QP), thus
avoiding the local minima drawback of NN. The
approach has its roots in statistical learning
theory (SLT) and provides a way to build
“optimum classifiers” according to some
optimality criterion that is referred to as the
maximal margin criterion. An interesting
development in SLT is the introduction of the
Vapnik- Chervonenkis (VC) dimension, which is
a measure of the complexity of the model.
Equipped
with a
sound
mathematical
background, support vector machines treat both
the problem of how to minimize complexity in
the course of learning and how high
generalization might be attained. This trade-off
between complexity and accuracy led to a range
of principles to find the optimal compromise.
Vapnik and co-authors' work have shown the
generalization to be bounded by the sum of the
training error and a term depending on the
Vapnik- Chervonenkis (VC) dimension of the
learning machine leading to the formulation of
the structural risk minimization (SRM) principle.
By minimizing this upper bound, which typically
depends on the margin of the classifier, the
resulting algorithms lead to high generalization
in the learning process.
4.3. Clustering
Clustering techniques apply when the instances
of data are to be divided into natural groups. The
classical clustering technique is k-means where
clusters are specified in advance prior to
application of the algorithm. This corresponds to
parameter k. Then k points are chosen at random
as clusters centers. All instances are assigned to
their closest cluster center according to the
Euclidian distance metric. Next the
centroid, or mean, of each cluster center is
calculated. These centroids are taken to be the
new cluster centers for their respective clusters.
The whole process is repeated with the new
cluster centers. Iteration continues until the same
points are assigned to each cluster in consecutive
runs. At this point the cluster centers have
stabilized and will remain the same [3]. There
are many variants of clustering even for the
kmeans algorithm depending upon the method of
choosing the initial centers.
4.4. Association rule mining
Association rules mining is one of the most well
studied data mining tasks. It discovers
relationships among attributes in databases,
producing if-then statements concerning
attribute-values [5]. An association rule X ⇒ Y
expresses that in those transactions in the
database where X occurs; there is a high
probability of having Y as well. X and Y are
called respectively the antecedent and
consequent of the rule. The strength of such a
rule is measured by its support and confidence.
The confidence of the rule is the percentage of
transactions with X in the database that contain
the consequent Y also. The support of the rule is
the percentage of transactions in the database
that contain both the antecedent and the
consequent. Association rule mining has been
applied to e-learning systems for traditionally
association analysis (finding correlations
between items in a dataset).An efficient
algorithm to discover these association rules was
first introduced in [5]. The algorithm constructs a
candidate set of frequent item sets of length k,
counts the number of occurrences, keeps only the
frequent ones, then constructs a candidate set of
item sets of length k+1 from the frequent item
sets of smaller length. It continues iteratively
until no candidate item set can be constructed. In
other words, every subset of a frequent item set
must also be frequent. The rules are then
generated from the frequent item sets with
probabilities attached to them indicating the
likelihood (called support) that the association
occurs. We use this idea of association rules to
train our recommender agent to build a model
representing the web page access behavior or
associations between on-line learning activities.
4.5. Rough sets
Rough set theory was proposed as a new
approach to vague concept description from
incomplete data. The rough set theory is one of
the most useful techniques in many real life
applications such as medicine, pharmacology,
engineering, banking and market analysis. This
theory provides a powerful foundation to reveal
and discover important structures in data and to
classify complex objects. One of the main
advantages of rough set theory is that it does not
need any preliminary or additional information
about data. Information about rough sets
software for data analysis was given in [6]. In
our research group at Ain Shams, a rough setbased medical system for mining patient data for
predictive rules to determine thrombosis disease
was developed in [6] this system aims to search
for patterns specific/sensitive to thrombosis
disease. This system reduced the number of
attributes that describe the thrombosis disease
from 60 to 16 significant attribute in addition to
extracting some decision rules, through decision
applying decision algorithms, which can help
young physicians to predict the thrombosis
disease.
4.6. Genetic Algorithms
Many classifications models have been proposed
in the literature, such as distributed algorithms,
restricted search, data reduction algorithms,
parallel algorithms, neural networks and decision
trees, genetic algorithms. These approaches
either cause loss of accuracy or cannot
effectively uncover the data structure. Genetic
Algorithms (GA) provide an approach to
learning that based loosely on simulated
evolution. The GA methodology hinges on a
population of potential solutions, and as such
exploits the mechanisms of natural selection well
known in evolution. Rather than searching from
general to specific hypothesis or from simple to
complex GA generates successive hypotheses by
repeatedly mutating and recombining parts of the
best currently known hypotheses. The GA
algorithm operates by iteratively updating a poll
of hypotheses (population). One each iteration,
old members of the population are evaluated
according a fitness function. A new generation is
then generated by probabilistically selecting the
fittest individuals form the current population.
Some of these selected individuals are carried
forward into the next generation population
others are used as the bases for creating new
offspring individuals by applying genetic
operations such as crossover and mutation. In
our research group we developed a hybrid
classifier that integrates the strengths of genetic
algorithms and decision trees. The algorithm was
applied on a medical database of 20 MB size for
predicting thrombosis disease [7]. The results
show that our classifier is a very promising tool
for thrombosis disease prediction in terms of
predictive accuracy.
5. Applications of data mining
techniques in e-learning
This section presents the applications of different
data mining methods and tasks in elearning
domain [8].
5.1. Application of association rule mining
in web-based education systems
Association rule mining has been applied to
web-based education systems for the following
tasks:
 Building recommender agents that
could recommend on-line learning
activities or shortcuts.
 Diagnosing student learning problems
and offer students advice.
 Guiding
the
learner’s
activities
automatically
and
recommending
learning materials.
 Determining which learning materials
are the most suitable to be
recommended to the user.
 Identifying attributes characterizing
patterns of performance disparity
between various groups of students.
 Discovering interesting relationships
from student’s usage information in
order to provide feedback to course
author.
 Finding out relationships in learners’
behaviour patterns.
 Finding students’ mistakes that often
accompany each other.
 Guiding the search for best fitting
transfer models of student learning.

Optimizing the content of the elearning
portal by determining what most
interests the user.
5.2. Information Visualization
Information visualization is a branch of
computer graphics and user interface which is
concerned with the presentation of interactive or
animated digital images so that users can
understand data [9]. These techniques facilitate
analysis of large amounts of information by
representing the data in some visual display.
Normally large quantities of raw instance data
are represented or plotted as spreadsheet charts,
scatter plots and 3D representations. Information
visualization can be used to graphically render
complex, multidimensional student tracking data
collected by web-based educational systems
[10]. The information visualized in e-learning
can be used in the following educational tasks;
complementary assignments, admitted questions,
exam scores, etc.
Visualization tools enable instructors to
manipulate the graphical representations
generated, which allow them to gain an
understanding of their learners and become
aware of what is happening in distance classes.
The most common specific visualization tools in
educational domain are:
 CourseVis visualizes data from a java
on-line distance course inside WebCT.
 GISMO uses Moodle students’ tracking
data as source data, and generates
graphical representations that can be
explored by course instructors.
 Listen tool browses vast student–tutor
interaction logs from Project LISTEN’s
automated Reading Tutor.
5.3. Clustering
Clustering is a process of grouping objects into
classes of similar objects [11]. It is an
unsupervised classification or partitioning of
patterns (observations, data items, or feature
vectors) into groups or subsets (clusters) based
on their locality and connectivity within an ndimensional space.
In e-learning, clustering has been used for:
 Finding clusters of students with similar
learning characteristics and to promote
group-based collaborative learning as
well as to provide incremental learner
diagnosis.





Discovering patterns reflecting user
behaviors and for collaboration
management to characterize similar
behavior groups in unstructured
collaboration spaces.
Grouping students and personalized
itineraries for courses based on learning
objects.
Grouping students in order to give them
differentiated guiding according to their
skills and other characteristics.
Grouping tests and questions into
related groups based on the data in the
score matrix.
Grouping users based on the timeframed navigation sessions.
5.4. Classification
A classifier is a mapping from a (discrete or
continuous) feature space X to a discrete set of
labels Y [12]. Classification or discriminant
analysis predicts class labels. This is supervised
classification which provides a collection of
labeled (preclassified) patterns, the problem
being to label a newly encountered, still
unlabeled, pattern. In e-learning, classification
has been used for:
 Discovering potential student groups
with similar
characteristics
and
reactions to a specific pedagogical
strategy.
 Predicting students’ performance and
their final grade.
 Detecting students’ misuse or students
playing around.
 Predicting the students’ performance as
well as to assess the relevance of the
attributes involved.
 Grouping students as hint-driven or
failure-driven and finding students’
common misconceptions.
 Identifying
learners
with
little
motivation and finding remedial actions
in order to lower drop-out rates.
 Predicting course success.
5.5. Sequential Pattern Mining (SPM)
SPM is a more restrictive form of association
rule mining in which the accessed items’ order is
taken into account. It tries to discover if the
presence of a set of items is followed by another
item in a time-ordered set of sessions or episodes
[13].
The applications of sequential patterns in elearning can be summarized in the following:
 Evaluating learners’ activities and can
be used in adapting and customizing
resource delivery.
 Discovering and comparison with
expected behavioral patterns specified
by the instructor that describe an ideal
learning path.
 Giving an indication of how to best
organize the educational web space and
be able to make suggestions to learners
who share similar characteristics.
 Generating personalized activities to
different groups of learners.
 Supporting
the
evaluation
and
validation of learning site designs.
 Identifying
interaction
sequences
indicative of problems and patterns that
are markers of success.
5.6. Text Mining
TM can be viewed as an extension of data
mining to text data and it is closely related to
web content mining. Its methods include text
mining that can work with unstructured or semistructured data sets such as full-text documents,
HTML files and emails [14]. The specific
application of text mining techniques in elearning can be used for the following:
 Grouping documents according to their
topics and similarities and providing
summaries.
 Finding and organizing material using
semantic information.
 Supporting editors when gathering and
preparing the materials.
 Evaluating the progress of the thread
discussion to see what the contribution
to the topic is.
 Collaborative learning and a discussion
board with evaluation between peers.
 Identifying the main blocks of
multimedia presentations.
 Selecting articles and automatically
constructing
e-textbooks
and
personalized courseware.
 Detecting the conversation focus of
threaded discussions, classifying topics
and estimating the technical depth of
contribution.
5.7. Applying data mining tools
management learning systems
for
Nowadays, data mining tools are normally
designed more for power and flexibility than for
simplicity. In what follows a brief description of
the general, public and specific educational data
mining tools
 General and specific data mining tools
and frameworks; e.g.DBMiner, SPSS
Clementine, DB2 and Intelligent Miner.
 Public domain mining tools; e.g. Weka and
Keel.
 Specific educational data mining tools
o Tools for association and pattern
and text mining [15]; e.g.
TheMining tool, EPRules, Simulog,
Sequential Mining tool, O3R and
KAON.
o Tools for statistics and
visualization [16]; e.g.
Synergo/ColAT, GISMO, Listen
tool TADAEd,
o Tools for association and
classification [17], MultiStar and
CIECoF.
o Tools for learning paths and
performance [18]; e.g. Tool and
MINEL
6. Conclusions
The paper discusses the application of data
mining techniques in e-learning tasks and
domains.
The
following
techniques;
visualization,
clustering,
classification,
sequential pattern mining, and text mining are
discussed from e-learning prospective. Data
mining techniques can enhance on-line education
for the educators as well as the learners. While
some tools using data mining techniques to help
educators and learners are being developed, the
research is still in its infancy. Data mining
techniques are very promising approach towards
the analysis of the data of student activities and
behavior which accumulated by learning
management systems. Most of the current data
mining tools are too complex for educators to
use their features go well beyond the scope of
what an educator might require.
7. References
[1] Cios K. J., Pedrycz, W. and Swiniarski, R.
W. Data Mining Methods for Knowledge
Discovery. Kluwer 1998.
[2] Romero, C., & Ventura, S. Data mining in elearning. Southampton, UK: Wit Press 2006.
[3] I. H. Witten and E. Frank, Data Mining –
Practical Machine Learning Tools and
Techniques. 2nd ed Elsevier, 2005.
[4] C. Cortes and V. Vapnik, “Support vector
networks”, Machine Learning, vol. 20, pp. 273297, 1995.
[5] R. Agrawal, T. Imielinski, and A. Swami.
Mining association rules between sets of items in
large databases. In Proc. 1993 ACM-SIGMOD
Int. Conf. Management of Data, pages 207–216,
Washington, D.C., May 1993.
[6] A. M. salem, safia A. Mahmoud., “Mining
patient Data Based on Rough Set Theory to
Determine Thrombosis Disease”, Proceedings of
First Intelligence conference on Intelligent
Computing and Information Systems, pp 291296. ICICIS 2002, Cairo, Egypt, June 2426,2002.
[7] Abdel-Badeeh M.Salem and Abeer
M.Mahmoud, “A Hybrid Genetic AlgorithmDecision Tree Classifier”, Proceedings of the 3rd
International Conference on New Trends in
Intelligent Information Processing and Web
Mining, Zakopane, Poland, pp. 221-232, June 25, 2003.
[8] C. Romero, S. Ventura, E. Garcıa. Data
mining in course management systems: Moodle
case study and tutorial. Computers & Education
2007.
[9] Spence, R. Information visualization.
Addison-Wesley 2001.
[10] I. Cadez, D. Heckerman, and C. Meek.
Visualization of navigation patterns on web site
using model based clustering. In ACM Int. Conf.
on Knowledge Discovery and Data Mining
(SIGKDD’00), PP 280–284, Boston, USA,
August 2000.
[11] Jain, A. K., Murty, M. N., & Flynn, P. J.
(1999). Data clustering: A review. ACM
Computing Surveys, 31(3), 264–323.
[12] Duda, R. O., Hart, P. E., & Stork, D. G.
Pattern classification. Wiley Interscience 2000.
[13] Agarwal, R., & Srikant, R. Mining
sequential patterns. In Proceedings of the
eleventh international conference on data
engineering, Taipei, Taiwan (pp. 3–14), 2005.
[14] Feldman, R., & Sanger, J. The text mining
handbook. Cambridge University Press 2006.
[15] Zaı¨ane, O., & Luo, J. Web usage mining
for a better web-based learning environment. In
Proceedings of conference on advanced
technology for education, Banff, Alberta, PP.
60–64 2001.
[16] Mazza, R., & Milani, C. Exploring usage
analysis in learning systems: Gaining insights
from visualisations. In Workshop on usage
analysis in learning systems at 12th international
conference on artificial intelligence in education,
New York, USA PP. 1–6, 2005.
[17] Silva, D., & Vieira, M. Using data
warehouse and data mining resources for
ongoing assessment in distance learning. In
IEEE international conference on advanced
learning technologies, Kazan, Russia PP. 40–45,
2002.
[18] Bellaachia, A., Vommina, E., & Berrada, B.
(2006). Minel: A framework for mining elearning logs. In Proceedings of the fifth
IASTED international conference on Web-based
education, Mexico PP. 259-263, 2006.