Download Pre-processing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Improving quality of graduate
students by data mining
Asst. Prof. Kitsana Waiyamai, Ph.D.
Dept. of Computer Engineering
Faculty of Engineering, Kasetsart University
Bangkok, Thailand
1
Content
PART I



Introduction to data mining
Data mining technique: association rule
discovery
Data mining technique: data classification
PART II

Improving quality of graduate students by
data mining
Conclusion
2
What Is Data Mining ?
Knowledge Discovery from Data: KDD (Data Mining):
The process of nontrivial extraction of patterns
from data. Patterns that are:
•implicit,
•previously unknown, and
•potentially useful
Patterns must be comprehensible for human users.
3
Mining
Objective
Knowledge Discovery Process:
Iterative & Interactive Process
Take actions
based on findings
Data sources
Databases, flat files,
Complex data
Data
Warehouses
Interpret results
Preprocessing data
Gathering, cleaning
and selecting data
Search for patterns: Data Mining
Neural nets, machine learning,
statistics and others
Report findings
Analyst reviews output
4
What kind of data can be mined?
Databases
Relational databases
Data warehouses
Data
Warehouse
Transactional databases and Flat files
Advanced DB systems and information repositories

Object-oriented and object-relational databases

Spatial databases

Time-series data and temporal data

Text databases, multimedia databases

Heterogeneous and legacy databases

World Wide Web

Bioinformatic data
5
Two modes of data mining
Predictive data mining



Predict behavior based on historic data
Use data with known results to build a model that
can be later used to explicitly predict values for
different data
Methods: classification, prediction, … etc.
Descriptive data mining


Describe patterns in existing data that may be
used to guide decisions
Methods: Associations rule discovery, Sequence
pattern discovery, Clustering, … etc.
6
Data Mining Techniques
Data Clustering
Association rule discovery
Data Classification
Outlier detection
Data regression
Etc.
7
8
Data Classification
Classification is the process of assigning new objects to
predefined categories or classes
 Given a set of labeled records
 Build a model
 Predict labels for future unlabeled records
Example:


Age, Educational background, Annual income, Current
debts, Housing location => Making Decision
Degree=“Master” and Income=7500 =>
Credit=“Excellent”
9
Three-Step Process of
Classification
Training
Data
Model construction
Classifier Model
Testing
Data
Model Evaluation
Classifier Model
Unseen
Data
Classification
10
Data Mining Tools
ANGOSS KnowledgeStudio
IBM Intelligent Miner
Metaputer PolyAnalyst
SAS Enterprise Miner
SGI Mineset
SPSS Clementine
Many others
More at http://www.kdnuggets.com/software
11
Data Mining Projects
Checklist:


Start with well-defined questions
Define measures of success and failure
Main difficulty: No automation




Understanding the problem
Data preparation
Selection of the right mining methods
Interpretation
12
Using Data Mining for Improving Quality
of Engineering Graduates
Objective:
Discover knowledge from large databases of
engineering student records.
Discovered knowledge are useful in:
- Assisting in development of new curricula,
- Improvement of existing curricula,
- Helping students to select the
appropriate major
13
Using a data mining technique to help
students in selecting their majors
Motivation:
- Student major selection is very important factor for
his/her success.
- Lack of experience and information on each major.
Solution:
- Find out the profiles of good students for each
major using student profile database and course
enrollment student databases (10 years)
- Determine the most appropriate major for each
student
14
A Data Mining based Approach for Improving
Quality of Engineering Graduates
Data Mining Tool
student
profile
database
SQL Server
course
enrollment
student
databases
User
DB2
Java
Servlet
15
Data for Data Mining
Stu_code
Sex
Address
Sch_GPA
.....
GPA
37058063
male
Bangkok
2.5
.....
2.3
37058167
male
Songkla
3.4
.....
3.2
...........
....
.......
......
....
....
Stu_code
Sub_code
Term
Year
Grade
37058063
204111
1
2537
C+
37058063
403111
1
2537
D
37058063
208111
1
2537
B+
Student profile
database
course enrollment
student databases
16
Data preparation a classification model
Stu_code
37058063
37058167
Sex
Sch_GPA .... GPA
.
male Bangkok
2.5
.... 2.3
.
male Songkla
3.4
.... 3.2
.
.......... ....
.
Address
+
Stu_code
Sub_code
37058063
204111
1
2537
C+
37058063
403111
1
2537
D
37058063
208111
1
2537
B+
.......
...... .... ....
Stu_code Sex 204111
403111
37058063 male Medium Low
37058167 male High
High
.......
.... ...... .......
.
…
....
.....
.....
Term Year Grade
GPA
2.3
3.2
......
17
Global Classification Model
Global Decision Tree
which determines which
majors should be appropriate
to which students.
Each internal node represents
a test on student’s profile.
Each leaf node represents an
appropriate major to be
selected
18
Drawbacks of Global Classification
Model
- Low Precision ~ 50% due to the large
number of majors
- Number of students is different in each
department => the model cannot predict
correctly the best major to be selected.
- The model proposes a unique major to be
selected, a set of possible majors ordered by
appropriateness score would be preferred.
19
Classification Model for Each Major
-
-
-
Decision tree
predicts whether a
student is likely to
be a good student
in a given major.
Good students are
those that
graduate within 4
years and are at
the first 40%
ranking in a given
major.
Leaf nodes
represent two
class: Good and
20
Bad
Advantage of Major’s Classification Model
 Good precision 80%
 The model predicts the best major to be
selected even if number of students in each
major is different
 Its proposes a set of possible majors to be
selected ordered by appropriateness score.
Encountered problems
 Database size

Other factors that could affect student’s decision:
 Teacher Preference, etc.
21
Presentation of Discovered
Knowledge
22
Applying Association rule discovery
for Grade prediction
Basket Analysis
Education
204111
403111
417167
417168
Medium
High
Medium
Medium
23
Grade Prediction for the Coming Term
24
Presentation of Discovered Knowledge
25
Conclusion & Future works
Application of data mining in Education
Use data mining techniques for improving
quality of engineering students
Apply data mining techniques to several other
educational domains.
26