Download Automating Cognitive Model Improvement by A*Search and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Multinomial logistic regression wikipedia , lookup

Transcript
Educational Data Mining
Ryan S.J.d. Baker
PSLC/HCII
Carnegie Mellon University
Richard Scheines
Professor of Statistics, Machine Learning, and Human-Computer Interaction
Carnegie Mellon University
Ken Koedinger
CMU Director of PSLC
Professor of Human-Computer Interaction & Psychology
Carnegie Mellon University
In this segment…

We will give a brief overview of classes of
Educational Data Mining methods

Discussing in detail

Causal Data Mining


An important Educational Data Mining method
Bayesian Knowledge Tracing

One of the key building blocks of many Educational Data
Mining analyses
Baker (under review)
EDM Methods





Prediction
Clustering
Relationship Mining
Discovery with Models
Distillation of Data for Human Judgment
Coverage at EDM2008
(of 31 papers; not mutually exclusive)

Prediction – 45%
Clustering – 6%
Relationship Mining – 19%
Discovery with Models – 13%
Distillation of Data for Human Judgment – 16%

None of the Above – 6%




We will talk about three
approaches now


2 types of Prediction
1 type of Relationship Mining
Tomorrow, 9:30am: Discovery with Models
Yesterday: Some examples of Distillation of
Data for Human Judgment
Prediction

Pretty much what it says

A student is using a tutor right now.
Is he gaming the system or not?
(“attempting to succeed in an interactive learning environment by
exploiting properties of the system rather than by learning the
material”)

A student has used the tutor for the last half hour.
How likely is it that she knows the knowledge component in
the next step?

A student has completed three years of high school.
What will be her score on the SAT-Math exam?
Two Key Types of Prediction
This slide adapted from slide by Andrew W. Moore, Google
http://www.cs.cmu.edu/~awm/tutorials
Classification


There is something you want to predict (“the
label”)
The thing you want to predict is categorical

The answer is one of a set of categories, not a number

CORRECT/WRONG (sometimes expressed as 0,1)
HELP REQUEST/WORKED EXAMPLE
REQUEST/ATTEMPT TO SOLVE
WILL DROP OUT/WON’T DROP OUT
WILL SELECT PROBLEM A,B,C,D,E,F, or G



Classification

Associated with each label are a set of
“features”, which maybe you can use to
predict the label
KnowledgeComp
ENTERINGGIVEN
ENTERINGGIVEN
USEDIFFNUM
ENTERINGGIVEN
REMOVECOEFF
REMOVECOEFF
USEDIFFNUM
….
pknow
0.704
0.502
0.049
0.967
0.792
0.792
0.073
time
9
10
6
7
16
13
5
totalactions
1
2
1
3
1
2
2
right
WRONG
RIGHT
WRONG
RIGHT
WRONG
RIGHT
RIGHT
Classification

The basic idea of a classifier is to determine
which features, in which combination, can
predict the label
KnowledgeComp
ENTERINGGIVEN
ENTERINGGIVEN
USEDIFFNUM
ENTERINGGIVEN
REMOVECOEFF
REMOVECOEFF
USEDIFFNUM
….
pknow
0.704
0.502
0.049
0.967
0.792
0.792
0.073
time
9
10
6
7
16
13
5
totalactions
1
2
1
3
1
2
2
right
WRONG
RIGHT
WRONG
RIGHT
WRONG
RIGHT
RIGHT
Many algorithms you can use

Decision Trees (e.g. C4.5, J48, etc.)
Logistic Regression
Etc, etc

In your favorite Machine Learning package





WEKA
RapidMiner
KEEL
Regression


There is something you want to predict (“the
label”)
The thing you want to predict is numerical



Number of hints student requests (0, 1, 2, 3...)
How long student takes to answer (4.7 s., 8.9 s.,
88.2 s., 0.3 s.)
What will the student’s test score be (95%, 84%,
33%, 100%)
Regression

Associated with each label are a set of
“features”, which maybe you can use to
predict the label
KnowledgeComp
ENTERINGGIVEN
ENTERINGGIVEN
USEDIFFNUM
ENTERINGGIVEN
REMOVECOEFF
REMOVECOEFF
USEDIFFNUM
….
pknow
0.704
0.502
0.049
0.967
0.792
0.792
0.073
time
9
10
6
7
16
13
5
totalactions
1
2
1
3
1
2
2
numhints
0
0
3
0
1
0
0
Regression

The basic idea of regression is to determine
which features, in which combination, can
predict the label’s value
KnowledgeComp
ENTERINGGIVEN
ENTERINGGIVEN
USEDIFFNUM
ENTERINGGIVEN
REMOVECOEFF
REMOVECOEFF
USEDIFFNUM
….
pknow
0.704
0.502
0.049
0.967
0.792
0.792
0.073
time
9
10
6
7
16
13
5
totalactions
1
2
1
3
1
2
2
numhints
0
0
3
0
1
0
0
Linear Regression

The most classic form of regression is linear
regression

Numhints = 0.12*Pknow + 0.932*Time –
0.11*Totalactions
Many more complex algorithms…


Neural Networks
Support Vector Machines

Surprisingly, Linear Regression performs quite
well in many cases despite being overly simple
Particularly when you have a lot of data

Which increasingly is not a problem in EDM…

Relationship Mining

Richard Scheines will now talk about one
type of relationship mining, Causal Data
Mining
Bayesian Knowledge-Tracing
The algorithm behind the skill bars …
Being improved by Educational Data Mining
Key in many EDM analyses and models
Bayesian Knowledge Tracing

Goal: For each knowledge component (KC),
infer the student’s knowledge state from
performance.

Suppose a student has six opportunities to
apply a KC and makes the following
sequence of correct (1) and incorrect (0)
responses. Has the student has learned the
rule?
001011
Model Learning Assumptions

Two-state learning model

Each skill is either learned or unlearned

In problem-solving, the student can learn a skill
at each opportunity to apply the skill

A student does not forget a skill, once he or she
knows it

Only one skill per action
Model Performance Assumptions

If the student knows a skill, there is still some
chance the student will slip and make a
mistake.

If the student does not know a skill, there is
still some chance the student will guess
correctly.
Corbett and Anderson’s Model
Not learned
p(T)
Learned
p(L0)
p(G)
Two Learning Parameters
correct
1-p(S)
correct
p(L0)
Probability the skill is already known before the first opportunity to use the skill in
problem solving.
p(T)
Probability the skill will be learned at each opportunity to use the skill.
Two Performance Parameters
p(G)
Probability the student will guess correctly if the skill is not known.
p(S)
Probability the student will slip (make a mistake) if the skill is known.
Bayesian Knowledge Tracing

Whenever the student has an opportunity to
use a skill, the probability that the student
knows the skill is updated using formulas
derived from Bayes’ Theorem.
Formulas
Knowledge Tracing

How do we know if a knowledge tracing model is
any good?

Our primary goal is to predict knowledge
Knowledge Tracing

How do we know if a knowledge tracing model is
any good?

Our primary goal is to predict knowledge

But knowledge is a latent trait
Knowledge Tracing

How do we know if a knowledge tracing model is
any good?

Our primary goal is to predict knowledge

But knowledge is a latent trait

But we can check those knowledge predictions
by checking how well the model predicts
performance
Fitting a Knowledge-Tracing Model

In principle, any set of four parameters can
be used by knowledge-tracing

But parameters that predict student
performance better are preferred
Knowledge Tracing

So, we pick the knowledge tracing parameters
that best predict performance

Defined as whether a student’s action will be
correct or wrong at a given time

Effectively a classifier
Recent Advances

Recently, there has been work towards
contextualizing the guess and slip parameters
(Baker, Corbett, & Aleven, 2008a, 2008b)

The intuition:
Do we really think the chance that an incorrect
response was a slip is equal when


Student has never gotten action right; spends 78
seconds thinking; answers; gets it wrong
Student has gotten action right 3 times in a row;
spends 1.2 seconds thinking; answers; gets it wrong
Recent Advances

In this work, P(G) and P(S) are determined
by a model that looks at time, previous
history, the type of action, etc.

Significantly improves predictive power of
method

Probability of distinguishing correct from incorrect
increases by about 15% of potential gain

To 71%, so still room for improvement
Uses

Outside of EDM, can be used to drive tutorial
decisions

Within educational data mining, there are
several things you can do with these models
Uses of Knowledge Tracing

Often key components in models of other
constructs



Help-Seeking and Metacognition (Aleven et al,
2004, 2008)
Gaming the System (Baker et al, 2004, in press)
Off-Task Behavior (Baker, 2007)
Uses of Knowledge Tracing

If you want to understand a student’s
strategic/meta-cognitive choices, it is helpful
to know whether the student knew the skill

Gaming the system means something
different if a student already knows the step,
versus if the student doesn’t know it

A student who doesn’t know a skill should
ask for help; a student who does, shouldn’t
Uses of Knowledge Tracing

Can be interpreted to learn about skills
Skills from the Algebra Tutor
skill
L0
T
0.01
0.01
ApplyExponentExpandExponentsevalradicalE
0.333
0.497
CalculateEliminateParensTypeinSkillElimi
0.979
0.001
CalculatenegativecoefficientTypeinSkillM
0.953
0.001
Changingaxisbounds
0.01
0.01
Changingaxisintervals
0.01
0.01
ChooseGraphicala
0.001
0.306
combineliketermssp
0.943
0.001
AddSubtractTypeinSkillIsolatepositiveIso
Which skills could probably be
removed from the tutor?
skill
L0
T
0.01
0.01
ApplyExponentExpandExponentsevalradicalE
0.333
0.497
CalculateEliminateParensTypeinSkillElimi
0.979
0.001
CalculatenegativecoefficientTypeinSkillM
0.953
0.001
Changingaxisbounds
0.01
0.01
Changingaxisintervals
0.01
0.01
ChooseGraphicala
0.001
0.306
combineliketermssp
0.943
0.001
AddSubtractTypeinSkillIsolatepositiveIso
Which skills could use better
instruction?
skill
L0
T
0.01
0.01
ApplyExponentExpandExponentsevalradicalE
0.333
0.497
CalculateEliminateParensTypeinSkillElimi
0.979
0.001
CalculatenegativecoefficientTypeinSkillM
0.953
0.001
Changingaxisbounds
0.01
0.01
Changingaxisintervals
0.01
0.01
ChooseGraphicala
0.001
0.306
combineliketermssp
0.943
0.001
AddSubtractTypeinSkillIsolatepositiveIso
END

This last example is a simple example of
Discovery with Models

Tomorrow at 9:30am, we’ll discuss some
more complex examples