Download Data mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Mixture model wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
In this presentation, you will be introduced to data mining and the relationship
with meaningful use.
Data mining refers to the art and science of intelligent data analysis. It is the
application of machine learning algorithms to large data sets with the primary
aim of discovering meaningful insights and knowledge from that data.
Data mining essentially is the construction of data models that instantiate a
machine learning algorithm on specific data elements. The model captures the
essence of the discovered knowledge and helps us in our understanding of the
world. Often times, these models are predictive. For instance, data mining
models have been applied to healthcare data to predict readmissions, risk of
disease, and efficacy of medications.
Modeling is the process of turning all that data into some structured form or
model that reflects the supplied data in useful way. The aim of modeling is to
explore the data to address a specific problem by modeling or mimicking the
real world. For instance, a lot research has been done in modeling the way in
which we make decisions. Machine learning algorithms that use artificial
intelligence develop models that closely represent how a human would make a
decision. The same methods can be applied to healthcare data were we
attempt to model decision making. For instance, we might want to develop a
model to predict drug relapse in patients with a history of drug addiction. The
machine learning algorithms, using artificial intelligence, would look at all of the
data elements to come up with a decision on the likelihood of whether a
patient will relapse. Unfortunately, no model can perfectly represent the world.
For instance, we might find that our model predicts a patient will relapse even
if the patient does not have a history of drug addiction. In the real world, we
would never make this mistake, but due to the rules governing the machining
learning algorithm, such mistakes are possible.
To ensure that the model is constructed in such a way to limit such mistakes
and represent the real world as closely as possible, there are a set of 8 steps
that can be followed. First, you must have a clear understanding of the data
and the business of healthcare. If you do not know what the data mean, it is
likely that your model will not make sense. Second, you must partition your
data into training, validation, and testing datasets when building, tuning, and
evaluating your model. This way, three different set of data are used to
validate your model. Third, build multiple models and compare their
performance. You may find that you favor one model, such as a neural
network, but that model may not be the most effective. Therefore, comparing
the performance of multiple models will yield the most effective end product.
Fourth, if you end up developing a perfect model, something went wrong.
Healthcare data is messy and complex. It’s unlikely that you will develop a
model that makes perfect decisions. The laws of probability suggest otherwise
that your model will at times make mistakes. Fifth, don’t overlook how the
model is to be deployed. Some of the algorithms are very difficult to employ.
For instance, neural networks are a black box and difficult to automate into a
system. However, rule based algorithms just as decision trees are very simple
to deploy. Sixth, when constructing your models they should be repeatable and
efficient. That is, if you were to take a different set of data and apply your
model, you should get similar results. Also, your model shouldn’t take 3 days
to run. It should be almost instantaneous otherwise it’s unlikely that it can be
implemented in a healthcare setting where everything is fast-paced. Seventh,
let the data talk to you but no mislead you. If you are certain that the results of
your analysis are doubtful, you should question the results. Don’t assume that
the results are the truth. Test it, test it again, and again. Lastly, after you
constructed your model and tested it, communicate your discoveries effectively
and visually.
There are many tools available for data mining and constructing models. One
of the most popular tools include SAS enterprise miner. The platform is
powerful and relatively easy to use. Weka is an open-source platform that
supports the development of a variety of different algorithms. Rattle is a
package available in the open-source analytics environment R and is also very
powerful and diverse. Rattle also supports predictive markup modeling
language (PMML) for deploying data mining models. There are many other
applications available.
Data mining has some terminologies that should be understood. A dataset is a
collection of data. Often times, a dataset will have multiple columns and many
rows. In mathematical terms, this is referred to as a matrix while in database
terms this would be referred to as a table. The observations make up the rows
of data while the variables make up the columns. The dimension of a dataset
is the number of observations, or rows, by the number of variables, or
columns.
Input variables include the measured data items. This can take on many
different forms, either text, numbers in ordinal, nominal, interval, or ratio. Other
names for input variables include predictors, covariates, independent
variables, observed variables, or descriptive variables. An example would be
systolic blood pressure, diastolic blood pressure, medications, weight, age,
gender, and so on.
Output variables are those that are influence by the input variables. They are
also known as target, response, or dependent variables. An example might be
a diagnosis of hypertension.
We build models to predict the output variables in terms of the input variables.
So if we were given data that includes systolic blood pressure, diastolic blood
pressure, medications, weight, age, and gender, we could use that data as
inputs for predicting the output of a diagnosis of hypertension.
There’s one caveat. Some data mining models may not have any output
variables. These are referred to as descriptive models and an example is
clustering. We will get to these in a moment.
Identifiers are unique variables for a particular observation. They may include
a patient’s name, or a patient ID. Categorical variables are one that take on a
single value and are discrete. They can be nominal where there is not order to
them (for example eye color) or ordinal where there is natural order (for
example age groups). Numeric variables, also known as continuous variables,
are values that are integers or real numbers (for example weight).
There are three datasets that are used when constructing a model: training,
validation, and testing datasets. The training dataset is the data that you use to
build the initial models. The validation dataset assess the model’s performance
that you develop using the training dataset. This step helps fine tune the model
as appropriate. The testing dataset, applies the refined model and assesses
expected performance on future datasets.
When developing a data mining model, you start with one large dataset and
partition that into training, validation, and testing datasets. The partitioning is
done by randomly selecting observations to one of the three datasets. The
training set typically has more data than the other datasets. For instance, if we
take a large dataset we can partition our three datasets as follows: 70% of the
observations go to the training dataset, 15% to the validation, and 15% to the
testing dataset.
The data mining process that is widely accepted is known as CRISP-DM, or
CRoss Industry Standard Process for Data Mining. The process includes 6
steps from understanding the business all the way to deploying a model.
The slide on your screen shows a description of the six steps. The first step
emphasizes the business understanding for planning your data mining project
so that it aligns with the organizations goals. The second is data understanding
so that you can assess the quality of the data and define each data element.
Data preparation is next where you select the relevant data, clean the data up,
carry out basic descriptive statistics, and reformat the data as necessary.
Modeling is next where you construct a data model or several models.
Evaluation is the step where you evaluate the performance of each of the
models constructed and choose the best performing model. Last is
deployment where you determine how you will deploy your model and present
the findings to the necessary parties.
The CRISP-DM process relates very well to specific data mining tasks. For
instance, business understanding relates to developing questions about the
data and data selection. The data understanding step is where we explore the
data. The data preparation step is where the data is transformed. The
modeling step is where we choose and build a model. The evaluation step is
where we validate and test our model. Finally, deployment is where we export
the model.
When building a model, there are two main categories. The first is descriptive
models also known as unsupervised learning. These are models that are
Providing
a representation of the
knowledge discovered without
necessarily modeling a specific
outcome.
constructed when we do not have a target variable.
An example of a descriptive model is a clustering
analysis. Predictive models, or supervised learning, are those that can be
developed when we have a target variable. We can predict the target variable
with our given set of input variables. The goal of a predictive model is to
extract knowledge from historic
data and represent it in such a
form that we can apply the
resulting model to new
situations
predicting
the occurrence of an event of
interest. The historic data will
already be associated with the
outcome and we can learn to
make this association on future
data.
. In that way, we are
Common predictive algorithms include decision trees, boost, and
neural networks.
If the model is found effective and ready
for use in real time, the next step is
deployment. One method to deploy
models is through the use of a language
called predictive modeling markup
language (PMML). It is an XML-based
standard that is supported by many major
commercial data mining vendors and
many open source data mining tools.
Descriptive and/or predictive models can be used on specific datasets.
Different models and algorithms have advantages and disadvantages.
Therefore, it is recommended to construct multiple models and choose the
best. Deployment of a successful model can be simple using PMML.
When considering the role of Health IT and Meaningful Use and the
implications for data mining, the use of data mining techniques can have great
potential for the development of clinical decision support systems and
outbreak detection to foster better patient outcomes. Also, as the government
invests more into health IT, the adopting of data mining approaches will
become more of a priority. New ways of analyzing and interpreting the data will
be sought after and it is anticipated that data mining will be center stage.