Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine learning algorithms to large data sets with the primary aim of discovering meaningful insights and knowledge from that data. Data mining essentially is the construction of data models that instantiate a machine learning algorithm on specific data elements. The model captures the essence of the discovered knowledge and helps us in our understanding of the world. Often times, these models are predictive. For instance, data mining models have been applied to healthcare data to predict readmissions, risk of disease, and efficacy of medications. Modeling is the process of turning all that data into some structured form or model that reflects the supplied data in useful way. The aim of modeling is to explore the data to address a specific problem by modeling or mimicking the real world. For instance, a lot research has been done in modeling the way in which we make decisions. Machine learning algorithms that use artificial intelligence develop models that closely represent how a human would make a decision. The same methods can be applied to healthcare data were we attempt to model decision making. For instance, we might want to develop a model to predict drug relapse in patients with a history of drug addiction. The machine learning algorithms, using artificial intelligence, would look at all of the data elements to come up with a decision on the likelihood of whether a patient will relapse. Unfortunately, no model can perfectly represent the world. For instance, we might find that our model predicts a patient will relapse even if the patient does not have a history of drug addiction. In the real world, we would never make this mistake, but due to the rules governing the machining learning algorithm, such mistakes are possible. To ensure that the model is constructed in such a way to limit such mistakes and represent the real world as closely as possible, there are a set of 8 steps that can be followed. First, you must have a clear understanding of the data and the business of healthcare. If you do not know what the data mean, it is likely that your model will not make sense. Second, you must partition your data into training, validation, and testing datasets when building, tuning, and evaluating your model. This way, three different set of data are used to validate your model. Third, build multiple models and compare their performance. You may find that you favor one model, such as a neural network, but that model may not be the most effective. Therefore, comparing the performance of multiple models will yield the most effective end product. Fourth, if you end up developing a perfect model, something went wrong. Healthcare data is messy and complex. It’s unlikely that you will develop a model that makes perfect decisions. The laws of probability suggest otherwise that your model will at times make mistakes. Fifth, don’t overlook how the model is to be deployed. Some of the algorithms are very difficult to employ. For instance, neural networks are a black box and difficult to automate into a system. However, rule based algorithms just as decision trees are very simple to deploy. Sixth, when constructing your models they should be repeatable and efficient. That is, if you were to take a different set of data and apply your model, you should get similar results. Also, your model shouldn’t take 3 days to run. It should be almost instantaneous otherwise it’s unlikely that it can be implemented in a healthcare setting where everything is fast-paced. Seventh, let the data talk to you but no mislead you. If you are certain that the results of your analysis are doubtful, you should question the results. Don’t assume that the results are the truth. Test it, test it again, and again. Lastly, after you constructed your model and tested it, communicate your discoveries effectively and visually. There are many tools available for data mining and constructing models. One of the most popular tools include SAS enterprise miner. The platform is powerful and relatively easy to use. Weka is an open-source platform that supports the development of a variety of different algorithms. Rattle is a package available in the open-source analytics environment R and is also very powerful and diverse. Rattle also supports predictive markup modeling language (PMML) for deploying data mining models. There are many other applications available. Data mining has some terminologies that should be understood. A dataset is a collection of data. Often times, a dataset will have multiple columns and many rows. In mathematical terms, this is referred to as a matrix while in database terms this would be referred to as a table. The observations make up the rows of data while the variables make up the columns. The dimension of a dataset is the number of observations, or rows, by the number of variables, or columns. Input variables include the measured data items. This can take on many different forms, either text, numbers in ordinal, nominal, interval, or ratio. Other names for input variables include predictors, covariates, independent variables, observed variables, or descriptive variables. An example would be systolic blood pressure, diastolic blood pressure, medications, weight, age, gender, and so on. Output variables are those that are influence by the input variables. They are also known as target, response, or dependent variables. An example might be a diagnosis of hypertension. We build models to predict the output variables in terms of the input variables. So if we were given data that includes systolic blood pressure, diastolic blood pressure, medications, weight, age, and gender, we could use that data as inputs for predicting the output of a diagnosis of hypertension. There’s one caveat. Some data mining models may not have any output variables. These are referred to as descriptive models and an example is clustering. We will get to these in a moment. Identifiers are unique variables for a particular observation. They may include a patient’s name, or a patient ID. Categorical variables are one that take on a single value and are discrete. They can be nominal where there is not order to them (for example eye color) or ordinal where there is natural order (for example age groups). Numeric variables, also known as continuous variables, are values that are integers or real numbers (for example weight). There are three datasets that are used when constructing a model: training, validation, and testing datasets. The training dataset is the data that you use to build the initial models. The validation dataset assess the model’s performance that you develop using the training dataset. This step helps fine tune the model as appropriate. The testing dataset, applies the refined model and assesses expected performance on future datasets. When developing a data mining model, you start with one large dataset and partition that into training, validation, and testing datasets. The partitioning is done by randomly selecting observations to one of the three datasets. The training set typically has more data than the other datasets. For instance, if we take a large dataset we can partition our three datasets as follows: 70% of the observations go to the training dataset, 15% to the validation, and 15% to the testing dataset. The data mining process that is widely accepted is known as CRISP-DM, or CRoss Industry Standard Process for Data Mining. The process includes 6 steps from understanding the business all the way to deploying a model. The slide on your screen shows a description of the six steps. The first step emphasizes the business understanding for planning your data mining project so that it aligns with the organizations goals. The second is data understanding so that you can assess the quality of the data and define each data element. Data preparation is next where you select the relevant data, clean the data up, carry out basic descriptive statistics, and reformat the data as necessary. Modeling is next where you construct a data model or several models. Evaluation is the step where you evaluate the performance of each of the models constructed and choose the best performing model. Last is deployment where you determine how you will deploy your model and present the findings to the necessary parties. The CRISP-DM process relates very well to specific data mining tasks. For instance, business understanding relates to developing questions about the data and data selection. The data understanding step is where we explore the data. The data preparation step is where the data is transformed. The modeling step is where we choose and build a model. The evaluation step is where we validate and test our model. Finally, deployment is where we export the model. When building a model, there are two main categories. The first is descriptive models also known as unsupervised learning. These are models that are Providing a representation of the knowledge discovered without necessarily modeling a specific outcome. constructed when we do not have a target variable. An example of a descriptive model is a clustering analysis. Predictive models, or supervised learning, are those that can be developed when we have a target variable. We can predict the target variable with our given set of input variables. The goal of a predictive model is to extract knowledge from historic data and represent it in such a form that we can apply the resulting model to new situations predicting the occurrence of an event of interest. The historic data will already be associated with the outcome and we can learn to make this association on future data. . In that way, we are Common predictive algorithms include decision trees, boost, and neural networks. If the model is found effective and ready for use in real time, the next step is deployment. One method to deploy models is through the use of a language called predictive modeling markup language (PMML). It is an XML-based standard that is supported by many major commercial data mining vendors and many open source data mining tools. Descriptive and/or predictive models can be used on specific datasets. Different models and algorithms have advantages and disadvantages. Therefore, it is recommended to construct multiple models and choose the best. Deployment of a successful model can be simple using PMML. When considering the role of Health IT and Meaningful Use and the implications for data mining, the use of data mining techniques can have great potential for the development of clinical decision support systems and outbreak detection to foster better patient outcomes. Also, as the government invests more into health IT, the adopting of data mining approaches will become more of a priority. New ways of analyzing and interpreting the data will be sought after and it is anticipated that data mining will be center stage.