Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining A special presentation for BCIS 4660 Spring 2012 Dr. Nick Evangelopoulos, ITDS Dept. Some slide material taken from: Groth, Han and Kamber, SAS Institute Overview of this Presentation • Introduction to Data Mining • Examples of Data Mining applications • The SEMMA Methodology • SAS EM Demo: The Home Equity Loan Case Logistic Regression Decision Trees Neural Networks Introduction to DM “It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.” Sir Arthur Conan Doyle: Sherlock Holmes, "A Scandal in Bohemia" What Is Data Mining? • Data mining (knowledge discovery in databases): – A process of identifying hidden patterns and relationships within data (Groth) • Data mining: – Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases Multidisciplinary Statistics Pattern Neurocomputing Recognition Machine Data Mining Learning Databases KDD AI Architecture of a Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Filtering Data cleaning & data integration Databases Data Warehouse A Data Mining example: The Home Equity Loan Case • The analytic goal is to determine who should be approved for a home equity loan. • The target variable is a binary variable that indicates whether an applicant eventually defaulted on the loan. • The input variables are variables such as the amount of the loan, amount due on the existing mortgage, the value of the property, and the number of recent credit inquiries. HMEQ case overview – The consumer credit department of a bank wants to automate the decision-making process for approval of home equity lines of credit. To do this, they will follow the recommendations of the Equal Credit Opportunity Act to create an empirically derived and statistically sound credit scoring model. The model will be based on data collected from recent applicants granted credit through the current process of loan underwriting. The model will be built from predictive modeling tools, but the created model must be sufficiently interpretable so as to provide a reason for any adverse actions (rejections). – The HMEQ data set contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates if an applicant eventually defaulted or was seriously delinquent. This adverse outcome occurred in 1,189 cases (20%). For each applicant, 12 input variables were recorded. The HMEQ Loan process 1. An applicant comes forward with a specific property and a reason for the loan (HomeImprovement, Debt-Consolidation) 2. Background info related to job and credit history is collected 3. The loan gets approved or rejected 4. Upon approval, the Applicant becomes a Customer 5. Information related to how the loan is serviced is maintained, including the Status of the loan (Current, Delinquent, Defaulted, Paid-Off) The HMEQ Loan Transactional Database • Entity Relationship Diagram (ERD), Logical Design: Loan Reason Approval Date APPLICANT Applies for HMEQ Loan on… using… becomes PROPERTY Balance Status OFFICER CUSTOMER ACCOUNT MonthlyPayment has HISTORY HMEQ Transactional database: the relations • Entity Relationship Diagram (ERD), Physical Design: Officer HMEQLoanApplication OFFICERID OFFICERNAME PHONE FAX OFFICERID APPLICANTID PROPERTYID LOAN REASON DATE APPROVAL Applicant APPLICANTID NAME JOB DEBTINC YOJ DEROG CLNO DELINQ CLAGE NINQ Property PROPERTYID ADDRESS VALUE MORTDUE Customer Account CUSTOMERID APPLICANTID NAME ADDRESS ACCOUNTID CUSTOMERID PROPERTYID ADDRESS BALANCE MONTHLYPAYMENT STATUS History HISTORYID ACCOUNTID PAYMENT DATE The HMEQ Loan Data Warehouse Design • We have some slowly changing attributes: HMEQLoanApplication: Loan, Reason, Date Applicant: Job and Credit Score related attributes Property: Value, Mortgage, Balance • An applicant may reapply for a loan, then some of these attributes may have changed. – Need to introduce “Key” attributes and make them primary keys The HMEQ Loan Data Warehouse Design STAR 1 – Loan Application facts • Fact Table: HMEQApplicationFact • Dimensions: Applicant, Property, Officer, Time STAR 2 – Loan Payment facts • Fact Table: HMEQPaymentFact • Dimensions: Customer, Property, Account, Time Two Star Schemas for HMEQ Loans Applicant APPLICANTKEY APPLICANTID NAME JOB DEBTINC YOJ DEROG CLNO DELINQ CLAGE NINQ Officer OFFICERKEY OFFICERID OFFICERNAME PHONE FAX Property Customer PROPERTYKEY PROPERTYID ADDRESS VALUE MORTDUE CUSTOMERKEY CUSTOMERID APPLICANTID NAME ADDRESS HMEQApplicationFact HMEQPaymentFact APPLICANTKEY PROPERTYKEY OFFICERKEY TIMEKEY LOAN REASON APPROVAL CUSTOMERKEY PROPERTYKEY ACCOUNTKEY TIMEKEY BALANCE PAYMENT STATUS Time Account TIMEKEY DATE MONTH YEAR ACCOUNTKEY LOAN MATURITYDATE MONTHLYPAYMENT The HMEQ Loan DW: Questions asked by management • How many applications were filed each month during the last year? What percentage of them were approved each month? • How has the monthly average loan amount been fluctuating during the last year? Is there a trend? • Which customers were delinquent in their loan payment during the month of September? • How many loans have defaulted each month during the last year? Is there an increasing or decreasing trend? • How many defaulting loans were approved last year by each loan officer? Who are the officers with the largest number of defaulting loans? The HMEQ Loan DW: Some more involved questions • Are there any patterns suggesting which applicants are more likely to default on their loan after it is approved? • Can we relate loan defaults to applicant job and credit history? Can we estimate probabilities to default based on applicant attributes at the time of application? Are there applicant segments with higher probability? • Can we look at relevant data and build a predictive model that will estimate such probability to default on the HMEQ loan? If we make such a model part of our business policy, can we decrease the percentage of loans that eventually default by applying more stringent loan approval criteria? Selecting Task-relevant attributes Applicant APPLICANTKEY APPLICANTID NAME JOB DEBTINC YOJ DEROG CLNO DELINQ CLAGE NINQ Officer OFFICERKEY OFFICERID OFFICERNAME PHONE FAX Property Customer PROPERTYKEY PROPERTYID ADDRESS VALUE MORTDUE CUSTOMERKEY CUSTOMERID APPLICANTID NAME ADDRESS HMEQApplicationFact HMEQPaymentFact APPLICANTKEY PROPERTYKEY OFFICERKEY TIMEKEY LOAN REASON APPROVAL CUSTOMERKEY PROPERTYKEY ACCOUNTKEY TIMEKEY BALANCE PAYMENT STATUS Time Account TIMEKEY DATE MONTH YEAR ACCOUNTKEY LOAN MATURITYDATE MONTHLYPAYMENT HMEQ final task-relevant data file Name Model Role Measurement Level Description BAD Target Binary 1=defaulted on loan, 0=paid back loan REASON Input Binary HomeImp=home improvement, DebtCon=debt consolidation JOB Input Nominal Six occupational categories LOAN Input Interval Amount of loan request MORTDUE Input Interval Amount due on existing mortgage VALUE Input Interval Value of current property DEBTINC Input Interval Debt-to-income ratio YOJ Input Interval Years at present job DEROG Input Interval Number of major derogatory reports CLNO Input Interval Number of trade lines DELINQ Input Interval Number of delinquent trade lines CLAGE Input Interval Age of oldest trade line in months NINQ Input Interval Number of recent credit inquiries HMEQ: Modeling Goal – The credit scoring model should compute the probability of a given loan applicant to default on loan repayment. A threshold is to be selected such that all applicants whose probability of default is in excess of the threshold are recommended for rejection. – Using the HMEQ task-relevant data file, three competing models will be built: A logistic Regression model, a Decision Tree, and a Neural Network – Model assessment will allow us to select the best of the three alternative models Predictive Modeling Inputs Cases .. .. .. .. .. .. .. .. .. . . . . . . . . . Target ... ... ... ... ... ... ... ... ... ... .. .. . . Introducing SAS Enterprise Miner (EM) The SEMMA Methodology – Introduced By SAS Institute – Implemented in SAS Enterprise Miner (EM) – Organizes a DM effort into 5 activity groups: Sample Explore Modify Model Assess Sample Input Data Source Sampling Data Partition Explore Distribution Explorer Association Multiplot Variable Selection Insight Link Analysis Modify Data Set Attributes Clustering Transform Variables Filter Outliers Self-Organized Maps Kohonen Networks Time Series Replacement Model Regression User Defined Model Tree Ensemble Neural Network Memory Based Reasoning Princomp/ Dmneural Two-Stage Model Assess Assessment Reporter SAS EM Demo: HMEQ Case