Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Paper SDA-04 Detecting Medicaid Data Anomalies Using Data Mining Techniques Shenjun Zhu, Qiling Shi, Aran Canes, AdvanceMed Corporation, Nashville, TN ABSTRACT TM The purpose of this study is to use statistical and data mining techniques in Base SAS(R) and SAS(R) Enterprise Miner to proactively reduce the number of false positives caused by data anomalies in Medicaid pharmacy claim data when employing a rule-based approach to identify overpayments. Typically rule-based techniques are based on specific state Medicaid laws and policies using certain formulas to detect and identify over charged payments. False positives are defined as an identified overpayment that is erroneously positive when a claim was paid correctly due to data anomalies or unknown factors. False positives substantially increase the amount of time and resources spent by the auditors. The specific objective of the study is to detect and reduce data anomalies by examining the relationships among key variables such as Medicaid amount paid (MAP), average wholesale price (AWP) and quantity of service in Medicaid pharmacy claim data. Pharmacy claim data were simulated and the overpayment was calculated by a rule-based approach developed by AdvanceMed Corporation. Different data mining techniques such as the studentized residual, leverage, Cook‟s distance, DFFITS and clustering were utilized to capture the abnormal claims and reduce the number of false positives. The results of this analysis indicated that the clustering statistical method is the best approach to detect these kinds of data anomalies, followed by the DFFITS method. INTRODUCTION AdvanceMed specializes in helping healthcare organizations evaluate and assess the integrity of their health and pharmacy benefit programs. AdvanceMed conducts sophisticated data analysis to detect potential fraud cases from both the pre and post payment perspective using rule violations, statistical outliers, etc. to identify health care fraud and abuse. AdvanceMed aligns itself with cutting edge resources, developments, and capabilities which allows for progressive healthcare integrity in today‟s fluid environment. Through these efforts, AdvanceMed brings forth all the necessary elements to provide the client with 1 the means to successfully meet its missions. METHODOLOGY A simulation was conducted based on Medicaid data. Abnormal claims were added into the simulation data to test different data mining techniques used to detect data anomalies. Below is a rule-based calculation methodology used by AdvanceMed to detect the overpayment from pharmacy claim data. This rule-based algorithm is to identify overpayments where state Medicaid paid more drug units than state policy allowed. If quantity of service (QOS) is greater than the maximum units (max units) permitted by the state, AdvanceMed can calculate the overpayment by the following formula: Overpayment= MAP- (AWP * discount_rate*max_units + dispense_fee). (1) The discount rate and dispensing fee are constants for a specific state. Hence by (1), we will have many false positives for identified overpayment if there exist abnormal claims related to MAP or AWP. With the exception of strikeouts and errors, MAP should be calculated by a formula using AWP and QOS for each prescription. Below is a formula AdvanceMed uses to define the relationship between MAP and QOS if no other third party payment exists. MAP= AWP * QOS * discount_rate + dispense_fee. (2) The discount rate and dispensing fee are constant for any prescribed prescription. We can infer from this equation that there is a linear relationship between MAP and the product of AWP and QOS. Hence we define a new variable called „AQ‟ and let AQ=AWP*QOS. Then we perform the bivariate association analysis computing the Pearson correlation coefficients between MAP and AQ. In the simulated data the Pearson coefficient equals 0.91 which means there is a strong positive linear relationship between MAP and AQ. We then perform regression analysis predicting MAP from AQ. 1 Consider the linear regression model MAP= 1 + 2 *AQ + where the errors ε are independent and all have the same variance. Observations which have an extreme studentized residual or leverage for the fitted regression model can be identified as outliers. Cook's distance is a measurement of the influence of the i-th data point on all the other data points. The higher Cook's distance is the more influential the point is. We consider the claims when Cook‟s distance is greater than 4/n as outliers. DFFITS shows how influential a point is in a statistical regression. More specifically, it is the difference between the fitted (predicted) values calculated with and without the i-th observation. We identify the claims with DFFITS greater than 2*sqrt(k/n) as outliers (where k is the number of predictors and n is the number of observations). Clustering is a statistical method of unsupervised learning. It puts a set of observations into subsets (called clusters) so that observations are clustered which have similar patterns between the variables. Since there are three distinct drugs in the table, we determined the number of clusters as k not less than three. SAS Enterprise Miner uses the clustering cubic criterion (CCC) cutoff value as its main criteria in the selection of number of clusters. In the average linkage method, the distance between two clusters is defined as the average of the distances between all pairs of objects, where each pair is made up of one object from each group. The segment identifier is assigned a role of segment. The cluster selects initial seeds that are very wellseparated using a full replacement algorithm. The clustering methods in the Cluster node perform disjoint cluster analysis on the basis of Euclidean distances. SAS Enterprise Miner uses the Convergence Criterion Value property to specify the value of the convergence criterion in the computation of cluster seeds. The default convergence value is 0.0001. RESULTS The simulated pharmacy claim dataset consists of information about Medicaid pharmacy services. The response variable is the overpayment, calculated based on state policy. Possible explanatory variables include various measures of Medicaid pharmacy service. We add some aberrant records to the AWP in the simulated dataset to evaluate the effects of AWP data anomalies on the identified overpayments in the results. The data structure employed to calculate overpayment by a rulebased methodology is as below: Table 1: Data Structure for Simulated Pharmacy Claim Table with Calculated Overpayment Type of Normal Claims Abnormal Claims Total Total Observations 2,998 300 3,298 Number of Observations for Overpayments 63 9(False Positives) 72 Identified Claim Count Rate (%) 2.10% 3.00% 2.18% The five highest and lowest overpayments for each drug are below: 2 Figure 1: The Five Highest and Lowest Overpayments for Each Drug We examined the regression command predicting MAP from AQ. We outputted several statistics that will be needed for the next few analyses as a dataset called “rx_res”. These statistics include the studentized residual (called r), leverage (called lev), Cook's Distance (called cd) and DFFITS (called dffit). First, we used studentized residuals to identify outliers. The studentized residuals were retrieved from the previous regression analysis output. Ninety-two claims with studentized residuals either less than -1.42 or greater than 3 were identified as outliers (data anomalies). Figure 2: Studentized Residuals Distribution Second, we assess the leverages to identify observations that have a potentially large influence on regression coefficient estimates. 3 Figure 3: Leverage Distribution After we closely examine the observations in the simulated dataset as plotted below, the “claim_pk” which is the ID number for claims in (3258,3236,3036,3270,3300,3228,3260,3136,3130,3111) displays high leverage. As a result, 200 claims with leverage>0.001 were identified as outliers. 4 Figure 4: “claim_pk” Plot for Leverage and R-squared 0. 0014 2473 2083 2994 2454 2263 2435 2942 2493 2235 2910 2320 2328 2363 2985 2967 2331 0. 0013 0. 0012 0. 0011 0. 0010 0. 0009 0. 0008 0. 0007 0. 0006 0. 0005 0. 0004 2774 2880 2503 2354 2031 2323 2697 2999 2596 2398 2965 2150 2330 2084 2008 2492 2147 2425 2148 2829 2038 2067 2485 2337 2410 2228 2349 2140 2558 2996 2215 2935 2205 2671 2736 2312 2417 2763 2591 2953 2807 2116 2687 2200 2839 2057 2085 2854 2240 2834 2411 2262 2709 2077 2783 2907 2546 2158 2344 2023 2981 2614 2889 2377 2515 2066 2042 3000 2689 2905 2157 2182 2692 2812 2466 2539 2014 2891 2144 2188 2197 2102 2347 2126 2682 2146 2719 2966 2721 2105 2356 2427 2416 2874 2431 2777 2248 2589 2094 2097 2137 2366 2652 2728 2851 2635 2654 2717 2266 2099 2267 2184 2982 2629 2073 2562 2852 2339 2934 2198 2618 2297 2468 2606 2143 2735 2013 2634 2455 2988 2290 2107 2196 2120 2998 2665 2903 2550 2666 2269 2062 2831 2927 2931 2132 2292 2275 2112 2522 2835 2837 2922 2712 2163 2460 2914 2280 2755 2973 2306 2693 2704 2752 2244 2415 2257 2669 2668 2769 2771 2044 2095 2246 2767 2125 2376 3168 2203 2233 2327 2659 3166 2632 2740 2744 2001 2342 2467 2939 2748 2930 2938 2048 2154 2164 2318 2322 2761 2884 2963 3156 2551 3188 2024 2111 2353 2401 2507 2642 2647 2664 2711 2754 2406 2487 2707 2856 2876 3133 2370 2858 2894 2130 2345 2403 2462 2508 2608 2785 2976 3164 3174 2542 2582 2584 2633 2772 2071 2070 2247 2350 2358 2517 2564 2051 2213 2346 2457 2661 2731 2945 2971 3176 3196 2414 2536 2604 2868 3109 3197 2045 2149 2216 2365 2423 2464 2534 2563 2766 3119 2207 2434 2569 2623 2729 2947 2987 3157 2236 2273 2382 2580 3110 2004 2100 2265 2277 2359 2371 2400 2498 2540 2638 2684 2780 2078 2152 2271 2675 2911 2950 3101 3183 2007 2049 2072 2302 2421 2716 2830 2848 2251 2326 2443 2644 2756 2768 2819 2867 3136 3155 2142 2232 2231 2268 2670 2804 2814 2978 2227 2512 2532 2576 2673 2765 2827 2975 3107 2101 2217 2340 2901 2955 2995 3105 3131 3180 3182 3192 2185 2303 2317 2389 2402 2444 2495 2552 2733 2775 2781 2860 3193 2445 3149 3153 3190 3194 2136 2165 2252 2489 2639 2678 2751 2906 2993 3144 3143 3142 2178 2479 2520 2531 2890 2980 2036 2069 2222 2438 2449 2613 2793 2986 2020 2047 2245 2388 2617 2720 2923 2928 2991 2172 2762 2784 2838 2005 2506 2525 2560 2570 2636 2699 2715 2730 3150 3173 3189 2040 2281 2408 2627 2676 3103 3148 2059 2475 2518 2586 2663 3172 2035 2593 2656 2759 2779 2796 2128 2239 2677 2885 2027 2151 3140 2145 2229 2272 2386 2418 2609 2802 3135 3146 2075 2313 2776 3171 2037 2174 2284 2395 2441 2585 2832 2843 3112 2367 2392 2452 2818 2878 2899 3124 3200 2131 2962 2983 3114 3132 3178 2015 2296 2959 2984 2012 2230 2301 2300 2764 2786 2805 3138 3158 2021 2096 2162 2500 2789 2823 2883 2925 3163 3177 3185 2171 2429 2577 2463 3154 3186 2329 2450 2499 3113 3116 3141 2448 2488 2549 3195 2054 2134 2334 2713 2801 3118 2557 2660 3160 2496 2626 3129 2836 3121 2969 3122 3120 3179 2159 2451 2547 2622 3108 3117 3137 3165 3102 3127 3145 3159 3181 3191 2641 3161 3123 3152 3198 2412 2787 2258 3115 3126 3147 3175 3184 2840 3125 2794 2624 3187 3106 3139 3169 2833 2428 2180 2816 2422 3128 3151 3167 2896 2181 3162 3170 2482 3104 2091 2378 2685 2195 2364 2087 2224 2369 2758 2847 2949 2029 2859 2989 2018 2396 2537 2897 2990 2088 2695 2533 2797 2291 2478 3199 2156 2571 2958 2010 2153 2177 2199 2379 2476 2619 2881 2104 2283 2282 2310 2387 2541 2637 2701 2893 2006 2274 2409 2599 2888 3134 2135 2238 2293 2472 2597 2773 2788 2861 2997 2086 2138 2315 2314 2461 2484 2553 2873 2902 2937 3111 2288 2413 2118 2161 2385 2578 2864 2951 2956 2065 2243 2278 2374 2458 2977 2061 2186 2556 2573 2841 2220 2226 2316 2380 2519 2667 2725 2017 2026 2141 2214 2264 2343 2381 2465 2501 2611 2648 2782 2909 2919 2043 2194 2567 2630 2681 2710 2778 2790 2815 2912 2032 2122 2299 2305 2394 2548 2587 2615 2680 2791 2850 2957 2210 2225 2294 2324 2333 2433 2446 2481 2516 2658 2760 2806 2853 2430 2502 2795 2855 2915 2944 2483 2511 2590 2698 2746 2809 2808 2863 2869 2898 2041 2093 2259 2368 2509 2574 2821 2824 2916 2003 2384 2420 2477 2581 2820 2879 2016 2060 2121 2459 2583 2600 2724 2734 2866 3130 2046 2391 2703 2706 2870 2926 2968 2139 2279 2397 2497 2857 2974 2336 2360 2895 2929 2033 2191 2256 2335 2471 2491 2527 2686 2936 2028 2074 2110 2241 2307 2419 2828 2940 2114 2373 2426 2108 2170 2179 2362 2561 2568 2653 2732 2844 2908 2924 2076 2160 2168 2332 2572 2030 2124 2405 2708 2865 2871 2952 2063 2189 2657 2954 2285 2319 2424 2440 2588 2133 2447 2513 2592 2610 2749 2964 1119 2103 2169 2219 2260 2612 2714 2826 2117 2289 2535 2621 2080 2270 2348 2524 2595 2645 2800 2123 2155 2201 2211 2757 2822 2050 2175 2691 2166 2510 2529 1109 2206 2375 2875 2970 2079 2204 2221 2544 2747 2504 2887 2352 2528 2543 2900 2439 2192 2726 2287 2579 2941 2523 2039 2325 2625 2025 2690 2620 2480 1287 1330 2190 2862 2770 2242 2750 1442 1725 1848 2009 2286 2486 2700 2825 2845 2872 2321 2813 1467 2019 2393 2469 2616 2792 2886 2237 2882 1300 1344 1576 2106 2212 2932 2209 2357 2913 1317 1960 2119 2167 2651 2811 2817 1476 1775 2053 2193 1408 2202 1582 1945 2208 2249 2933 2948 2082 2846 2127 2187 2295 2311 2603 2737 2918 2034 2055 2304 2341 2554 1308 2173 2183 2530 2598 2628 2261 2702 2743 2892 2081 2662 2877 2943 2992 2113 2679 2705 2727 2917 2255 2254 2351 2361 2372 2494 2904 2058 2355 2566 2437 2649 2655 2722 2753 2799 2453 2456 2526 2696 2718 2849 2972 2092 2250 2399 2404 2490 2607 2741 2098 2383 2602 2601 2631 2745 2309 2575 2129 2276 2521 2646 2688 1602 2022 2056 2298 2407 2672 2803 2052 2234 2308 2594 2640 2723 2960 1523 1831 2176 2218 2650 2674 2921 1361 1573 2436 2514 2798 1194 1673 2643 2694 1291 1423 2011 1143 1181 1285 1491 1813 1937 2338 2683 2946 1188 1286 1333 1387 1443 1510 1663 2109 1138 1731 1030 1040 1049 1052 1074 1073 1223 1311 1675 1709 1745 1862 2253 2559 1051 1090 1150 1360 1391 2810 1236 1372 1398 1469 1603 1606 1632 1833 1957 2470 1016 1086 1167 1198 1241 1485 1592 1755 1786 2920 1005 1020 1038 1249 1289 1305 1591 1610 1697 1913 2089 2538 2842 1125 1148 1338 1374 1455 1526 1667 1686 1844 2223 1029 1035 1176 1206 1219 1336 1362 1410 1426 1528 1689 1770 1797 1987 1037 1235 1302 1588 1594 1605 1932 2555 1033 1063 1495 1566 1649 1220 1870 2738 1187 1288 1463 1533 1555 1902 1920 1983 1205 1233 1699 1754 1825 1583 1660 1659 1714 1600 1677 1371 1279 1334 2115 1011 1874 1822 1411 1543 1683 1948 1768 1993 1966 1369 1409 1155 1695 1142 1179 1553 1911 1919 1025 1214 1532 1078 1693 1730 1425 1631 1856 1882 1343 1946 1817 1982 1061 1805 1257 1674 1710 1713 1787 1044 1131 1807 1985 1057 1267 1759 1315 1413 1161 1380 1386 1826 1435 1744 1050 1650 1678 1196 1705 1162 1118 1192 1297 1889 1906 1961 1245 1304 1351 1484 1716 1963 1803 1776 1397 1827 1031 1322 1840 1022 1085 1145 1464 1525 1554 1584 1853 1881 1095 1144 1172 1404 1515 1711 1941 1984 1225 1596 2505 1345 1186 1190 1072 1383 1479 1505 1550 1561 1571 1652 1783 1858 1990 1027 1500 1587 1618 1639 1739 1753 1698 1477 1104 1106 1347 1428 1790 1811 1835 1034 1071 1103 1185 1231 1295 1540 1796 1922 1004 1135 1475 1559 1608 1704 1742 1928 1446 1501 1551 1624 1644 1666 1842 1924 1615 2000 2090 2979 1574 1580 1370 1432 1445 1458 1944 1048 1108 1173 1262 1294 1303 1417 1564 1581 1877 1940 1942 1001 1152 1177 1180 1377 1508 1647 1761 1997 1019 1077 1213 1478 1743 1113 1158 1209 1321 1331 1416 1449 1687 1741 1933 1136 1394 1635 1642 1671 1794 1307 1665 1696 1828 1904 1002 1195 1228 1750 1066 1278 1355 1884 1045 1264 1298 1368 1552 1703 1758 1900 1059 1088 1254 1306 1339 1346 1378 1390 1531 1548 1327 1332 1366 1456 1270 1969 1487 1128 1480 1124 1437 1681 1751 1903 1995 1046 1165 1516 1518 1746 1841 1878 1082 1777 1917 1348 1601 1723 1815 2442 1283 1414 1977 1122 1329 1271 1341 1356 1578 1771 1820 1080 1232 1405 1440 1444 1454 1662 1700 1850 1854 1589 1901 1036 1097 1217 1293 1792 1975 1184 1680 2739 2961 1393 1598 1389 1951 1972 1091 1326 1373 1784 1253 1502 1829 1083 1472 1412 1422 1719 1866 1534 1748 1259 1973 1499 1918 1512 1134 1268 1541 1672 1956 1123 1872 1028 1646 1692 1047 20021981 1137 1247 1707 1722 1494 1617 1522 1178 1905 1221 1949 1832 1846 2064 2432 1140 1024 1855 2742 1400 1804 1898 1316 1538 1868 1939 1712 2068 1342 1607 2474 2545 1627 1434 1468 1064 1773 1929 1936 1251 1447 2390 1740 1376 1738 1238 1058 1482 1954 1892 1488 1062 1274 1296 1821 1089 1392 1470 1201 1542 1171 1622 1363 1979 1312 1726 1292 1791 1943 1299 1385 1473 1664 1830 1636 1139 1418 1657 1992 1075 1955 1182 1396 1766 1887 1053 1087 1272 1728 1756 1893 1191 1679 1875 1266 1504 1585 1641 1888 1967 1202 1067 1197 1529 1685 1808 1837 1117 1421 1560 1669 1708 1814 1839 1968 1546 1565 1670 1733 1861 1381 1729 1970 1702 1812 1867 1407 1141 1962 1965 1098 1157 1301 1453 1497 1637 1895 1915 1930 1989 1160 1107 1100 1222 1401 1629 1980 1524 1764 1834 1099 1282 1384 1613 1654 1653 1847 1215 1230 1772 1780 1860 1926 1054 1403 1558 1824 1896 1156 1340 1430 1527 1843 1934 1263 1514 1623 1013 1101 1212 1218 1007 1012 1023 1042 1503 1324 1691 1706 1243 1175 1246 1248 1110 1237 1520 1694 1907 1189 1616 1724 1763 1539 1688 1947 1779 1009 1250 1382 1427 1474 1886 1310 1076 1070 1120 1159 1597 1701 1204 1658 1778 1782 1789 1798 1950 1224 1547 1645 1864 1923 1325 1402 1431 1795 1931 1935 2605 1450 1424 1656 1065 1079 1147 1164 1203 1364 1483 1638 1890 1988 1575 1604 1661 1115 1352 1557 1595 1873 1130 1277 1126 1256 1281 1535 1041 1462 1621 1747 1026 1318 1375 1486 1549 1556 1563 1816 1899 1313 1507 1676 1570 1801 1879 1640 1018 1769 2565 1823 1851 1309 1055 1153 1239 1429 1737 1974 1419 1567 1765 1273 1509 1809 1579 1852 1994 1353 1114 1151 1200 1395 1865 1032 1319 1489 1496 1655 1863 1891 1436 1819 1094 1976 1003 1999 1379 1388 1800 1668 1799 1216 1883 1269 1415 1234 1978 1451 1459 1481 1350 1785 1836 1365 1433 1986 1227 1420 1328 1043 1163 1727 1068 1105 1460 1818 1536 1154 1859 1006 1060 1908 1017 1349 1149 1466 1760 1229 1927 1015 1717 1735 1628 1255 1897 1359 1959 1069 1530 1736 1885 1916 1682 1958 1880 1593 1146 1452 1625 1633 1648 1718 1199 1612 1614 1952 1183 1261 1207 1788 1914 1211 1265 1998 1810 1493 1513 1242 1252 1537 1690 1774 1909 1129 1849 1971 1609 1008 1092 1684 1406 1490 1569 1996 1323 1572 1749 1335 1471 1762 1938 1876 1599 1492 1438 1465 1715 1314 1337 1102 1244 1802 1111 1132 1869 1991 1367 1857 1448 1210 1626 1910 1021 1752 1517 1619 1093 1039 1399 1806 1014 1168 1290 1871 1611 1133 1441 1521 1280 1838 1112 1461 1276 1734 1116 1562 1767 1010 1568 1721 1912 1240 1498 1056 1193 1357 1781 1121 1226 1590 1084 1634 1169 1793 1953 1170 1096 1894 1208 1260 1511 1845 1320 1732 1964 1354 1545 1519 1921 1620 1651 1643 1925 1457 1544 1127 1284 1577 1720 1506 1258 1081 1439 1586 1630 1275 1757 1358 3021 3063 351 163 538 3034 3093 3007 3051 3089 3031 3082 3016 3064 3070 3067 3018 3049 3035 3100 3030 3096 3059 3013 3024 3098 3080 3037 3047 3039 3087 3071 3056 3086 3003 3043 730 3022 644 3040 23 3002 3068 247 548 788 148 973 853 323 31 8 3006 3038 22 108 3092 240 3052 3078 3010 3084 3004 3033 3073 3083 3062 3077 3046 3058 3076 3099 3017 3011 3045 3090 3072 3057 3015 3032 3009 3026 3065 3036 3048 3054 3028 3019 3014 651 3074 3081 3012 3097 3091 3044 3008 3053 3020 3042 3050 3088 3023 3029 3075 3095 216 318 231 574 3085 3055 112 483 887 918 863 120 501 680 873 722 809 365 394 620 990 145 428 557 608 700 99 117 701 103 131 260 291 518 723 826 637 828 916 685 856 894 141 397 573 836 403 255 470 603 923 544 684 787 963 746 760 664 223 72 232 89 513 3060 422 147 492 688 732 267 102 177 480 628 920 922 221 239 400 477 532 829 885 903 972 176 294 334 488 776 844 63 213 500 830 967 34 51 455 724 32 57 3094 3001 838 866 974 364 708 919 86 114 142 180 569 650 955 971 276 412 559 951 48 580 733 704 899 26 38 317 485 568 586 673 855 911 982 54 375 927 772 3066 694 683 779 3069 3027 3079 452 172 508 462 454 798 132 646 331 753 290 804 475 979 275 698 521 961 3005 781 332 3041 690 3025 604 192 3061 124 556 316 342 653 737 562 868 545 272 796 975 719 151 195 322 347 386 389 388 493 537 542 583 657 679 808 846 859 143 161 208 215 677 705 786 842 875 891 907 924 154 280 496 553 582 821 867 874 915 991 219 226 451 642 667 728 774 912 952 95 306 834 29 135 25 30 44 77 93 94 50 448 752 890 309 466 498 550 566 602 861 101 271 315 393 441 482 502 510 514 588 618 906 933 130 156 162 212 237 236 311 339 346 362 373 519 527 572 610 615 617 631 630 640 666 676 675 678 710 709 716 734 749 754 757 105 146 159 182 191 202 207 249 265 293 299 324 349 354 366 421 424 431 450 476 479 478 499 511 541 551 599 611 665 691 725 729 775 805 877 959 968 125 127 235 329 337 340 405 434 468 489 635 661 720 770 784 802 833 841 869 871 879 934 965 41 189 197 284 303 319 396 408 456 670 682 693 718 748 931 936 945 980 989 81 251 358 984 12 19 52 220 837 24 69 73 92 14 78 10 234 766 194 581 639 759 800 840 946 128 185 457 523 671 687 771 847 878 944 115 144 150 209 242 292 307 419 418 427 442 495 613 636 699 780 785 864 988 999 160 166 186 285 361 395 534 555 762 811 816 819 827 889 949 119 138 175 190 198 298 367 440 520 529 576 590 669 707 851 913 938 941 962 168 363 371 996 15 56 82 763 888 21 27 55 59 68 42 66 6 7 278 460 252 504 206 301 813 850 937 90 149 155 164 230 273 312 547 795 895 901 909 970 261 314 333 352 357 384 443 494 638 674 695 738 756 854 860 80 250 469 696 9 295 536 612 928 803 711 947 121 297 387 413 430 444 503 552 554 561 563 731 761 898 28 85 183 233 325 378 391 426 458 474 486 491 517 622 647 660 745 758 807 814 820 849 852 886 902 35 100 110 113 116 122 173 170 179 187 193 199 210 217 277 296 305 310 313 330 344 359 368 377 401 411 423 429 445 481 497 505 524 560 570 587 629 641 645 686 692 703 712 727 726 744 777 797 812 824 858 862 884 917 935 954 969 985 4 16 65 71 171 253 281 345 343 369 379 415 459 461 533 571 585 609 663 672 751 764 893 892 900 926 925 932 960 997 1000 83 139 463 857 1 11 60 64 97 104 109 111 153 152 158 201 200 222 266 286 350 360 372 382 407 414 417 438 437 436 484 522 584 594 593 592 591 605 627 625 643 652 706 717 768 782 793 930 950 957 983 987 986 994 3 2 5 13 18 17 20 33 37 36 40 39 43 47 46 45 49 53 58 62 61 67 70 76 75 74 79 84 88 87 91 98 96 107 106 118 123 126 129 134 133 137 136 140 157 165 167 169 174 178 181 184 188 196 205 204 203 211 214 218 225 224 229 228 227 238 241 246 245 244 243 248 254 259 258 257 256 264 263 262 270 269 268 274 279 283 282 289 288 287 300 302 304 308 321 320 328 327 326 336 335 338 341 348 353 356 355 370 374 376 383 381 380 385 390 392 399 398 402 404 406 410 409 416 420 425 433 432 439 435 447 446 449 453 465 464 467 473 472 471 487 490 509 507 506 512 516 515 526 525 528 531 530 535 540 539 543 546 549 558 565 564 567 575 579 578 577 589 598 597 596 595 601 600 607 606 614 616 619 621 626 624 623 634 633 632 649 648 656 655 654 659 658 662 668 681 689 697 702 715 714 713 721 736 735 743 742 741 740 739 747 750 755 765 769 767 773 778 783 794 792 791 790 789 799 801 806 810 815 818 817 823 822 825 832 831 835 839 843 845 848 865 870 872 876 883 882 881 880 897 896 905 904 908 910 914 921 929 940 939 943 942 948 953 958 956 964 966 978 977 976 981 995 993 992 998 3269 3282 3201 3247 3215 3264 3203 3257 3265 3224 3262 3230 3290 3235 3280 3246 3207 3287 3213 3208 3209 3248 3205 3276 3259 3220 3295 3239 3296 3245 3271 3232 3214 3263 3274 3222 3291 3225 3254 3242 3217 3233 3218 3277 3283 3226 3293 3223 3298 3278 3206 3240 3229 3212 3238 3243 3289 3252 3237 3255 3250 3284 3231 3294 3227 3253 3266 3261 3288 3279 3275 3202 3251 3281 3285 3210 3244 3297 3272 3204 3211 3299 3241 3219 3292 3256 3258 3236 3216 3273 3270 33003260 3228 3249 3221 3268 3286 3267 3234 0. 0003 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 r squar ed SAS code: proc univariate data=rx_res plots; var lev; run; proc sql; create table rx_res2 as select *, r**2 as rsquared from rx_res; quit; goptions reset=all; axis1 label=(r=0 a=90); symbol1 pointlabel = ("#claim_pk") font=simplex value=none; proc gplot data=rx_res2; plot lev*rsquared / vaxis=axis1; run; quit; The results of Cook‟ distance showed that there were 118 claims with Cook‟s distance>4/3298 and 195 claims with an absolute value of DFFITS>2*sqrt(1/3298) which were considered as outliers. TM We used the SAS(R) Enterprise Miner to do the cluster analysis. Each observation represents a claim for overpayment detection. The following is the flow diagram of this clustering model design. 5 Figure 5: Flow Diagram of the Clustering Model In the “RX_SIMU” node, we did not use any target information created by a rule-based algorithm because it is not necessary for the unsupervised learning model. In the “Replacement” node, we replaced the missing value of character variables with “Unknown” and ignored the missing values of interval variables. In the “Transform Variables” node, we created a new variable “log_aq” by employing the formula: log_aq=log (AWP*quantity_of_service+1). To reduce the variance of the variable “AQ” which has a skewness of 17.76, a log transformation on “AQ” was performed and a new variable log_aq was created. Below are the statistics after the log transformation. Figure 6: Transformation Statistics of “log_aq” The cluster selects initial seeds that are very well-separated using a full replacement algorithm. The following pie chart shows there are 6 segments selected for this clustering. 6 Figure 7: Segment Size Plot There are 3 segments with sizes of around 1000 observations each, and 3 segments which have sizes of 100 observations each. From the distribution of each variable within the segments, we know that most of them are evenly distributed within each segment and they appear the same in the pairs of (1, 6), (2, 4) and (3, 5). Figure 8: Cluster Proximities Plot Cluster proximity for average clustering is defined as the average pairwise distance of all pair of points from different clusters. From the plot of cluster proximities, the pattern becomes obvious. The distance of cluster proximities for the segment pairs of (1, 6), (2, 4) and (3, 5) are very close to each other. From the segment size plot, the sizes of segment 4, 5 and 6 are very small compared to their closest segments and hence can be identified as abnormal claims. After figuring out which variable caused this abnormality we used SAS code node to delete segments=4, 5 and 6. There are 300 claims in segments 4, 5 and 6 identified as abnormal claims. SAS code: 7 libname cls "C:\Documents and Settings\Administrator\Desktop\paper reference"; data cls.rx_clus; set &em_import_data.; if _segment_ in (1,2,3); drop _segment_ distance im_awp im_log_aq im_max_units im_medicaid_amount_paid im_period im_quantity_of_service im_national_drug_code _impute_ log_aq; run; The following is the summary of experiment results for Student Residual, Leverage, Cook‟s distance, DFFITS and Clustering. Table 2: Summary of Experiment Results Number of Abnormal Claims Removed Abnormal Claims Capture Rate Number of False Positives Removed Student Residual 87 29% 0 0% 5 0.17% Leverage 2 1% 0 0% 198 6.60% Cook's Distance 195 65% 0 0% 3 0.10% DFFITS 182 61% 6 67% 35 1.17% Clustering 300 100% 9 100% 0 0.00% Statistical Techniques False Positives Capture Rate Number of Normal Removed Normal Claims Misclassification Rate CONCLUSION When working with Medicaid data, AdvanceMed has learned that there are different types of data anomalies in Medicaid pharmacy claim data. A simulation of the pharmacy claim file shows that false positives are caused by these anomalies in a rule based algorithm. To avoid false positives, we introduced five different statistical approaches to detect and eliminate the abnormal claims. The results of this study indicate that clustering technique is the best approach, followed by DFFITS. 8 REFERENCES 1 AdvanceMed Corporation http://www.csc.com/advancemed/ds/11858-about_us ACKNOWLEDGMENTS Special Thanks to Tom Mathis, who is the program director of AdvanceMed Corporation, for his patience and support. Huge thanks to Rick Wells, who is the project director of AdvanceMed Corporation, for his incredibly understanding and sincere encouragement. Finally, to all of the colleagues who perfectly demonstrate creative excellences — thank you. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the authors at: SHENJUN ZHU Chief Statistician, AdvanceMed Corporation, 2636 Elm Hill Pike, Suite 110, Nashville, TN 37214 p: 615.425.2481 | f: 615.872.0272 [email protected] | www.admedcorp.com QILING SHI Data Analyst, Mathematics PhD AdvanceMed Corporation, 2636 Elm Hill Pike, Suite 110, Nashville, TN 37214 p: 615.425.2451 | f: 615.872.0272 [email protected], [email protected]| www.admedcorp.com ARAN CANES Data Analyst, Economics MA AdvanceMed Corporation, 2636 Elm Hill Pike, Suite 110, Nashville, TN 37214 p: 615.872.0272 | f: 615.872.0272 [email protected] | www.admedcorp.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. 9