The problem scenario of this lab is from our optional textbook, Fundamentals of Machine Learning for Predictive Data Analytics.
A credit card issuer has built two different credit scoring models that predict the propensity of customers to default on their loans. The output scores of the two models for a test dataset are shown in the table below (it should be obvious that the lower the score the better) and in CSV format:
| ID | Target | Model 1 Score | Model 2 Score |
| 1 | bad | 0.634 | 0.230 |
| 2 | bad | 0.782 | 0.859 |
| 3 | good | 0.464 | 0.154 |
| 4 | bad | 0.593 | 0.325 |
| 5 | bad | 0.827 | 0.952 |
| 6 | bad | 0.815 | 0.900 |
| 7 | bad | 0.855 | 0.501 |
| 8 | good | 0.500 | 0.650 |
| 9 | bad | 0.600 | 0.940 |
| 10 | bad | 0.803 | 0.806 |
| 11 | bad | 0.976 | 0.507 |
| 12 | good | 0.504 | 0.251 |
| 13 | good | 0.303 | 0.597 |
| 14 | good | 0.391 | 0.376 |
| 15 | good | 0.238 | 0.285 |
| 16 | good | 0.072 | 0.421 |
| 17 | bad | 0.567 | 0.842 |
| 18 | bad | 0.738 | 0.891 |
| 19 | bad | 0.325 | 0.480 |
| 20 | bad | 0.863 | 0.340 |
| 21 | bad | 0.625 | 0.962 |
| 22 | good | 0.119 | 0.238 |
| 23 | bad | 0.995 | 0.362 |
| 24 | bad | 0.958 | 0.848 |
| 25 | bad | 0.726 | 0.915 |
| 26 | good | 0.117 | 0.096 |
| 27 | good | 0.295 | 0.319 |
| 28 | good | 0.064 | 0.740 |
| 29 | good | 0.141 | 0.211 |
| 30 | good | 0.670 | 0.152 |
The lab task is to write a function to calculate the area under the ROC curve of ONE model using approximate integration.
To calculate the ROC area, your program should read in one model's prediction result at a time, and the data can be pre-processed to be sorted according to the model score. For example, to calculate the ROC area for Model 1, the data feed into your program should look like (and in CSV format):
| Target | Model Score |
| good | 0.064 |
| good | 0.072 |
| good | 0.117 |
| good | 0.119 |
| good | 0.141 |
| good | 0.238 |
| good | 0.295 |
| good | 0.303 |
| bad | 0.325 |
| good | 0.391 |
| good | 0.464 |
| good | 0.5 |
| good | 0.504 |
| bad | 0.567 |
| bad | 0.593 |
| bad | 0.6 |
| bad | 0.625 |
| bad | 0.634 |
| good | 0.67 |
| bad | 0.726 |
| bad | 0.738 |
| bad | 0.782 |
| bad | 0.803 |
| bad | 0.815 |
| bad | 0.827 |
| bad | 0.855 |
| bad | 0.863 |
| bad | 0.958 |
| bad | 0.976 |
| bad | 0.995 |
Then, your program should use each score as the threshold value to calculate the FPR (false positive rate) and TPR (true positive rate). Each such pair gives a point in the ROC curve.
Finally, calculate the area under the ROC curve using the (simplified) approximate integration based on these points.
(Optional), use your program to calculate the ROC area for model 1 and model 2 in the above given credit card prediction example, and determine which model is a better one.