CSCI 479 (Machine Learning)

Problem Description:

The problem scenario of this lab is from our optional textbook, Fundamentals of Machine Learning for Predictive Data Analytics.

A credit card issuer has built two different credit scoring models that predict the propensity of customers to default on their loans. The output scores of the two models for a test dataset are shown in the table below (it should be obvious that the lower the score the better) and in CSV format:

ID Target Model 1 Score Model 2 Score

1 bad 0.634 0.230

2 bad 0.782 0.859

3 good 0.464 0.154

4 bad 0.593 0.325

5 bad 0.827 0.952

6 bad 0.815 0.900

7 bad 0.855 0.501

8 good 0.500 0.650

9 bad 0.600 0.940

10 bad 0.803 0.806

11 bad 0.976 0.507

12 good 0.504 0.251

13 good 0.303 0.597

14 good 0.391 0.376

15 good 0.238 0.285

16 good 0.072 0.421

17 bad 0.567 0.842

18 bad 0.738 0.891

19 bad 0.325 0.480

20 bad 0.863 0.340

21 bad 0.625 0.962

22 good 0.119 0.238

23 bad 0.995 0.362

24 bad 0.958 0.848

25 bad 0.726 0.915

26 good 0.117 0.096

27 good 0.295 0.319

28 good 0.064 0.740

29 good 0.141 0.211

30 good 0.670 0.152

The lab task is to write a function to calculate the area under the ROC curve of ONE model using approximate integration.

To calculate the ROC area, your program should read in one model's prediction result at a time, and the data can be pre-processed to be sorted according to the model score. For example, to calculate the ROC area for Model 1, the data feed into your program should look like (and in CSV format):

Target Model Score

good 0.064

good 0.072

good 0.117

good 0.119

good 0.141

good 0.238

good 0.295

good 0.303

bad 0.325

good 0.391

good 0.464

good 0.5

good 0.504

bad 0.567

bad 0.593

bad 0.6

bad 0.625

bad 0.634

good 0.67

bad 0.726

bad 0.738

bad 0.782

bad 0.803

bad 0.815

bad 0.827

bad 0.855

bad 0.863

bad 0.958

bad 0.976

bad 0.995

Then, your program should use each score as the threshold value to calculate the FPR (false positive rate) and TPR (true positive rate). Each such pair gives a point in the ROC curve.

Finally, calculate the area under the ROC curve using the (simplified) approximate integration based on these points.

(Optional), use your program to calculate the ROC area for model 1 and model 2 in the above given credit card prediction example, and determine which model is a better one.

ID	Target	Model 1 Score	Model 2 Score
1	bad	0.634	0.230
2	bad	0.782	0.859
3	good	0.464	0.154
4	bad	0.593	0.325
5	bad	0.827	0.952
6	bad	0.815	0.900
7	bad	0.855	0.501
8	good	0.500	0.650
9	bad	0.600	0.940
10	bad	0.803	0.806
11	bad	0.976	0.507
12	good	0.504	0.251
13	good	0.303	0.597
14	good	0.391	0.376
15	good	0.238	0.285
16	good	0.072	0.421
17	bad	0.567	0.842
18	bad	0.738	0.891
19	bad	0.325	0.480
20	bad	0.863	0.340
21	bad	0.625	0.962
22	good	0.119	0.238
23	bad	0.995	0.362
24	bad	0.958	0.848
25	bad	0.726	0.915
26	good	0.117	0.096
27	good	0.295	0.319
28	good	0.064	0.740
29	good	0.141	0.211
30	good	0.670	0.152

Target	Model Score
good	0.064
good	0.072
good	0.117
good	0.119
good	0.141
good	0.238
good	0.295
good	0.303
bad	0.325
good	0.391
good	0.464
good	0.5
good	0.504
bad	0.567
bad	0.593
bad	0.6
bad	0.625
bad	0.634
good	0.67
bad	0.726
bad	0.738
bad	0.782
bad	0.803
bad	0.815
bad	0.827
bad	0.855
bad	0.863
bad	0.958
bad	0.976
bad	0.995