Part II. Association rules in a large dataset of transactions

1.    Dataset description

Download the following dataset: marketbasket.csv. This dataset contains the data from the point-of-sale transactions in a small supermarket.

Open the file in WEKA explorer.

The dataset consists of 1361 transactions. The total number of distinct items is 255. All attributes are understood by WEKA as numeric. In fact, they are all binary, having values either 0 (not purchased) or 1 (purchased). The first thing we need to do is to apply Filter->unsupervised->attribute->NumericToNominal. Select from the dropdown box of class attribute: no class option. If you do not select this option, then the class attribute will be not converted into Nominal. Then click apply filter. Save the resulting file in arff format as marketbasket.arff.

Now the dataset exactly corresponds to the binary input for frequent pattern mining (as in the Pizza toppings dataset in slide 37 of our first lecture about the Apriori algorithm). Though it is tempting to try Apriori, do not attempt it in the lab: it will cause memory overflow and WEKA will crash. You can try it at home, where you know how to stop a non-responsive program, and how to recover from the memory overflow.

In the previous lab, we applied Apriori algorithm to categorical attributes with 5 different categories for each attribute. Unlike Apriori algorithm, the FP-growth algorithm takes as an input only binary format expressed as nominal attributes with 2 values: 0 and 1. This is exactly what we have, and now we can try the FP-growth algorithm in Associate tab.

2.    FP-growth with default parameters

Select FP-growth and run it with default parameters. No rules found!

3.    Adjusting parameters  

Click on the parameters line. Set lowerBoundMinSupport: 0.01; min metric: 0.7. Start.

4.    Results

This time 40,664 rules were generated in several seconds. This demonstrates the power of the FP-growth algorithm. The output:

Figure 7

End of Part II