CSCI 479 (Machine Learning)

Problem Description:

The training data set and the testng data set used in Assignment 2 uses the same data format used by Netflix Prize.

Each data item is a quadruplet of the form <user, movie, date of grade, grade>. The user and movie fields are integer IDs representing the users and the movies respectively, the date of grade takes the format of "yyyy-mm-dd", and grades are from 1 to 5 (integral and inclusive) stars.

In order to use the given data in the similarity based learning algorithms, the data must be transformed to a suitable format.

Your tasks:

Design and implement a program to transform the data set from its current format to a vector format where one user is one vector.
You don't have to use this recommended user vector format. You can propose your own data format, but the same data format must be used consistently by your distance function (in the next task) and your recommendation algorithm (in Assignment 2).

Then, based on the format of the transformed data generated in the previous task, design and implement a similarity (or distance) function that calculates the distance between any two user vectors.