CSCI 485 --- Fall 2023

Lab A

Problem Description:

Generally speaking, most machine learning algorithms fall into the category of supervised inductive learning, and more specifically, into the category of building a predictive model to perform classification. A learning agent learns a model from a set of historial examples/observations/dataset.

Therefore, before feeding the dataset to a learning agent, a very important step in machine learning is to understand this historical dataset, which includes, but not limited to, a statistic summary of the data, such as how many data items, how many attributes for each data item, the range of each attribute, how values of each attribute distribute in the attribute's range, how many data items have missing values in their attributes, etc.

Because most machine learning algorithms can't tolerate missing values in the dataset, so an important step in pre-processing dataset is to clean up the missing values in the dataset. The general approaches of dealing missing values include:

get rid of the data items with missing values in any of its attributes;
using the attribute's average value of all data items to replace the missing value;
using the attribute's average value of all the data items belonging to the same class label to replace the missing value.

As an example, you can download the Census Income data set from Adult Data Set at UCI. The data set has already been splitted into training data set (adult.data) and testing data set (adult.test). Each dataset file is in CSV format. The website also includes a description of the data set.

Your tasks:

Understand the data.

Produce a statistical summary report for each attribute.

Pre-process the data to deal with the missing values in both the training and testing datasets.

There are multiple ways to perform the above tasks:

Write your own program to process the data
Open data files in Excel, and then use Excel tools (Excel formula and/or VBA programming) to process the data
Load data into a database table using a database such as SQLite, and then use queries to process the data

Optional task:

Use the Adult Data Set to build a predictive model.

In the data's website, the prediction task is to determine whether a person makes over 50K a year. But you can choose any discrete typed attribute as your predictive model's target class label.

The recommended model type is a decision tree. That is, the restriction bias in this learning process is to build a decision tree model only.