CSCI 479 -- Exam Preparaton
Final Exam: (Time: 9:00 - 12:00, 17 April 2026, Friday;
Location: Building 315/Room 216)
The final is based on:
Topics included:
- Introduction
- Data
- Why is it important to understand the data collection?
- Attribute types (and later, how they affect the learning algorithms)
- Common issues and how they are handled: missing values, noisy data
- Data aggregation, reduction and transformation
- Information Based Learning
- product: decision tree
- how to build it: recursive algorithm, greedy algorithm
- information in "information based": entropy, GINI, mis-classification
rate
- how to use it
- issues (and how to handle them)
- bad split situation (use information gain ratio instead of
information gain)
- overfitting (pre-pruning, post-pruning)
- continuous descriptive attribute (discretization - equal
width, equal depth, information based)
- continuous target attribute (regression tree - using variance
instead of entropy)
- model ensembles (decision forest instead of tree)
- Similarity Based Learning
- feature space (data format)
- similarity (distance) metrics
- how to use it: nearest neighbor algorithm
- issues (and how to handle them)
- noisy data (K nearest neighbor algorithm)
- feature difference (normalization)
- continuous target attribute (average instead of class label)
- too many descriptive attributes (dimension reduction)
- improve efficiency - indexing (such as K-D tree), pre-screening
- Probability Based Learning
- some statistics concepts
- how to get the probabilities
- Bayes' Theorem
- how to use probability based classifier in general?
- Naive Bayes' classifier, and its BIG assumption
- Bayesian Belief Networks
- what does it look like?
- Markov blanket
- how to use it?
- how to build it?
- issues (and how to handle them)
- insufficient data (smoothing)
- continuous descriptive attribute (probability density function
instead of a fixed set of probabilities)
- missing parent attribute values - hidden variables
(consider all possible values and their probabilities)
- Error Based Learning
- multivariable linear regression
- gradient descent algorithm
- Artificial Neural networks (modelling non-linear functions)
- Neuron model
- Neuron Network model
- Model applying (Forward propagation)
- Model learning (Backward propagation)
- issues (and how to handle them)
- categorical target attribute (find the decision boundary)
- non-linear relationships (use basis functions)
- multinomial (instead of binary) output (multiple one-vs-all models)
- computationally too expensive dot product operation
(use kernel trick)
- overfitting (stop training at an appropriate time)
- Classification Model Evaluation
- statistical measurements based on the confusion matrix
- Receiver Operating Characteristic Curve and ROC Index
- Model Evaluation after Deployment
- Cluster Analysis
- similarity based and density based
- Clustering algorithms
- Kmeans and its variants
- Hierarchical clustering
- Density-based clustering
- Cluster Validity Evaluation
- Outlier Detection
- Definition of Outlier
- detection methods: graphical and statistical based; distance based;
model based;
- Association Analysis
- data format - market-basket model
- find frequent itemsets
- A-Priori algorithm, hash tree
- tree projection
- ECLAT, vertical database format
- FP growth algorithm, FP-tree, conditional FP-tree
- rule generation for A-Priori algorithm
- Multiple Minimum Support and its effect on A-Priori algorithm
- Association Rule Evaluation - objective statistical based measures
(usually use contingency table) and subjective measures
Past exams:
Midterm of Fall 2022
Final Exam of Fall 2022