(a) If a classifier $C_{1}$ performs better than another classifier $C_{2}$ on the training set, then $C_{1}$ performs better than $C_{2}$ on the test set. True or false? Justify your answer.
Answer. [Write your solution here. Add cells as needed.]
(b) If a class of models $H_{1}$ (e.g. linear functions) is a subset of another class $H_{2}$ (e.g. polynomials), then we should always use the larger class $H_{2}$ so as to achieve better generalization performance. True or false? Justify your answer.
Answer. [Write your solution here. Add cells as needed.]
(c)
Load the diabetes dataset using the
sklearn.datasets.load_diabetes()
function.
How many examples and features are in the datasets?
Answer. [Write your solution here. Add cells as needed.]
(d) Randomly split the data into a training set and a test set, with 70\% in the training set, and 30\% in the test set. What are the average feature vectors for the training set and the test set?
Answer. [Write your solution here. Add cells as needed.]
(e) Train a $k$NN model for $k=1$ to 30. Plot the training set and test set mean squared errors (MSEs) against $k$. Is a model's training set MSE a good estimate of its test set MSE?
Answer. [Write your solution here. Add cells as needed.]
(f) Which $k$ has the smallest test set MSE and what is its test set MSE?
Answer. [Write your solution here. Add cells as needed.]
(g) Perform 10-fold cross-validation to choose the value of $k$ for $k$NN among values from 1 to 30. If the model trained with the $i$-th fold left out has an MSE $e_{i}$ on the $i$-th fold, then the 10-fold CV MSE is $\frac{1}{10} \sum_{i} e_{i}$. What is the $k$ value with the smallest 10-fold CV MSE? What is the test set MSE of the chosen $k$NN model?
Answer. [Write your solution here. Add cells as needed.]
(h) Plot the 10-fold CV MSEs and the test set MSEs against $k$. Are the 10-fold CV MSEs better estimates for the test set MSEs as compared to the training set MSEs?
Answer. [Write your solution here. Add cells as needed.]
We compare the bias and variance of two simple models on a toy problem in this question.
In the model selection lecture, for simplicity, we discuss the bias-variance decomposition for a single example. Specifically, consider an arbitrary regression algorithm. Let $D = (\mathbf{x}_{1}, y_{1}), \ldots, (\mathbf{x}_{n}, y_{n})$ be a random set of training examples independently drawn from a data generating distribution $P$, $(X, Y)$ be a random test example with $X$ fixed to be $\mathbf{x}$, and $Y'$ be the output predicted on $\mathbf{x}$ by the model trained on $D$.
\begin{equation} \overset{\text{expected prediction error}}{\overbrace{\mathbb{E}\bigl((Y'-Y)^2\bigr)}} = \overset{\text{variance}}{\overbrace{\mathbb{E}\bigl((Y' - \mathbb{E}(Y'))^2\bigr)}} + \overset{\text{bias (squared)}}{\overbrace{\bigl(\mathbb{E}(Y') - \mathbb{E}(Y)\bigr)^2}} + \overset{\text{irreducible noise}}{\overbrace{\mathbb{E}\bigl((Y - \mathbb{E}(Y))^2\bigr)}}. \end{equation}We can often view a regression algorithm as a function $y = f(\mathbf{x}, D)$, that is, it accepts a training set $D$, and a test example $\mathbf{x}$, then outputs a predicted output $y$ for $\mathbf{x}$.
In general, we are not interested in the expected prediction error (EPE), squared bias (or simply bias), and variance of a regression algorithm for a single fixed input. Instead, we are interested in the EPE, bias, and variance of a regression algorithm for a random test example. Informally, this simply involves taking another expectation wrt the random test input $X$ on both sides of the decomposition above. To be precise, we have \begin{align} \text{EPE} &= \mathbb{E}_{X, Y, D} \bigl((Y'-Y)^2\bigr), \\ \text{bias} &= \mathbb{E}_{X} \bigl((\mathbb{E}_{D}(Y' \mid X) - \mathbb{E}_{Y}(Y \mid X))^2\bigr), \\ \text{variance} &= \mathbb{E}_{X} \bigl(\mathbb{E}_{D} (Y'- \mathbb{E}_{D}(Y' \mid X))^2\bigr). %\text{noise} &= \mathbb{E}_{X, Y} \bigl((Y- \mathbb{E}_{Y}(Y \mid X))^2\bigr). \end{align} Note that $Y' = f(X, D)$ in the above equations. Here the notation $\mathbb{E}_{A}(B \mid C)$ denotes the conditional expectation of $B$ given $C$, and is computed by assuming $C$ is given, and then averaging over all randomness of $B$ due to $A$.
The above equations allow us to estimate EPE, bias and variance using sampling. For example, to estimate the EPE, a naive method is to simply draw many samples of $X, Y, D$, then compute the average of $(f(X, D) - Y)^{2}$ values computed on these samples.
Assume that the input $X$ and output $Y$ follows a distribution where $X \sim U[0, 1]$, and $Y = 3X + 1 + \epsilon$ with $\epsilon \sim N(0, 1)$. We also assume that the training examples are i.i.d.
(a) Consider the mean regression method which learns a model to predict the mean output of the training examples. Propose an efficient procedure to estimate the expected prediction error, bias, variance of this method by drawing 1,000 training sets with 10 examples, and a random test set with 10,000 examples which are labeled with both the random output and the expected output.
Answer. [Write your solution here. Add cells as needed.]
(b) Similarly, estimate the expected prediction error, bias, variance for linear regression.
Answer. [Write your solution here. Add cells as needed.]
(c) Comment on the results that you obtain for the two methods.
Answer. [Write your solution here. Add cells as needed.]