\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"

\n",
"

"
],
"text/plain": [
" Id MSSubClass MSZoning LotFrontage SaleType SaleCondition SalePrice\n",
"0 1 60 RL 65.0 WD Normal 208500\n",
"1 2 20 RL 80.0 WD Normal 181500\n",
"2 3 60 RL 68.0 WD Normal 223500\n",
"3 4 70 RL 60.0 WD Abnorml 140000"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_data.iloc[0:4, [0, 1, 2, 3, -3, -2, -1]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that in each example, the first feature is the ID. \n",
"This helps the model identify each training example. \n",
"While this is convenient, it doesn't carry \n",
"any information for prediction purposes. \n",
"Hence we remove it from the dataset before feeding the data into the network."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "30"
}
},
"outputs": [],
"source": [
"all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Preprocessing\n",
"\n",
"As stated above, we have a wide variety of datatypes. \n",
"Before we feed it into a deep network,\n",
"we need to perform some amount of processing. \n",
"Let's start with the numerical features. \n",
"We begin by replacing missing values with the mean. \n",
"This is a reasonable strategy if features are missing at random. \n",
"To adjust them to a common scale,\n",
"we rescale them to zero mean and unit variance. \n",
"This is accomplished as follows:\n",
"\n",
"$$x \\leftarrow \\frac{x - \\mu}{\\sigma}$$\n",
"\n",
"To check that this transforms $x$ to data \n",
"with zero mean and unit variance simply calculate \n",
"$\\mathbf{E}[(x-\\mu)/\\sigma] = (\\mu - \\mu)/\\sigma = 0$. \n",
"To check the variance we use $\\mathbf{E}[(x-\\mu)^2] = \\sigma^2$ \n",
"and thus the transformed variable has unit variance. \n",
"The reason for 'normalizing' the data is that \n",
"it brings all features to the same order of magnitude. \n",
"After all, we do not know *a priori* \n",
"which features are likely to be relevant."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "6"
}
},
"outputs": [],
"source": [
"numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index\n",
"all_features[numeric_features] = all_features[numeric_features].apply(\n",
" lambda x: (x - x.mean()) / (x.std()))\n",
"# After standardizing the data all means vanish, hence we can set missing\n",
"# values to 0\n",
"all_features[numeric_features] = all_features[numeric_features].fillna(0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we deal with discrete values. \n",
"This includes variables such as 'MSZoning'.\n",
"We replace them by a one-hot encoding \n",
"in the same manner as how we transformed multiclass classification data \n",
"into a vector of $0$ and $1$. \n",
"For instance, 'MSZoning' assumes the values 'RL' and 'RM'. \n",
"They map into vectors $(1,0)$ and $(0,1)$ respectively. \n",
"Pandas does this automatically for us."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "7"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(2919, 331)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Dummy_na=True refers to a missing value being a legal eigenvalue, and\n",
"# creates an indicative feature for it\n",
"all_features = pd.get_dummies(all_features, dummy_na=True)\n",
"all_features.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can see that this conversion increases the number of features \n",
"from 79 to 331. \n",
"Finally, via the `values` attribute,\n",
" we can extract the NumPy format from the Pandas dataframe \n",
" and convert it into MXNet's native NDArray representation for training."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "9"
}
},
"outputs": [],
"source": [
"n_train = train_data.shape[0]\n",
"train_features = nd.array(all_features[:n_train].values)\n",
"test_features = nd.array(all_features[n_train:].values)\n",
"train_labels = nd.array(train_data.SalePrice.values).reshape((-1, 1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Training\n",
"\n",
"To get started we train a linear model with squared loss. \n",
"Not surprisingly, our linear model will not lead \n",
"to a competition winning submission\n",
"but it provides a sanity check to see whether \n",
"there's meaningful information in the data.\n",
"If we can't do better than random guessing here,\n",
"then there might be a good chance \n",
"that we have a data processing bug.\n",
"And if things work, the linear model will serve as a baseline\n",
"giving us some intuition about how close the simple model\n",
"gets to the best reported models, giving us a sense \n",
"of how much gain we should expect from fanicer models."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "13"
}
},
"outputs": [],
"source": [
"loss = gloss.L2Loss()\n",
"\n",
"def get_net():\n",
" net = nn.Sequential()\n",
" net.add(nn.Dense(1))\n",
" net.initialize()\n",
" return net"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With house prices, as with stock prices, we care about relative quantities more than absolute quantities. \n",
"More concretely, we tend to care more \n",
"about the relative error $\\frac{y - \\hat{y}}{y}$ \n",
"than about the absolute error $y - \\hat{y}$. \n",
"For instance, if our prediction is off by USD 100,000\n",
"when estimating the price of a house in Rural Ohio,\n",
"where the value of a typical house is 125,000 USD,\n",
"then we are probably doing a horrible job. \n",
"On the other hand, if we err by this amount in Los Altos Hills, California, \n",
"this might represent a stunningly accurate prediction \n",
"(their, the median house price exceeds 4 million USD).\n",
"\n",
"One way to address this problem is to \n",
"measure the discrepancy in the logarithm of the price estimates. \n",
"In fact, this is also the official error metric\n",
"used by the compeitition to measure the quality of submissions. \n",
"After all, a small value $\\delta$ of $\\log y - \\log \\hat{y}$ \n",
"translates into $e^{-\\delta} \\leq \\frac{\\hat{y}}{y} \\leq e^\\delta$. \n",
"This leads to the following loss function:\n",
"\n",
"$$L = \\sqrt{\\frac{1}{n}\\sum_{i=1}^n\\left(\\log y_i -\\log \\hat{y}_i\\right)^2}$$"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "11"
}
},
"outputs": [],
"source": [
"def log_rmse(net, features, labels):\n",
" # To further stabilize the value when the logarithm is taken, set the\n",
" # value less than 1 as 1\n",
" clipped_preds = nd.clip(net(features), 1, float('inf'))\n",
" rmse = nd.sqrt(2 * loss(clipped_preds.log(), labels.log()).mean())\n",
" return rmse.asscalar()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Unlike in previous sections, our training functions here \n",
"will rely on the Adam optimizer \n",
"(a slight variant on SGD that we will describe in greater detail later).\n",
"The main appeal of Adam vs vanilla SGD is that \n",
"the Adam optimizer, despite doing no better (and sometimes worse)\n",
"given unlimited resources for hyperparameter optimization,\n",
"people tend to find that it is significantly less sensitive \n",
"to the initial learning rate. \n",
"This will be covered in further detail later on \n",
"when we discuss the details on [Optimization Algorithms](../chapter_optimization/index.md) in a separate chapter."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "14"
}
},
"outputs": [],
"source": [
"def train(net, train_features, train_labels, test_features, test_labels,\n",
" num_epochs, learning_rate, weight_decay, batch_size):\n",
" train_ls, test_ls = [], []\n",
" train_iter = gdata.DataLoader(gdata.ArrayDataset(\n",
" train_features, train_labels), batch_size, shuffle=True)\n",
" # The Adam optimization algorithm is used here\n",
" trainer = gluon.Trainer(net.collect_params(), 'adam', {\n",
" 'learning_rate': learning_rate, 'wd': weight_decay})\n",
" for epoch in range(num_epochs):\n",
" for X, y in train_iter:\n",
" with autograd.record():\n",
" l = loss(net(X), y)\n",
" l.backward()\n",
" trainer.step(batch_size)\n",
" train_ls.append(log_rmse(net, train_features, train_labels))\n",
" if test_labels is not None:\n",
" test_ls.append(log_rmse(net, test_features, test_labels))\n",
" return train_ls, test_ls"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## k-Fold Cross-Validation\n",
"\n",
"If you are reading in a linear fashion, \n",
"you might recall that we intorduced k-fold cross-validation \n",
"in the section where we discussed how to deal \n",
"with [“Model Selection, Underfitting and Overfitting\"](underfit-overfit.md). We will put this to good use to select the model design \n",
"and to adjust the hyperparameters. \n",
"We first need a function that returns \n",
"the i-th fold of the data in a k-fold cros-validation procedure. \n",
"It proceeds by slicing out the i-th segment as validation data \n",
"and returning the rest as training data. \n",
"Note that this is not the most efficient way of handling data \n",
"and we would definitely do something much smarter\n",
"if our dataset wasconsiderably larger.\n",
"But this added complexity might obfuscate our code unnecessarily\n",
"so we can safely omit here owing to the simplicity of our problem."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"def get_k_fold_data(k, i, X, y):\n",
" assert k > 1\n",
" fold_size = X.shape[0] // k\n",
" X_train, y_train = None, None\n",
" for j in range(k):\n",
" idx = slice(j * fold_size, (j + 1) * fold_size)\n",
" X_part, y_part = X[idx, :], y[idx]\n",
" if j == i:\n",
" X_valid, y_valid = X_part, y_part\n",
" elif X_train is None:\n",
" X_train, y_train = X_part, y_part\n",
" else:\n",
" X_train = nd.concat(X_train, X_part, dim=0)\n",
" y_train = nd.concat(y_train, y_part, dim=0)\n",
" return X_train, y_train, X_valid, y_valid"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The training and verification error averages are returned \n",
"when we train $k$ times in the k-fold cross-validation."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "15"
}
},
"outputs": [],
"source": [
"def k_fold(k, X_train, y_train, num_epochs,\n",
" learning_rate, weight_decay, batch_size):\n",
" train_l_sum, valid_l_sum = 0, 0\n",
" for i in range(k):\n",
" data = get_k_fold_data(k, i, X_train, y_train)\n",
" net = get_net()\n",
" train_ls, valid_ls = train(net, *data, num_epochs, learning_rate,\n",
" weight_decay, batch_size)\n",
" train_l_sum += train_ls[-1]\n",
" valid_l_sum += valid_ls[-1]\n",
" if i == 0:\n",
" d2l.semilogy(range(1, num_epochs + 1), train_ls, 'epochs', 'rmse',\n",
" range(1, num_epochs + 1), valid_ls,\n",
" ['train', 'valid'])\n",
" print('fold %d, train rmse: %f, valid rmse: %f' % (\n",
" i, train_ls[-1], valid_ls[-1]))\n",
" return train_l_sum / k, valid_l_sum / k"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Model Selection\n",
"\n",
"In this example, we pick an un-tuned set of hyperparameters \n",
"and leave it up to the reader to improve the model. \n",
"Finding a good choice can take quite some time, \n",
"depending on how many things one wants to optimize over.\n",
"Within reason, the k-fold cross-validation approach \n",
"is resilient against multiple testing. \n",
"However, if we were to try out an unreasonably large number of options \n",
"it might fail since we might just get lucky \n",
"on the validation split with a particular set of hyperparameters."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "16"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
"Id | MSSubClass | MSZoning | LotFrontage | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|

0 | 1 | 60 | RL | 65.0 | WD | Normal | 208500 |

1 | 2 | 20 | RL | 80.0 | WD | Normal | 181500 |

2 | 3 | 60 | RL | 68.0 | WD | Normal | 223500 |

3 | 4 | 70 | RL | 60.0 | WD | Abnorml | 140000 |