{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "i6sPnhMH0raw" }, "source": [ "# House Price Prediction\n", "\n", "Think of finding the perfect house as a complex journey involving negotiations, research, and decision-making. Now, imagine having a smart guide that helps you navigate through this maze by analyzing data and predicting outcomes. Linear regression is that guide. In this tutorial, we'll explore how linear regression helps us understand and predict relationships in data, just like finding the ideal house by matching features with price. Let's get started and discover how this powerful tool can simplify your data-driven decisions!" ] }, { "cell_type": "markdown", "metadata": { "id": "huwdh4Es3EZ2" }, "source": [ "## Setup\n", "\n", "The House Price Prediction Dataset contains 13 features\n", "\n", "| # | Column Name | Description |\n", "|----|----------------|---------------------------------------------------------------------|\n", "| 1 | Id | To count the records. |\n", "| 2 | MSSubClass | Identifies the type of dwelling involved in the sale. |\n", "| 3 | MSZoning | Identifies the general zoning classification of the sale. |\n", "| 4 | LotArea | Lot size in square feet. |\n", "| 5 | LotConfig | Configuration of the lot |\n", "| 6 | BldgType | Type of dwelling |\n", "| 7 | OverallCond | Rates the overall condition of the house |\n", "| 8 | YearBuilt | Original construction year |\n", "| 9 | YearRemodAdd | Remodel date (same as construction date if no remodeling or additions). |\n", "| 10 | Exterior1st | Exterior covering on house |\n", "| 11 | BsmtFinSF2 | Type 2 finished square feet. |\n", "| 12 | TotalBsmtSF | Total square feet of basement area |\n", "| 13 | SalePrice | To be predicted |\n", "\n", "\n", "Run the cell below to download dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "W5Y7ZPKz0nf8", "outputId": "786a4293-3f63-4065-c005-4d5b11d8c15b" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2024-08-23 11:26:23-- https://docs.google.com/spreadsheets/d/1caaR9pT24GNmq3rDQpMiIMJrmiTGarbs/export?format=csv&id=1caaR9pT24GNmq3rDQpMiIMJrmiTGarbs&gid=1150341366\n", "Resolving docs.google.com (docs.google.com)... 172.217.164.14, 2607:f8b0:4025:803::200e\n", "Connecting to docs.google.com (docs.google.com)|172.217.164.14|:443... connected.\n", "HTTP request sent, awaiting response... 307 Temporary Redirect\n", "Location: https://doc-08-30-sheets.googleusercontent.com/export/54bogvaave6cua4cdnls17ksc4/pq9qv18vh410aflmm0u38bfc4o/1724412380000/115253717745408081083/*/1caaR9pT24GNmq3rDQpMiIMJrmiTGarbs?format=csv&id=1caaR9pT24GNmq3rDQpMiIMJrmiTGarbs&gid=1150341366 [following]\n", "Warning: wildcards not supported in HTTP.\n", "--2024-08-23 11:26:24-- https://doc-08-30-sheets.googleusercontent.com/export/54bogvaave6cua4cdnls17ksc4/pq9qv18vh410aflmm0u38bfc4o/1724412380000/115253717745408081083/*/1caaR9pT24GNmq3rDQpMiIMJrmiTGarbs?format=csv&id=1caaR9pT24GNmq3rDQpMiIMJrmiTGarbs&gid=1150341366\n", "Resolving doc-08-30-sheets.googleusercontent.com (doc-08-30-sheets.googleusercontent.com)... 172.217.12.1, 2607:f8b0:4025:815::2001\n", "Connecting to doc-08-30-sheets.googleusercontent.com (doc-08-30-sheets.googleusercontent.com)|172.217.12.1|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: unspecified [text/csv]\n", "Saving to: ‘dataset.csv’\n", "\n", "dataset.csv [ <=> ] 171.41K --.-KB/s in 0.03s \n", "\n", "2024-08-23 11:26:24 (5.15 MB/s) - ‘dataset.csv’ saved [175524]\n", "\n" ] } ], "source": [ "!wget -O dataset.csv \"https://docs.google.com/spreadsheets/d/1caaR9pT24GNmq3rDQpMiIMJrmiTGarbs/export?format=csv&id=1caaR9pT24GNmq3rDQpMiIMJrmiTGarbs&gid=1150341366\"" ] }, { "cell_type": "markdown", "metadata": { "id": "-r572AXndHZn" }, "source": [ "### Importing libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Vrb6A7dQdGK2" }, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.preprocessing import OneHotEncoder\n", "from sklearn.preprocessing import MinMaxScaler\n", "from sklearn.model_selection import train_test_split\n", "import numpy as np\n", "from sklearn.metrics import mean_squared_error\n", "from sklearn.metrics import mean_absolute_error\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": { "id": "KLdgiCGJ9dN-" }, "source": [ "### Create DataFrame" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NhdMHsM4831s" }, "outputs": [], "source": [ "file_path = '/content/dataset.csv'\n", "df = pd.read_csv(file_path)" ] }, { "cell_type": "markdown", "metadata": { "id": "KtYExjQRHBQo" }, "source": [ "# Preprocessing" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "x7n5GmAWwoO_" }, "outputs": [], "source": [ "df.drop(['Id'],axis=1,inplace=True)\n", "df['SalePrice'].fillna(df['SalePrice'].mean(), inplace=True)\n", "df_copy=df.copy()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2LrDmZryG9Pm" }, "outputs": [], "source": [ "new_data = df_copy.dropna()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Q0PeWhNYHTdl", "outputId": "f3988361-0a4c-4155-dbf2-d852e1d90ccd" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Categorical variables:\n", "['MSZoning', 'LotConfig', 'BldgType', 'Exterior1st']\n", "No. of. categorical features: 4\n" ] } ], "source": [ "s = (new_data.dtypes == 'object')\n", "object_cols = list(s[s].index)\n", "print(\"Categorical variables:\")\n", "print(object_cols)\n", "print('No. of. categorical features: ',\n", "\tlen(object_cols))" ] }, { "cell_type": "markdown", "metadata": { "id": "qzkrtVtEd2ab" }, "source": [ "### One Hot Encoding" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 307 }, "id": "HwH4nB9QILS8", "outputId": "a1ccab90-5d6d-4c09-b5b8-c5d9a9b831a8" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_encoders.py:975: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n", " warnings.warn(\n" ] }, { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "df_final" }, "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MSSubClassLotAreaOverallCondYearBuiltYearRemodAddBsmtFinSF2TotalBsmtSFSalePriceMSZoning_C (all)MSZoning_FV...Exterior1st_CemntBdExterior1st_HdBoardExterior1st_ImStuccExterior1st_MetalSdExterior1st_PlywoodExterior1st_StoneExterior1st_StuccoExterior1st_VinylSdExterior1st_Wd SdngExterior1st_WdShing
06084505200320030.0856.0208500.00.00.0...0.00.00.00.00.00.00.01.00.00.0
12096008197619760.01262.0181500.00.00.0...0.00.00.01.00.00.00.00.00.00.0
260112505200120020.0920.0223500.00.00.0...0.00.00.00.00.00.00.01.00.00.0
37095505191519700.0756.0140000.00.00.0...0.00.00.00.00.00.00.00.01.00.0
460142605200020000.01145.0250000.00.00.0...0.00.00.00.00.00.00.01.00.00.0
\n", "

5 rows × 38 columns

\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "text/plain": [ " MSSubClass LotArea OverallCond YearBuilt YearRemodAdd BsmtFinSF2 \\\n", "0 60 8450 5 2003 2003 0.0 \n", "1 20 9600 8 1976 1976 0.0 \n", "2 60 11250 5 2001 2002 0.0 \n", "3 70 9550 5 1915 1970 0.0 \n", "4 60 14260 5 2000 2000 0.0 \n", "\n", " TotalBsmtSF SalePrice MSZoning_C (all) MSZoning_FV ... \\\n", "0 856.0 208500.0 0.0 0.0 ... \n", "1 1262.0 181500.0 0.0 0.0 ... \n", "2 920.0 223500.0 0.0 0.0 ... \n", "3 756.0 140000.0 0.0 0.0 ... \n", "4 1145.0 250000.0 0.0 0.0 ... \n", "\n", " Exterior1st_CemntBd Exterior1st_HdBoard Exterior1st_ImStucc \\\n", "0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 \n", "\n", " Exterior1st_MetalSd Exterior1st_Plywood Exterior1st_Stone \\\n", "0 0.0 0.0 0.0 \n", "1 1.0 0.0 0.0 \n", "2 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 \n", "\n", " Exterior1st_Stucco Exterior1st_VinylSd Exterior1st_Wd Sdng \\\n", "0 0.0 1.0 0.0 \n", "1 0.0 0.0 0.0 \n", "2 0.0 1.0 0.0 \n", "3 0.0 0.0 1.0 \n", "4 0.0 1.0 0.0 \n", "\n", " Exterior1st_WdShing \n", "0 0.0 \n", "1 0.0 \n", "2 0.0 \n", "3 0.0 \n", "4 0.0 \n", "\n", "[5 rows x 38 columns]" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "OH_encoder = OneHotEncoder(sparse=False)\n", "OH_cols = pd.DataFrame(OH_encoder.fit_transform(new_data[object_cols]))\n", "OH_cols.index = new_data.index\n", "OH_cols.columns = OH_encoder.get_feature_names_out()\n", "df_final = new_data.drop(object_cols, axis=1)\n", "df_final = pd.concat([df_final, OH_cols], axis=1)\n", "df_final.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "eZDFB6FZv1k-" }, "source": [ "### Scaling" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Ve81LFh9wEs0" }, "outputs": [], "source": [ "scaler = MinMaxScaler()\n", "df_normalized = pd.DataFrame(scaler.fit_transform(df_final), columns=df_final.columns)" ] }, { "cell_type": "markdown", "metadata": { "id": "TCfu66R_cyAJ" }, "source": [ "# Data splitting" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wiiNfRQRgR1G" }, "outputs": [], "source": [ "X, y = df_normalized.loc[:, df_normalized.columns != 'SalePrice'], df_normalized[['SalePrice']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gMCC4I4bgjf2" }, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "o5k0PNG_pMGY", "outputId": "59927321-5524-4bc6-cf99-b579566c6527" }, "outputs": [ { "data": { "text/plain": [ "((2330, 37), (583, 37), (2330, 1), (583, 1))" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.shape, X_test.shape, y_train.shape, y_test.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "yKhUdAyntrv8", "outputId": "d60ccb04-1f08-4cf2-8503-0f5d980322c3" }, "outputs": [ { "data": { "text/plain": [ "(755000.0, 34900.0)" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "label_max, label_min = df[\"SalePrice\"].max(), df[\"SalePrice\"].min()\n", "label_max,label_min" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7c7wVOvwtk7V" }, "outputs": [], "source": [ "y_reg_test = y_test * (label_max - label_min) + label_min\n", "feature_names=X_train.columns" ] }, { "cell_type": "markdown", "metadata": { "id": "egogzbyuca1N" }, "source": [ "# Model Building" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "jC2CYGELsB-G" }, "outputs": [], "source": [ "def init_params(n_features):\n", " w = np.zeros((n_features, 1)) # no of weights should be 1 + no. of featuers, to account for the bias term\n", " b = 0\n", " return w,b\n", "\n", "def log_likelihood(X, y,w,b):\n", " # Assume variance = 1\n", " N = len(y)\n", " y_pred = np.dot(X, w) + b\n", " residual = y - y_pred\n", " log_likelihood = -0.5 * N * np.log(2 * np.pi) - 0.5 * np.sum(residual ** 2)\n", " return log_likelihood\n", "\n", "def update(X, y,w,b,n_iterations=400,learning_rate=0.001):\n", " # Assume variance = 1\n", " N = len(y)\n", " for i in range(n_iterations):\n", " y_pred = np.dot(X, w) + b\n", "\n", " dw = np.dot(X.T, (y - y_pred)) / N\n", " db = np.sum(y - y_pred) / N\n", "\n", " w = w - learning_rate * dw\n", " b = b - learning_rate * db\n", "\n", " return w,b\n", "\n", "def train_model(X, y):\n", "\n", " w,b=init_params(X.shape[1])\n", "\n", " w,b=update(X, y,w,b)\n", "\n", " return w,b\n", "\n", "def predict(X,w,b):\n", " return np.dot(X, w) + b" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3OPBEw2KrAH7" }, "outputs": [], "source": [ "W, b = train_model(X_train.values,y_train.values)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "GfLO0M_EvFxU" }, "outputs": [], "source": [ "y_pred = predict(X_test,W,b)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "jYP5Zc18vNqm" }, "outputs": [], "source": [ "y_pred=y_pred * (label_max - label_min) + label_min\n", "mae=mean_absolute_error(y_reg_test,y_pred)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "9XYUL6hpvYTA", "outputId": "422ec562-f641-48d1-ddb6-4211fefd66ad" }, "outputs": [ { "data": { "text/plain": [ "858342.4210579091" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mae" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "PxShUdVuv5jn" }, "outputs": [], "source": [] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 4 }