{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "i6sPnhMH0raw" }, "source": [ "# House Price Prediction\n", "\n", "Think of finding the perfect house as a complex journey involving negotiations, research, and decision-making. Now, imagine having a smart guide that helps you navigate through this maze by analyzing data and predicting outcomes. Linear regression is that guide. In this tutorial, we'll explore how linear regression helps us understand and predict relationships in data, just like finding the ideal house by matching features with price. Let's get started and discover how this powerful tool can simplify your data-driven decisions!" ] }, { "cell_type": "markdown", "metadata": { "id": "huwdh4Es3EZ2" }, "source": [ "## Setup\n", "\n", "The House Price Prediction Dataset contains 13 features\n", "\n", "| # | Column Name | Description |\n", "|----|----------------|---------------------------------------------------------------------|\n", "| 1 | Id | To count the records. |\n", "| 2 | MSSubClass | Identifies the type of dwelling involved in the sale. |\n", "| 3 | MSZoning | Identifies the general zoning classification of the sale. |\n", "| 4 | LotArea | Lot size in square feet. |\n", "| 5 | LotConfig | Configuration of the lot |\n", "| 6 | BldgType | Type of dwelling |\n", "| 7 | OverallCond | Rates the overall condition of the house |\n", "| 8 | YearBuilt | Original construction year |\n", "| 9 | YearRemodAdd | Remodel date (same as construction date if no remodeling or additions). |\n", "| 10 | Exterior1st | Exterior covering on house |\n", "| 11 | BsmtFinSF2 | Type 2 finished square feet. |\n", "| 12 | TotalBsmtSF | Total square feet of basement area |\n", "| 13 | SalePrice | To be predicted |\n", "\n", "\n", "Run the cell below to download dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "W5Y7ZPKz0nf8", "outputId": "786a4293-3f63-4065-c005-4d5b11d8c15b" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2024-08-23 11:26:23-- https://docs.google.com/spreadsheets/d/1caaR9pT24GNmq3rDQpMiIMJrmiTGarbs/export?format=csv&id=1caaR9pT24GNmq3rDQpMiIMJrmiTGarbs&gid=1150341366\n", "Resolving docs.google.com (docs.google.com)... 172.217.164.14, 2607:f8b0:4025:803::200e\n", "Connecting to docs.google.com (docs.google.com)|172.217.164.14|:443... connected.\n", "HTTP request sent, awaiting response... 307 Temporary Redirect\n", "Location: https://doc-08-30-sheets.googleusercontent.com/export/54bogvaave6cua4cdnls17ksc4/pq9qv18vh410aflmm0u38bfc4o/1724412380000/115253717745408081083/*/1caaR9pT24GNmq3rDQpMiIMJrmiTGarbs?format=csv&id=1caaR9pT24GNmq3rDQpMiIMJrmiTGarbs&gid=1150341366 [following]\n", "Warning: wildcards not supported in HTTP.\n", "--2024-08-23 11:26:24-- https://doc-08-30-sheets.googleusercontent.com/export/54bogvaave6cua4cdnls17ksc4/pq9qv18vh410aflmm0u38bfc4o/1724412380000/115253717745408081083/*/1caaR9pT24GNmq3rDQpMiIMJrmiTGarbs?format=csv&id=1caaR9pT24GNmq3rDQpMiIMJrmiTGarbs&gid=1150341366\n", "Resolving doc-08-30-sheets.googleusercontent.com (doc-08-30-sheets.googleusercontent.com)... 172.217.12.1, 2607:f8b0:4025:815::2001\n", "Connecting to doc-08-30-sheets.googleusercontent.com (doc-08-30-sheets.googleusercontent.com)|172.217.12.1|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: unspecified [text/csv]\n", "Saving to: ‘dataset.csv’\n", "\n", "dataset.csv [ <=> ] 171.41K --.-KB/s in 0.03s \n", "\n", "2024-08-23 11:26:24 (5.15 MB/s) - ‘dataset.csv’ saved [175524]\n", "\n" ] } ], "source": [ "!wget -O dataset.csv \"https://docs.google.com/spreadsheets/d/1caaR9pT24GNmq3rDQpMiIMJrmiTGarbs/export?format=csv&id=1caaR9pT24GNmq3rDQpMiIMJrmiTGarbs&gid=1150341366\"" ] }, { "cell_type": "markdown", "metadata": { "id": "-r572AXndHZn" }, "source": [ "### Importing libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Vrb6A7dQdGK2" }, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.preprocessing import OneHotEncoder\n", "from sklearn.preprocessing import MinMaxScaler\n", "from sklearn.model_selection import train_test_split\n", "import numpy as np\n", "from sklearn.metrics import mean_squared_error\n", "from sklearn.metrics import mean_absolute_error\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": { "id": "KLdgiCGJ9dN-" }, "source": [ "### Create DataFrame" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NhdMHsM4831s" }, "outputs": [], "source": [ "file_path = '/content/dataset.csv'\n", "df = pd.read_csv(file_path)" ] }, { "cell_type": "markdown", "metadata": { "id": "KtYExjQRHBQo" }, "source": [ "# Preprocessing" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "x7n5GmAWwoO_" }, "outputs": [], "source": [ "df.drop(['Id'],axis=1,inplace=True)\n", "df['SalePrice'].fillna(df['SalePrice'].mean(), inplace=True)\n", "df_copy=df.copy()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2LrDmZryG9Pm" }, "outputs": [], "source": [ "new_data = df_copy.dropna()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Q0PeWhNYHTdl", "outputId": "f3988361-0a4c-4155-dbf2-d852e1d90ccd" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Categorical variables:\n", "['MSZoning', 'LotConfig', 'BldgType', 'Exterior1st']\n", "No. of. categorical features: 4\n" ] } ], "source": [ "s = (new_data.dtypes == 'object')\n", "object_cols = list(s[s].index)\n", "print(\"Categorical variables:\")\n", "print(object_cols)\n", "print('No. of. categorical features: ',\n", "\tlen(object_cols))" ] }, { "cell_type": "markdown", "metadata": { "id": "qzkrtVtEd2ab" }, "source": [ "### One Hot Encoding" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 307 }, "id": "HwH4nB9QILS8", "outputId": "a1ccab90-5d6d-4c09-b5b8-c5d9a9b831a8" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_encoders.py:975: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n", " warnings.warn(\n" ] }, { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "df_final" }, "text/html": [ "\n", "
| \n", " | MSSubClass | \n", "LotArea | \n", "OverallCond | \n", "YearBuilt | \n", "YearRemodAdd | \n", "BsmtFinSF2 | \n", "TotalBsmtSF | \n", "SalePrice | \n", "MSZoning_C (all) | \n", "MSZoning_FV | \n", "... | \n", "Exterior1st_CemntBd | \n", "Exterior1st_HdBoard | \n", "Exterior1st_ImStucc | \n", "Exterior1st_MetalSd | \n", "Exterior1st_Plywood | \n", "Exterior1st_Stone | \n", "Exterior1st_Stucco | \n", "Exterior1st_VinylSd | \n", "Exterior1st_Wd Sdng | \n", "Exterior1st_WdShing | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "60 | \n", "8450 | \n", "5 | \n", "2003 | \n", "2003 | \n", "0.0 | \n", "856.0 | \n", "208500.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "
| 1 | \n", "20 | \n", "9600 | \n", "8 | \n", "1976 | \n", "1976 | \n", "0.0 | \n", "1262.0 | \n", "181500.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| 2 | \n", "60 | \n", "11250 | \n", "5 | \n", "2001 | \n", "2002 | \n", "0.0 | \n", "920.0 | \n", "223500.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "
| 3 | \n", "70 | \n", "9550 | \n", "5 | \n", "1915 | \n", "1970 | \n", "0.0 | \n", "756.0 | \n", "140000.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "
| 4 | \n", "60 | \n", "14260 | \n", "5 | \n", "2000 | \n", "2000 | \n", "0.0 | \n", "1145.0 | \n", "250000.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "
5 rows × 38 columns
\n", "