Data Science

US House Price Prediction

Building a house is one of the challenging factors in our lives. Before the construction, it is possible to estimate the price of your house based on the previous house’s price. The factors that majorly affect the house price includes the total number of rooms (bed, bath, etc.) and the area of the land. By this, we can estimate the needed budget for constructing the house.

In this guide, we will see how to predict the price of the US houses using Machine Learning through Python. First, we discuss the dataset that we use and then preprocess the data. After that, we visualize the attributes that are present in the dataset and apply different Machine Learning algorithms on the training dataset (Seattle, Washington August 2022 – December 2022). Finally, we end this guide by predicting the price of some houses that are present in the Test dataset. Before going to implement this project, we need to understand the Machine Learning terminologies that are used in this project.

Regression

In Machine Learning, if you are working with numeric data, you need to understand Regression. Regression is a Supervised Learning Technique in Machine Learning which is used to understand the relationship between independent attributes and a dependent attributes (class label/ target). The machine predicts the house price by learning each record that is present in the dataset. Hence, it is a supervised learning.

For example, in our scenario, the independent attributes are the number of beds, number of baths, size of land, zip code, etc. Based on these, we are able to predict our house price. So, these are the independent attributes which do not depend on anything. The price is the target attribute or class label which depends upon these attributes.

1. Linear Regression

The Linear Regression algorithm shows a linear relationship between the dependent attribute (Y) and independent attribute (X) variables. Mathematically, we can evaluate it as follows:

Y=aX+b

Here, “a” and “b” are Linear Coefficients.

In Python, LinearRegression() is available in the “sklearn.linear_model” module. We will see how to specify this while implementing the project. The following is the model with parameters:

2. Decision Tree

Basically, a Decision Tree is a graphical representation for getting all the possible solutions to a problem based on the conditions provided using the nodes. The Decision node is used to make the decision and the Leaf node refers to the output of a specific decision. We can predict the price of our house with the Decision Tree Regressor.

In Python, the DecisionTreeRegressor is available in the “sklearn.tree” module. We will see how to specify this while implementing the project. The following is the model with parameters:

3. Random Forest

Random Forest performs the same functionality that is similar to a Decision Tree. But it takes a Forest (collection of Decision Trees) and combine (mean value) all the outputs of the Decision Trees. For example, the Random Forest size is 3. So, internally, three Decision trees are created and the House Price outcome of the first Decision Tree is 20000. The House Price outcome of the second Decision Tree is 20000. And the House Price outcome of the last Decision Tree is 10000. 16,666.666 is the final outcome ((20000+20000+10000)/3).

In Python, RandomForestRegressor is available in the “sklearn.ensemble” module. The following is the model with parameters. We can specify the number of trees in the “n_estimators” parameter. It is 100 by default.

Implementation

Quickly see the steps involved in predicting the US House price. We consider the houses_train (CSV file) dataset with 2016 records (used to train the Machine Learning model). Then, we predict the house’s price (505) records that are present in the house_test file.

1. Loading the Train and Test Datasets

Pandas is the available module in Python that is used for data analysis. We utilize this module to load the datasets into the Python Environment. Here, we use the Google Colab as the Code Environment. This is available for free. Just a Google account is needed.

First, we need to load the files from our local PC to the Colab Env. Download the datasets from here.

# Upload houses_train.csv and house_test.csv files into your Google Colab

# one after another.

from google.colab import files

files.upload()

The read_csv() is the function that is used to load the CSV data into a variable. It takes the file name as the parameter.

import pandas

# Load the houses_train.csv into the train_data variable

train_data=pandas.read_csv('houses_train.csv')

# Load the house_test.csv into the test_data variable

test_data=pandas.read_csv('house_test.csv')

# Store the test_data into test_data1 variable

test_data1=test_data

Let’s view the columns and non-null records count in each column. The Pandas.DataFrame.info() is used to get this information.

print(train_data.info())

print(test_data1.info())

Output:

2. Data Preprocessing

In both datasets, the “lot_size” column holds the values with sqft and acre (You will find the variance by seeing the rows in the “lot_size_unit’s” column). But the format should be in sqft. So, we need to convert the values in the “lot_size” column from acre to sqft. Similarly, this is to be done for the “test_data1”.

The DataFrame.loc[] is utilized here to find the “lot_size_units” with “acre” and multiply the value that is present in “lot_size” with 43560.

# Convert the lot_size acre values into Square feet in train_data

train_data.loc[(train_data["lot_size_units"]=="acre"),"lot_size"]=train_data["lot_size"]* 43560

# Convert the lot_size acre values into Square feet in test_data1

test_data1.loc[(test_data1["lot_size_units"]=="acre"),"lot_size"]=test_data1["lot_size"]* 43560

print(train_data.head())

print(test_data1.head())

Output:

Now, you will see that all the values in the” lot_size” column are sqft values.

You see some missing values in this column. Let’s replace the NaN values that are present in the columns with the mean of the same column in both datasets.

The DataFrame[‘column_name’].fillna() is used to fill the missing values with the mean using the mean() function. The DataFrame[‘column_name’].mean() is passed as a parameter to the finna() function. Let’s display the mean and see the count now:

# Fill the missing values present in the lot_size column with Mean of existing values

train_data['lot_size']=train_data['lot_size'].fillna(train_data['lot_size'].mean())

# Display Mean

print("Train data Mean Value: ", train_data['lot_size'].mean())

print(len(train_data['lot_size']))

# Fill the missing values present in the lot_size column with Mean of existing values

test_data1['lot_size']=test_data1['lot_size'].fillna(test_data1['lot_size'].mean())

# Display Mean

print("Test data Mean Value: ", test_data1['lot_size'].mean())

print(len(test_data1['lot_size']))

Output:

The missing values that are present in the “lot_size” column Train Dataset is replaced by the mean value of 18789.95194 and the missing values that are present in the “lot_size” column Test Dataset is replaced by the mean value of 8961.0

3. Data Cleaning

While training the model, there are some unnecessary attributes that are not required to predict the outcome. In our case, there are three attributes which are “lot_size_units”, “zip_code”, and “size_units” to be removed from both datasets. The pandas.DataFrame.drop() is used to remove these three columns from both datasets.

train_data=train_data.drop(['lot_size_units','zip_code','size_units'],axis=1)

test_data1=test_data1.drop(['lot_size_units','zip_code','size_units'],axis=1)

print(train_data.info())

print(test_data1.info())

Output:

Now, the datasets are in good shape. Unnecessary columns are removed and the missing values do not exist.

4. Data Visualization

Let’s create a histogram for the columns of the Train data. The pandas.DataFrame.hist() function is used to generate histograms for all attributes.

train_data.hist(figsize=(4,9))

Output:

Histogram is generated for beds, baths, size, lot_size, and price columns for the Train data.

Let’s create the correlation for all fields with respect to each other. The Plotly.express module is utilized to plot the correlated values.

import plotly.express

corr = train_data.corr()

# Plot the correlated data

view_fig = plotly.express.imshow(corr,text_auto=True)

# Display

view_fig.show()

Output:

  1. The beds are 0.2935 correlated with the price, -0.059 correlated with the lot_size, 0.77 correlated with the size, and 0.65 correlated with baths.
  2. The baths are 0.3173 correlated with the price, -0.054 correlated with the lot_size, 0.667 correlated with the baths, and 0.771 correlated with beds.
  3. The size is 0.444 correlated with the price, -0.044 correlated with the lot_size, 0.667 correlated with the size, and 0.652 correlated with beds.

5. Model Preparation

We need to set the price as the target by removing it from the train_data. Make sure that the attributes that are present in the Train and Test data should be the same in this phase.

target=train_data['price']

train_data=train_data.drop(['price'],axis=1)

print(train_data.info())

print(test_data1.info())

Output:

Now, there are four independent attributes (beds, baths, size, and lot_size) and the price is the dependent attribute that depends on these four attributes.

6. Training the Model

First, we apply the RandomForestRegressor algorithm. Import it from the “sklearn.ensemble” package. It is an Ensembling Technique.

  1. Create a model from the RandomForestRegressor(). We are not passing any parameter to this model. So, the number of Decision Trees is 100 by default.
  2. Use the fit() method to fit the model. It takes two parameters. The first parameter is the dependent attributes and the second parameter is the class label/target.
  3. Use the score() method to see the Model Accuracy. It also takes the same parameters similar to the fit() method.
from sklearn.ensemble import RandomForestRegressor

# Define the Model

model1 = RandomForestRegressor()

# Fit the model

model1.fit(train_data, target)

# Model Accuracy

print(model1.score(train_data, target) * 100)

Output:

86.08400889419033

7. Test the Model and Store the Results

This is the final step where we need to predict the result and store them.

  1. The predict() method is used to predict the Test data. It is used with the model and takes the Nested list of values/DataFrame.
  2. Use the to_csv() method to store the results into the CSV file.
  3. Download the file from the Python environment (Google Colab).
# Predict the test_data1 with the model1.

test_data['Price']=model1.predict(test_data1)

# Save the test_data to test_results.csv

test_data.to_csv('test_results.csv')

# Download this file from the Colab

files.download('test_results.csv')

Output:

Let’s show 20 records out of 505 records. You can see that the Price column holds the predicted values for each house.

Other Models

Let’s predict the houses using the DecisionTreeRegressor. You are able to import it from the “sklearn.tree” module.

from sklearn.tree import DecisionTreeRegressor

# Define the Model

model2 = DecisionTreeRegressor()

# Fit the model

model2.fit(train_data, target)

# Model Accuracy

print(model2.score(train_data, target) * 100)

# Predict the test_data1 with the model1.

test_data['Price']=model2.predict(test_data1)

# Save the test_data to test_results.csv

test_data.to_csv('test_results.csv')

# Download this file from the Colab

files.download('test_results.csv')

Output:

99.94183165335028

You can see the predicted result here:

Let’s predict the houses using the LinearrEgression. Import the model from the “sklearn.linear_model” module.

from sklearn.linear_model import LinearRegression

# Define the Model

model3 = LinearRegression()

# Fit the model

model3.fit(train_data, target)

# Predict the test_data1 with the model1.

test_data['Price']=model3.predict(test_data1)

# Save the test_data to test_results.csv

test_data.to_csv('test_results.csv')

# Download this file from the Colab

files.download('test_results.csv')

You can see the predicted result here:

Conclusion

Now, you are able to predict your house price based on the attributes like the number of rooms, the area of your land, etc. In this guide, we considered the real house data from Seattle, Washington. Using the Regression techniques like Linear Regression, Decision Tree, and Random Forest, we predicted the price of 505 houses. All the steps (Data Preprocessing, Data Cleaning, and Data Visualization) that have to be done before training the model are explained step by step with code snippets and outputs.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain