Another awesome framework (a.k.a library) in Python is scikit-learn. The framework provides a simple and consistent API for data modeling and analytics. With scikit-learn, it is easy to implement all the key machine learning algorithms e.g. Linear Regression, Support Vector Machines, Naives Bayes, Neural Network, Decision Tree, K-means … etc.
In this case study, we will build a model to predict the price of a condo unit using a Multivariate Regression algorithm. Actual condo transactions data from Q1 2019 was downloaded from Urban Redevelopment Authority (https://www.ura.gov.sg) into a csv file. The downloaded data is used as a baseline to construct a training dataset for our model.
Prior to training the model, an exploratory data analysis is done on the dataset to determine the relationships among the different variables (a.k.a. features). This analysis would yield useful information for selecting the appropriate features as the Dependent Variables to predict the Sale Price (i.e. Target). See the histograms and pair-plots of the variables below.
From the exploratory data analysis, we select the appropriate dependent variables to predict Sale Price. The selected variables are ‘Floor Area’, ‘Age’, ‘Tenure’, ‘Near MRT’, ‘District’ and ‘Floor Level’. In order for the algorithm to compute, we transform categorical variables ‘Tenure’, ‘Near MRT’, ‘District’ and ‘Floor Level’ variables into numerical variables. See the transformation of the condo DataFrame below.
Once the training of the model is completed, we could use it to predict the Sale Price of a condo unit. See the result of a test run to predict the Sale Price of a condo unit having specified features.
The Python code is shown below: