Scrape and perform ETL process on Star Property House Rental website

12 min readJan 22, 2021

This article will be discussing about how data is scraped from one website and the method that is performed to solve miscellaneous error inside the dataset so that cleaned data can be obtained in the end. This group project is as assessment for the course that we took which is Introduction to Data Engineering.

Python language is used in this article as it is a universal or general-purpose coding language which means it can be used for any type of programming.

Ready? Let’s get started!

First of all, we need a library that can help us in scraping data from website. Therefore, BeautifulSoup library will be used as our scrapper. Request module will also be used as we need something that can indicate either the SIP (Session Initial Protocol) response code for our action is success or fail. To make it easier, value range of 200 will indicate a successful responses while range of 400 will indicate a client failure responses.

from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import pandas as pdurl= "https://www.starproperty.my/to-rent"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
response = requests.get(url, headers=headers)
response

Output:

<Response [200]>

Then, variable data will be use to store all the HTML text data from the website so that soup can use it as a reference to parse extracted data based on preferences.

data = response.text
soup = BeautifulSoup(data, 'html.parser')

Phase 1: Extraction

The website that we chose is about information of house rentals that can be found around Malaysia. 10 attributes have been identified from this website which are agent name, rental price, number of bathroom, number of bedroom, location, date, phone number, square feet of the house, property name and accommodation category. Steps below shown on how we extract for each data.

Attribute 1 : Agent Name

agent_names = soup.find_all("a",{"class":"property__agentdetails"})
for agent_name in agent_names:
print(agent_name.text)

The output as follows :-

Estee Chong ( E2250 ) 
Estee Chong ( E2250 ) 
Eric Yen ( PEA 2176 ) 
Eric Yen ( PEA 2176 ) 
Eric Yen ( PEA 2176 ) 
Eric Yen ( PEA 2176 ) 
Eric Yen ( PEA 2176 ) 
Eric Yen ( PEA 2176 ) 
Eric Yen ( PEA 2176 ) 
Eric Yen ( PEA 2176 ) 
Eric Yen ( PEA 2176 ) 
Simeon Tang ( REN 00413 ) 
Eric Yen ( PEA 2176 ) 
Eric Yen ( PEA 2176 ) 
Eric Yen ( PEA 2176 ) 
Eric Yen ( PEA 2176 ) 
Eric Yen ( PEA 2176 ) 
Eric Yen ( PEA 2176 ) 
Cedric Goey ( E2528 ) 
Eric Yen ( PEA 2176 )

Attribute 2 : Date

dates = soup.find_all("span",{"class":"property__listed-date"})
for date in dates:
print(date.text)

The output as follows :-

Listed 4 hour ago                           
Listed 4 hour ago                           
Listed 4 hour ago                                   
listed 4 hour ago                                   
listed 4 hour ago                                   
listed 8 hour ago                                   
listed 8 hour ago                                   
listed 8 hour ago                                   
listed 8 hour ago                                   
listed 8 hour ago                                   
listed 8 hour ago                                   
listed on 21/01/2021                                   
listed on 21/01/2021                                   
listed on 21/01/2021                                   
listed on 19/01/2021                                   
listed on 15/01/2021                                   
listed on 12/01/2021                                   
listed on 11/01/2021                                   
listed on 09/01/2021                                   
listed on 09/01/2021

Attribute 3 : Location

locations = soup.find_all("p",{"class":"property__location"})
for location in locations:
print(location.text)

The output as follows :-

Petaling Jaya 
Petaling Jaya 
Petaling Jaya 
Mutiara Damansara 
Mutiara Damansara 
Mutiara Damansara 
Mutiara Damansara 
Wangsa Maju 
Wangsa Maju 
Wangsa Maju 
Wangsa Maju 
Old Klang Road 
Old Klang Road 
Old Klang Road 
Old Klang Road 
Penang 
Old Klang Road 
Old Klang Road 
Old Klang Road 
Old Klang Road

Attribute 4 : Number of Bedroom

bedrooms= soup.find_all("span",{"itemprop":"numberOfBedrooms"})
for bedroom in bedrooms:
print(bedroom.text)

The output as follows :-

Attribute 5 : Number of Bathroom

bathrooms = soup.find_all("span",{"itemprop":"numberOfBathroomsTotal"})
for bathroom in bathrooms:
print(bathroom.text)

The output as follows :-

Attribute 6: Rental Price

prices = soup.find_all("h4",{"class":"property__price"})
for price in prices:
print(price.text)

The output as follows :-

RM 5,500.00 
RM 6,000.00 
RM 3,100.00 
RM 2,400.00 
RM 2,400.00 
RM 2,400.00 
RM 2,400.00 
RM 1,300.00 
RM 1,300.00 
RM 1,300.00 
RM 1,300.00 
RM 2,300.00 
RM 2,300.00 
RM 2,300.00 
RM 2,300.00 
RM 200,000.00 
RM 2,200.00 
RM 2,200.00 
RM 2,200.00 
RM 2,200.00

Attribute 7 : Phone Number

phones = soup.find_all("div",{"class":"property__agentcontacts"})
for phone in phones:
print(phone.text)

The output as follows :-

+60122004539                                                 +60122004539                                                 +60122004539                                                 +60122004539                                                     +60122004539                                                     +60122811754                                                     +60122811754                                                     +60122811754                                                     +60122811754                                                     +60122811754                                                     +60122811754                                                     +60122811754                                                     +60122811754                                                     +60122811754                                                     +60123190970                                                     +60122811754                                                     +60122811754                                                     +60122811754                                                     +60122811754                                                     +60122811754

Attribute 8 : Square Feet

feets = soup.find_all("span",{"itemprop":"floorSize"})
for feet in feets:
print(feet.text)

The output as follows :-

3400 sq ft  
2200 sq ft 
850 sq ft 
850 sq ft 
1150 sq ft 
1150 sq ft 
1094 sq ft 
1094 sq ft 
814 sq ft 
814 sq ft 
850 sq ft 
850 sq ft 
850 sq ft 
1400 sq ft 
1094 sq ft 
1150 sq ft 
1150 sq ft 
2096 sq ft 
2097 sq ft 
2097 sq ft

Attribute 9 : Property Name

pnames = soup.find_all("h3", {"class":"property__name"})
for pname in pnames:
print(pname.text)

The output as follows :-

Aman Suria Damansara                          
Aman Suria Damansara                          
Aman Suria Damansara                          
Surian Residences 
Surian Residences 
Mutiara Damansara 
Mutiara Damansara 
Wangsa Maju 
Wangsa Maju 
Wangsa Metroview 
Wangsa Metroview 
Citizen 
Citizen 
Old Klang Road 
Old Klang Road 
Quay West Residence 
Old Klang Road 
Old Klang Road 
Residency V 
Residency V

Attribute 10 : Accommadation Category

categories = soup.find_all("div",{"itemprop":"accommodationCategory"})
for category in categories:
print(category.text)

The output as follows :-

Fully Furnished                                                                 Semi-D                                                                                                                                   Semi-Furnished                                                          Bungalow                                                                                                                                   Semi-Furnished                                                               House                                                              Fully Furnished Condominium                                                                              Fully Furnished Condominium                                                                              Fully Furnished Condominium                                                                              Fully Furnished Condominium                                                                              Fully Furnished Condominium                                                                              Fully Furnished Condominium                                                                              Condominium                                             F/H  Fully Furnished Condominium                                                                              Fully Furnished Condominium                                                                              Semi-Furnished Apartment                                                                              Semi-Furnished Apartment                                                                              Semi-Furnished Apartment                                                                              Semi-Furnished Apartment                                                                              House, 2-sty terrace/link house, 3-sty...                                             F/H  Fully Furnished Condominium                                                                              Fully Furnished Condominium                                                                              Fully Furnished Condominium

After we finish extracting the data, we saved them and combine it all into a pandas dataframe.

rents = soup.find_all("div",{"class":"property property--half mb-4"})
rent_no = 0
rents_list = {}
for rent in rents:date = rent.find("span",{"class":"property__listed-date"}).text
agent_name = rent.find("a",{"class":"property__agentdetails"}).text
location = rent.find("p",{"class":"property__location"}).text
price = rent.find("h4",{"class":"property__price"}).text
no_bedroom = rent.find("span",{"itemprop":"numberOfBedrooms"}).text
no_bathroom = rent.find("span",{"itemprop":"numberOfBathroomsTotal"}).text
phone = rent.find("div",{"class":"property__agentcontacts"}).text
feet = rent.find("span",{"itemprop":"floorSize"}).text
pname = rent.find("h3",{"class":"property__name"}).text
category = rent.find("div",{"itemprop":"accommodationCategory"}).textrent_no += 1
rents_list[rent_no] = [agent_name,price, no_bedroom,no_bathroom,location, date,phone, feet, pname,category]
rents_df = pd.DataFrame.from_dict(rents_list, orient ="index", columns=['Agent Name','Rental Price','Number of Bedroom', 'Number of Bathroom', 'Location','Date','Number Phone','Square Feet', 'Property Name','Accommodation Category'])rents_df.head()

Phase 2 : Transformation

Second stage in ETL is transformation where we have to do some data cleaning in order to fix or remove the inaccurate data. For every attribute we identify the incorrect data beforehand to know which cleaning process we have to do. After we’re done cleaning, we proceed with some visualization.

Cleaning task 1 : Change Letters to Lowercase

This is done to ensure that the naming convention of the data is standardized without having mixed-case values in the dataset. We chose “Agent Name” , “Location” and “Property Name” columns to lowercase their values.

Column — Agent Name

Before

1         Estee Chong ( E2250 ) 
2         Estee Chong ( E2250 ) 
3         Eric Yen ( PEA 2176 ) 
4         Eric Yen ( PEA 2176 ) 
5         Eric Yen ( PEA 2176 ) 
6         Eric Yen ( PEA 2176 ) 
7         Eric Yen ( PEA 2176 ) 
8         Eric Yen ( PEA 2176 ) 
9         Eric Yen ( PEA 2176 ) 
10        Eric Yen ( PEA 2176 ) 
11        Eric Yen ( PEA 2176 ) 
12    Simeon Tang ( REN 00413 ) 
13        Eric Yen ( PEA 2176 ) 
14        Eric Yen ( PEA 2176 ) 
15        Eric Yen ( PEA 2176 ) 
16        Eric Yen ( PEA 2176 ) 
17        Eric Yen ( PEA 2176 ) 
18        Eric Yen ( PEA 2176 ) 
19        Cedric Goey ( E2528 ) 
20        Eric Yen ( PEA 2176 )

After

rents_df['Agent Name'] = rents_df['Agent Name'].str.lower()
rents_df['Agent Name']

Changed output :-

1         estee chong ( e2250 ) 
2         estee chong ( e2250 ) 
3         eric yen ( pea 2176 ) 
4         eric yen ( pea 2176 ) 
5         eric yen ( pea 2176 ) 
6         eric yen ( pea 2176 ) 
7         eric yen ( pea 2176 ) 
8         eric yen ( pea 2176 ) 
9         eric yen ( pea 2176 ) 
10        eric yen ( pea 2176 ) 
11        eric yen ( pea 2176 ) 
12    simeon tang ( ren 00413 ) 
13        eric yen ( pea 2176 ) 
14        eric yen ( pea 2176 ) 
15        eric yen ( pea 2176 ) 
16        eric yen ( pea 2176 ) 
17        eric yen ( pea 2176 ) 
18        eric yen ( pea 2176 ) 
19        cedric goey ( e2528 ) 
20        eric yen ( pea 2176 )

Column — Location

Before

1    Mutiara Damansara 
2    Mutiara Damansara 
3          Wangsa Maju 
4          Wangsa Maju 
5       Old Klang Road

After

rents_df[‘Location’] = rents_df[‘Location’].str.lower()
rents_df[‘Location’].head()

Output changed :-

1    mutiara damansara 
2    mutiara damansara 
3          wangsa maju 
4          wangsa maju 
5       old klang road

Column — Property Name

Before

1    Surian Residences 
2    Mutiara Damansara 
3          Wangsa Maju 
4     Wangsa Metroview 
5              Citizen

After

rents_df['Property Name'] = rents_df['Property Name'].str.lower()
rents_df['Property Name'].head()

Output changed :-

1    surian residences 
2    mutiara damansara 
3          wangsa maju 
4     wangsa metroview 
5              citizen

Cleaning task 2 : Handling missing value

Based on the scraped data, there are columns that contain missing values such as “Number of Bedroom” and “Number of Bathroom”. Therefore, cleaning need to be done to ensure that there are no missing values in the column. As we can see, there are two value that can be categorized as unknown which stated as N/A (null) and also values that has “+1”. So we decided to replace the “+1” value with the correct calculated one and N/A (null) value with the mean value for number of bedroom and bathroom.

Before

Column — Number of Bedroom

Visualisation :-

After

rents_df['Number of Bedroom'].replace("N/A", np.nan, inplace = True)
rents_df['Number of Bedroom'].replace("4 +1", "5", inplace = True)
rents_df['Number of Bedroom'].replace("3 +1", "4", inplace = True)
missing_data = rents_df['Number of Bedroom']
mean_data = rents_df['Number of Bedroom'].astype("float").mean()
mean_data = round(mean_data)
rents_df['Number of Bedroom'].replace(np.nan, mean_data, inplace = True)
rents_df['Number of Bedroom'].astype(int)

Changed output:-

Visualization :-

%matplotlib inline
import matplotlib.pyplot as pltrents_df['Price-Binned'].value_counts().plot(kind='bar', x='Rental Price', y='Number of Bedroom',figsize=(10, 6), color='lightblue')plt.title('Histogram of the Rental Price for Property') 
plt.ylabel('Number of Property') 
plt.xlabel('Rental Price')
plt.show()

Column — Number of Bathroom

Before

Visualization :-

After

rents_df['Number of Bathroom'].replace("N/A", np.nan, inplace = True) avg_1 = rents_df['Number of Bathroom'].astype("float").mean(axis = 0) avg_1 = round(avg_1) rents_df['Number of Bathroom'].replace(np.nan,avg_1, inplace = True) rents_df['Number of Bathroom'].astype(int)

Changed output :-

Visualization :-

%matplotlib inline 
import matplotlib.pyplot as plt rents_df['Number of Bathroom'].value_counts().plot(kind='bar', x='Number of Bathroom', y='Frequency',    figsize=(10, 6), color='lightblue') plt.title('Histogram of the Frequency Number of Bathroom for Property')  
plt.ylabel('Frequency')  
plt.xlabel('Number of Bathroom') 
plt.show()

Cleaning task 3 : Binning data

After we done scraping the rental price data, we can see that there are so many numbers and therefore we don’t know how to estimate the property either it is cheap or expensive. Therefore, we decided to bin the rental price data so that we can classify the property according to their price.

Before

1 RM 2,400.00 
2 RM 2,400.00 
3 RM 1,300.00 
4 RM 1,300.00 
5 RM 2,300.00 
6 RM 2,300.00 
7 RM 2,200.00 
8 RM 2,200.00 
9 RM 1,300.00 
10 RM 1,300.00 
11 RM 1,300.00 
12 RM 1,500.00 
13 RM 2,300.00 
14 RM 1,300.00 
15 RM 1,300.00 
16 RM 1,500.00 
17 RM 1,500.00 
18 RM 1,500.00 
19 RM 4,000.00 
20 RM 2,200.00

Visualization :-

After

rents_df['Rental Price'] = rents_df['Rental Price'].replace('RM ','', regex=True)rents_df['Rental Price'] = rents_df['Rental Price'].replace(',','', regex=True)bins = [0.00, 1500.00, 2500.00, 3500.00, 20000.00]rents_df['Binned'] = pd.cut(rents_df['Rental Price'].astype(float), bins)labels = ['Cheap','Medium','Expensive', 'Super Expensive']rents_df['Price-Binned'] = pd.cut(rents_df['Rental Price'].astype(float), bins,labels=labels)print (rents_df)

Changed output :-

Visualization :-

%matplotlib inlineimport matplotlib.pyplot as pltrents_df['Price-Binned'].value_counts().plot(kind='bar', x='Rental Price', y='Number of Bedroom',figsize=(10, 6), color='lightblue')plt.title('Histogram of the Rental Price for Property')plt.ylabel('Number of Property')plt.xlabel('Rental Price')plt.show()

Cleaning task 4 : Remove irrelevant data

Column — Number Phone

We decided to remove the newlines and ‘+6’ from “Number Phone” columns. It is to ensure the “Number Phone” does not have mixed data type and has integer data type only so that it will become easier for user to read it.

Before

1    \n\n\n+60122004539\n\n\n\n                    ... 
2    \n\n\n+60122004539\n\n\n\n                    ... 
3    \n\n\n+60122811754\n\n\n\n                    ... 
4    \n\n\n+60122811754\n\n\n\n                    ... 
5    \n\n\n+60122811754\n\n\n\n                    ...

After

rents_df['Number Phone'] =  rents_df['Number Phone'].replace('\n','', regex=True)rents_df['Number Phone'] =  rents_df['Number Phone'].replace('6','', regex=True)rents_df['Number Phone'].head()

Changed output :-

1    0122004539                                    
2    0122004539                                    
3    0122811754                                    
4    0122811754                                   
5    0122811754

Column — Square Feet

For the “Square Feet” column, we can see that there is ‘sq ft’ from “Square Feet” data. So we decided to remove ‘sq ft’ to standardize it to become integer data type values.

Before

1     850 sq ft 
2     850 sq ft 
3    1150 sq ft 
4    1150 sq ft 
5    1094 sq ft

After

rents_df['Square Feet'] =  rents_df['Square Feet'].replace('sq ft','', regex=True)rents_df['Square Feet'].head()

Changed output :-

Column — Date

We decided to remove the newlines and ‘listed on’ from “Date” columns. It is to ensure the “Date” does not have mixed data type and has integer data type only so that it will become easier for user to read it.

Before

1                                    listed 4 hour ago 
2                                    listed 4 hour ago 
3                                    listed 8 hour ago 
4                                    listed 8 hour ago 
5                                    listed 8 hour ago

After

rents_df['Date'] =  rents_df['Date'].replace('\n','', regex=True)text = ["listed","listed on"]
for char in text:
rents_df['Date'] = rents_df['Date'].str.replace(char, ' ')rents_df['Date'].head()

Changed output :-

1                                    4 hour ago 
2                                    4 hour ago 
3                                    8 hour ago 
4                                    8 hour ago 
5                                    8 hour ago

Column — Accommodation Category

We decided to replace the word ‘fully furnished’ to ‘fully-furnished’ in Accommodation Category column to make it easier to remove everything after the words ‘fully-furnished’, ‘semi-furnished’ and ‘unfurnished’. This will allows us to make a visualization of the furnishing types.

Before

1     fully-furnished condominium                   ... 
2     fully-furnished condominium                   ... 
3     fully-furnished condominium                   ... 
4     fully-furnished condominium                   ... 
5     fully-furnished condominium                   ...

After

rents_df['Accommodation Category'] =  rents_df['Accommodation Category'].replace('fully furnished','fully-furnished', regex=True)rents_df['Accommodation Category']

Changed output :-

1     fully-furnished 
2     fully-furnished 
3     fully-furnished 
4     fully-furnished 
5     fully-furnished

Visualisation :-

%matplotlib inline
import matplotlib.pyplot as pltrents_df['Accommodation Category'].value_counts().plot(kind='bar', x='Accommodation Category', y='Frequency',
figsize=(10, 6), color='purple')plt.title('Histogram of the Frequency of Furnishing')
plt.ylabel('Frequency')
plt.xlabel('Furnishing')
plt.show()

Final dataframe after cleaning all of the attributes will look like as shown below

Phase 3 : Load

Last but not least we have to store data in a csv file and download it.

rents_df.to_csv('cleaned_data_rents_df.csv')

Here is the complete code of this project.

That’s all from us! I hope you guys gain something from this article 😊

Scrape and perform ETL process on Star Property House Rental website

Phase 1: Extraction

Phase 2 : Transformation

Phase 3 : Load

Written by Farah Izzlyn