Scrape and perform ETL process on Star Property House Rental website
This article will be discussing about how data is scraped from one website and the method that is performed to solve miscellaneous error inside the dataset so that cleaned data can be obtained in the end. This group project is as assessment for the course that we took which is Introduction to Data Engineering.
Python language is used in this article as it is a universal or general-purpose coding language which means it can be used for any type of programming.
Ready? Let’s get started!
First of all, we need a library that can help us in scraping data from website. Therefore, BeautifulSoup library will be used as our scrapper. Request module will also be used as we need something that can indicate either the SIP (Session Initial Protocol) response code for our action is success or fail. To make it easier, value range of 200 will indicate a successful responses while range of 400 will indicate a client failure responses.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import pandas as pdurl= "https://www.starproperty.my/to-rent"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
response = requests.get(url, headers=headers)
response
Output:
<Response [200]>
Then, variable data will be use to store all the HTML text data from the website so that soup can use it as a reference to parse extracted data based on preferences.
data = response.text
soup = BeautifulSoup(data, 'html.parser')
Phase 1: Extraction

The website that we chose is about information of house rentals that can be found around Malaysia. 10 attributes have been identified from this website which are agent name, rental price, number of bathroom, number of bedroom, location, date, phone number, square feet of the house, property name and accommodation category. Steps below shown on how we extract for each data.
Attribute 1 : Agent Name
agent_names = soup.find_all("a",{"class":"property__agentdetails"})
for agent_name in agent_names:
print(agent_name.text)
The output as follows :-
Estee Chong ( E2250 )
Estee Chong ( E2250 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Simeon Tang ( REN 00413 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Cedric Goey ( E2528 )
Eric Yen ( PEA 2176 )
Attribute 2 : Date
dates = soup.find_all("span",{"class":"property__listed-date"})
for date in dates:
print(date.text)
The output as follows :-
Listed 4 hour ago
Listed 4 hour ago
Listed 4 hour ago
listed 4 hour ago
listed 4 hour ago
listed 8 hour ago
listed 8 hour ago
listed 8 hour ago
listed 8 hour ago
listed 8 hour ago
listed 8 hour ago
listed on 21/01/2021
listed on 21/01/2021
listed on 21/01/2021
listed on 19/01/2021
listed on 15/01/2021
listed on 12/01/2021
listed on 11/01/2021
listed on 09/01/2021
listed on 09/01/2021
Attribute 3 : Location
locations = soup.find_all("p",{"class":"property__location"})
for location in locations:
print(location.text)
The output as follows :-
Petaling Jaya
Petaling Jaya
Petaling Jaya
Mutiara Damansara
Mutiara Damansara
Mutiara Damansara
Mutiara Damansara
Wangsa Maju
Wangsa Maju
Wangsa Maju
Wangsa Maju
Old Klang Road
Old Klang Road
Old Klang Road
Old Klang Road
Penang
Old Klang Road
Old Klang Road
Old Klang Road
Old Klang Road
Attribute 4 : Number of Bedroom
bedrooms= soup.find_all("span",{"itemprop":"numberOfBedrooms"})
for bedroom in bedrooms:
print(bedroom.text)
The output as follows :-
5
6
5
1
1
3
3
3 +1
3 +1
2
2
3
3
3
3
3 +1
3
3
4
N/A
Attribute 5 : Number of Bathroom
bathrooms = soup.find_all("span",{"itemprop":"numberOfBathroomsTotal"})
for bathroom in bathrooms:
print(bathroom.text)
The output as follows :-
5
6
3
1
N/A
2
2
2
2
2
2
2
2
2
2
2
2
2
3
3
Attribute 6: Rental Price
prices = soup.find_all("h4",{"class":"property__price"})
for price in prices:
print(price.text)
The output as follows :-
RM 5,500.00
RM 6,000.00
RM 3,100.00
RM 2,400.00
RM 2,400.00
RM 2,400.00
RM 2,400.00
RM 1,300.00
RM 1,300.00
RM 1,300.00
RM 1,300.00
RM 2,300.00
RM 2,300.00
RM 2,300.00
RM 2,300.00
RM 200,000.00
RM 2,200.00
RM 2,200.00
RM 2,200.00
RM 2,200.00
Attribute 7 : Phone Number
phones = soup.find_all("div",{"class":"property__agentcontacts"})
for phone in phones:
print(phone.text)
The output as follows :-
+60122004539 +60122004539 +60122004539 +60122004539 +60122004539 +60122811754 +60122811754 +60122811754 +60122811754 +60122811754 +60122811754 +60122811754 +60122811754 +60122811754 +60123190970 +60122811754 +60122811754 +60122811754 +60122811754 +60122811754
Attribute 8 : Square Feet
feets = soup.find_all("span",{"itemprop":"floorSize"})
for feet in feets:
print(feet.text)
The output as follows :-
3400 sq ft
2200 sq ft
850 sq ft
850 sq ft
1150 sq ft
1150 sq ft
1094 sq ft
1094 sq ft
814 sq ft
814 sq ft
850 sq ft
850 sq ft
850 sq ft
1400 sq ft
1094 sq ft
1150 sq ft
1150 sq ft
2096 sq ft
2097 sq ft
2097 sq ft
Attribute 9 : Property Name
pnames = soup.find_all("h3", {"class":"property__name"})
for pname in pnames:
print(pname.text)
The output as follows :-
Aman Suria Damansara
Aman Suria Damansara
Aman Suria Damansara
Surian Residences
Surian Residences
Mutiara Damansara
Mutiara Damansara
Wangsa Maju
Wangsa Maju
Wangsa Metroview
Wangsa Metroview
Citizen
Citizen
Old Klang Road
Old Klang Road
Quay West Residence
Old Klang Road
Old Klang Road
Residency V
Residency V
Attribute 10 : Accommadation Category
categories = soup.find_all("div",{"itemprop":"accommodationCategory"})
for category in categories:
print(category.text)
The output as follows :-
Fully Furnished Semi-D Semi-Furnished Bungalow Semi-Furnished House Fully Furnished Condominium Fully Furnished Condominium Fully Furnished Condominium Fully Furnished Condominium Fully Furnished Condominium Fully Furnished Condominium Condominium F/H Fully Furnished Condominium Fully Furnished Condominium Semi-Furnished Apartment Semi-Furnished Apartment Semi-Furnished Apartment Semi-Furnished Apartment House, 2-sty terrace/link house, 3-sty... F/H Fully Furnished Condominium Fully Furnished Condominium Fully Furnished Condominium
After we finish extracting the data, we saved them and combine it all into a pandas dataframe.
rents = soup.find_all("div",{"class":"property property--half mb-4"})
rent_no = 0
rents_list = {}
for rent in rents:date = rent.find("span",{"class":"property__listed-date"}).text
agent_name = rent.find("a",{"class":"property__agentdetails"}).text
location = rent.find("p",{"class":"property__location"}).text
price = rent.find("h4",{"class":"property__price"}).text
no_bedroom = rent.find("span",{"itemprop":"numberOfBedrooms"}).text
no_bathroom = rent.find("span",{"itemprop":"numberOfBathroomsTotal"}).text
phone = rent.find("div",{"class":"property__agentcontacts"}).text
feet = rent.find("span",{"itemprop":"floorSize"}).text
pname = rent.find("h3",{"class":"property__name"}).text
category = rent.find("div",{"itemprop":"accommodationCategory"}).textrent_no += 1
rents_list[rent_no] = [agent_name,price, no_bedroom,no_bathroom,location, date,phone, feet, pname,category]
rents_df = pd.DataFrame.from_dict(rents_list, orient ="index", columns=['Agent Name','Rental Price','Number of Bedroom', 'Number of Bathroom', 'Location','Date','Number Phone','Square Feet', 'Property Name','Accommodation Category'])rents_df.head()

Phase 2 : Transformation
Second stage in ETL is transformation where we have to do some data cleaning in order to fix or remove the inaccurate data. For every attribute we identify the incorrect data beforehand to know which cleaning process we have to do. After we’re done cleaning, we proceed with some visualization.
Cleaning task 1 : Change Letters to Lowercase
This is done to ensure that the naming convention of the data is standardized without having mixed-case values in the dataset. We chose “Agent Name” , “Location” and “Property Name” columns to lowercase their values.
Column — Agent Name
Before
1 Estee Chong ( E2250 )
2 Estee Chong ( E2250 )
3 Eric Yen ( PEA 2176 )
4 Eric Yen ( PEA 2176 )
5 Eric Yen ( PEA 2176 )
6 Eric Yen ( PEA 2176 )
7 Eric Yen ( PEA 2176 )
8 Eric Yen ( PEA 2176 )
9 Eric Yen ( PEA 2176 )
10 Eric Yen ( PEA 2176 )
11 Eric Yen ( PEA 2176 )
12 Simeon Tang ( REN 00413 )
13 Eric Yen ( PEA 2176 )
14 Eric Yen ( PEA 2176 )
15 Eric Yen ( PEA 2176 )
16 Eric Yen ( PEA 2176 )
17 Eric Yen ( PEA 2176 )
18 Eric Yen ( PEA 2176 )
19 Cedric Goey ( E2528 )
20 Eric Yen ( PEA 2176 )
After
rents_df['Agent Name'] = rents_df['Agent Name'].str.lower()
rents_df['Agent Name']
Changed output :-
1 estee chong ( e2250 )
2 estee chong ( e2250 )
3 eric yen ( pea 2176 )
4 eric yen ( pea 2176 )
5 eric yen ( pea 2176 )
6 eric yen ( pea 2176 )
7 eric yen ( pea 2176 )
8 eric yen ( pea 2176 )
9 eric yen ( pea 2176 )
10 eric yen ( pea 2176 )
11 eric yen ( pea 2176 )
12 simeon tang ( ren 00413 )
13 eric yen ( pea 2176 )
14 eric yen ( pea 2176 )
15 eric yen ( pea 2176 )
16 eric yen ( pea 2176 )
17 eric yen ( pea 2176 )
18 eric yen ( pea 2176 )
19 cedric goey ( e2528 )
20 eric yen ( pea 2176 )
Column — Location
Before
1 Mutiara Damansara
2 Mutiara Damansara
3 Wangsa Maju
4 Wangsa Maju
5 Old Klang Road
After
rents_df[‘Location’] = rents_df[‘Location’].str.lower()
rents_df[‘Location’].head()
Output changed :-
1 mutiara damansara
2 mutiara damansara
3 wangsa maju
4 wangsa maju
5 old klang road
Column — Property Name
Before
1 Surian Residences
2 Mutiara Damansara
3 Wangsa Maju
4 Wangsa Metroview
5 Citizen
After
rents_df['Property Name'] = rents_df['Property Name'].str.lower()
rents_df['Property Name'].head()
Output changed :-
1 surian residences
2 mutiara damansara
3 wangsa maju
4 wangsa metroview
5 citizen
Cleaning task 2 : Handling missing value
Based on the scraped data, there are columns that contain missing values such as “Number of Bedroom” and “Number of Bathroom”. Therefore, cleaning need to be done to ensure that there are no missing values in the column. As we can see, there are two value that can be categorized as unknown which stated as N/A (null) and also values that has “+1”. So we decided to replace the “+1” value with the correct calculated one and N/A (null) value with the mean value for number of bedroom and bathroom.
Before
Column — Number of Bedroom
1 1
2 1
3 3
4 3
5 3 +1
6 3 +1
7 2
8 2
9 3
10 3
11 3
12 3
13 3 +1
14 3
15 3
16 4
17 4
18 4
19 N/A
20 2
Visualisation :-

After
rents_df['Number of Bedroom'].replace("N/A", np.nan, inplace = True)
rents_df['Number of Bedroom'].replace("4 +1", "5", inplace = True)
rents_df['Number of Bedroom'].replace("3 +1", "4", inplace = True)
missing_data = rents_df['Number of Bedroom']
mean_data = rents_df['Number of Bedroom'].astype("float").mean()
mean_data = round(mean_data)
rents_df['Number of Bedroom'].replace(np.nan, mean_data, inplace = True)
rents_df['Number of Bedroom'].astype(int)
Changed output:-
1 1
2 1
3 3
4 3
5 4
6 4
7 2
8 2
9 3
10 3
11 3
12 3
13 4
14 3
15 3
16 4
17 4
18 4
19 3
20 2
Visualization :-
%matplotlib inline
import matplotlib.pyplot as pltrents_df['Price-Binned'].value_counts().plot(kind='bar', x='Rental Price', y='Number of Bedroom',figsize=(10, 6), color='lightblue')plt.title('Histogram of the Rental Price for Property')
plt.ylabel('Number of Property')
plt.xlabel('Rental Price')
plt.show()

Column — Number of Bathroom
Before
1 1
2 N/A
3 2
4 2
5 2
6 2
7 2
8 2
9 2
10 2
11 2
12 2
13 2
14 2
15 2
16 3
17 3
18 3
19 2
20 2
Visualization :-

After
rents_df['Number of Bathroom'].replace("N/A", np.nan, inplace = True) avg_1 = rents_df['Number of Bathroom'].astype("float").mean(axis = 0) avg_1 = round(avg_1) rents_df['Number of Bathroom'].replace(np.nan,avg_1, inplace = True) rents_df['Number of Bathroom'].astype(int)
Changed output :-
1 1
2 2
3 2
4 2
5 2
6 2
7 2
8 2
9 2
10 2
11 2
12 2
13 2
14 2
15 2
16 3
17 3
18 3
19 2
20 2
Visualization :-
%matplotlib inline
import matplotlib.pyplot as plt rents_df['Number of Bathroom'].value_counts().plot(kind='bar', x='Number of Bathroom', y='Frequency', figsize=(10, 6), color='lightblue') plt.title('Histogram of the Frequency Number of Bathroom for Property')
plt.ylabel('Frequency')
plt.xlabel('Number of Bathroom')
plt.show()

Cleaning task 3 : Binning data
After we done scraping the rental price data, we can see that there are so many numbers and therefore we don’t know how to estimate the property either it is cheap or expensive. Therefore, we decided to bin the rental price data so that we can classify the property according to their price.
Before
1 RM 2,400.00
2 RM 2,400.00
3 RM 1,300.00
4 RM 1,300.00
5 RM 2,300.00
6 RM 2,300.00
7 RM 2,200.00
8 RM 2,200.00
9 RM 1,300.00
10 RM 1,300.00
11 RM 1,300.00
12 RM 1,500.00
13 RM 2,300.00
14 RM 1,300.00
15 RM 1,300.00
16 RM 1,500.00
17 RM 1,500.00
18 RM 1,500.00
19 RM 4,000.00
20 RM 2,200.00
Visualization :-

After
rents_df['Rental Price'] = rents_df['Rental Price'].replace('RM ','', regex=True)rents_df['Rental Price'] = rents_df['Rental Price'].replace(',','', regex=True)bins = [0.00, 1500.00, 2500.00, 3500.00, 20000.00]rents_df['Binned'] = pd.cut(rents_df['Rental Price'].astype(float), bins)labels = ['Cheap','Medium','Expensive', 'Super Expensive']rents_df['Price-Binned'] = pd.cut(rents_df['Rental Price'].astype(float), bins,labels=labels)print (rents_df)
Changed output :-

Visualization :-
%matplotlib inlineimport matplotlib.pyplot as pltrents_df['Price-Binned'].value_counts().plot(kind='bar', x='Rental Price', y='Number of Bedroom',figsize=(10, 6), color='lightblue')plt.title('Histogram of the Rental Price for Property')plt.ylabel('Number of Property')plt.xlabel('Rental Price')plt.show()

Cleaning task 4 : Remove irrelevant data
Column — Number Phone
We decided to remove the newlines and ‘+6’ from “Number Phone” columns. It is to ensure the “Number Phone” does not have mixed data type and has integer data type only so that it will become easier for user to read it.
Before
1 \n\n\n+60122004539\n\n\n\n ...
2 \n\n\n+60122004539\n\n\n\n ...
3 \n\n\n+60122811754\n\n\n\n ...
4 \n\n\n+60122811754\n\n\n\n ...
5 \n\n\n+60122811754\n\n\n\n ...
After
rents_df['Number Phone'] = rents_df['Number Phone'].replace('\n','', regex=True)rents_df['Number Phone'] = rents_df['Number Phone'].replace('6','', regex=True)rents_df['Number Phone'].head()
Changed output :-
1 0122004539
2 0122004539
3 0122811754
4 0122811754
5 0122811754
Column — Square Feet
For the “Square Feet” column, we can see that there is ‘sq ft’ from “Square Feet” data. So we decided to remove ‘sq ft’ to standardize it to become integer data type values.
Before
1 850 sq ft
2 850 sq ft
3 1150 sq ft
4 1150 sq ft
5 1094 sq ft
After
rents_df['Square Feet'] = rents_df['Square Feet'].replace('sq ft','', regex=True)rents_df['Square Feet'].head()
Changed output :-
1 850
2 850
3 1150
4 1150
5 1094
Column — Date
We decided to remove the newlines and ‘listed on’ from “Date” columns. It is to ensure the “Date” does not have mixed data type and has integer data type only so that it will become easier for user to read it.
Before
1 listed 4 hour ago
2 listed 4 hour ago
3 listed 8 hour ago
4 listed 8 hour ago
5 listed 8 hour ago
After
rents_df['Date'] = rents_df['Date'].replace('\n','', regex=True)text = ["listed","listed on"]
for char in text:
rents_df['Date'] = rents_df['Date'].str.replace(char, ' ')rents_df['Date'].head()
Changed output :-
1 4 hour ago
2 4 hour ago
3 8 hour ago
4 8 hour ago
5 8 hour ago
Column — Accommodation Category
We decided to replace the word ‘fully furnished’ to ‘fully-furnished’ in Accommodation Category column to make it easier to remove everything after the words ‘fully-furnished’, ‘semi-furnished’ and ‘unfurnished’. This will allows us to make a visualization of the furnishing types.
Before
1 fully-furnished condominium ...
2 fully-furnished condominium ...
3 fully-furnished condominium ...
4 fully-furnished condominium ...
5 fully-furnished condominium ...
After
rents_df['Accommodation Category'] = rents_df['Accommodation Category'].replace('fully furnished','fully-furnished', regex=True)rents_df['Accommodation Category']
Changed output :-
1 fully-furnished
2 fully-furnished
3 fully-furnished
4 fully-furnished
5 fully-furnished
Visualisation :-
%matplotlib inline
import matplotlib.pyplot as pltrents_df['Accommodation Category'].value_counts().plot(kind='bar', x='Accommodation Category', y='Frequency',
figsize=(10, 6), color='purple')plt.title('Histogram of the Frequency of Furnishing')
plt.ylabel('Frequency')
plt.xlabel('Furnishing')
plt.show()

Final dataframe after cleaning all of the attributes will look like as shown below

Phase 3 : Load

Last but not least we have to store data in a csv file and download it.
rents_df.to_csv('cleaned_data_rents_df.csv')
Here is the complete code of this project.
That’s all from us! I hope you guys gain something from this article 😊