Scrape and perform ETL process on Star Property House Rental website

Farah Izzlyn
12 min readJan 22, 2021

This article will be discussing about how data is scraped from one website and the method that is performed to solve miscellaneous error inside the dataset so that cleaned data can be obtained in the end. This group project is as assessment for the course that we took which is Introduction to Data Engineering.

Python language is used in this article as it is a universal or general-purpose coding language which means it can be used for any type of programming.

Ready? Let’s get started!

First of all, we need a library that can help us in scraping data from website. Therefore, BeautifulSoup library will be used as our scrapper. Request module will also be used as we need something that can indicate either the SIP (Session Initial Protocol) response code for our action is success or fail. To make it easier, value range of 200 will indicate a successful responses while range of 400 will indicate a client failure responses.

from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import pandas as pd
url= "https://www.starproperty.my/to-rent"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
response = requests.get(url, headers=headers)
response

Output:

<Response [200]>

Then, variable data will be use to store all the HTML text data from the website so that soup can use it as a reference to parse extracted data based on preferences.

data = response.text
soup = BeautifulSoup(data, 'html.parser')

Phase 1: Extraction

The website that we chose is about information of house rentals that can be found around Malaysia. 10 attributes have been identified from this website which are agent name, rental price, number of bathroom, number of bedroom, location, date, phone number, square feet of the house, property name and accommodation category. Steps below shown on how we extract for each data.

Attribute 1 : Agent Name

agent_names = soup.find_all("a",{"class":"property__agentdetails"})
for agent_name in agent_names:
print(agent_name.text)

The output as follows :-

Estee Chong ( E2250 ) 
Estee Chong ( E2250 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Simeon Tang ( REN 00413 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Eric Yen ( PEA 2176 )
Cedric Goey ( E2528 )
Eric Yen ( PEA 2176 )

Attribute 2 : Date

dates = soup.find_all("span",{"class":"property__listed-date"})
for date in dates:
print(date.text)

The output as follows :-

Listed 4 hour ago                           
Listed 4 hour ago
Listed 4 hour ago
listed 4 hour ago
listed 4 hour ago
listed 8 hour ago
listed 8 hour ago
listed 8 hour ago
listed 8 hour ago
listed 8 hour ago
listed 8 hour ago
listed on 21/01/2021
listed on 21/01/2021
listed on 21/01/2021
listed on 19/01/2021
listed on 15/01/2021
listed on 12/01/2021
listed on 11/01/2021
listed on 09/01/2021
listed on 09/01/2021

Attribute 3 : Location

locations = soup.find_all("p",{"class":"property__location"})
for location in locations:
print(location.text)

The output as follows :-

Petaling Jaya 
Petaling Jaya
Petaling Jaya
Mutiara Damansara
Mutiara Damansara
Mutiara Damansara
Mutiara Damansara
Wangsa Maju
Wangsa Maju
Wangsa Maju
Wangsa Maju
Old Klang Road
Old Klang Road
Old Klang Road
Old Klang Road
Penang
Old Klang Road
Old Klang Road
Old Klang Road
Old Klang Road

Attribute 4 : Number of Bedroom

bedrooms= soup.find_all("span",{"itemprop":"numberOfBedrooms"})
for bedroom in bedrooms:
print(bedroom.text)

The output as follows :-

5  
6
5
1
1
3
3
3 +1
3 +1
2
2
3
3
3
3
3 +1
3
3
4
N/A

Attribute 5 : Number of Bathroom

bathrooms = soup.find_all("span",{"itemprop":"numberOfBathroomsTotal"})
for bathroom in bathrooms:
print(bathroom.text)

The output as follows :-

5  
6
3
1
N/A
2
2
2
2
2
2
2
2
2
2
2
2
2
3
3

Attribute 6: Rental Price

prices = soup.find_all("h4",{"class":"property__price"})
for price in prices:
print(price.text)

The output as follows :-

RM 5,500.00 
RM 6,000.00
RM 3,100.00
RM 2,400.00
RM 2,400.00
RM 2,400.00
RM 2,400.00
RM 1,300.00
RM 1,300.00
RM 1,300.00
RM 1,300.00
RM 2,300.00
RM 2,300.00
RM 2,300.00
RM 2,300.00
RM 200,000.00
RM 2,200.00
RM 2,200.00
RM 2,200.00
RM 2,200.00

Attribute 7 : Phone Number

phones = soup.find_all("div",{"class":"property__agentcontacts"})
for phone in phones:
print(phone.text)

The output as follows :-

+60122004539                                                 +60122004539                                                 +60122004539                                                 +60122004539                                                     +60122004539                                                     +60122811754                                                     +60122811754                                                     +60122811754                                                     +60122811754                                                     +60122811754                                                     +60122811754                                                     +60122811754                                                     +60122811754                                                     +60122811754                                                     +60123190970                                                     +60122811754                                                     +60122811754                                                     +60122811754                                                     +60122811754                                                     +60122811754                                                    

Attribute 8 : Square Feet

feets = soup.find_all("span",{"itemprop":"floorSize"})
for feet in feets:
print(feet.text)

The output as follows :-

3400 sq ft  
2200 sq ft
850 sq ft
850 sq ft
1150 sq ft
1150 sq ft
1094 sq ft
1094 sq ft
814 sq ft
814 sq ft
850 sq ft
850 sq ft
850 sq ft
1400 sq ft
1094 sq ft
1150 sq ft
1150 sq ft
2096 sq ft
2097 sq ft
2097 sq ft

Attribute 9 : Property Name

pnames = soup.find_all("h3", {"class":"property__name"})
for pname in pnames:
print(pname.text)

The output as follows :-

Aman Suria Damansara                          
Aman Suria Damansara
Aman Suria Damansara
Surian Residences
Surian Residences
Mutiara Damansara
Mutiara Damansara
Wangsa Maju
Wangsa Maju
Wangsa Metroview
Wangsa Metroview
Citizen
Citizen
Old Klang Road
Old Klang Road
Quay West Residence
Old Klang Road
Old Klang Road
Residency V
Residency V

Attribute 10 : Accommadation Category

categories = soup.find_all("div",{"itemprop":"accommodationCategory"})
for category in categories:
print(category.text)

The output as follows :-

Fully Furnished                                                                 Semi-D                                                                                                                                   Semi-Furnished                                                          Bungalow                                                                                                                                   Semi-Furnished                                                               House                                                              Fully Furnished Condominium                                                                              Fully Furnished Condominium                                                                              Fully Furnished Condominium                                                                              Fully Furnished Condominium                                                                              Fully Furnished Condominium                                                                              Fully Furnished Condominium                                                                              Condominium                                             F/H  Fully Furnished Condominium                                                                              Fully Furnished Condominium                                                                              Semi-Furnished Apartment                                                                              Semi-Furnished Apartment                                                                              Semi-Furnished Apartment                                                                              Semi-Furnished Apartment                                                                              House, 2-sty terrace/link house, 3-sty...                                             F/H  Fully Furnished Condominium                                                                              Fully Furnished Condominium                                                                              Fully Furnished Condominium                                                                              

After we finish extracting the data, we saved them and combine it all into a pandas dataframe.

rents = soup.find_all("div",{"class":"property property--half mb-4"})
rent_no = 0
rents_list = {}
for rent in rents:
date = rent.find("span",{"class":"property__listed-date"}).text
agent_name = rent.find("a",{"class":"property__agentdetails"}).text
location = rent.find("p",{"class":"property__location"}).text
price = rent.find("h4",{"class":"property__price"}).text
no_bedroom = rent.find("span",{"itemprop":"numberOfBedrooms"}).text
no_bathroom = rent.find("span",{"itemprop":"numberOfBathroomsTotal"}).text
phone = rent.find("div",{"class":"property__agentcontacts"}).text
feet = rent.find("span",{"itemprop":"floorSize"}).text
pname = rent.find("h3",{"class":"property__name"}).text
category = rent.find("div",{"itemprop":"accommodationCategory"}).text
rent_no += 1
rents_list[rent_no] = [agent_name,price, no_bedroom,no_bathroom,location, date,phone, feet, pname,category]
rents_df = pd.DataFrame.from_dict(rents_list, orient ="index", columns=['Agent Name','Rental Price','Number of Bedroom', 'Number of Bathroom', 'Location','Date','Number Phone','Square Feet', 'Property Name','Accommodation Category'])
rents_df.head()

Phase 2 : Transformation

Second stage in ETL is transformation where we have to do some data cleaning in order to fix or remove the inaccurate data. For every attribute we identify the incorrect data beforehand to know which cleaning process we have to do. After we’re done cleaning, we proceed with some visualization.

Cleaning task 1 : Change Letters to Lowercase

This is done to ensure that the naming convention of the data is standardized without having mixed-case values in the dataset. We chose “Agent Name” , “Location” and “Property Name” columns to lowercase their values.

Column — Agent Name

Before

1         Estee Chong ( E2250 ) 
2 Estee Chong ( E2250 )
3 Eric Yen ( PEA 2176 )
4 Eric Yen ( PEA 2176 )
5 Eric Yen ( PEA 2176 )
6 Eric Yen ( PEA 2176 )
7 Eric Yen ( PEA 2176 )
8 Eric Yen ( PEA 2176 )
9 Eric Yen ( PEA 2176 )
10 Eric Yen ( PEA 2176 )
11 Eric Yen ( PEA 2176 )
12 Simeon Tang ( REN 00413 )
13 Eric Yen ( PEA 2176 )
14 Eric Yen ( PEA 2176 )
15 Eric Yen ( PEA 2176 )
16 Eric Yen ( PEA 2176 )
17 Eric Yen ( PEA 2176 )
18 Eric Yen ( PEA 2176 )
19 Cedric Goey ( E2528 )
20 Eric Yen ( PEA 2176 )

After

rents_df['Agent Name'] = rents_df['Agent Name'].str.lower()
rents_df['Agent Name']

Changed output :-

1         estee chong ( e2250 ) 
2 estee chong ( e2250 )
3 eric yen ( pea 2176 )
4 eric yen ( pea 2176 )
5 eric yen ( pea 2176 )
6 eric yen ( pea 2176 )
7 eric yen ( pea 2176 )
8 eric yen ( pea 2176 )
9 eric yen ( pea 2176 )
10 eric yen ( pea 2176 )
11 eric yen ( pea 2176 )
12 simeon tang ( ren 00413 )
13 eric yen ( pea 2176 )
14 eric yen ( pea 2176 )
15 eric yen ( pea 2176 )
16 eric yen ( pea 2176 )
17 eric yen ( pea 2176 )
18 eric yen ( pea 2176 )
19 cedric goey ( e2528 )
20 eric yen ( pea 2176 )

Column — Location

Before

1    Mutiara Damansara 
2 Mutiara Damansara
3 Wangsa Maju
4 Wangsa Maju
5 Old Klang Road

After

rents_df[‘Location’] = rents_df[‘Location’].str.lower()
rents_df[‘Location’].head()

Output changed :-

1    mutiara damansara 
2 mutiara damansara
3 wangsa maju
4 wangsa maju
5 old klang road

Column — Property Name

Before

1    Surian Residences 
2 Mutiara Damansara
3 Wangsa Maju
4 Wangsa Metroview
5 Citizen

After

rents_df['Property Name'] = rents_df['Property Name'].str.lower()
rents_df['Property Name'].head()

Output changed :-

1    surian residences 
2 mutiara damansara
3 wangsa maju
4 wangsa metroview
5 citizen

Cleaning task 2 : Handling missing value

Based on the scraped data, there are columns that contain missing values such as “Number of Bedroom” and “Number of Bathroom”. Therefore, cleaning need to be done to ensure that there are no missing values in the column. As we can see, there are two value that can be categorized as unknown which stated as N/A (null) and also values that has “+1”. So we decided to replace the “+1” value with the correct calculated one and N/A (null) value with the mean value for number of bedroom and bathroom.

Before

Column — Number of Bedroom

1       1  
2 1
3 3
4 3
5 3 +1
6 3 +1
7 2
8 2
9 3
10 3
11 3
12 3
13 3 +1
14 3
15 3
16 4
17 4
18 4
19 N/A
20 2

Visualisation :-

After

rents_df['Number of Bedroom'].replace("N/A", np.nan, inplace = True)
rents_df['Number of Bedroom'].replace("4 +1", "5", inplace = True)
rents_df['Number of Bedroom'].replace("3 +1", "4", inplace = True)
missing_data = rents_df['Number of Bedroom']
mean_data = rents_df['Number of Bedroom'].astype("float").mean()
mean_data = round(mean_data)
rents_df['Number of Bedroom'].replace(np.nan, mean_data, inplace = True)
rents_df['Number of Bedroom'].astype(int)

Changed output:-

1     1 
2 1
3 3
4 3
5 4
6 4
7 2
8 2
9 3
10 3
11 3
12 3
13 4
14 3
15 3
16 4
17 4
18 4
19 3
20 2

Visualization :-

%matplotlib inline
import matplotlib.pyplot as plt
rents_df['Price-Binned'].value_counts().plot(kind='bar', x='Rental Price', y='Number of Bedroom',figsize=(10, 6), color='lightblue')plt.title('Histogram of the Rental Price for Property')
plt.ylabel('Number of Property')
plt.xlabel('Rental Price')
plt.show()

Column — Number of Bathroom

Before

1      1  
2 N/A
3 2
4 2
5 2
6 2
7 2
8 2
9 2
10 2
11 2
12 2
13 2
14 2
15 2
16 3
17 3
18 3
19 2
20 2

Visualization :-

After

rents_df['Number of Bathroom'].replace("N/A", np.nan, inplace = True) avg_1 = rents_df['Number of Bathroom'].astype("float").mean(axis = 0) avg_1 = round(avg_1) rents_df['Number of Bathroom'].replace(np.nan,avg_1, inplace = True) rents_df['Number of Bathroom'].astype(int)

Changed output :-

1     1 
2 2
3 2
4 2
5 2
6 2
7 2
8 2
9 2
10 2
11 2
12 2
13 2
14 2
15 2
16 3
17 3
18 3
19 2
20 2

Visualization :-

%matplotlib inline 
import matplotlib.pyplot as plt
rents_df['Number of Bathroom'].value_counts().plot(kind='bar', x='Number of Bathroom', y='Frequency', figsize=(10, 6), color='lightblue') plt.title('Histogram of the Frequency Number of Bathroom for Property')
plt.ylabel('Frequency')
plt.xlabel('Number of Bathroom')
plt.show()

Cleaning task 3 : Binning data

After we done scraping the rental price data, we can see that there are so many numbers and therefore we don’t know how to estimate the property either it is cheap or expensive. Therefore, we decided to bin the rental price data so that we can classify the property according to their price.

Before

1 RM 2,400.00 
2 RM 2,400.00
3 RM 1,300.00
4 RM 1,300.00
5 RM 2,300.00
6 RM 2,300.00
7 RM 2,200.00
8 RM 2,200.00
9 RM 1,300.00
10 RM 1,300.00
11 RM 1,300.00
12 RM 1,500.00
13 RM 2,300.00
14 RM 1,300.00
15 RM 1,300.00
16 RM 1,500.00
17 RM 1,500.00
18 RM 1,500.00
19 RM 4,000.00
20 RM 2,200.00

Visualization :-

After

rents_df['Rental Price'] = rents_df['Rental Price'].replace('RM ','', regex=True)rents_df['Rental Price'] = rents_df['Rental Price'].replace(',','', regex=True)bins = [0.00, 1500.00, 2500.00, 3500.00, 20000.00]rents_df['Binned'] = pd.cut(rents_df['Rental Price'].astype(float), bins)labels = ['Cheap','Medium','Expensive', 'Super Expensive']rents_df['Price-Binned'] = pd.cut(rents_df['Rental Price'].astype(float), bins,labels=labels)print (rents_df)

Changed output :-

Visualization :-

%matplotlib inlineimport matplotlib.pyplot as pltrents_df['Price-Binned'].value_counts().plot(kind='bar', x='Rental Price', y='Number of Bedroom',figsize=(10, 6), color='lightblue')plt.title('Histogram of the Rental Price for Property')plt.ylabel('Number of Property')plt.xlabel('Rental Price')plt.show()

Cleaning task 4 : Remove irrelevant data

Column — Number Phone

We decided to remove the newlines and ‘+6’ from “Number Phone” columns. It is to ensure the “Number Phone” does not have mixed data type and has integer data type only so that it will become easier for user to read it.

Before

1    \n\n\n+60122004539\n\n\n\n                    ... 
2 \n\n\n+60122004539\n\n\n\n ...
3 \n\n\n+60122811754\n\n\n\n ...
4 \n\n\n+60122811754\n\n\n\n ...
5 \n\n\n+60122811754\n\n\n\n ...

After

rents_df['Number Phone'] =  rents_df['Number Phone'].replace('\n','', regex=True)rents_df['Number Phone'] =  rents_df['Number Phone'].replace('6','', regex=True)rents_df['Number Phone'].head()

Changed output :-

1    0122004539                                    
2 0122004539
3 0122811754
4 0122811754
5 0122811754

Column — Square Feet

For the “Square Feet” column, we can see that there is ‘sq ft’ from “Square Feet” data. So we decided to remove ‘sq ft’ to standardize it to become integer data type values.

Before

1     850 sq ft 
2 850 sq ft
3 1150 sq ft
4 1150 sq ft
5 1094 sq ft

After

rents_df['Square Feet'] =  rents_df['Square Feet'].replace('sq ft','', regex=True)rents_df['Square Feet'].head()

Changed output :-

1     850  
2 850
3 1150
4 1150
5 1094

Column — Date

We decided to remove the newlines and ‘listed on’ from “Date” columns. It is to ensure the “Date” does not have mixed data type and has integer data type only so that it will become easier for user to read it.

Before

1                                    listed 4 hour ago 
2 listed 4 hour ago
3 listed 8 hour ago
4 listed 8 hour ago
5 listed 8 hour ago

After

rents_df['Date'] =  rents_df['Date'].replace('\n','', regex=True)text = ["listed","listed on"]
for char in text:
rents_df['Date'] = rents_df['Date'].str.replace(char, ' ')
rents_df['Date'].head()

Changed output :-

1                                    4 hour ago 
2 4 hour ago
3 8 hour ago
4 8 hour ago
5 8 hour ago

Column — Accommodation Category

We decided to replace the word ‘fully furnished’ to ‘fully-furnished’ in Accommodation Category column to make it easier to remove everything after the words ‘fully-furnished’, ‘semi-furnished’ and ‘unfurnished’. This will allows us to make a visualization of the furnishing types.

Before

1     fully-furnished condominium                   ... 
2 fully-furnished condominium ...
3 fully-furnished condominium ...
4 fully-furnished condominium ...
5 fully-furnished condominium ...

After

rents_df['Accommodation Category'] =  rents_df['Accommodation Category'].replace('fully furnished','fully-furnished', regex=True)rents_df['Accommodation Category']

Changed output :-

1     fully-furnished 
2 fully-furnished
3 fully-furnished
4 fully-furnished
5 fully-furnished

Visualisation :-

%matplotlib inline
import matplotlib.pyplot as plt
rents_df['Accommodation Category'].value_counts().plot(kind='bar', x='Accommodation Category', y='Frequency',
figsize=(10, 6), color='purple')
plt.title('Histogram of the Frequency of Furnishing')
plt.ylabel('Frequency')
plt.xlabel('Furnishing')
plt.show()

Final dataframe after cleaning all of the attributes will look like as shown below

Phase 3 : Load

Last but not least we have to store data in a csv file and download it.

rents_df.to_csv('cleaned_data_rents_df.csv')

Here is the complete code of this project.

That’s all from us! I hope you guys gain something from this article 😊

--

--