Web scraping using Beautiful Soup & Python: Wikipedia (II)
After executing my first test about how to use BeautifulSoup, the next step is trying to answer a question using information from different pages from Wikipedia, to check how easy it could solve a common issue.
.
Some questions that could be answered
- Are the inhabitants of the happiest cities in the world living in fully developed countries? Is living in a context where everyday stuff is easily solved, a significant component in happiness?
.
Step by step
.
- Collect the data using web scraping of World Happiest cities in the World Report
- Collect the data using web scraping of Human Development Index
- Looking for correlation between both tables using graphs, the easier and direct way to find any clue about how your data is
- Playing around of correlation to find the coefficients
.
Collecting the data
url = "https://en.wikipedia.org/wiki/World_Happiness_Report"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "html.parser")
tables = soup.find_all("table")
happiness_table = soup.find('table', class_='wikitable sortable')
table_rows = happiness_table.find_all('tr')
data = []
for row in table_rows:
data.append([t.text.strip() for t in row.find_all('td')])
df_happiness = pd.DataFrame(data, columns=['Happiness_ranking', 'country', 'score', 'GPD', 'social_support', 'healthy_life_expectancy', 'freedom', 'generosity', 'perception_of_corruption'])
df_happiness = df_happiness[~df_happiness['country'].isnull()]
url = "https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "lxml")
tables = soup.find_all("table")
HDI_table = soup.find('table', class_='wikitable sortable')
table_rows_2 = HDI_table.find_all('tr')
data2 = []
for row in table_rows_2:
data2.append([t.text.strip() for t in row.find_all('td')])
df_HDI = pd.DataFrame(data2, columns=['HDI_ranking','1','country','3','4'])
df_HDI = df_HDI.drop(['1','3','4'], axis=1)
df_HDI = df_HDI[~df_HDI['country'].isnull()]
Joining data
# joining data
df_result = pd.merge(df_happiness, df_HDI, on= 'country', how = 'left' )
df_result.HDI_ranking = pd.to_numeric(df_result.HDI_ranking, errors='coerce').fillna(0).astype(int)
df_result.Happiness_ranking= pd.to_numeric(df_result.Happiness_ranking, errors='coerce').fillna(0).astype(int)
# Checking the change
df_result.dtypes
Analysis
Statistical analysis is a process of understanding how variables in a dataset relate to each other and how those relationships depend on other variables. Visualization can be a core component of this process because, when data are visualized properly, the human visual system can see trends and patterns that indicate a relationship.
.
Relating happiness ranking with Human Development index using a scatterplot
The scatter plot is a mainstay of statistical visualization. It depicts the joint distribution of two variables using a cloud of points, where each point represents an observation in the dataset. This depiction allows the eye to infer a substantial amount of information about whether there is any meaningful relationship between them.
import matplotlib.pyplot as plt
import seaborn as sns
sns.scatterplot('Happiness_ranking', 'HDI_ranking', data=df_result)
.
Which is the correlation between happiness ranking and Human development index?
df_result['Happiness_ranking'].corr(df_result['HDI_ranking'])
0.7446511487457009
.
Conclusion
Are the inhabitants of the happiest cities in the world living in fully developed countries? Is living in a context where everyday stuff is easily solved, a significant component in happiness?
There is a correlation between happiness ranking and HD index which clearly could be appreciated visually using the scatterplot and numeric via correlation analysis.
We can conclude that people who live in more development countries tends to be happier than people who struggles to solve their everyday life.