HW5_visualization_and_eda

4 minute read

Homework Assignment #5 : Visualization and EDA

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

From the given data (covid20200416_rev.csv), write your codes according to the following questions.

df_covid = pd.read_csv('./covid20200416_rev.csv')
print(len(df_covid))
df_covid.head()

` 212 `

country total_cases total_deaths total_recovered Tot_cases_per_1Mpop Deaths_per_1Mpop region
0 USA 644089 28529 48701 1946.0 86.0 North America
1 Spain 180659 18812 70853 3864.0 402.0 Europe
2 Italy 165155 21645 38092 2732.0 358.0 Europe
3 France 147863 17167 30955 2265.0 263.0 Europe
4 Germany 134753 3804 72600 1608.0 45.0 Europe
df_covid.describe()
total_cases total_deaths total_recovered Tot_cases_per_1Mpop Deaths_per_1Mpop
count 212.000000 212.000000 212.000000 210.000000 162.000000
mean 9826.905660 634.976415 2407.268868 626.120143 33.438457
std 50052.898899 3155.560189 10488.024252 1463.639958 107.465073
min 1.000000 0.000000 0.000000 0.030000 0.030000
25% 39.500000 1.000000 5.000000 18.000000 0.600000
50% 382.500000 6.000000 67.000000 104.500000 3.000000
75% 2270.250000 52.750000 346.250000 465.750000 17.000000
max 644089.000000 28529.000000 77892.000000 11582.000000 1061.000000

(1) Visualize the mean total_cases according to each region by the bar chart.

# your code here
mean_total_cases_by_region = df_covid.groupby('region').total_cases.mean()
mean_total_cases_by_region
mean_total_cases_by_region.sort_values(inplace=True, ascending=True)
x = mean_total_cases_by_region.index.tolist()
y = mean_total_cases_by_region.values.tolist()
# x1 = list(mean_total_cases_by_region.index)
# y1 = list(mean_total_cases_by_region.values)
#print(type(x), y, type(x1), y1)

plt.title('Mean Total Cases by Each Region')
plt.xlabel("Region")
plt.ylabel("Mean cases")
#plt.barh(x,y)
plt.bar(x,y)
plt.xticks(rotation=-15)
plt.show()

output_6_0

(2) Plot the distribution(histogram) of total cases per 1M population (Tot_cases_per_1Mpop) using Seaborn package.

  • Note: Before applying the visualization for distribution, you may need to filter out rows of missing value (NaN).
# your code here 2-1 : find NaN
print("duplicated data?:", df_covid.duplicated().sum())
sns.heatmap(df_covid.isnull(), cbar=False)
plt.show()

` duplicated data?: 0 `

output_8_1

# your code here 2-2 :dropna 
df_covid = df_covid.dropna(subset=['Tot_cases_per_1Mpop'], axis = 'index')

d = df_covid.Tot_cases_per_1Mpop.isnull()
# len(d)
count = 0 ;
for i in range(0, len(d)):
    if d.iloc[i] == True:
        count += 1 
print("NaN of 'Tot_cases_per_1Mpop':", count)

sns.heatmap(df_covid.isnull(), cbar=False)
plt.show()

` NaN of ‘Tot_cases_per_1Mpop’: 0 `

output_9_1

# your code here 2-3 : draw plot
sns.distplot(df_covid['Tot_cases_per_1Mpop'])
plt.show()

output_10_0

(3) Plot the scatter plot according to the following conditions

  • x values: ‘total_cases’
  • y values: ‘total_deaths’
# your code here 3-1 : describe dataframe
df_covid.describe()
total_cases total_deaths total_recovered Tot_cases_per_1Mpop Deaths_per_1Mpop
count 210.000000 210.000000 210.000000 210.000000 162.000000
mean 9917.061905 640.957143 2427.128571 626.120143 33.438457
std 50283.196920 3170.021744 10536.046314 1463.639958 107.465073
min 1.000000 0.000000 0.000000 0.030000 0.030000
25% 41.500000 1.000000 5.250000 18.000000 0.600000
50% 382.500000 6.000000 67.000000 104.500000 3.000000
75% 2426.750000 58.250000 333.000000 465.750000 17.000000
max 644089.000000 28529.000000 77892.000000 11582.000000 1061.000000
# your code here 3-2 : draw plot

x = df_covid['total_cases']
y = df_covid['total_deaths']

plt.scatter(x, y, marker = '+', s=150, color='red')
plt.show()


output_13_0

(4) Plot the heatmap of correlations among all numerical variables. Which variable (column) is the most correlated with the value “total_cases”?

# your code here : 4-1 calculate corr
df_corr = df_covid[:]
df_corr.corr()

total_cases total_deaths total_recovered Tot_cases_per_1Mpop Deaths_per_1Mpop
total_cases 1.000000 0.885806 0.652631 0.156965 0.220542
total_deaths 0.885806 1.000000 0.664417 0.209220 0.379455
total_recovered 0.652631 0.664417 1.000000 0.160642 0.244011
Tot_cases_per_1Mpop 0.156965 0.209220 0.160642 1.000000 0.852964
Deaths_per_1Mpop 0.220542 0.379455 0.244011 0.852964 1.000000
# your code here :4-2 draw heatmap
h = sns.heatmap(df_corr.corr(), annot=True, cmap='RdYlGn_r')
h.set_xticklabels(h.get_xticklabels(),rotation=-15, fontsize='small')
plt.show()

output_16_0

5) Plot the boxplots of mean total_cases according to each “region”.

# your code here
sns.catplot(x='region', y='total_cases', kind='box', data=df_covid)
plt.xticks(rotation = -15)
plt.show()

output_18_0

Open question: Can you find any other insight from the data?

#your code here : Open question-1
sns.pairplot(data=df_covid, hue='region')
plt.show()

output_20_1

#your code here : Open question-2
df_covid.sort_values(by='total_deaths', ascending=False)
country total_cases total_deaths total_recovered Tot_cases_per_1Mpop Deaths_per_1Mpop region
0 USA 644089 28529 48701 1946.00 86.0 North America
2 Italy 165155 21645 38092 2732.00 358.0 Europe
1 Spain 180659 18812 70853 3864.00 402.0 Europe
3 France 147863 17167 30955 2265.00 263.0 Europe
5 UK 98476 12868 344 1451.00 190.0 Europe
... ... ... ... ... ... ... ...
175 Nepal 16 0 1 0.50 NaN Asia
176 Dominica 16 0 8 222.00 NaN North America
178 Namibia 16 0 3 6.00 NaN Africa
179 Saint Lucia 15 0 11 82.00 NaN North America
211 Yemen 1 0 0 0.03 NaN Asia

210 rows × 7 columns

#your code here : Open question-3
df_covid.sort_values(by='total_recovered', ascending=False)
country total_cases total_deaths total_recovered Tot_cases_per_1Mpop Deaths_per_1Mpop region
6 China 82341 3342 77892 57.00 2.0 Asia
4 Germany 134753 3804 72600 1608.00 45.0 Europe
1 Spain 180659 18812 70853 3864.00 402.0 Europe
7 Iran 76389 4777 49933 909.00 57.0 Asia
0 USA 644089 28529 48701 1946.00 86.0 North America
... ... ... ... ... ... ... ...
172 Belize 18 2 0 45.00 5.0 North America
158 Haiti 41 3 0 4.00 0.3 North America
157 Guinea-Bissau 43 0 0 22.00 NaN Africa
149 French Polynesia 55 0 0 196.00 NaN Oceania
211 Yemen 1 0 0 0.03 NaN Asia

210 rows × 7 columns

#your code here : Open question-4
df_covid_no_usa = df_covid.drop(df_covid[df_covid['country'] == 'USA'].index)
sns.pairplot(data=df_covid_no_usa, hue='region')
plt.show()

output_23_1

`

결론

  • North America(미국) 이 다른 나라에 비해 월등하게 높은 감염이 일어나고 있다.
  • 미국 케이스를 제외하면 total cases 기준 유럽국가들의 사망률(특히 이탈리아)이 높다.
  • 아시아 국가들(중국, 이란)이 동일 감염case에 대해 recover 되는 확률이 높은 편이다 `

Tags: ,

Categories:

Updated:

Leave a comment