HW5_visualization_and_eda

4 minute read

Homework Assignment #5 : Visualization and EDA

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

From the given data (covid20200416_rev.csv), write your codes according to the following questions.

df_covid = pd.read_csv('./covid20200416_rev.csv')
print(len(df_covid))
df_covid.head()

` 212 `

	country	total_cases	total_deaths	total_recovered	Tot_cases_per_1Mpop	Deaths_per_1Mpop	region
0	USA	644089	28529	48701	1946.0	86.0	North America
1	Spain	180659	18812	70853	3864.0	402.0	Europe
2	Italy	165155	21645	38092	2732.0	358.0	Europe
3	France	147863	17167	30955	2265.0	263.0	Europe
4	Germany	134753	3804	72600	1608.0	45.0	Europe

df_covid.describe()

	total_cases	total_deaths	total_recovered	Tot_cases_per_1Mpop	Deaths_per_1Mpop
count	212.000000	212.000000	212.000000	210.000000	162.000000
mean	9826.905660	634.976415	2407.268868	626.120143	33.438457
std	50052.898899	3155.560189	10488.024252	1463.639958	107.465073
min	1.000000	0.000000	0.000000	0.030000	0.030000
25%	39.500000	1.000000	5.000000	18.000000	0.600000
50%	382.500000	6.000000	67.000000	104.500000	3.000000
75%	2270.250000	52.750000	346.250000	465.750000	17.000000
max	644089.000000	28529.000000	77892.000000	11582.000000	1061.000000

(1) Visualize the mean total_cases according to each region by the bar chart.

# your code here
mean_total_cases_by_region = df_covid.groupby('region').total_cases.mean()
mean_total_cases_by_region
mean_total_cases_by_region.sort_values(inplace=True, ascending=True)
x = mean_total_cases_by_region.index.tolist()
y = mean_total_cases_by_region.values.tolist()
# x1 = list(mean_total_cases_by_region.index)
# y1 = list(mean_total_cases_by_region.values)
#print(type(x), y, type(x1), y1)

plt.title('Mean Total Cases by Each Region')
plt.xlabel("Region")
plt.ylabel("Mean cases")
#plt.barh(x,y)
plt.bar(x,y)
plt.xticks(rotation=-15)
plt.show()

output_6_0

(2) Plot the distribution(histogram) of total cases per 1M population (Tot_cases_per_1Mpop) using Seaborn package.

Note: Before applying the visualization for distribution, you may need to filter out rows of missing value (NaN).

# your code here 2-1 : find NaN
print("duplicated data?:", df_covid.duplicated().sum())
sns.heatmap(df_covid.isnull(), cbar=False)
plt.show()

` duplicated data?: 0 `

output_8_1

# your code here 2-2 :dropna 
df_covid = df_covid.dropna(subset=['Tot_cases_per_1Mpop'], axis = 'index')

d = df_covid.Tot_cases_per_1Mpop.isnull()
# len(d)
count = 0 ;
for i in range(0, len(d)):
    if d.iloc[i] == True:
        count += 1 
print("NaN of 'Tot_cases_per_1Mpop':", count)

sns.heatmap(df_covid.isnull(), cbar=False)
plt.show()

` NaN of ‘Tot_cases_per_1Mpop’: 0 `

output_9_1

# your code here 2-3 : draw plot
sns.distplot(df_covid['Tot_cases_per_1Mpop'])
plt.show()

output_10_0

(3) Plot the scatter plot according to the following conditions

x values: ‘total_cases’
y values: ‘total_deaths’

# your code here 3-1 : describe dataframe
df_covid.describe()

	total_cases	total_deaths	total_recovered	Tot_cases_per_1Mpop	Deaths_per_1Mpop
count	210.000000	210.000000	210.000000	210.000000	162.000000
mean	9917.061905	640.957143	2427.128571	626.120143	33.438457
std	50283.196920	3170.021744	10536.046314	1463.639958	107.465073
min	1.000000	0.000000	0.000000	0.030000	0.030000
25%	41.500000	1.000000	5.250000	18.000000	0.600000
50%	382.500000	6.000000	67.000000	104.500000	3.000000
75%	2426.750000	58.250000	333.000000	465.750000	17.000000
max	644089.000000	28529.000000	77892.000000	11582.000000	1061.000000

# your code here 3-2 : draw plot

x = df_covid['total_cases']
y = df_covid['total_deaths']

plt.scatter(x, y, marker = '+', s=150, color='red')
plt.show()

output_13_0

(4) Plot the heatmap of correlations among all numerical variables. Which variable (column) is the most correlated with the value “total_cases”?

# your code here : 4-1 calculate corr
df_corr = df_covid[:]
df_corr.corr()

	total_cases	total_deaths	total_recovered	Tot_cases_per_1Mpop	Deaths_per_1Mpop
total_cases	1.000000	0.885806	0.652631	0.156965	0.220542
total_deaths	0.885806	1.000000	0.664417	0.209220	0.379455
total_recovered	0.652631	0.664417	1.000000	0.160642	0.244011
Tot_cases_per_1Mpop	0.156965	0.209220	0.160642	1.000000	0.852964
Deaths_per_1Mpop	0.220542	0.379455	0.244011	0.852964	1.000000

# your code here :4-2 draw heatmap
h = sns.heatmap(df_corr.corr(), annot=True, cmap='RdYlGn_r')
h.set_xticklabels(h.get_xticklabels(),rotation=-15, fontsize='small')
plt.show()

output_16_0

5) Plot the boxplots of mean total_cases according to each “region”.

# your code here
sns.catplot(x='region', y='total_cases', kind='box', data=df_covid)
plt.xticks(rotation = -15)
plt.show()

output_18_0

Open question: Can you find any other insight from the data?

#your code here : Open question-1
sns.pairplot(data=df_covid, hue='region')
plt.show()

output_20_1

#your code here : Open question-2
df_covid.sort_values(by='total_deaths', ascending=False)

	country	total_cases	total_deaths	total_recovered	Tot_cases_per_1Mpop	Deaths_per_1Mpop	region
0	USA	644089	28529	48701	1946.00	86.0	North America
2	Italy	165155	21645	38092	2732.00	358.0	Europe
1	Spain	180659	18812	70853	3864.00	402.0	Europe
3	France	147863	17167	30955	2265.00	263.0	Europe
5	UK	98476	12868	344	1451.00	190.0	Europe
...	...	...	...	...	...	...	...
175	Nepal	16	0	1	0.50	NaN	Asia
176	Dominica	16	0	8	222.00	NaN	North America
178	Namibia	16	0	3	6.00	NaN	Africa
179	Saint Lucia	15	0	11	82.00	NaN	North America
211	Yemen	1	0	0	0.03	NaN	Asia

210 rows × 7 columns

#your code here : Open question-3
df_covid.sort_values(by='total_recovered', ascending=False)

	country	total_cases	total_deaths	total_recovered	Tot_cases_per_1Mpop	Deaths_per_1Mpop	region
6	China	82341	3342	77892	57.00	2.0	Asia
4	Germany	134753	3804	72600	1608.00	45.0	Europe
1	Spain	180659	18812	70853	3864.00	402.0	Europe
7	Iran	76389	4777	49933	909.00	57.0	Asia
0	USA	644089	28529	48701	1946.00	86.0	North America
...	...	...	...	...	...	...	...
172	Belize	18	2	0	45.00	5.0	North America
158	Haiti	41	3	0	4.00	0.3	North America
157	Guinea-Bissau	43	0	0	22.00	NaN	Africa
149	French Polynesia	55	0	0	196.00	NaN	Oceania
211	Yemen	1	0	0	0.03	NaN	Asia

210 rows × 7 columns

#your code here : Open question-4
df_covid_no_usa = df_covid.drop(df_covid[df_covid['country'] == 'USA'].index)
sns.pairplot(data=df_covid_no_usa, hue='region')
plt.show()

output_23_1

결론

North America(미국) 이 다른 나라에 비해 월등하게 높은 감염이 일어나고 있다.
미국 케이스를 제외하면 total cases 기준 유럽국가들의 사망률(특히 이탈리아)이 높다.
아시아 국가들(중국, 이란)이 동일 감염case에 대해 recover 되는 확률이 높은 편이다 `

Share on

Twitter Facebook LinkedIn

HW5_visualization_and_eda

Homework Assignment #5 : Visualization and EDA

From the given data (covid20200416_rev.csv), write your codes according to the following questions.

(1) Visualize the mean total_cases according to each region by the bar chart.

(2) Plot the distribution(histogram) of total cases per 1M population (Tot_cases_per_1Mpop) using Seaborn package.

(3) Plot the scatter plot according to the following conditions

(4) Plot the heatmap of correlations among all numerical variables. Which variable (column) is the most correlated with the value “total_cases”?

5) Plot the boxplots of mean total_cases according to each “region”.

Open question: Can you find any other insight from the data?

결론

Share on

Leave a comment

You may also enjoy

ML : SIFT(Scale-Invarient Feature Transform

ML : 머신러닝 기본 알고리즘 총정리

ML : MLMLMLML

AI : 인공지능 기본 알고리즘 총정리