본문 바로가기

Upstage AI Lab 2기

Upstage AI Lab 2기 [Day015-022] EDA 조별 프로젝트 (1) 기초통계

2024년 1월 4일 (목) Day_017

혜윤 작업

 

1. histplot for BMI

 

plt.figure(figsize=(8,6))
sns.histplot(data=df_diabetes_binary, x='BMI', bins=15)
plt.title('BMI Distribution of All Subjects', pad=20)
plt.show()

plt.figure(figsize=(8,6))
sns.histplot(data=df_diabetes_binary, x='BMI', bins=15, hue='Diabetes_bin_cat')
plt.title('BMI Distribution of Diabetic/Non-Diabetic', pad=20)
plt.show()

plt.figure(figsize=(8,6))
sns.histplot(data=df_diabetes_binary, x='BMI', bins=10, hue='Diabetes_bin_cat')
plt.title('BMI Distribution of Diabetic/Non-Diabetic', pad=20)
plt.show()

plt.figure(figsize=(8,6))
sns.histplot(data=df_diabetes_binary[df_diabetes_binary['Diabetes_binary'] == 0], x='BMI', bins=15)
plt.title('BMI Distribution of Non-Diabetic Subjects', pad=20)
plt.show()

 

 

2. countplot for Mental Health

 

plt.figure(figsize=(8,6))
sns.countplot(data=df_diabetes_binary, x='MentHlth')
plt.title('Mental Health Distribution of All Subjects', pad=20)
plt.show()

plt.figure(figsize=(8,6))
sns.countplot(data=df_diabetes_binary, x='MentHlth', hue='Diabetes_bin_cat')
plt.title('Mental Health Distribution of Non-Diabetic/Diabetic Subjects', pad=20)
plt.show()

plt.figure(figsize=(8,6))
sns.countplot(data=df_diabetes_binary[df_diabetes_binary['Diabetes_binary'] == 0], x='MentHlth', hue='Diabetes_bin_cat')
plt.title('Mental Health Distribution of Non-Diabetic Subjects', pad=20)
plt.show()

plt.figure(figsize=(8,6))
sns.countplot(data=df_diabetes_binary[df_diabetes_binary['Diabetes_binary'] == 1], x='MentHlth', hue='Diabetes_bin_cat')
plt.title('Mental Health Distribution of Diabetic Subjects', pad=20)
plt.show()

 

 

3. countplot for Physical Health

plt.figure(figsize=(8,6))
sns.countplot(data=df_diabetes_binary, x='PhysHlth')
plt.title('Physical Health Distribution of All Subjects', pad=20)
plt.show()

plt.figure(figsize=(8,6))
sns.countplot(data=df_diabetes_binary, x='PhysHlth', hue='Diabetes_bin_cat')
plt.title('Physical Health Distribution of Non-Diabetic/Diabetic Subjects', pad=20)
plt.show()

plt.figure(figsize=(8,6))
sns.countplot(data=df_diabetes_binary[df_diabetes_binary['Diabetes_binary'] == 0], x='PhysHlth', hue='Diabetes_bin_cat')
plt.title('Physical Health Distribution of Non-Diabetic Subjects', pad=20)
plt.show()

plt.figure(figsize=(8,6))
sns.countplot(data=df_diabetes_binary[df_diabetes_binary['Diabetes_binary'] == 1], x='PhysHlth', hue='Diabetes_bin_cat')
plt.title('Physical Health Distribution of Diabetic Subjects', pad=20)
plt.show()

 

* 추가로 하고 싶은 작업 :  Mental Health /  Physical Health 값이 0, 30인 데이터 제외한 분포도 보기

 

 

4. countplot for General Health

fig, axes = plt.subplots(2, 2, figsize=(16, 12))

sns.countplot(data=df_diabetes_binary, x='GenHlth', ax=axes[0, 0])
axes[0, 0].set_title('General Health Distribution of All Subjects', pad=20)

sns.countplot(data=df_diabetes_binary, x='GenHlth', hue='Diabetes_bin_cat', ax=axes[0, 1])
axes[0, 1].set_title('General Health Distribution of Non-Diabetic/Diabetic Subjects', pad=20)

sns.countplot(data=df_diabetes_binary[df_diabetes_binary['Diabetes_binary'] == 0], x='GenHlth', hue='Diabetes_bin_cat', ax=axes[1, 0])
axes[1, 0].set_title('General Health Distribution of Non-Diabetic Subjects', pad=20)

sns.countplot(data=df_diabetes_binary[df_diabetes_binary['Diabetes_binary'] == 1], x='GenHlth', hue='Diabetes_bin_cat', ax=axes[1, 1])
axes[1, 1].set_title('General Health Distribution of Diabetic Subjects', pad=20)

plt.tight_layout(pad=2.0)
plt.show()

 

5. countplot for Age

fig, axes = plt.subplots(2, 2, figsize=(16, 12))

sns.countplot(data=df_diabetes_binary, x='Age', ax=axes[0, 0])
axes[0, 0].set_title('Age Distribution of All Subjects', pad=20)

sns.countplot(data=df_diabetes_binary, x='Age', hue='Diabetes_bin_cat', ax=axes[0, 1])
axes[0, 1].set_title('Age Distribution of Non-Diabetic/Diabetic Subjects', pad=20)

sns.countplot(data=df_diabetes_binary[df_diabetes_binary['Diabetes_binary'] == 0], x='Age', hue='Diabetes_bin_cat', ax=axes[1, 0])
axes[1, 0].set_title('Age Distribution of Non-Diabetic Subjects', pad=20)

sns.countplot(data=df_diabetes_binary[df_diabetes_binary['Diabetes_binary'] == 1], x='Age', hue='Diabetes_bin_cat', ax=axes[1, 1])
axes[1, 1].set_title('Age Distribution of Diabetic Subjects', pad=20)

plt.tight_layout(pad=2.0)
plt.show()

 

6. countplot for Education

 

7. countplot for Income

 

 

8. pairplot for 'PhysHlth', 'GenHlth', 'DiffWalk', 'MentHlth'

barplot, correlation sorting으로부터 'PhysHlth', 'GenHlth', 'MentHlth'간 상관관계가 높을 것으로 판단,

pairplot을 그려봤으나 결과는 NG

 (1) GenHlth가 1to5의 카테고리 데이터라 그런듯

 (2) 'PhysHlth', 'GenHlth'가 0~30까지라 scatter에서 어떤 패턴이 보이길 기대했으나 실패