Upstage AI Lab 2기 [Day015-022] EDA 조별 프로젝트 (5) 설명변수 조합

f1-score 문제 때문에 binary 5050 로 실험

	no diabetes	diabetes	total
sample size	33960	35097	69,057
sample size	49.18%	50.82%	100%

1. 전처리 :

'MentHlth', 'PhysHlth' 데이터 유효하지 않아 제거
BMI categorize (① 'obese', 'overweight', 'healthy', 'underweight' / ② 'obese'&'overweight', 'healthy'&'underweight' / ③ 'obese', 'not obese')
scaling :
StandardScaler - 'Age', 'GenHlth'
MinMaxScale - 'Education', 'Income'

obese_order_list = ['underweight', 'healthy', 'overweight', 'obese']

def cat_obesity(bmi) : 
    if bmi >= 30:
        return 'obese'
    elif bmi >= 25:
        return 'overweight'
    elif bmi >= 18.5 : 
        return 'healthy'
    else :
        return 'underweight'

def obese_cat_to_num(obesity):
    if obesity == 'obese':
        return 3
    elif obesity =='overweight':
        return 2
    elif obesity == 'healthy':
        return 1
    else :
        return 0
    
def obese_cat_to_bin(obesity):
    if obesity == 'obese':
        return 1
    elif obesity =='overweight':
        return 1
    elif obesity == 'healthy':
        return 0
    else :
        return 0

def obese_or_not(obesity) :
    if obesity == 'obese':
        return 1
    else :
        return 0

2. check VIF and logistic regression

from statsmodels.stats.outliers_influence import variance_inflation_factor

vars_for_VIF = ['HighBP', 'HighChol', 'Smoker', 'PhysActivity', 'Fruits', 'Veggies', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Sex', 'Age', 'Education', 'Income', 'Obesity_cat_num']
df = df_diabetes_binary_5050[vars_for_VIF]

def calculat_vif(df) :
    vif_data = pd.DataFrame()
    vif_data['Variable'] = df.columns
    vif_data['VIF'] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
    return vif_data
    
vif_result = calculat_vif(df)
vif_result.sort_values(by='VIF', ascending=False)

	11	12	10	7	13	5	3	0	4	1	2	9	8	6
Variable	Education	Income	Age	GenHlth	Obesity_cat_num	Veggies	PhysActivity	HighBP	Fruits	HighChol	Smoker	Sex	DiffWalk	HvyAlcoholConsump
VIF	21.355463	11.004932	10.20476	9.448657	8.827424	4.997879	3.635329	3.090527	2.763437	2.477878	2.030551	1.96979	1.902939	1.066131

- trial 1. VIF 높은 Education 제거

설명 변수 : 총 13개 (이중 비만여부는 3가지 방법으로 접근 가능)

ind_var_list = ['HighBP', 'HighChol', 'Smoker', 'PhysActivity', 'Fruits', 'Veggies', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Sex', 'Age', 'Income', 'Obese']

f1 score: 0.7580846634281748

- trial 2-1 corr 낮은 'Fruits' 제거

ind_var_list = ['HighBP', 'HighChol', 'Smoker', 'PhysActivity', 'Veggies', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Sex', 'Age', 'Income', 'Obese']

f1 score: 0.7586206896551725

- trial 2-2 vif 높은 'Age' 제거

	10	7	12	11	5	3	0	4	1	2	9	8	6
Variable	Age	GenHlth	Obesity_cat_num	Income	Veggies	PhysActivity	HighBP	Fruits	HighChol	Smoker	Sex	DiffWalk	HvyAlcoholConsump
VIF	9.584987	9.005169	8.430411	7.412648	4.857749	3.451595	3.079206	2.74397	2.477765	2.030179	1.969684	1.902916	1.064133

ind_var_list = ['HighBP', 'HighChol', 'Smoker', 'PhysActivity', 'Fruits', 'Veggies', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Sex', 'Income', 'Obese']

f1 score: 0.745327597150716

- trial 3-1 corr 낮은 'Sex' 제거
ind_var_list = ['HighBP', 'HighChol', 'Smoker', 'PhysActivity', 'Veggies', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obese']

f1 score: 0.7559055118110236

- trial 3-2 vif 높은 Age' 제거
ind_var_list = ['HighBP', 'HighChol', 'Smoker', 'PhysActivity', 'Veggies', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Sex', 'Income', 'Obese']

f1 score: 0.745391623702239

- trial 4 corr 낮은 'Veggies' 제거

ind_var_list = ['HighBP', 'HighChol', 'Smoker', 'PhysActivity', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obese']

f1 score: 0.7559668777398928

- trial 5 corr 낮은 'Smoker' 제거
ind_var_list = ['HighBP', 'HighChol', 'PhysActivity', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obese']

f1 score: 0.7561468273316152

- trial 6 corr 낮은 'HvyAlcoholConsump' 제거

ind_var_list = ['HighBP', 'HighChol', 'PhysActivity', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obese']

f1 score: 0.7554226918798665

- trial 7 corr 낮은 'PhysActivity' 제거

ind_var_list = ['HighBP', 'HighChol', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obese']

f1 score: 0.7555401180965613

3. hyperparameter tuning

선행된 모델링 중 가장 f1 score가 높았던 trial 5를 베이스로 시작
ind_var_list = ['HighBP', 'HighChol', 'PhysActivity', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obese']

# multi_class / solver / class_weight

- trial 1 : multi_class = 'multinomial'

f1 score: 0.7561468273316152

solver 지정에 따른 차이는 없었음 (‘newton-cg’, ‘sag’, ‘saga’, ‘lbfgs’)

# 1차 시도 중 가장 f1 score가 높은 지점부터 시작
# 5
ind_var_list = ['HighBP', 'HighChol', 'PhysActivity', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obese']
logreg_model = LogisticRegression(n_jobs=-1, multi_class= 'multinomial')
logreg_model.fit(X_train[ind_var_list], y_train)
y_pred = logreg_model.predict(X_test[ind_var_list])
f1 = f1_score(y_test, y_pred)
f1

- trial 2 : multi_class = 'multinomial', penalty='l1', solver='saga'

f1 score: 0.7560601838952354

- trial 3 : multi_class= 'multinomial', penalty='elasticnet', l1_ratio=0.1, solver='saga'

f1 score: 0.7561468273316152

~~l1_ratio=0.5~~

~~f1 score: 0.7560601838952354~~

~~l1_ratio=0.9~~

~~f1 score: 0.7560601838952354~~

~~l1_ratio=0.01~~

~~f1 score: 0.7561468273316152~~

- trial 4 : multi_class= 'multinomial', penalty='elasticnet', l1_ratio=0.1, solver='saga'

'Obese' 대신 'Obesity_cat_bin'

f1 score: 0.7530650412135486

'Obese' 대신 'Obesity_cat_num' (scaled)

f1 score: 0.7611650485436894

note :

- trial 5 : class_weight tuning

class_weight='balanced'

f1 score: 0.7587172664892873

ind_var_list = ['HighBP', 'HighChol', 'PhysActivity', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obesity_cat_num']
class_weights = {0: 2, 1: 2, 2 : 1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1}

f1 score: 0.7611650485436894

어려운 점 : 이것보다 개선이 어려움

이 지점 이후로 변수를 제거하면 오히려 f1 score가 조금 떨어짐

ind_var_list = ['HighBP', 'HighChol', 'PhysActivity', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obesity_cat_num']
logreg_model = LogisticRegression(n_jobs=-1, multi_class= 'multinomial', penalty='elasticnet', l1_ratio=0.1, solver='saga')
logreg_model.fit(X_train[ind_var_list], y_train)
y_pred = logreg_model.predict(X_test[ind_var_list])
f1 = f1_score(y_test, y_pred)
f1

# 0.7611650485436894

sklearn logistic regression documentation

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

강사님 피드백 (1월 10일)

강력한 디버깅 2가지 검증이 있는데요,
1. 성능에 큰 도움이 되는 변수 하나를 빼고 성능 변화 체크하기
2. 완전히 overfitting 시켜보고 성능보기

'Upstage AI Lab 2기' 카테고리의 다른 글

Upstage AI Lab 2기 [Day026] 온라인 강의 - 기초통계 (2) (0)	2024.01.17
Upstage AI Lab 2기 [Day024] git-협업 (0)	2024.01.15
Upstage AI Lab 2기 [Day015-022] EDA 조별 프로젝트 (4) 가설 설정 (0)	2024.01.08
통계학 복습 (0)	2024.01.06
Upstage AI Lab 2기 [Day015-022] EDA 조별 프로젝트 (데이터 개요) (1)	2024.01.04

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

연역적 인간의 귀납적 세상에서 살아남기

Upstage AI Lab 2기 [Day015-022] EDA 조별 프로젝트 (5) 설명변수 조합

'Upstage AI Lab 2기' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

Upstage AI Lab 2기 [Day015-022] EDA 조별 프로젝트 (5) 설명변수 조합

'Upstage AI Lab 2기' 카테고리의 다른 글

'Upstage AI Lab 2기' Related Articles

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역