f1-score 문제 때문에 binary 5050 로 실험
no diabetes | diabetes | total | |
sample size | 33960 | 35097 | 69,057 |
49.18% | 50.82% | 100% |
1. 전처리 :
- 'MentHlth', 'PhysHlth' 데이터 유효하지 않아 제거
- BMI categorize (① 'obese', 'overweight', 'healthy', 'underweight' / ② 'obese'&'overweight', 'healthy'&'underweight' / ③ 'obese', 'not obese')
- scaling :
StandardScaler - 'Age', 'GenHlth'
MinMaxScale - 'Education', 'Income'
obese_order_list = ['underweight', 'healthy', 'overweight', 'obese']
def cat_obesity(bmi) :
if bmi >= 30:
return 'obese'
elif bmi >= 25:
return 'overweight'
elif bmi >= 18.5 :
return 'healthy'
else :
return 'underweight'
def obese_cat_to_num(obesity):
if obesity == 'obese':
return 3
elif obesity =='overweight':
return 2
elif obesity == 'healthy':
return 1
else :
return 0
def obese_cat_to_bin(obesity):
if obesity == 'obese':
return 1
elif obesity =='overweight':
return 1
elif obesity == 'healthy':
return 0
else :
return 0
def obese_or_not(obesity) :
if obesity == 'obese':
return 1
else :
return 0
2. check VIF and logistic regression
from statsmodels.stats.outliers_influence import variance_inflation_factor
vars_for_VIF = ['HighBP', 'HighChol', 'Smoker', 'PhysActivity', 'Fruits', 'Veggies', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Sex', 'Age', 'Education', 'Income', 'Obesity_cat_num']
df = df_diabetes_binary_5050[vars_for_VIF]
def calculat_vif(df) :
vif_data = pd.DataFrame()
vif_data['Variable'] = df.columns
vif_data['VIF'] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
return vif_data
vif_result = calculat_vif(df)
vif_result.sort_values(by='VIF', ascending=False)
11 | 12 | 10 | 7 | 13 | 5 | 3 | 0 | 4 | 1 | 2 | 9 | 8 | 6 | |
Variable | Education | Income | Age | GenHlth | Obesity_cat_num | Veggies | PhysActivity | HighBP | Fruits | HighChol | Smoker | Sex | DiffWalk | HvyAlcoholConsump |
VIF | 21.355463 | 11.004932 | 10.20476 | 9.448657 | 8.827424 | 4.997879 | 3.635329 | 3.090527 | 2.763437 | 2.477878 | 2.030551 | 1.96979 | 1.902939 | 1.066131 |
- trial 1. VIF 높은 Education 제거
설명 변수 : 총 13개 (이중 비만여부는 3가지 방법으로 접근 가능)
ind_var_list = ['HighBP', 'HighChol', 'Smoker', 'PhysActivity', 'Fruits', 'Veggies', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Sex', 'Age', 'Income', 'Obese']
f1 score: 0.7580846634281748
- trial 2-1 corr 낮은 'Fruits' 제거
ind_var_list = ['HighBP', 'HighChol', 'Smoker', 'PhysActivity', 'Veggies', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Sex', 'Age', 'Income', 'Obese']
f1 score: 0.7586206896551725
- trial 2-2 vif 높은 'Age' 제거
10 | 7 | 12 | 11 | 5 | 3 | 0 | 4 | 1 | 2 | 9 | 8 | 6 | |
Variable | Age | GenHlth | Obesity_cat_num | Income | Veggies | PhysActivity | HighBP | Fruits | HighChol | Smoker | Sex | DiffWalk | HvyAlcoholConsump |
VIF | 9.584987 | 9.005169 | 8.430411 | 7.412648 | 4.857749 | 3.451595 | 3.079206 | 2.74397 | 2.477765 | 2.030179 | 1.969684 | 1.902916 | 1.064133 |
ind_var_list = ['HighBP', 'HighChol', 'Smoker', 'PhysActivity', 'Fruits', 'Veggies', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Sex', 'Income', 'Obese']
f1 score: 0.745327597150716
- trial 3-1 corr 낮은 'Sex' 제거
ind_var_list = ['HighBP', 'HighChol', 'Smoker', 'PhysActivity', 'Veggies', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obese']
f1 score: 0.7559055118110236
- trial 3-2 vif 높은 Age' 제거
ind_var_list = ['HighBP', 'HighChol', 'Smoker', 'PhysActivity', 'Veggies', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Sex', 'Income', 'Obese']
f1 score: 0.745391623702239
- trial 4 corr 낮은 'Veggies' 제거
ind_var_list = ['HighBP', 'HighChol', 'Smoker', 'PhysActivity', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obese']
f1 score: 0.7559668777398928
- trial 5 corr 낮은 'Smoker' 제거
ind_var_list = ['HighBP', 'HighChol', 'PhysActivity', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obese']
f1 score: 0.7561468273316152
- trial 6 corr 낮은 'HvyAlcoholConsump' 제거
ind_var_list = ['HighBP', 'HighChol', 'PhysActivity', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obese']
f1 score: 0.7554226918798665
- trial 7 corr 낮은 'PhysActivity' 제거
ind_var_list = ['HighBP', 'HighChol', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obese']
f1 score: 0.7555401180965613
3. hyperparameter tuning
선행된 모델링 중 가장 f1 score가 높았던 trial 5를 베이스로 시작
ind_var_list = ['HighBP', 'HighChol', 'PhysActivity', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obese']
# multi_class / solver / class_weight
- trial 1 : multi_class = 'multinomial'
f1 score: 0.7561468273316152
solver 지정에 따른 차이는 없었음 (‘newton-cg’, ‘sag’, ‘saga’, ‘lbfgs’)
# 1차 시도 중 가장 f1 score가 높은 지점부터 시작
# 5
ind_var_list = ['HighBP', 'HighChol', 'PhysActivity', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obese']
logreg_model = LogisticRegression(n_jobs=-1, multi_class= 'multinomial')
logreg_model.fit(X_train[ind_var_list], y_train)
y_pred = logreg_model.predict(X_test[ind_var_list])
f1 = f1_score(y_test, y_pred)
f1
- trial 2 : multi_class = 'multinomial', penalty='l1', solver='saga'
f1 score: 0.7560601838952354
- trial 3 : multi_class= 'multinomial', penalty='elasticnet', l1_ratio=0.1, solver='saga'
f1 score: 0.7561468273316152
l1_ratio=0.5
f1 score: 0.7560601838952354
l1_ratio=0.9
f1 score: 0.7560601838952354
l1_ratio=0.01
f1 score: 0.7561468273316152
- trial 4 : multi_class= 'multinomial', penalty='elasticnet', l1_ratio=0.1, solver='saga'
'Obese' 대신 'Obesity_cat_bin'
f1 score: 0.7530650412135486
'Obese' 대신 'Obesity_cat_num' (scaled)
f1 score: 0.7611650485436894
note :
- trial 5 : class_weight tuning
class_weight='balanced'
f1 score: 0.7587172664892873
ind_var_list = ['HighBP', 'HighChol', 'PhysActivity', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obesity_cat_num']
class_weights = {0: 2, 1: 2, 2 : 1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1}
f1 score: 0.7611650485436894
어려운 점 : 이것보다 개선이 어려움
이 지점 이후로 변수를 제거하면 오히려 f1 score가 조금 떨어짐
ind_var_list = ['HighBP', 'HighChol', 'PhysActivity', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obesity_cat_num']
logreg_model = LogisticRegression(n_jobs=-1, multi_class= 'multinomial', penalty='elasticnet', l1_ratio=0.1, solver='saga')
logreg_model.fit(X_train[ind_var_list], y_train)
y_pred = logreg_model.predict(X_test[ind_var_list])
f1 = f1_score(y_test, y_pred)
f1
# 0.7611650485436894
sklearn logistic regression documentation
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
강사님 피드백 (1월 10일)
강력한 디버깅 2가지 검증이 있는데요,
1. 성능에 큰 도움이 되는 변수 하나를 빼고 성능 변화 체크하기
2. 완전히 overfitting 시켜보고 성능보기
'Upstage AI Lab 2기' 카테고리의 다른 글
Upstage AI Lab 2기 [Day026] 온라인 강의 - 기초통계 (2) (0) | 2024.01.17 |
---|---|
Upstage AI Lab 2기 [Day024] git-협업 (0) | 2024.01.15 |
Upstage AI Lab 2기 [Day015-022] EDA 조별 프로젝트 (4) 가설 설정 (0) | 2024.01.08 |
통계학 복습 (0) | 2024.01.06 |
Upstage AI Lab 2기 [Day015-022] EDA 조별 프로젝트 (데이터 개요) (1) | 2024.01.04 |