본문 바로가기

Upstage AI Lab 2기

Upstage AI Lab 2기 [Day015-022] EDA 조별 프로젝트 (5) 설명변수 조합

f1-score 문제 때문에 binary 5050 로 실험

 

  no diabetes diabetes  total
sample size 33960 35097 69,057
49.18% 50.82% 100%

 

1. 전처리 : 

  • 'MentHlth', 'PhysHlth' 데이터 유효하지 않아 제거
  • BMI categorize (① 'obese', 'overweight', 'healthy', 'underweight' / ② 'obese'&'overweight', 'healthy'&'underweight' / ③ 'obese', 'not obese')
  • scaling : 
    StandardScaler - 'Age', 'GenHlth'
    MinMaxScale - 'Education', 'Income'
더보기
obese_order_list = ['underweight', 'healthy', 'overweight', 'obese']

def cat_obesity(bmi) : 
    if bmi >= 30:
        return 'obese'
    elif bmi >= 25:
        return 'overweight'
    elif bmi >= 18.5 : 
        return 'healthy'
    else :
        return 'underweight'

def obese_cat_to_num(obesity):
    if obesity == 'obese':
        return 3
    elif obesity =='overweight':
        return 2
    elif obesity == 'healthy':
        return 1
    else :
        return 0
    
def obese_cat_to_bin(obesity):
    if obesity == 'obese':
        return 1
    elif obesity =='overweight':
        return 1
    elif obesity == 'healthy':
        return 0
    else :
        return 0

def obese_or_not(obesity) :
    if obesity == 'obese':
        return 1
    else :
        return 0

 

2. check VIF and logistic regression

더보기
from statsmodels.stats.outliers_influence import variance_inflation_factor

vars_for_VIF = ['HighBP', 'HighChol', 'Smoker', 'PhysActivity', 'Fruits', 'Veggies', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Sex', 'Age', 'Education', 'Income', 'Obesity_cat_num']
df = df_diabetes_binary_5050[vars_for_VIF]

def calculat_vif(df) :
    vif_data = pd.DataFrame()
    vif_data['Variable'] = df.columns
    vif_data['VIF'] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
    return vif_data
    
vif_result = calculat_vif(df)
vif_result.sort_values(by='VIF', ascending=False)
  11 12 10 7 13 5 3 0 4 1 2 9 8 6
Variable Education Income Age GenHlth Obesity_cat_num Veggies PhysActivity HighBP Fruits HighChol Smoker Sex DiffWalk HvyAlcoholConsump
VIF 21.355463 11.004932 10.20476 9.448657 8.827424 4.997879 3.635329 3.090527 2.763437 2.477878 2.030551 1.96979 1.902939 1.066131

 

- trial 1. VIF 높은 Education 제거

설명 변수 : 총 13개 (이중 비만여부는 3가지 방법으로 접근 가능)

ind_var_list = ['HighBP', 'HighChol', 'Smoker', 'PhysActivity', 'Fruits', 'Veggies', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Sex', 'Age', 'Income', 'Obese']

f1 score: 0.7580846634281748

 

- trial 2-1 corr 낮은 'Fruits' 제거

ind_var_list = ['HighBP', 'HighChol', 'Smoker', 'PhysActivity', 'Veggies', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Sex', 'Age', 'Income', 'Obese']

f1 score: 0.7586206896551725

 

- trial 2-2 vif 높은 'Age' 제거

  10 7 12 11 5 3 0 4 1 2 9 8 6
Variable Age GenHlth Obesity_cat_num Income Veggies PhysActivity HighBP Fruits HighChol Smoker Sex DiffWalk HvyAlcoholConsump
VIF 9.584987 9.005169 8.430411 7.412648 4.857749 3.451595 3.079206 2.74397 2.477765 2.030179 1.969684 1.902916 1.064133

ind_var_list = ['HighBP', 'HighChol', 'Smoker', 'PhysActivity', 'Fruits', 'Veggies', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Sex', 'Income', 'Obese']

f1 score: 0.745327597150716

 

- trial 3-1 corr 낮은 'Sex' 제거
ind_var_list = ['HighBP', 'HighChol', 'Smoker', 'PhysActivity', 'Veggies', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obese']

f1 score: 0.7559055118110236

 

- trial 3-2 vif 높은 Age' 제거
ind_var_list = ['HighBP', 'HighChol', 'Smoker', 'PhysActivity', 'Veggies', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Sex', 'Income', 'Obese']

f1 score: 0.745391623702239

 

- trial 4 corr 낮은 'Veggies' 제거

ind_var_list = ['HighBP', 'HighChol', 'Smoker', 'PhysActivity', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obese']

f1 score: 0.7559668777398928

 

- trial 5 corr 낮은 'Smoker' 제거
ind_var_list = ['HighBP', 'HighChol', 'PhysActivity', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obese']

f1 score: 0.7561468273316152

 

- trial 6 corr 낮은 'HvyAlcoholConsump' 제거

ind_var_list = ['HighBP', 'HighChol', 'PhysActivity', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obese']

f1 score: 0.7554226918798665

 

- trial 7 corr 낮은 'PhysActivity' 제거

ind_var_list = ['HighBP', 'HighChol', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obese']

f1 score: 0.7555401180965613

 

 

3. hyperparameter tuning

선행된 모델링 중 가장 f1 score가 높았던 trial 5를 베이스로 시작
ind_var_list = ['HighBP', 'HighChol', 'PhysActivity', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obese']

# multi_class / solver / class_weight

 

- trial 1 : multi_class = 'multinomial'

f1 score: 0.7561468273316152

solver 지정에 따른 차이는 없었음 (‘newton-cg’, ‘sag’, ‘saga’, ‘lbfgs’)

# 1차 시도 중 가장 f1 score가 높은 지점부터 시작
# 5
ind_var_list = ['HighBP', 'HighChol', 'PhysActivity', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obese']
logreg_model = LogisticRegression(n_jobs=-1, multi_class= 'multinomial')
logreg_model.fit(X_train[ind_var_list], y_train)
y_pred = logreg_model.predict(X_test[ind_var_list])
f1 = f1_score(y_test, y_pred)
f1

 

- trial 2 : multi_class = 'multinomial', penalty='l1', solver='saga'

f1 score: 0.7560601838952354

 

- trial 3 : multi_class= 'multinomial', penalty='elasticnet', l1_ratio=0.1, solver='saga'

f1 score: 0.7561468273316152

 

l1_ratio=0.5

f1 score: 0.7560601838952354

 

l1_ratio=0.9

f1 score: 0.7560601838952354

 

l1_ratio=0.01

f1 score: 0.7561468273316152

 

- trial 4 : multi_class= 'multinomial', penalty='elasticnet', l1_ratio=0.1, solver='saga'

'Obese' 대신 'Obesity_cat_bin'

f1 score: 0.7530650412135486

 

'Obese' 대신 'Obesity_cat_num' (scaled)

f1 score: 0.7611650485436894

 

note : 

 

- trial 5 : class_weight tuning

class_weight='balanced'

f1 score:  0.7587172664892873

ind_var_list = ['HighBP', 'HighChol', 'PhysActivity', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obesity_cat_num']
class_weights = {0: 2, 1: 2, 2 : 1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1}

f1 score:  0.7611650485436894

 

 

 

어려운 점 : 이것보다 개선이 어려움

이 지점 이후로 변수를 제거하면 오히려 f1 score가 조금 떨어짐

ind_var_list = ['HighBP', 'HighChol', 'PhysActivity', 'HvyAlcoholConsump', 'GenHlth', 'DiffWalk', 'Age', 'Income', 'Obesity_cat_num']
logreg_model = LogisticRegression(n_jobs=-1, multi_class= 'multinomial', penalty='elasticnet', l1_ratio=0.1, solver='saga')
logreg_model.fit(X_train[ind_var_list], y_train)
y_pred = logreg_model.predict(X_test[ind_var_list])
f1 = f1_score(y_test, y_pred)
f1

# 0.7611650485436894

 

 

강사님 피드백 (1월 10일)

강력한 디버깅 2가지 검증이 있는데요,
1. 성능에 큰 도움이 되는 변수 하나를 빼고 성능 변화 체크하기
2. 완전히 overfitting 시켜보고 성능보기