Upstage AI Lab 2기 [Day015-022] EDA 조별 프로젝트 (4) 가설 설정

기초 통계량을 바탕으로 두가지 방향의 가설설정이 가능함

1. 당뇨병 예측에 필요한 변수는 [ ], [ ], [ ] 일 것이다.

2. 당뇨병 설문을 위해 [ ], [ ], [ ] 변수는 [ ] 변수만으로 충분히 설명된다.

단일 변수 검증

dependent : 'Diabetes_binary'

chi-squared(categorical independent variable) :

		'Diabetes_binary'와의 correlation	chi-squared test p-value
binary	'HighBP'	0.254318	0.0
	'HighChol'	0.194944	0.0
	'CholCheck'	0.072523	0.0
	'Smoker'	0.045504	0.0
	'Stroke'	0.099193	0.0
	'HeartDiseaseorAttack'	0.168213	0.0
	'PhysActivity'	-0.100404	0.0
	'Fruits'	-0.024805	0.0
	'Veggies'	-0.041734	0.0
	'HvyAlcoholConsump'	-0.065950	0.0
	'AnyHealthcare'	0.025331	0.0
	'NoDocbcCost'	0.020048	0.0
	'DiffWalk'	0.205302	0.0
	'Sex'	0.032724	0.0
categorical	'GenHlth'	0.276940	0.0
	'Age'	0.177263	0.0
	'Education'	-0.102686	0.0
	'Income'	-0.140659	0.0

개별변수와 당뇨병 환자를 종속변수로 설정한 logistic regression()

'BMI', 'MentHlth', 'PhysHlth', 'GenHlth', 'Age', 'Education', 'Income'

생각하고 있는 가설들

'CholCheck'/ 'AnyHealthcare' / 'NoDocbcCost' 변수는 영향이 없을 것이다.
'Stroke'/ 'HeartDiseaseorAttack' 변수는 당뇨병에 대한 설명변수보다는 당뇨병에 의한 결과일 수 있다.
'GenHlth' / 'MentHlth' / 'PhysHlth' 를 별도로 구분하지 않아도 될 것이다.
BMI의 카테고리(① 'underweight', 'healthy', 'overweight', 'obese' / ② 'not obese' / 'obese' )는 당뇨병 예측에 중요한 변수일 것이다.

+ 어떤 변수들의 조합이 예측력을 높일 수 있을 것인가.

가설들에 대한 근거

1. 'CholCheck'/ 'AnyHealthcare' / 'NoDocbcCost' 변수는 영향이 없을 것이다.

'CholCheck'

(1) Non-Diabetic과 Diabetic의 'CholCheck' mean (barplot)

sns.barplot(data = df_diabetes_binary, y = 'CholCheck', x = 'Diabetes_binary')

(2) 'CholCheck' - 'Diabetes_binary'의 상관계수 : 0.072523

'AnyHealthcare'

(1) Non-Diabetic과 Diabetic의 'AnyHealthcare' mean (barplot)

sns.barplot(data = df_diabetes_binary, y = 'AnyHealthcare', x = 'Diabetes_binary')

(2) 'AnyHealthcare' - 'Diabetes_binary'의 상관계수 : 0.025331

'NoDocbcCost'

(1) Non-Diabetic과 Diabetic의 'NoDocbcCost' mean (barplot)

(2) 'NoDocbcCost' - 'Diabetes_binary'의 상관계수 : 0.020048

note. 'Diabetes_binary' 와의 correlation 절댓값 순서

NoDocbcCost	Fruits	AnyHealthcare	Sex	Veggies	Smoker	MentHlth	HvyAlcoholConsump	CholCheck	Stroke	PhysActivity	Education	Income	PhysHlth	HeartDiseaseorAttack	Age	HighChol	BMI	DiffWalk	HighBP	GenHlth
0.020048	0.024805	0.025331	0.032724	0.041734	0.045504	0.054153	0.06595	0.072523	0.099193	0.100404	0.102686	0.140659	0.156211	0.168213	0.177263	0.194944	0.205086	0.205302	0.254318	0.27694

2. 'Stroke'/ 'HeartDiseaseorAttack' 변수는 당뇨병에 대한 설명변수보다는 당뇨병에 의한 결과일 수 있다.

'Stroke'	'HeartDiseaseorAttack'

출처 : Stegmayr, B., & Asplund, K. (1995). Diabetes as a risk factor for stroke. A population perspective. Diabetologia, 38, 1061-1068.

Stroke incidence, case fatality and mortality in diabetic patients were compared to non-diabetic subjects in a 35-74-year-old population in northern Sweden (target population 241,000).

During an 8-year period, 1,544 stroke events in diabetic patients and 4,826 events in non-diabetic subjects were recorded.

The crude incidence of stroke was 1,000 per 100,000 in the diabetic men vs 247 in the non-diabetic men (relative risk 4.1; 95 % confidence interval 3.2-5.2).

Among diabetic women, the crude incidence was 757 per 100,000 and 152 in non-iabetic women (relative risk 5.8; 95 % confidence interval 3.7-6.9).

The 28-day case fatality among men was similar in the diabetic and non-diabetic stroke patients (18.6 vs 17.1%; p = 0.311), but significantly higher in diabetic women compared with non-diabetic women (22.2 vs 17.9 %; p = 0.02). When compared with the non-diabetic population, the overall mortality from stroke in the diabetic population (first and recurrent) was 4.4-times higher in male and 5.1 times higher in the female patients. Hypertension, atrial fibrillation, heart failure or myocardial infarction were all significantly more common in diabetic than in non-diabetic stroke patients. The
population attributable risk, a crude estimate of all strokes ascribed to diabetes mellitus, was 18 % in men and 22% in women. In Sweden, about 50 strokes are annually directly attributed to diabetes in a population of 100,000 in this age group. [Diabetologia (1995) 38: 1061-1068]

3. 'GenHlth' / 'MentHlth' / 'PhysHlth' 를 별도로 구분하지 않아도 될 것이다.

변수간 상관관계가 높은 편

+ ) 'MentHlth' / 'PhysHlth' 0이 많아서 고민

'MentHlth' 중 응답 0 의 비율 : 0.663801

Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good? scale 1-30 days

Behavioral Risk Factor Surveillance System 2015 Codebook Report

'PhysHlth' 중 응답 0 의 비율 : 0.595179

Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? scale 1-30 days

https://www.cdc.gov/brfss/annual_data/annual_2015.html

모델 검증을 위해

1. correlation이 0에 가까운 변수부터 제거?

corr_series = df_diabetes_binary.drop('Diabetes_binary', axis=1).corrwith(df_diabetes_binary.Diabetes_binary)
corr_series.abs().sort_values()

NoDocbcCost	Fruits	AnyHealthcare	Sex	Veggies	Smoker	MentHlth	HvyAlcoholConsump	CholCheck	Stroke	PhysActivity	Education	Income	PhysHlth	HeartDiseaseorAttack	Age	HighChol	BMI	DiffWalk	HighBP	GenHlth
0.020048	0.024805	0.025331	0.032724	0.041734	0.045504	0.054153	0.06595	0.072523	0.099193	0.100404	0.102686	0.140659	0.156211	0.168213	0.177263	0.194944	0.205086	0.205302	0.254318	0.27694

2. VIF 높은 값부터 제거?

from statsmodels.stats.outliers_influence import variance_inflation_factor

def calculat_vif(df) :
    vif_data = pd.DataFrame()
    vif_data['Variable'] = df.columns
    vif_data['VIF'] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
    return vif_data
    
vif_result = calculat_vif(df_diabetes_binary.drop(columns=['Diabetes_binary', 'Diabetes_bin_cat', 'Obesity_cat']))
vif_result.sort_values(by='VIF', ascending=False)

	19	2	11	3	20	13	18	9	7	8	0	1	15	4	17	16	14	6	12	5	10
Variable	Education	CholCheck	AnyHealthcare	BMI	Income	GenHlth	Age	Veggies	PhysActivity	Fruits	HighBP	HighChol	PhysHlth	Smoker	Sex	DiffWalk	MentHlth	HeartDiseaseorAttack	NoDocbcCost	Stroke	HvyAlcoholConsump
VIF	27.02937	21.45925	18.92962	17.34323	12.57782	10.96276	9.754024	5.279982	4.136394	2.825639	2.352183	2.066881	2.011309	1.985653	1.903811	1.850008	1.469319	1.296633	1.218706	1.128022	1.091476

3. 로지스틱 회귀 이외의 모델로 검증?

https://www.indeed.com/career-advice/career-development/multivariate-logistic-regression#:~:text=Multivariate%20logistic%20regression%20analysis%20is,data%20science%20and%20machine%20learning.

In a multivariate regression, there are multiple independent variables and multiple outcomes. In multivariable regression, there are multiple independent variables, but only one outcome.

'Upstage AI Lab 2기' 카테고리의 다른 글

Upstage AI Lab 2기 [Day024] git-협업 (0)	2024.01.15
Upstage AI Lab 2기 [Day015-022] EDA 조별 프로젝트 (5) 설명변수 조합 (0)	2024.01.09
통계학 복습 (0)	2024.01.06
Upstage AI Lab 2기 [Day015-022] EDA 조별 프로젝트 (데이터 개요) (1)	2024.01.04
Upstage AI Lab 2기 [Day015-022] EDA 조별 프로젝트 (1) 기초통계 (0)	2024.01.04

연역적 인간의 귀납적 세상에서 살아남기

Upstage AI Lab 2기 [Day015-022] EDA 조별 프로젝트 (4) 가설 설정

'Upstage AI Lab 2기' 카테고리의 다른 글

티스토리툴바

Upstage AI Lab 2기 [Day015-022] EDA 조별 프로젝트 (4) 가설 설정

'Upstage AI Lab 2기' 카테고리의 다른 글

'Upstage AI Lab 2기' Related Articles

티스토리툴바