Upstage AI Lab 2기 [Day026] 온라인 강의

Upstage AI Lab 2기

2024년 1월 17일 (수) Day_026

Day_026 온라인 강의 : 기초통계

Project1 - 실제 데이터로 가설 설정, 검정 수행, 결과 해석하기

CH04_05. t-test

data 또는 sample size, 비교 그룹의 성질 등에 따라 선택하는 검정이 달라짐

독립표본 t-test : 서로 독립인 두 집단 간 비교

대응표본 t-test : 동일 그룹에 대한 처리 전후 비

독립표본 t-test의 순서

표본의 크기 10~30 : 정규성 검정
- 정규성 O : 등분산 검정
- 정규성 X : 순위합 검정
표본의 크기 30 이상 : 정규성 검정
- 등분산 O : 등분산 가정 독립표본 t-test
- 등분산 X : 이분산 가정 독립표본 t-test

등분산성(Homoskedasticity) : 분산이 특정 패턴이 없이 일정해야 한다.

np.random.normal(평균, 분산, n)

-> in array format

pd.DataFrame(house_a.tolist())

array -> list -> DataFrame

tmp1 = pd.concat([pd.DataFrame(['A']*40), pd.DataFrame(house_a.tolist())], axis=1)
tmp2 = pd.concat([pd.DataFrame(['B']*40), pd.DataFrame(house_b.tolist())], axis=1)

df = pd.concat([tmp1, tmp2], axis=0)

### 등분산성 검정

귀무가설 : 두 비교집단의 분산이 같다.
대립가설 : 두 비교집단의 분산이 다르다.

stats.levene()

stats.levene(np.array(df[df['grp'] == 'A']['value']), np.array(df[df['grp'] == 'B']['value']))
# LeveneResult(statistic=0.1482823966555207, pvalue=0.7012302503912982)

즉, 등분산이다.

∴ 등분산 가정 독립표본 t-test진행

stats.ttest_ind(np.array(df[df['grp'] == 'A']['value'])
              , np.array(df[df['grp'] == 'B']['value'])
              , equal_var=True)

CH04_07. ANOVA

Null Hypothesis : 집단(target) 간 sepal_width 차이가 없다.

Alt Hypothesis : 집단(target) 간 sepal_width 차이가 있다.

정규성 검정

Null Hypothesis : 정규분포를 따른다.

Alt Hypothesis : 정규분포를 따르지 않는다.

shapiro()

from scipy.stats import shapiro

shapiro()
# ShapiroResult(statistic=  , pvalue=  )

Perform the Shapiro-Wilk test for normality.

(to see if the data follows normal distribution)

print(shapiro(df.sepal_width[df.target==0]))
print(shapiro(df.sepal_width[df.target==1]))
print(shapiro(df.sepal_width[df.target==2]))

# ShapiroResult(statistic=0.971718966960907, pvalue=0.2715126574039459)
# ShapiroResult(statistic=0.9741329550743103, pvalue=0.3379843533039093)
# ShapiroResult(statistic=0.9673907160758972, pvalue=0.18089871108531952)

결론 : sepal_width of target 0, 1, 2 follow normal distribution

등분산성 검정

Null Hypothesis : 등분산성을 만족한다.

Alt Hypothesis : 등분산성을 만족하지 않는다.

levene()

from scipy.stats import levene

Perform Levene test for equal variances.

print(levene(df.sepal_width[df.target==0], df.sepal_width[df.target==1], df.sepal_width[df.target==2]))

# LeveneResult(statistic=0.5902115655853319, pvalue=0.5555178984739075)

결론 : 등분산성을 만족한다.

One-way ANOVA (일원분산분석)

stats.f_oneway()

import scipy.stats as stats

stats.f_oneway()

stats.f_oneway(df.sepal_width[df.target==0], df.sepal_width[df.target==1], df.sepal_width[df.target==2])

# F_onewayResult(statistic=49.160040089612075, pvalue=4.492017133309115e-17)

결론 : 집단 간 sepal_width 차이가 있다.

Post-hoc Analysis

가설검정을 여러단계 거칠수록 FWER(Family Wise Error Rate - 1종 오류 발생 가능성?) 증가

pairwise_tukeyhsd()

CH04_09. Chi-squared

독립성 검정
적합성 검정
동일성 검정

chi-squared = sum of ((측정값 - 기댓값)^2 / 기댓값)

stats.chi2_contingency()

# Chi2ContingencyResult(statistic=54.17534722222223, pvalue=1.833731033899248e-13, dof=1, expected_freq=array([[240.,  60.],
#       [360.,  90.]]))

chi-squared value, p-value, degree of freedom, E(given as array)

'Upstage AI Lab 2기' 카테고리의 다른 글

Upstage AI Lab 2기 [Day027] 선형회귀분석 (0)	2024.01.18
Upstage AI Lab 2기 [Day026] 실시간 강의 - 통계 (2) 머신러닝의 통계적 학습 (0)	2024.01.18
Upstage AI Lab 2기 [Day026] 실시간 강의 - 통계 (1) 통계 톺아보기 (0)	2024.01.17
Upstage AI Lab 2기 [Day026] 온라인 강의 - 기초통계 (2) (0)	2024.01.17
Upstage AI Lab 2기 [Day024] git-협업 (0)	2024.01.15

연역적 인간의 귀납적 세상에서 살아남기

Upstage AI Lab 2기 [Day026] 온라인 강의 - 기초통계 (3) 실습

'Upstage AI Lab 2기' 카테고리의 다른 글

티스토리툴바

Upstage AI Lab 2기 [Day026] 온라인 강의 - 기초통계 (3) 실습

'Upstage AI Lab 2기' 카테고리의 다른 글

'Upstage AI Lab 2기' Related Articles

티스토리툴바