본문 바로가기

Upstage AI Lab 2기

Upstage AI Lab 2기 [Day043] ML 프로젝트 (day1-2) (2) 업무분담

  • credit_bureau_a_2 : (188298452, 19)
  • static_cb_0 : (1500476, 53)
  • applprev_1 : (6525979, 41)

credit_bureau_a_2

(188298452, 19)

더보기

columns

- 'case_id'
- 'collater_typofvalofguarant_298M' : Collateral valuation type (active contract).
- 'collater_typofvalofguarant_407M' : Collateral valuation type (closed contract).
- 'collater_valueofguarantee_1124L' : Value of collateral for active contract.
- 'collater_valueofguarantee_876L' : Value of collateral for closed contract.
- 'collaterals_typeofguarante_359M' : Type of collateral that was used as a guarantee for a closed contract.
- 'collaterals_typeofguarante_669M' : Collateral type for the active contract.
- 'num_group1'
- 'num_group2'
- 'pmts_dpd_1073P' : Days past due of the payment for the active contract 
(num_group1 - existing contract, num_group2 - payment).
- 'pmts_dpd_303P' : Days past due of the payment for terminated contract according to credit bureau 
(num_group1 - terminated contract, num_group2 - payment).
- 'pmts_month_158T' : Month of payment for a closed contract 
(num_group1 - existing contract, num_group2 - payment).
- 'pmts_month_706T' : Month of payment for active contract (num_group1 - terminated contract, num_group2 - payment).
- 'pmts_overdue_1140A' : Overdue payment for an active contract (num_group1 - existing contract, num_group2 - payment).
- 'pmts_overdue_1152A' : Overdue payment for a closed contract (num_group1 - terminated contract, num_group2 - payment).
- 'pmts_year_1139T' : Year of payment for an active contract (num_group1 - existing contract, num_group2 - payment).
- 'pmts_year_507T' : Payment year for a closed credit contract (num_group1 - terminated contract, num_group2 - payment).
- 'subjectroles_name_541M' : Name of subject role in closed credit contract (num_group1 - terminated contract, num_group2 - subject roles).
- 'subjectroles_name_838M' : Name of subject role in active credit contract (num_group1 - existing contract, num_group2 - subject roles).

  active contract closed contract
Collateral  valuation type 298M 407M
Value 1124L 876L
Type 669M 359M
DPD 1073P 303P
Month of payment 158T 706T
Overdue payment 1140A 1152A
Year of payment 1139T 507T
subject role 838M 541M

 

collateral : 담보

 

    data type null_유무 null 개수 null 비율 0 개수 nunique
- 'case_id'   int64 FALSE 0 0.00 0 1,385,288
- 'collater_typofvalofguarant_298M' M object FALSE 0 0.00 0
- 'collater_typofvalofguarant_407M' M object FALSE 0 0.00 0
- 'collater_valueofguarantee_1124L' L float64 TRUE 185,545,117 0.99 2,545,706 83,274 
- 'collater_valueofguarantee_876L' L float64 TRUE 181,414,202 0.96 6,155,687 148,992 
- 'collaterals_typeofguarante_359M' M object FALSE 0 0.00   15 
- 'collaterals_typeofguarante_669M' M object FALSE 0 0.00   15 
- 'num_group1'   int64 FALSE 0 0.00   333
- 'num_group2'   int64 FALSE 0 0.00   101
- 'pmts_dpd_1073P' P float64 TRUE 152,991,152 0.81 33,295,090 4,414
- 'pmts_dpd_303P' P float64 TRUE 113,465,446 0.60 62,769,627 4,771
- 'pmts_month_158T' T float64 TRUE 122,168,636 0.65 0 12
- 'pmts_month_706T' T float64 TRUE 29,568,980 0.16 0 12
- 'pmts_overdue_1140A' A float64 TRUE 152,850,520 0.81 33,402,002 1,115,375
- 'pmts_overdue_1152A' A float64 TRUE 113,377,640 0.60 62,326,665 3,260,905
- 'pmts_year_1139T' T float64 TRUE 122,168,636 0.65 0 13
- 'pmts_year_507T' T float64 TRUE 29,568,980 0.16 0 26
- 'subjectroles_name_541M' M object FALSE 0 0.00 0 11
- 'subjectroles_name_838M' M object FALSE 0 0.00 0 9

 

int 로변환해도 되는 컬럼

- 'pmts_month_158T'
- 'pmts_month_706T'

- 'pmts_year_1139T'
- 'pmts_year_507T'

 

기타 특이사항

- 'pmts_month_158T' : 모든 value count가 5510818로 동일

- 'pmts_month_706T' : 모든 value count가 13227456로 동일

a55475b1이 null일지도???

 

 

 

 

 

 

 

credit_bureau_a_2.case_id.value_counts()

 

case_id value_counts
214097 9404
221467 7859
973076 6764
1706379 5986
941237 4968
  ...
1285135 12
937186 12
631714 12
774521 2
218801 1

 

Name: case_id, Length: 1385288, dtype: int64

 

bureau에 정보가 많다는 건 무슨 뜻일까?

 

마스킹된 데이터는 그 값이 무슨 의미인지 찾기보다 그 값에 따른 경향을 봐야할 듯

 


static_cb_0

(1500476, 53)

더보기

- case_id
- assignmentdate_238D : Tax authority data - date of assignment.
- assignmentdate_4527235D : Tax authority data - Date of assignment.
- assignmentdate_4955616D : Tax authority assignment date.
- birthdate_574D : Client's date of birth (credit bureau data).
- contractssum_5085716L : Total sum of values of contracts retrieved from external credit bureau.
- dateofbirth_337D : Client's date of birth.
- dateofbirth_342D : Client's date of birth.
- days120_123L : Number of credit bureau queries for the last 120 days.
- days180_256L : Number of credit bureau queries for last 180 days.
- days30_165L : Number of credit bureau queries for the last 30 days.
- days360_512L : Number of Credit Bureau queries for last 360 days.
- days90_310L : Number of credit bureau queries for the last 90 days.
- description_5085714M : Categorization of clients by credit bureau.
- education_1103M : Level of education of the client provided by external source.
- education_88M : Education level of the client.
- firstquarter_103L : Number of results obtained from credit bureau in the first quarter.
- for3years_128L : Number of rejected applications in the past 3 years.
- for3years_504L : Client's credit history data over the last three years.
- for3years_584L : Number of cancellations in the last 3 years.
- formonth_118L : Number of rejections in a month.
- formonth_206L : Number of cancelations in the previous month.
- formonth_535L : Credit history for the last month.
- forquarter_1017L : Number of cancellations recorded in the credit bureau in the last quarter.
- forquarter_462L : Number of credit applications that were rejected in the last quarter.

- forquarter_634L : Credit history for the last quarter.
- fortoday_1092L : Client's credit history for today.
- forweek_1077L : Number of cancelations in the last week.
- forweek_528L : Credit history for the last week.
- forweek_601L : Number of rejected applications in the last week.

- foryear_618L : Number of application rejections in the previous year.
- foryear_818L : Number of cancelations that occurred in last year.
- foryear_850L : Credit history for the last year.
- fourthquarter_440L : Number of results in fourth quarter.
- maritalst_385M : Marital status of the client.
- maritalst_893M : Marital status of the client.
- numberofqueries_373L : Number of queries to credit bureau.
- pmtaverage_3A : Average of tax deductions.
- pmtaverage_4527227A : Average of tax deductions.
- pmtaverage_4955615A : Average of tax deductions.

- pmtcount_4527229L : Number of tax deductions.
- pmtcount_4955617L : Number of tax deductions.
- pmtcount_693L : Number of tax deductions.
- pmtscount_423L : Number of tax deduction payments.
- pmtssum_45A : Sum of tax deductions for the client.
- requesttype_4525192L : Tax authority request type.
- responsedate_1012D : Tax authority's response date.
- responsedate_4527233D : Tax authority's response date.
- responsedate_4917613D : Tax authority's response date.
- riskassesment_302T : Estimated probability that the client will default on their credit obligation within the next year.
- riskassesment_940T : Estimate of client's creditworthiness.
- secondquarter_766L : Number of results in second quarter.
- thirdquarter_1082L : Number of results in third quarter.

tax deductions : 세금 공제액

 

L    32
D     9
M     5
A     4
T     2

 

- days30_165L : Number of credit bureau queries for the last 30 days.

- days90_310L : Number of credit bureau queries for the last 90 days.

- days120_123L : Number of credit bureau queries for the last 120 days.
- days180_256L : Number of credit bureau queries for last 180 days.
- days360_512L : Number of Credit Bureau queries for last 360 days.
- numberofqueries_373L : Number of queries to credit bureau.

 

- pmtcount_4527229L : Number of tax deductions.
- pmtcount_4955617L : Number of tax deductions.
- pmtcount_693L : Number of tax deductions.

 

  rejections  cancellations  Credit history
in the last 3 years. 128L 584L 504L
in the previous month. 118L 206L 535L
in the last quarter. 462L 1017L 634L
in the last week 601L 1077L 528L
in the previous year. 618L 818L 850L
fortoday     1092L

 

 

quarter_1 103L
quarter_2 766L
quarter_3 1082L
quarter_4 440L

 

 

assignment date(tax authority) 238D 4527235D 4955616D
date of birth 574D
(credit bureau data)
337D 342D

 

  30 90 120 180 360 total?
num of queries 165L 310L 123L 256L 512L 373L

 

education 1103M 88M
marital status 385M 893M

 

Average of tax deductions. 3A 4527227A 4955615A    
Number of tax deductions. 4527229L 4955617L 693L 423L 45A (sum)

 

request type. 4525192L    
response date 1012D 4527233D 4917613D

 

probability of default 302T
creditworthiness. 940T

 

 

 

train_static_cb_0.columns[6]

dateofbirth_337D

null : 114785

count 합계 : 1385691

 

train_static_cb_0.columns[7]

dateofbirth_342D

null : 1463976

count 합계 :  36500

 

왜 같은 dateofbirth 컬럼이 두개가 있는지....?

 

null값 114785개가 반복되는 것 같음

null값 1463962개도 몇번 반복됨

 

cancellation과 rejection의 차이?

 

cancellation / rejection / credit history - 기간을 조금씩 다르게 하여 컬럼들이 있음, 기간들을 하나의 수치로 변환해서 한개의 컬럼으로 만든다든가 등의 조작이 가능할 것 같음

 

Average of tax deductions.

#### 37_train_static_cb_0.pmtaverage_3A

#### 38_train_static_cb_0.pmtaverage_4527227A

#### 39_train_static_cb_0.pmtaverage_4955615A

 

Number of tax deductions.

#### 40_train_static_cb_0.pmtcount_4527229L

#### 41_train_static_cb_0.pmtcount_4955617L

#### 42_train_static_cb_0.pmtcount_693L

 

 

#### 45_train_static_cb_0.requesttype_4525192L

- Tax authority request type.

데이터 타입은? object
NA 값이 있나요? (True, 827212, 0.5512997208885714)
분포는? 
DEDUCTION_6    550387
PENSION_6      117047
SOCIAL_6         5830
Name: requesttype_4525192L, dtype: int64
합계 : 673264
count 최소 최대: (5830, 550387)
num of unique values : 3

 

Tax authority's response date.

#### 46_train_static_cb_0.responsedate_1012D

#### 47_train_static_cb_0.responsedate_4527233D

#### 48_train_static_cb_0.responsedate_4917613D

 

#### 49_train_static_cb_0.riskassesment_302T
- 302T
- Estimated probability that the client will default on their credit obligation within the next year.

데이터 타입은? object
NA 값이 있나요? (True, 1446917, 0.9643053271095305)
분포는? 
3% - 4%       6445
2% - 3%       6147
4% - 6%       5681
2% - 2%       4948
6% - 8%       4561
67% - 100%    4549
1% - 1%       4103
8% - 11%      3922
11% - 15%     2989
15% - 19%     2244
20% - 25%     1569
33% - 41%     1556
26% - 33%     1495
41% - 49%     1370
50% - 58%     1187
59% - 66%      793
Name: riskassesment_302T, dtype: int64
합계 : 53559
count 최소 최대: (793, 6445)
num of unique values : 16

 

encoding 필요

 

downpayment : 착수금? 계약금?

 

 

분명 case_id로 합쳐야하는데.... mean을 해야하는 컬럼과, sum을 해야하는 컬럼이 아직 감이 안 잡힘

아직 num_group1, num_group2의 분포를 모르겠음

 

applprev_1

(6525979, 41)

더보기

- 'case_id'
- 'actualdpd_943P' : 
- 'annuity_853A'
- 'approvaldate_319D'
- 'byoccupationinc_3656910L'
- 'cancelreason_3545846M'
- 'childnum_21L'
- 'creationdate_885D'
- 'credacc_actualbalance_314A'
- 'credacc_credlmt_575A'
- 'credacc_maxhisbal_375A'
- 'credacc_minhisbal_90A'
- 'credacc_status_367L'
- 'credacc_transactions_402L'
- 'credamount_590A'
- 'credtype_587L'
- 'currdebt_94A'
- 'dateactivated_425D'
- 'district_544M'
- 'downpmt_134A'
- 'dtlastpmt_581D'
- 'dtlastpmtallstes_3545839D'
- 'education_1138M'
- 'employedfrom_700D'
- 'familystate_726L'
- 'firstnonzeroinstldate_307D'
- 'inittransactioncode_279L'
- 'isbidproduct_390L'
- 'isdebitcard_527L'
- 'mainoccupationinc_437A'
- 'maxdpdtolerance_577P'
- 'num_group1'
- 'outstandingdebt_522A'
- 'pmtnum_8L'
- 'postype_4733339M'
- 'profession_152M'
- 'rejectreason_755M'
- 'rejectreasonclient_4145042M'
- 'revolvingaccount_394A'
- 'status_219L'
- 'tenor_203L'

 

 

 

 

 

todo

  • data 찍어보기
    • credit_bureau_a_2
    • static_cb_0
    • applprev_1
  • home credit wiki 읽어보기
  • 찍어본 data의 의미 파악하기

 

 

null 시각화 잘 되어있는 discussion

https://www.kaggle.com/competitions/home-credit-credit-risk-model-stability/discussion/473950

 

 

variables, description, dtype, tables

https://www.kaggle.com/competitions/home-credit-credit-risk-model-stability/discussion/475922

 

https://www.kaggle.com/competitions/home-credit-credit-risk-model-stability/discussion/475695