- credit_bureau_a_2 : (188298452, 19)
- static_cb_0 : (1500476, 53)
- applprev_1 : (6525979, 41)
credit_bureau_a_2
(188298452, 19)
columns
- 'case_id'
- 'collater_typofvalofguarant_298M' : Collateral valuation type (active contract).
- 'collater_typofvalofguarant_407M' : Collateral valuation type (closed contract).
- 'collater_valueofguarantee_1124L' : Value of collateral for active contract.
- 'collater_valueofguarantee_876L' : Value of collateral for closed contract.
- 'collaterals_typeofguarante_359M' : Type of collateral that was used as a guarantee for a closed contract.
- 'collaterals_typeofguarante_669M' : Collateral type for the active contract.
- 'num_group1'
- 'num_group2'
- 'pmts_dpd_1073P' : Days past due of the payment for the active contract
(num_group1 - existing contract, num_group2 - payment).
- 'pmts_dpd_303P' : Days past due of the payment for terminated contract according to credit bureau
(num_group1 - terminated contract, num_group2 - payment).
- 'pmts_month_158T' : Month of payment for a closed contract
(num_group1 - existing contract, num_group2 - payment).
- 'pmts_month_706T' : Month of payment for active contract (num_group1 - terminated contract, num_group2 - payment).
- 'pmts_overdue_1140A' : Overdue payment for an active contract (num_group1 - existing contract, num_group2 - payment).
- 'pmts_overdue_1152A' : Overdue payment for a closed contract (num_group1 - terminated contract, num_group2 - payment).
- 'pmts_year_1139T' : Year of payment for an active contract (num_group1 - existing contract, num_group2 - payment).
- 'pmts_year_507T' : Payment year for a closed credit contract (num_group1 - terminated contract, num_group2 - payment).
- 'subjectroles_name_541M' : Name of subject role in closed credit contract (num_group1 - terminated contract, num_group2 - subject roles).
- 'subjectroles_name_838M' : Name of subject role in active credit contract (num_group1 - existing contract, num_group2 - subject roles).
active contract | closed contract | ||
Collateral | valuation type | 298M | 407M |
Value | 1124L | 876L | |
Type | 669M | 359M | |
DPD | 1073P | 303P | |
Month of payment | 158T | 706T | |
Overdue payment | 1140A | 1152A | |
Year of payment | 1139T | 507T | |
subject role | 838M | 541M |
collateral : 담보
data type | null_유무 | null 개수 | null 비율 | 0 개수 | nunique | ||
- 'case_id' | int64 | FALSE | 0 | 0.00 | 0 | 1,385,288 | |
- 'collater_typofvalofguarant_298M' | M | object | FALSE | 0 | 0.00 | 0 | 5 |
- 'collater_typofvalofguarant_407M' | M | object | FALSE | 0 | 0.00 | 0 | 8 |
- 'collater_valueofguarantee_1124L' | L | float64 | TRUE | 185,545,117 | 0.99 | 2,545,706 | 83,274 |
- 'collater_valueofguarantee_876L' | L | float64 | TRUE | 181,414,202 | 0.96 | 6,155,687 | 148,992 |
- 'collaterals_typeofguarante_359M' | M | object | FALSE | 0 | 0.00 | 15 | |
- 'collaterals_typeofguarante_669M' | M | object | FALSE | 0 | 0.00 | 15 | |
- 'num_group1' | int64 | FALSE | 0 | 0.00 | 333 | ||
- 'num_group2' | int64 | FALSE | 0 | 0.00 | 101 | ||
- 'pmts_dpd_1073P' | P | float64 | TRUE | 152,991,152 | 0.81 | 33,295,090 | 4,414 |
- 'pmts_dpd_303P' | P | float64 | TRUE | 113,465,446 | 0.60 | 62,769,627 | 4,771 |
- 'pmts_month_158T' | T | float64 | TRUE | 122,168,636 | 0.65 | 0 | 12 |
- 'pmts_month_706T' | T | float64 | TRUE | 29,568,980 | 0.16 | 0 | 12 |
- 'pmts_overdue_1140A' | A | float64 | TRUE | 152,850,520 | 0.81 | 33,402,002 | 1,115,375 |
- 'pmts_overdue_1152A' | A | float64 | TRUE | 113,377,640 | 0.60 | 62,326,665 | 3,260,905 |
- 'pmts_year_1139T' | T | float64 | TRUE | 122,168,636 | 0.65 | 0 | 13 |
- 'pmts_year_507T' | T | float64 | TRUE | 29,568,980 | 0.16 | 0 | 26 |
- 'subjectroles_name_541M' | M | object | FALSE | 0 | 0.00 | 0 | 11 |
- 'subjectroles_name_838M' | M | object | FALSE | 0 | 0.00 | 0 | 9 |
int 로변환해도 되는 컬럼
- 'pmts_month_158T'
- 'pmts_month_706T'
- 'pmts_year_1139T'
- 'pmts_year_507T'
기타 특이사항
- 'pmts_month_158T' : 모든 value count가 5510818로 동일
- 'pmts_month_706T' : 모든 value count가 13227456로 동일
a55475b1이 null일지도???
credit_bureau_a_2.case_id.value_counts()
case_id | value_counts |
214097 | 9404 |
221467 | 7859 |
973076 | 6764 |
1706379 | 5986 |
941237 | 4968 |
... | |
1285135 | 12 |
937186 | 12 |
631714 | 12 |
774521 | 2 |
218801 | 1 |
Name: case_id, Length: 1385288, dtype: int64
bureau에 정보가 많다는 건 무슨 뜻일까?
마스킹된 데이터는 그 값이 무슨 의미인지 찾기보다 그 값에 따른 경향을 봐야할 듯
static_cb_0
(1500476, 53)
- case_id
- assignmentdate_238D : Tax authority data - date of assignment.
- assignmentdate_4527235D : Tax authority data - Date of assignment.
- assignmentdate_4955616D : Tax authority assignment date.
- birthdate_574D : Client's date of birth (credit bureau data).
- contractssum_5085716L : Total sum of values of contracts retrieved from external credit bureau.
- dateofbirth_337D : Client's date of birth.
- dateofbirth_342D : Client's date of birth.
- days120_123L : Number of credit bureau queries for the last 120 days.
- days180_256L : Number of credit bureau queries for last 180 days.
- days30_165L : Number of credit bureau queries for the last 30 days.
- days360_512L : Number of Credit Bureau queries for last 360 days.
- days90_310L : Number of credit bureau queries for the last 90 days.
- description_5085714M : Categorization of clients by credit bureau.
- education_1103M : Level of education of the client provided by external source.
- education_88M : Education level of the client.
- firstquarter_103L : Number of results obtained from credit bureau in the first quarter.
- for3years_128L : Number of rejected applications in the past 3 years.
- for3years_504L : Client's credit history data over the last three years.
- for3years_584L : Number of cancellations in the last 3 years.
- formonth_118L : Number of rejections in a month.
- formonth_206L : Number of cancelations in the previous month.
- formonth_535L : Credit history for the last month.
- forquarter_1017L : Number of cancellations recorded in the credit bureau in the last quarter.
- forquarter_462L : Number of credit applications that were rejected in the last quarter.
- forquarter_634L : Credit history for the last quarter.
- fortoday_1092L : Client's credit history for today.
- forweek_1077L : Number of cancelations in the last week.
- forweek_528L : Credit history for the last week.
- forweek_601L : Number of rejected applications in the last week.
- foryear_618L : Number of application rejections in the previous year.
- foryear_818L : Number of cancelations that occurred in last year.
- foryear_850L : Credit history for the last year.
- fourthquarter_440L : Number of results in fourth quarter.
- maritalst_385M : Marital status of the client.
- maritalst_893M : Marital status of the client.
- numberofqueries_373L : Number of queries to credit bureau.
- pmtaverage_3A : Average of tax deductions.
- pmtaverage_4527227A : Average of tax deductions.
- pmtaverage_4955615A : Average of tax deductions.
- pmtcount_4527229L : Number of tax deductions.
- pmtcount_4955617L : Number of tax deductions.
- pmtcount_693L : Number of tax deductions.
- pmtscount_423L : Number of tax deduction payments.
- pmtssum_45A : Sum of tax deductions for the client.
- requesttype_4525192L : Tax authority request type.
- responsedate_1012D : Tax authority's response date.
- responsedate_4527233D : Tax authority's response date.
- responsedate_4917613D : Tax authority's response date.
- riskassesment_302T : Estimated probability that the client will default on their credit obligation within the next year.
- riskassesment_940T : Estimate of client's creditworthiness.
- secondquarter_766L : Number of results in second quarter.
- thirdquarter_1082L : Number of results in third quarter.
tax deductions : 세금 공제액
L 32
D 9
M 5
A 4
T 2
- days30_165L : Number of credit bureau queries for the last 30 days.
- days90_310L : Number of credit bureau queries for the last 90 days.
- days120_123L : Number of credit bureau queries for the last 120 days.
- days180_256L : Number of credit bureau queries for last 180 days.
- days360_512L : Number of Credit Bureau queries for last 360 days.
- numberofqueries_373L : Number of queries to credit bureau.
- pmtcount_4527229L : Number of tax deductions.
- pmtcount_4955617L : Number of tax deductions.
- pmtcount_693L : Number of tax deductions.
rejections | cancellations | Credit history | |
in the last 3 years. | 128L | 584L | 504L |
in the previous month. | 118L | 206L | 535L |
in the last quarter. | 462L | 1017L | 634L |
in the last week | 601L | 1077L | 528L |
in the previous year. | 618L | 818L | 850L |
fortoday | 1092L |
quarter_1 | 103L |
quarter_2 | 766L |
quarter_3 | 1082L |
quarter_4 | 440L |
assignment date(tax authority) | 238D | 4527235D | 4955616D |
date of birth | 574D (credit bureau data) |
337D | 342D |
30 | 90 | 120 | 180 | 360 | total? | |
num of queries | 165L | 310L | 123L | 256L | 512L | 373L |
education | 1103M | 88M |
marital status | 385M | 893M |
Average of tax deductions. | 3A | 4527227A | 4955615A | ||
Number of tax deductions. | 4527229L | 4955617L | 693L | 423L | 45A (sum) |
request type. | 4525192L | ||
response date | 1012D | 4527233D | 4917613D |
probability of default | 302T |
creditworthiness. | 940T |
train_static_cb_0.columns[6]
dateofbirth_337D
null : 114785
count 합계 : 1385691
train_static_cb_0.columns[7]
dateofbirth_342D
null : 1463976
count 합계 : 36500
왜 같은 dateofbirth 컬럼이 두개가 있는지....?
null값 114785개가 반복되는 것 같음
null값 1463962개도 몇번 반복됨
cancellation과 rejection의 차이?
cancellation / rejection / credit history - 기간을 조금씩 다르게 하여 컬럼들이 있음, 기간들을 하나의 수치로 변환해서 한개의 컬럼으로 만든다든가 등의 조작이 가능할 것 같음
Average of tax deductions.
#### 37_train_static_cb_0.pmtaverage_3A
#### 38_train_static_cb_0.pmtaverage_4527227A
#### 39_train_static_cb_0.pmtaverage_4955615A
Number of tax deductions.
#### 40_train_static_cb_0.pmtcount_4527229L
#### 41_train_static_cb_0.pmtcount_4955617L
#### 42_train_static_cb_0.pmtcount_693L
#### 45_train_static_cb_0.requesttype_4525192L
- Tax authority request type.
데이터 타입은? object
NA 값이 있나요? (True, 827212, 0.5512997208885714)
분포는?
DEDUCTION_6 550387
PENSION_6 117047
SOCIAL_6 5830
Name: requesttype_4525192L, dtype: int64
합계 : 673264
count 최소 최대: (5830, 550387)
num of unique values : 3
Tax authority's response date.
#### 46_train_static_cb_0.responsedate_1012D
#### 47_train_static_cb_0.responsedate_4527233D
#### 48_train_static_cb_0.responsedate_4917613D
#### 49_train_static_cb_0.riskassesment_302T
- 302T
- Estimated probability that the client will default on their credit obligation within the next year.
데이터 타입은? object
NA 값이 있나요? (True, 1446917, 0.9643053271095305)
분포는?
3% - 4% 6445
2% - 3% 6147
4% - 6% 5681
2% - 2% 4948
6% - 8% 4561
67% - 100% 4549
1% - 1% 4103
8% - 11% 3922
11% - 15% 2989
15% - 19% 2244
20% - 25% 1569
33% - 41% 1556
26% - 33% 1495
41% - 49% 1370
50% - 58% 1187
59% - 66% 793
Name: riskassesment_302T, dtype: int64
합계 : 53559
count 최소 최대: (793, 6445)
num of unique values : 16
encoding 필요
downpayment : 착수금? 계약금?
분명 case_id로 합쳐야하는데.... mean을 해야하는 컬럼과, sum을 해야하는 컬럼이 아직 감이 안 잡힘
아직 num_group1, num_group2의 분포를 모르겠음
applprev_1
(6525979, 41)
- 'case_id'
- 'actualdpd_943P' :
- 'annuity_853A'
- 'approvaldate_319D'
- 'byoccupationinc_3656910L'
- 'cancelreason_3545846M'
- 'childnum_21L'
- 'creationdate_885D'
- 'credacc_actualbalance_314A'
- 'credacc_credlmt_575A'
- 'credacc_maxhisbal_375A'
- 'credacc_minhisbal_90A'
- 'credacc_status_367L'
- 'credacc_transactions_402L'
- 'credamount_590A'
- 'credtype_587L'
- 'currdebt_94A'
- 'dateactivated_425D'
- 'district_544M'
- 'downpmt_134A'
- 'dtlastpmt_581D'
- 'dtlastpmtallstes_3545839D'
- 'education_1138M'
- 'employedfrom_700D'
- 'familystate_726L'
- 'firstnonzeroinstldate_307D'
- 'inittransactioncode_279L'
- 'isbidproduct_390L'
- 'isdebitcard_527L'
- 'mainoccupationinc_437A'
- 'maxdpdtolerance_577P'
- 'num_group1'
- 'outstandingdebt_522A'
- 'pmtnum_8L'
- 'postype_4733339M'
- 'profession_152M'
- 'rejectreason_755M'
- 'rejectreasonclient_4145042M'
- 'revolvingaccount_394A'
- 'status_219L'
- 'tenor_203L'
todo
- data 찍어보기
credit_bureau_a_2static_cb_0applprev_1
- home credit wiki 읽어보기
- 찍어본 data의 의미 파악하기
null 시각화 잘 되어있는 discussion
https://www.kaggle.com/competitions/home-credit-credit-risk-model-stability/discussion/473950
variables, description, dtype, tables
https://www.kaggle.com/competitions/home-credit-credit-risk-model-stability/discussion/475922
https://www.kaggle.com/competitions/home-credit-credit-risk-model-stability/discussion/475695
'Upstage AI Lab 2기' 카테고리의 다른 글
Upstage AI Lab 2기 [Day055] Deep Learning - 손실 함수 (0) | 2024.03.03 |
---|---|
Upstage AI Lab 2기 [Day055] Deep Learning (1) | 2024.02.29 |
Upstage AI Lab 2기 [Day043] ML 프로젝트 (day1) (0) | 2024.02.13 |
Upstage AI Lab 2기 [Day042] 자료구조 및 알고리즘 (3) (0) | 2024.02.09 |
Upstage AI Lab 2기 [Day041] 자료구조 및 알고리즘 (2) (0) | 2024.02.07 |