Data Science/데이터사이언스개론 2024. 3. 17. 13:48

Predictive Modeling

general procedure
- 데이터를 가장 잘 표현하는 모델을 만든다.
- 결과를 예측하기 위해 모델에 새로운 데이터를 적용한다.

general procedure
- 데이터를 가장 잘 표현하는 모델을 만든다.
- 결과를 예측하기 위해 모델에 새로운 데이터를 적용한다.

Model

: 목적에 따라 현실을 간략하게 표현한 것

중요한 것과 중요하지 않은 것을 바탕으로 간략화한 것
불필요한 정보는 버리면서 요약하고 필요한 정보는 유지하면서 요약한다.

Types of Model: Predictive Model(예측 모델)

: 모르는 관심 있는 값을 추출하는 공식

공식의 종류
- 수학적인 표현(linear regression)
- 논리적 기술 혹은 규칙(decision tree*)

Types of Model: Descriptive Model (설명 모델)

: 데이터로부터 통찰력을 얻는 것이 주요 목적이다. (clustering, profiling)

예측 모델 설명 모델

performance 기준	predictive accuracy	intelligibility or understandability
	정량적	정성적

Terminologies for Classification (분류 관련 용어)

Model: 목표 값을 특정 값의 함수로부터 추정한다.
- (데이터 인스턴스의 값에 대한 함수로 target의 값을 추정한다.)
Label: 학습데이터에서 target attribute 의 값

Model Induction(모델 학습)

데이터로부터의 모델 생성 → 일반화
모델은 통계적인 의미에서 일반적인 규칙이다. 100% 맞는 것은 아니다.
용어
- Induction algorithm: 데이터로부터 모델을 만드는 절차
- training data: 모델을 학습시키기 위해 학습 알고리즘에 넣는 데이터
- labeled data: target attribute 아는 데이터

Classification

기본 원리
- target attribute를 다르게 가지도록 데이터를 subgroup으로 나눈다. 그러면 subgroup 안에는 비슷한 target attribute를 갖는 인스턴스들이 존재한다.
- 값을 이용한 분류가 끝나면, target attribute를 예측하는 데 사용할 수 있다.

Selecting Informative Attributes

정보있는 데이터들을 나누기 위해서는 informative attribute를 사용한다.
informative attribute
- 타겟값에 대해 중요한 정보를 가지고 있는 속성
- 타겟값을 예측하는 데 도움을 준다
pure= 그룹의 모든 멤버가 같은 target variable을 가지고 있다.
⇒ impure: 적어도 하나 이상의 멤버가 다른 target value를 가지고 있다.
: target attribute 가 동일한 것들끼리 묶을 수 있을 때 pure 하다고 한다.
최대한 순수하도록 그룹을 나누는 것이 좋다. → impurity를 줄이는 속성을 찾자!

Complications

속성이 완벽하게 그룹을 나누는 일은 흔치 않다.
명목값에서 어떻게 해야 잘 쪼갤까? (pure한 게 하나라도 더 있는 것 vs 섞여있어도 평균 높은 것)
numerical attributes를 사용할 때 데이터를 어떻게 쪼갤까?

Splitting Criterion

attribute가 데이터를 얼마나 잘 쪼개는지 확인하는 공식을 사용하자. (순수도 측정 가능한)

Information gain
- 가장 흔한 splitting criterion
- 엔트로피에 기반한다.

Entropy

: 무질서도에 대한 측정/얼마나 섞여있는가?

equation for entropy

H(S)는 클래스들이 균일하게 섞여있을 때 가장 maximized,
H(S)는 클래스들이 모두 같을 때 가장 minimized

Information Gain (IG)

엔트로피는 얼마나 impure 한지에 대해서만 알려준다.

information gain은 attribute가 생성한 전체의 segmentation에 대해 attribute가 얼마나 improve 하거나 decrease했는지를 측정해준다.

equation

여러 개의 attribute가 있다면 각각의 IG를 비교해 가장 큰 것을 우선으로 두며 tree를 만든다.

Numerical Variables

앞에서는 categorical attribute를 쉽게 나눌 수 있었다.

이번에는 numerical variables 를 나누는 아이디어를 소개한다.

하나 이상의 분할점을 선택해 숫자 값을 분리 (두 클래스로 나누기) → 10C2
- 결과는 categorical attribute처럼 다룬다.
분할점을 여러 개 선택한다. (세 클래스 이상으로 나누기) → 10Cn

Tree-Structured Segmentation

두 개 이상의 attribute에 대해서도 나눌 수 있다. 이렇게 나누다 보면 트리 형태가 된다.

Classification Tree (Decision Tree)

구조
- interior node: attribute에 대한 테스트를 포함한다.
- terminal node (leaf node): segment와 classification을 나타낸다.
- branch: attribute의 분명한/범위의 값을 나타낸다.
- path from the root to a leaf: segment의 특징을 나타낸다.
절차
- classification을 모를 때 root node에서 interior node를 지나고 branch를 고른다. 이렇게 terminal node에 도착하면 classification이 주어진다.

Creating Classification Tree (Tree induction)

divide-and-conquer approach를 사용한다.
- 처음에는 전체 데이터를 가장 잘 나누는 attribute를 선택한다.
- 각각의 subgroup 에 대해서는 재귀적으로 위의 방법을 실행한다.
induction process summary
- 데이터를 재귀적으로 분리한다.
- 각각의 스텝에서 ig를 바탕으로 attribute를 속성한다.
- - 모든 잎 노드가 pure할 때
  - 더 이상 나눌 variable이 없을 때
  - overfitting되지 않도록 좀 더 일찍 멈춘다.

Visualizing Segmentation

classification tree를 만들었는데,

종종 얼마나 classificaiton tree가 instance 공간을 분할했는지 시각화하는 것이 유용할 때도 있다.

시각화는 두, 세 개의 attribute에 대해서만 가능하다. 그래서 속성에서 3개씩 골라내서 그릴 수도 있다.
아직 시각화는 고차원 공간에 적용되는 통찰력을 제공한다.

Trees as Sets of Rules

classification tree는 해석하기 쉬워서 자주 사용한다. mathmatical formula 처럼 복잡X
classification tree를 logical statement로도 해석할 수 있다.
- root-leaf node까지의 path는 rule을 나타낸다.
- 각각의 rule은 AND와 연결된 경로를 따른다.
ex) if (Balance≤50K) AND (Age<50) THEN Class=Write-off

Probability Estimation

classification보다 확률을 측정하고 싶을 때도 있을 수 있다.

Probability estimation tree
- leaf node가 probability를 준다. 1에 가까울수록 확신하고 0.5에 가까울수록 불안한 것
constructing Probability estimation tree
- frequency-based probability estimation / 빈도수로 계산하자.
  - leaf node에 있는 instance 의 개수를 세어 class probability를 측정한다.
  - leaf node가 n개의 positive instance, m개의 negative instance를 가진다면,
  - **n/(n+m)**이다.

Example: Churn Prediction Problem

떠날 사람 찾아서 붙들 때 확률이 자주 쓰인다.

0.5보다 높다면 떠날 확률이 높고 0.5보다 낮다면 유지할 확률이 높다고 본다.

→ 애매하게 0.5 살짝 위인 그룹들 붙들고 있는 게 중요하다.

root node는 전체 데이터에서 가장 높은 IG를 가진다.
나머지 interior node는 각각의 subgroup에 대해 가장 높은 IG를 가진다.
트리 짓는 걸 언제 멈춰야 할까?
- 모델이 너무 복잡해지기 전에 멈추기
- model generality와 overfitting에 연관되어 있다.
트리의 정확성은 어떻게 측정하는가?
- 트리를 original dataset에 적용하고 관측했을 때 00%의 정확성을 가졌다고 가정하자.
- 다른 데이터셋에 적용하면 정확성도 바뀔 것

summary

Predictive modeling
- data science의 주요 업무 중 하나이다.
- 새로운 예시에 대해 target value를 측정하기 위해 모델을 만들고, 사용한다.
Finding and selecting informative attributes
- 그 자체로 유용한 데이터 마이닝 절차
- 다른 속성에 대해 정보를 주는 속성을 찾자.
- 기본적인 측정인 purity에 기반을 둔 entropy를 바탕으로 information gain을 사용한다.
Tree induction
- informative attribute를 찾는 것에 기반을 둔 모델링 기술
- subset의 데이터에 대해 informative attributes를 재귀적으로 찾는다.
- 가능한 모든 인스턴스의 공간을 서로 다른 예측 값을 가진 세그먼트 집합으로 분할한다.
Tree induction is a very popular data mining procedure
- 이해하기 쉽고, 설명하기 쉽고, 실행하기 쉽다.
- 실무에 잘 쓰인다.

'Data Science > 데이터사이언스개론' 카테고리의 다른 글

[데이터사이언스개론] Chapter 4 (0)	2024.03.17
[데이터사이언스개론] Chapter 2 (2)	2024.03.17
[데이터사이언스개론] Chapter 1 (0)	2024.03.17

ABOUT ME

이게 왜 돌아가지 이게 왜 돌아가지

Predictive Modeling

Model

Types of Model: Predictive Model(예측 모델)

Types of Model: Descriptive Model (설명 모델)

Terminologies for Classification (분류 관련 용어)

Model Induction(모델 학습)

Classification

Selecting Informative Attributes

Complications

Splitting Criterion

Entropy

Information Gain (IG)

Numerical Variables

Tree-Structured Segmentation

Classification Tree (Decision Tree)

Creating Classification Tree (Tree induction)

Visualizing Segmentation

Trees as Sets of Rules

Probability Estimation

Example: Churn Prediction Problem

'Data Science > 데이터사이언스개론' 카테고리의 다른 글

티스토리툴바

ABOUT ME

Predictive Modeling

Model

Types of Model: Predictive Model(예측 모델)

Types of Model: Descriptive Model (설명 모델)

Terminologies for Classification (분류 관련 용어)

Model Induction(모델 학습)

Classification

Selecting Informative Attributes

Complications

Splitting Criterion

Entropy

Information Gain (IG)

Numerical Variables

Tree-Structured Segmentation

Classification Tree (Decision Tree)

Creating Classification Tree (Tree induction)

Visualizing Segmentation

Trees as Sets of Rules

Probability Estimation

Example: Churn Prediction Problem

'Data Science > 데이터사이언스개론' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바