Support Vector Machines

이진 분류 작업

훈련 데이터를 분리할 수 있는 초평면이 많이 있습니다.

서포트 벡터 분류 원리

트레이닝 데이터셋 $D = \left \{ (\mathbf{x_{1}},y_{1} ), (\mathbf{x_{2}},y_{2} ), ..., (\mathbf{x_{n}},y_{n} ) \right \}$

$\mathbf{x_{i}} = (x_{i}^{1}, ..., x_{i}^{d}) \in R^{d}$은 d 입력 변수의 i-th 입력 벡터입니다.

$y_{i} \in \left \{ -1, +1 \right \}$는 대응하는 출력 변수의 레이블입니다.

SVM은 positive ($y_{i} = +1$)와 negative ($y_{i} = -1$) 데이터 포인트 사이에서 maximum-margin hyperplane $\mathbf{w}^{T}\mathbf{x} + b = 0$을 찾습니다.

Margin : $2 / \left \| \mathbf{w} \right \|$

Prediction $\hat{y} = f(\mathbf{x}) = sign(\mathbf{w}^{T}\mathbf{x} + b)$

Hard-Margin Formulation & Soft-Margin Formulation

Hard-margin formulation

어떠한 에러도 허용하지 않습니다. $H_{1}$과 $H_{2}$ 사이에 있는 트레이닝 포인트가 없습니다.

$min\jmath(\mathbf{w}, b) = \frac{1}{2}\mathbf{w}^{T}\mathbf{w}$ <- maximize the margin

subject to $y_{i}(\mathbf{w}^{T}x_{i} + b) \geq 1, \forall i$ <- 모든 트레이닝 데이터 포인트가 경계 밖입니다.

Soft-margin formulation

슬랙 변수 $\xi _{i} \geq 0$를 추가하여 에러를 허용합니다.

$min\jmath(\mathbf{w}, b, \xi _{i}) = \frac{1}{2}\mathbf{w}^{T}\mathbf{w} + C\sum_{i}\xi_{i}$

subject to $y_{i}(\mathbf{w}^{T}\mathbf{x_{i}}+b) \geq 1 - \xi_{i}$, $\xi_{i} \geq 0$, $\forall i$

$\frac{1}{2}\mathbf{w}^{T}\mathbf{w}$ <- maximize the margin

$min\jmath(\mathbf{w}, b, \xi _{i}) = \frac{1}{2}\mathbf{w}^{T}\mathbf{w} + C\sum_{i}\xi_{i}$ <- minimize empirical risk(hinge loss) trade-off hyperparameter C

$L(y, \mathbf{w}^{T}\mathbf{x} + b) = max(0,1 - y(\mathbf{w}^{T}\mathbf{x} + b))$

$y_{i}(\mathbf{w}^{T}\mathbf{x_{i}}+b) \geq 1 - \xi_{i}$ <- 대부분의 드레이닝 데이터 포인트가 경계 밖이지만, 아닌 것도 존재합니다.

Dual Problem(Quadratic Programming) -> QP Solver를 사용해야합니다.

$maxL(\alpha_{i}) = \sum_{i}\alpha_{i} - \frac{1}{2}\alpha_{i}\alpha_{j}y_{i}y_{j}\mathbf{x}_{i}^{T}\mathbf{x_{j}}$

subject to $\sum_{i}\alpha_{i}y_{i} = 0$, $0 \leq \alpha_{i} \leq C$, $\forall i$ Convex optimizaion -> Global optimum is guaranteed

서포트 벡터 분류 훈련 모델

Soft-margin fromulation

최적의 파라미터 $\mathbf{w}^{*}$, $b^{*}$를 얻습니다.

$\mathbf{w}^{*} = \sum_{i=1}^{n}\alpha_{i}y_{i}\mathbf{x_{i}}$

$b^{*} = \frac{1}{y_{sv}} - \mathbf{w}^{*T}\mathbf{x}_{sv} = \frac{1}{y_{sv}} - \sum_{i=1}^{n} \alpha_{i}y_{i}\mathbf{x}_{i}^{T}\mathbf{x}_{sv}$

where $(\mathbf{x}_{sv}, y_{sv}) \in \left \{ (\mathbf{x_{i}}, y_{i} \mid 0 < \alpha_{i} < C) \right \}$

The trained model

$f(\mathbf{x}) = sign(\mathbf{w^{*}}\mathbf{x} + b^{*}) = sign(\sum_{(\mathbf{x_{i}}, y_{i}) \in D} \alpha_{i}y_{i}\mathbf{x}_{i}^{T}\mathbf{x} + b^{*})$

Let $D_{SV} = \left \{ (\mathbf{x_{i}}, y_{i}) \in D \mid \alpha_{i} > 0 \right \}$, $f(\mathbf{x}) = sign(\sum_{(\mathbf{x_{i}}, y_{i}) \in D_{SV} } \alpha_{i}y_{i}\mathbf{x_{i}^{T}}\mathbf{x} + b^{*})$ (sparse solution)

서포트 벡터 분류

from mglearn.datasets import make_forge
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

X, y = mglearn.datasets.make_forge()

fig, axes = plt.subplots(1, 2, figsize = (10, 3))

for model, ax in zip([LinearSVC(), LogisticRegression()], axes):
    clf = model.fit(X,y)
    mglearn.plots.plot_2d_separator(clf, X, fill=False, eps=0.5, ax=ax, alpha=0.7)
    mglearn.discrete_scatter(X[:, 0], X[:, 1], y, ax=ax)
    ax.set_title("{}".format(clf.__class__.__name__))
    ax.set_xlabel("Feature 0")
    ax.set_ylabel("Feature 1")
axes[0].legend

Forge 데이터셋을 사용하여 LogisticRegression과 LinearSVC 모델을 만들고 이 선형 모델들이 만들어낸 결정 경계를 그림으로 나타냈습니다.

mglearn.plots.plot_linear_svc_regularization()

하이퍼파라미터 C (규제의 강도) 트레이드 오프

C 값을 낮츠면 규제가 증가합니다.

모델은 계수 벡터(w)가 0에 가까워지도록 만듭니다. -> 과소적합

C 값이 높아지면 규제가 감소합니다.

훈련 세트에 최대한 맞추려 합니다. -> 과대적합

선형 모델과 비선형 특성

직선과 초평면은 유연하지 못하며 저차원 데이터셋에서는 선형 모델이 매우 제한적입니다.

커널 서포트 벡터 머신은 입력 데이터에서 단순한 초평면으로 정의되지 않는 더 복잡한 모델을 만들 수 있도록 확장한 것입니다.

from mglearn.datasets import make_blobs
X, y = make_blobs(centers=4, random_state=8)
y = y % 2

mglearn.discrete_scatter(X[:, 0], X[:, 1], y)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")

from sklearn.svm import LinearSVC
linear_svm = LinearSVC().fit(X, y)

mglearn.plots.plot_2d_separator(linear_svm, X)
mglearn.discrete_scatter(X[:, 0], X[:, 1], y)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")

분류를 위한 선형 무델은 직선으로만 데이터 포인트를 나눌 수 있어서 어떤 데이터셋에는 잘들어 맞지 않습니다.

결정 경계는 선형 SVM에 의해 만들어졌습니다.

선형 모델을 유연하게 만드는 한 가지 방법은 특성끼리 곱하거나 특성을 거듭제곱하는 식으로 새로운 특성을 추가하는 것입니다.

예를 들어 두 번째 특성을 제곱한 특성 0 ** 2를 새로운 특성으로 추가해 입력 특성을 확장해보겠습니다.

# add the squared first feature
X_new = np.hstack([X, X[:, 1:] ** 2])


from mpl_toolkits.mplot3d import Axes3D, axes3d
figure = plt.figure()
# visualize in 3D
ax = Axes3D(figure, elev=-152, azim=-26)
# plot first all the points with y==0, then all with y == 1
mask = y == 0
ax.scatter(X_new[mask, 0], X_new[mask, 1], X_new[mask, 2], c='b',
           cmap=mglearn.cm2, s=60, edgecolor='k')
ax.scatter(X_new[~mask, 0], X_new[~mask, 1], X_new[~mask, 2], c='r', marker='^',
           cmap=mglearn.cm2, s=60, edgecolor='k')
ax.set_xlabel("feature0")
ax.set_ylabel("feature1")
ax.set_zlabel("feature0 ** 2")

linear_svm_3d = LinearSVC().fit(X_new, y)
coef, intercept = linear_svm_3d.coef_.ravel(), linear_svm_3d.intercept_

# show linear decision boundary
figure = plt.figure()
ax = Axes3D(figure, elev=-152, azim=-26)
xx = np.linspace(X_new[:, 0].min() - 2, X_new[:, 0].max() + 2, 50)
yy = np.linspace(X_new[:, 1].min() - 2, X_new[:, 1].max() + 2, 50)

XX, YY = np.meshgrid(xx, yy)
ZZ = (coef[0] * XX + coef[1] * YY + intercept) / -coef[2]
ax.plot_surface(XX, YY, ZZ, rstride=8, cstride=8, alpha=0.3)
ax.scatter(X_new[mask, 0], X_new[mask, 1], X_new[mask, 2], c='b',
           cmap=mglearn.cm2, s=60, edgecolor='k')
ax.scatter(X_new[~mask, 0], X_new[~mask, 1], X_new[~mask, 2], c='r', marker='^',
           cmap=mglearn.cm2, s=60, edgecolor='k')

ax.set_xlabel("feature0")
ax.set_ylabel("feature1")
ax.set_zlabel("feature10 ** 2")

데이터셋에 비선형 특성을 추가하여 선형모델을 강력하게 만들었습니다.

하지만 많은 경우 어떤 특성을 추가해야 할지 모르고 특성을 많이 추가하면 연산 비용이 커집니다.

다행히 수학적 기교를 사용해서 새로운 특성을 많이 만들지 않고서도 고차원 분류기를 학습시킬 수 있습니다.

커널 서포트 벡터 분류

비선형 분류 SVM : 커널 기법

차원이 무한한 특성 공간에 매핑하는 가우시안 커널을 사용합니다. ($\psi$)

$\mathbf{x_{i}}$를 $\psi (\mathbf{x_{i}}) $로 대체합니다.

예를 들어

$\psi (x_{1}, x_{2}) = (x_{1}^{2}, \sqrt{2}x_{1}x_{2}, x_{2}^{2})$

$\phi : R^{2} -> R^{3} $

$(x_{1}, x_{2}) x \mapsto x^2 (z_{1}, z_{2}, z_{3}) := (x_{1}^{2}, \sqrt{2}x_{1}x_{2}, x_{2}^{2}) $

커널 함수는 변화된 공간의 내적으로 정의합니다.

$k(\mathbf{x_{i}}, \mathbf{x_{j}}) = \psi (\mathbf{x_{i}})^{T} \psi (\mathbf{x_{j}})$, $\psi$ 대신 k를 사용합니다.

$\mathbf{x_{i}}^{T} \mathbf{x_{j}}$를 $k(\mathbf{x_{i}}, \mathbf{x_{j}})$로 대체합니다ㅣ.

모든 함수가 커널이 될 수는 없습니다. (Mercer's theorem)

커널 함수의 예

Linear Kernel $k(\mathbf{x}, \mathbf{x'}) = \mathbf{x}^{T} \mathbf{x'}$

Polynomial Kernel $k(\mathbf{x}, \mathbf{x'}) = (1 + \mathbf{x}^{T} \mathbf{x'})^{p}$

Tanh Kernel $k(\mathbf{x}, \mathbf{x'}) = tanh(a + b\mathbf{x}^{T} \mathbf{x'})$

RBF Kernel $k(\mathbf{x}, \mathbf{x'}) = exp(-\gamma(\mathbf{x} - \mathbf{x'})^{2})$ <- 가장 널리 쓰입니다. 사이킷런의 기본값입니다.

Soft-margin fromulation

최적의 파라미터 $\mathbf{w}^{*}$, $b^{*}$를 얻습니다.

$\mathbf{w}^{*} = \sum_{i=1}^{n}\alpha_{i}y_{i}\psi(\mathbf{x_{i}}) $

$b^{*} = \frac{1}{y_{sv}} - \mathbf{w}^{*T}\mathbf{x}_{sv} = \frac{1} {y_{sv}} - \sum_{i=1}^{n} \alpha_{i}y_{i}k(\mathbf{x}_{i}, \mathbf{x}_{sv})$

where $(\mathbf{x}_{sv}, y_{sv}) \in \left \{ (\mathbf{x_{i}}, y_{i} \mid 0 < \alpha_{i} < C) \right \}$

The trained model

$$f(\mathbf{x}) = sign(\mathbf{w^{*}}\mathbf{x} + b^{*}) = sign(\sum_{(\mathbf{x_{i}}, y_{i}) \in D} \alpha_{i}y_{i}k(\mathbf{x}_{i}, \mathbf{x}) + b^{*})$

Let $D_{SV} = \left \{ (\mathbf{x_{i}}, y_{i}) \in D \mid \alpha_{i} > 0 \right \}$, $f(\mathbf{x}) = sign(\sum_{(\mathbf{x_{i}}, y_{i}) \in D_{SV} } \alpha_{i}y_{i}k(\mathbf{x_{i}, \mathbf{x}}) + b^{*})$ (sparse solution)

SVM 매개변수 튜닝

fig, axes = plt.subplots(3, 3, figsize=(15, 10))

for ax, C in zip(axes, [-1, 0, 3]):
    for a, gamma in zip(ax, range(-1, 2)):
        mglearn.plots.plot_svm(log_C=C, log_gamma=gamma, ax=a)
        
axes[0, 0].legend(["class 0", "class 1", "sv class 0", "sv class 1"],
                  ncol=4, loc=(.9, 1.2))

SVM 하이퍼파라미터 튜닝

C, kernel (default='rbf'), gamma (if 'rbf' kernel) :

gamma는 하나의 훈련 샘플이 미치는 영향의 범위를 결정합니다.

작은 값은 넓은 영역을 뜻하며 (낮은 모델 복잡성, 과소적합)

큰 값은 영향이 미치는 범위가 제한적입니다 (높은 모델 복잡성, 과대적합)

커널 서포트 벡터 분류 (다중)

만약 다중 클래스 분류라면

decision_function_shape = 'ovr' (default) or 'ovo'

일대다 방식 (One-vs.-Rest (OVR) Approach)

일대다 방식은 각 클래스를 다른 모든 클래스와 구분하도록 이진 분류 모델을 학습시킵니다. -> c models

결국 클래스 수만큼 이진 분류 모델이 만들어집니다. 예측을 할 때 이렇게 만들어진 모든 이진 분류기가 작동하여 가장 높은 점수를 내는 분류기의 클래스를 예측값으로 선택합니다.

일대일 방식 (OVO Approach)

각 클래스를 쌍으로 학습시킵니다. -> c(c-1)/2 models

예측을 할 때 다수의 voting을 기반으로 예측합니다.

유방암 데이터셋 예시 - SVC

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=0)

RBF 커널 SVM을 유방암 데이터셋에 적용해보겠습니다.

훈련 세트하고 테스트 세트로 나눠줍니다.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

SVM은 데이터 스케일에 매우 민감합니다. 특히 입력 특성의 범위가 비슷해야합니다.

따라서 특성 값의 범위가 비슷해지도록 조정했습니다.

clf = SVC(C=100)
clf.fit(X_train_scaled, y_train)

모델을 만들어줍니다.

y_train_hat = clf.predict(X_train_scaled)
print('train accuracy :', accuracy_score(y_train, y_train_hat))
y_test_hat = clf.predict(X_test_scaled)
print('test accuracy :', accuracy_score(y_test, y_test_hat))

훈련 세트와 테스트 세트의 정확도가 높지만 차이가 많이 나지 않아 과소적합의 위험이 있습니다.

C_gamma_data = pd.DataFrame(columns = ('C', 'gamma', 'training accuracy', 
                                      'test_accuracy'))

training_accuracy =[]
test_accuracy = []

C_settings = [0.01, 1, 100]
gamma_settings = [0.01, 0.1, 1]
for C in C_settings:
    for gamma in gamma_settings:
        #build the model
        clf = SVC(C=C, kernel = 'rbf', gamma=gamma)
        clf.fit(X_train_scaled, y_train)
        
        # accuracy on the training set
        y_train_hat = clf.predict(X_train_scaled)
        training_accuracy.append(accuracy_score(y_train, y_train_hat))
    
        # accuracy on the test set
        y_test_hat = clf.predict(X_test_scaled)
        test_accuracy.append(accuracy_score(y_test, y_test_hat))
        
        i =  [C, gamma, accuracy_score(y_train, y_train_hat), accuracy_score(y_test, y_test_hat)]
        
        C_gamma_data.loc[len(C_gamma_data)] = i

하이퍼파라미터 C와 gamma 값을 달리하여 훈련 세트와 테스트 세트의 성능을 비교하였습니다.

서포트 벡터 회귀 (Support Vector Regression)

데이터 적합을 위한 가능한 함수가 많이 있습니다.

회귀에서도 비슷한 개념이 젹용됩니다. -> Support Vector Regression

$minimize_{\mathbf{w}, b, \xi_{i}, \xi_{i}^{*}} \frac{1}{2}\mathbf{w}^{T}\mathbf{w} + C(\sum_{i} \xi_{i} + \sum_{i} \xi_{i}^{*})$

subject to $y_{i} - (\mathbf{\mathbf{w}^{T}} \psi (\mathbf{x_{i}}) + b) \leq \epsilon + \xi_{i}$,

$(\mathbf{\mathbf{w}^{T}} \psi (\mathbf{x_{i}}) + b) - y_{i} \leq \epsilon + \xi_{i}^{*}$

$\xi_{i}, \xi_{i}^{*} \geq 0$, $i = 1, ..., N$

The trained model

$f(\mathbf{x}) = \mathbf{\mathbf{w}^{T}} \psi (\mathbf{x}) + b = \sum_{i=1}^{N} (\alpha_{i} - \alpha_{i}^{*})k(\mathbf{x_{i}}, \mathbf{x}) + b$

Let $D_{SV} = \left \{ (\mathbf{x_{i}}, y_{i}) \in D \mid \alpha_{i} > 0 ~ or ~\alpha_{i}^{*} > 0
\right \}$, $f(\mathbf{x}) = \sum_{(\mathbf{x_{i}}, y_{i}) \in D_{SV}} (\alpha_{i} - \alpha_{i}^{*})k(\mathbf{x_{i}}, \mathbf{x}) + b$ (sparse solution)

확장된 보스턴 데이터셋 예시 - SVR

import mglearn
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
X, y = mglearn.datasets.load_extended_boston()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
y_train = y_train.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)

회귀는 확장된 보스턴 데이터셋으로 진행합니다.

데이터셋을 불러오고 훈련 데이터셋과 테스트 데이터셋으로 나눠줍니다.

from sklearn.preprocessing import StandardScaler
scalerX = StandardScaler()
scalerX.fit(X_train)
X_train_scaled = scalerX.transform(X_train)
X_test_scaled = scalerX.transform(X_test)
scalerY = StandardScaler()
scalerY.fit(y_train)
y_train_scaled = scalerY.transform(y_train)
y_test_scaled = scalerY.transform(y_test)

데이터 스케일을 조정합니다.

reg = SVR()
reg.fit(X_train_scaled, y_train_scaled)

모델을 만듭니다.

y_train_hat_scaled = reg.predict(X_train_scaled)
y_train_hat = scalerY.inverse_transform(y_train_hat_scaled)
print('MAE :', mean_absolute_error(y_train, y_train_hat))
print('RMSE :', mean_squared_error(y_train, y_train_hat)**0.5)
print('R_square :', r2_score(y_train, y_train_hat))
y_test_hat_scaled = reg.predict(X_test_scaled)
y_test_hat = scalerY.inverse_transform(y_test_hat_scaled)
print('MAE :', mean_absolute_error(y_test, y_test_hat))
print('RMSE :', mean_squared_error(y_test, y_test_hat)**0.5)
print('R_square :', r2_score(y_test, y_test_hat))

훈련 세트와 테스트 세트의 성능은 위와 같습니다.

훈련 세트의 점수는 0.91로 높지만 테스트 세트의 점수는 0.63으로 낮습니다.

과대적합입니다.

C_ep_ga_data = pd.DataFrame(columns = ('C', 'epsilon', 'gamma', 'training r2', 
                                      'test r2'))

training_r2score =[]
test_r2score = []

C_settings = [1, 100]
epsilon_settings = [0.001, 0.01, 0.1]
gamma_settings = [0.01, 0.1]
for C in C_settings:
    for epsilon in epsilon_settings:
        for gamma in gamma_settings:
            #build the model
            reg = SVR(C=C, kernel = 'rbf', epsilon=epsilon, gamma=gamma)
            reg.fit(X_train_scaled, y_train_scaled)
        
            # r2 on the training set
            y_train_hat = scalerY.inverse_transform(reg.predict(X_train_scaled))
            training_r2score.append(r2_score(y_train, y_train_hat))
    
            # r2 on the test set
            y_test_hat = scalerY.inverse_transform(reg.predict(X_test_scaled))
            test_r2score.append(r2_score(y_test, y_test_hat))
        
            i =  [C, epsilon, gamma, r2_score(y_train, y_train_hat), r2_score(y_test, y_test_hat)]
        
            C_ep_ga_data.loc[len(C_ep_ga_data)] = i
            
C_ep_ga_data

하이퍼파라미터 C와 epsilon, gamma 값을 달리하여 훈련 세트와 테스트 세트의 성능을 비교하였습니다.

결론

서포트 벡터 머신의 주요 하이퍼파라미터

C, kernel, kerenl-specific hyperparameters (for both SVC and SVR)

epsilon (for SVR)

일반적으로 평가 데이터에서 가장 좋은 성능을 낸 모델을 선택합니다.

데이터를 전처리하는 것이 중요합니다. (데이터 스케일링, 원-핫 인코딩)

장점

다양한 데이터셋에서 잘 작동합니다.

데이터 특성이 몇 개 안되더라도 복잡한 결정 경계를 만들 수 있습니다.

단점

샘플이 많을 때는 잘 맞지 않습니다. 100,000개 이상의 데이터셋에서는 속도와 메모리 관점에서 도전적인 과제입니다.

데이터 전처리와 하이퍼파라미터 설정에 신경을 많이 써야합니다.

SVM 모델은 분석하기도 어렵습니다. 예측이 어떻게 결정되었는지 이해하기 어렵고 비전문가에게 모델을 설명하기가 난해합니다.

파이썬 라이브러리를 활용한 머신러닝 책과 성균관대학교 강석호 교수님 수업 내용을 바탕으로 요약 작성되었습니다.

저작자표시

'파이썬 라이브러리를 활용한 머신러닝' 카테고리의 다른 글

Uncertainty Estimates from Classifiers & Summary and Outlook (0)	2019.11.15
Neural Networks (0)	2019.11.14
Decision Trees (0)	2019.11.12
Linear Models (2)	2019.11.11
K-Nearest Neighbors (0)	2019.11.08