Logistic Regression
对率几率回归(Logistic Regression)
引言
Logistic Regression,称为对率几率回归。听名字是一个回归模型,但是实际上该模型用于解决二分类问题。线性回归的目的在于找到一条直线或超平面,用于拟合所有的数据点,但是对率几率回归的目的是在于找到一条直线或者超平面尽可能的将所有点分为两类,这个过程中都包含了找到一条直线这个过程(即对率几率回归的线性部分是做了一个回归的)。
原理推导
根据引言,我们需要在线性模型
z
=
w
T
x
+
b
z=w^Tx+b
z=wTx+b上进行二分类,即对输入数据进行判别输出,输出应当非0即1,也就是
y
∈
0
,
1
y\in {0,1}
y∈0,1,那么就可以在线性模型的输出上进行映射处理,套上一个函数
y
=
g
(
z
)
y=g(z)
y=g(z),将输出映射到{0,1},最简单的应该就是“单位阶跃函数”,表示为:
y
=
{
0
,
z
<
0
;
0.5
,
z
=
0
;
1
,
z
>
0.
y= \begin{cases} 0,\quad z<0;\\ 0.5, \quad z=0;\\ 1, \quad z>0. \end{cases}
y=⎩
⎨
⎧0,z<0;0.5,z=0;1,z>0.
此时其实就是将
z
=
w
T
x
+
b
z=w^Tx+b
z=wTx+b看作一个分割线,在这个先上方的判别为0,线下方的判别为1。由于问题是一个优化问题(找到一个最好的模型来尽可能多的区分出两个类别),那么对于这个问题target function(目标函数)就最好要连续可导,阶跃函数是个分段函数,函数性质较差,所以可以找一个形似的连续函数来进行代替,Sigmoid函数中的对率几率函数就是常用的替代函数,之所以叫Sigmoid是因为他形似一个S形。
y
=
1
1
+
e
−
z
y=\frac{1}{1+e^{-z}}
y=1+e−z1
那么此时,模型的数学公示就变为
y
=
1
1
+
e
−
(
w
T
x
+
b
)
y=\frac{1}{1+e^{-(w^Tx+b)}}
y=1+e−(wTx+b)1
对率几率函数有一个有趣的性质:它的导数可以用自身来进行表示,即
y
′
(
z
)
=
y
(
z
)
(
1
−
y
(
z
)
)
y\prime(z) = y(z)(1-y(z))
y′(z)=y(z)(1−y(z))
如果假设
y
y
y为正例,
1
−
y
1-y
1−y为负例,几率就是二者之比
y
1
−
y
\frac{y}{1-y}
1−yy。几率反应了样本为正例的相对可能性。
对率几率就是对几率取对数
ln
y
1
−
y
\ln\frac{y}{1-y}
ln1−yy,对率几率实际上就是前面提到的Sigmoid函数,推导如下:
假设存在一个线性关系:
ln
y
1
−
y
=
w
T
x
+
b
\ln\frac{y}{1-y} = w^Tx+b
ln1−yy=wTx+b 进行等式变换后就有
y
=
1
1
+
e
−
(
w
T
x
+
b
)
y=\frac{1}{1+e^{-(w^Tx+b)}}
y=1+e−(wTx+b)1
我们将
y
=
1
1
+
e
−
(
w
T
x
+
b
)
y=\frac{1}{1+e^{-(w^Tx+b)}}
y=1+e−(wTx+b)1 带入到对率几率中,得到
ln
y
1
−
y
=
w
T
x
+
b
\ln\frac{y}{1-y} = w^Tx+b
ln1−yy=wTx+b
可以看出,sigmoid实际上就是用线性回归模型的预测结果取逼近真实值的对数几率,因此逻辑回归也被称为“对率几率回归”。
Loss函数推导
一般回归问题使用的都是平均误差平方损失函数(MSE):
L
=
1
n
∑
i
=
1
n
(
y
^
−
y
)
2
L=\frac{1}{n}\sum_{i=1}^n(\hat{y}-y)^2
L=n1i=1∑n(y^−y)2
对率几率回归和线性回归的最大区别就是:对率几率回归用于解决分类问题,得到的是{0,1},但是通过公式最后求解出的是一个概率,通过这个概率来决定是0还是1.因此损失函数可以分为两种:
- 如果给定样本的真实类别y=1,则估计出来的概率p越小,损失函数越大(估计错误)
- 如果给定样本的真实类别y=0,则估计出来的概率p越大,损失函数越大(估计错误)
以上两种情况可以使用如下函数进行表示
L = { − l o g ( y ^ ) i f y = 1 − l o g ( 1 − y ^ ) i f y = 0 L = \begin{cases} -log(\hat y) \quad if\enspace y=1\\ -log(1-\hat y) \quad if\enspace y=0 \end{cases} L={−log(y^)ify=1−log(1−y^)ify=0
由于这是一个二分类问题,分类结果非0即1,那么可以巧妙的将上述两个式子合并为一个
L = − [ y l o g ( y ^ ) + ( 1 − y ) l o g ( 1 − y ^ ) ] L = - [ylog(\hat y) + (1-y)log(1-\hat y)] L=−[ylog(y^)+(1−y)log(1−y^)]
这个损失函数通常称为对数损失,取负对数的原因是,我们希望损失函数能够尽量的小,从而让模型更加优化,不取负对数那么要做的优化就是让对数尽量大,这与我们优化的目标相反,所以需要取负对数。
另一种推导方式
我们假设数据的两个分类都服从伯努利分布,那么则有
P
(
样本是正例
)
=
Φ
P(样本是正例)=\Phi
P(样本是正例)=Φ,
P
(
样本是负例
)
=
1
−
Φ
P(样本是负例)=1-\Phi
P(样本是负例)=1−Φ。那么我们可以有分类公式:
P
(
Y
=
1
∣
x
)
=
p
=
1
1
+
e
−
W
x
P
(
Y
=
0
∣
x
)
=
1
−
p
=
e
−
W
x
1
+
e
−
W
x
P(Y=1|x)=p=\frac{1}{1+e^{-Wx}} \\ P(Y=0|x)=1-p=\frac{e^{-Wx}}{1+e^{-Wx}}
P(Y=1∣x)=p=1+e−Wx1P(Y=0∣x)=1−p=1+e−Wxe−Wx
两个式子相比,取对数得到
ln
p
1
−
p
=
W
x
\ln\frac{p}{1-p} = Wx
ln1−pp=Wx
对于给定的样本,可以采用极大似然来估计最佳的模型参数。假设有m个样本,输入特征有n个,设
P
(
Y
=
1
∣
x
)
=
Φ
(
x
)
P(Y=1|x)= \Phi(x)
P(Y=1∣x)=Φ(x),
P
(
Y
=
0
∣
x
)
=
1
−
Φ
(
x
)
P(Y=0|x)=1-\Phi(x)
P(Y=0∣x)=1−Φ(x)得到似然估计函数:
∏
i
=
1
m
[
Φ
(
x
)
]
y
i
[
1
−
Φ
(
x
i
)
]
1
−
y
i
\prod_{i = 1}^{m}[\Phi(x)]^{y_i}[1-\Phi(x_i)]^{1-y_i}
i=1∏m[Φ(x)]yi[1−Φ(xi)]1−yi
将其取对数得到:
L
(
W
)
=
∑
i
=
1
m
y
i
ln
Φ
(
x
i
)
+
(
1
−
y
i
)
ln
(
1
−
Φ
(
x
i
)
)
L(W)= \sum_{i = 1}^{m}y_i\ln\Phi(x_i)+(1-y_i)\ln(1-\Phi(x_i))
L(W)=i=1∑myilnΦ(xi)+(1−yi)ln(1−Φ(xi))
最大化
L
(
W
)
L(W)
L(W),即最小化
J
(
W
)
=
−
L
(
W
)
J(W)=-L(W)
J(W)=−L(W)。
通常使用梯度下降法和拟牛顿法来求最佳参数。
梯度下降优化求解
损失函数
L
(
w
,
b
=
−
[
y
l
o
g
y
^
+
(
1
−
y
)
l
o
g
(
1
−
y
^
)
]
L(w,b = -[ylog\hat y + (1-y)log(1-\hat y)]
L(w,b=−[ylogy^+(1−y)log(1−y^)]
其中
y
^
=
1
1
+
e
−
z
\hat y = \frac{1}{1+e^{-z}}
y^=1+e−z1,
z
=
w
T
x
+
b
z=w^Tx+b
z=wTx+b
现在的学习任务转化为数学优化形式就是:
(
w
∗
,
b
∗
)
=
a
r
g
m
i
n
w
,
b
L
(
w
,
b
)
(w^*,b^*)=argmin_{w,b}L(w,b)
(w∗,b∗)=argminw,bL(w,b)
由于损失函数连续可微,可以使用梯度下降进行优化求解。
w
←
w
−
α
∂
L
∂
w
w \leftarrow w - \alpha \frac{\partial L}{\partial w}
w←w−α∂w∂L
b
←
b
−
α
∂
L
∂
w
b \leftarrow b - \alpha \frac{\partial L}{\partial w}
b←b−α∂w∂L
根据链式法则有
∂
L
∂
y
^
=
y
^
−
y
y
^
(
1
−
y
^
)
\frac{\partial L}{\partial \hat y} = \frac {\hat y - y}{\hat y (1-\hat y)}
∂y^∂L=y^(1−y^)y^−y
∂
y
^
=
(
1
1
+
e
−
z
)
′
=
∂
e
−
z
(
1
+
e
−
z
)
2
=
y
^
(
1
−
y
^
)
\partial \hat y = (\frac {1}{1+e^{-z}})\prime\\ = \partial {e^{-z}}{(1+e^{-z})^2} \\ = \hat y (1-\hat y)
∂y^=(1+e−z1)′=∂e−z(1+e−z)2=y^(1−y^)
此时有
∂
L
∂
z
=
y
^
−
y
y
^
(
1
−
y
^
)
⋅
y
^
(
1
−
y
^
)
=
y
^
−
y
\frac{\partial L }{\partial z} = \frac {\hat y -y }{\hat y (1-\hat y)} \cdot \hat y (1-\hat y ) = \hat y -y
∂z∂L=y^(1−y^)y^−y⋅y^(1−y^)=y^−y
那么就可以得到
∂
L
∂
w
=
∂
L
∂
z
∂
z
∂
w
=
(
y
^
−
y
)
x
\frac {\partial L}{\partial w} = \frac{\partial L}{\partial z}\frac {\partial z}{\partial w} = (\hat y -y)x
∂w∂L=∂z∂L∂w∂z=(y^−y)x
∂
L
∂
b
=
∂
L
∂
z
∂
z
∂
b
=
(
y
^
−
y
)
\frac {\partial L}{\partial b} = \frac{\partial L}{\partial z}\frac {\partial z}{\partial b} = (\hat y -y)
∂b∂L=∂z∂L∂b∂z=(y^−y)
那么就可以对参数w和b进行更新
w
←
w
−
α
(
y
^
−
y
)
x
b
←
b
−
α
(
y
^
−
y
)
w \leftarrow w - \alpha(\hat y - y )x\\ b \leftarrow b - \alpha(\hat y - y)
w←w−α(y^−y)xb←b−α(y^−y)
代码实现(梯度下降)
纯python
# -*- coding: utf-8 -*-
import numpy as np
class LogisticRegression(object):
def __init__(self, learning_rate=0.1, max_iter=100, seed=None):
self.seed = seed
self.lr = learning_rate
self.max_iter = max_iter
def fit(self, x, y):
np.random.seed(self.seed)
self.w = np.random.normal(loc=0.0, scale=1.0, size=x.shape[1])
self.b = np.random.normal(loc=0.0, scale=1.0)
self.x = x
self.y = y
for i in range(self.max_iter):
self._update_step()
# print('loss: \t{}'.format(self.loss()))
# print('score: \t{}'.format(self.score()))
# print('w: \t{}'.format(self.w))
# print('b: \t{}'.format(self.b))
def _sigmoid(self, z):
return 1.0 / (1.0 + np.exp(-z))
def _f(self, x, w, b):
z = x.dot(w) + b
return self._sigmoid(z)
def predict_proba(self, x=None):
if x is None:
x = self.x
y_pred = self._f(x, self.w, self.b)
return y_pred
def predict(self, x=None):
if x is None:
x = self.x
y_pred_proba = self._f(x, self.w, self.b)
y_pred = np.array([0 if y_pred_proba[i] < 0.5 else 1 for i in range(len(y_pred_proba))])
return y_pred
def score(self, y_true=None, y_pred=None):
if y_true is None or y_pred is None:
y_true = self.y
y_pred = self.predict()
acc = np.mean([1 if y_true[i] == y_pred[i] else 0 for i in range(len(y_true))])
return acc
def loss(self, y_true=None, y_pred_proba=None):
if y_true is None or y_pred_proba is None:
y_true = self.y
y_pred_proba = self.predict_proba()
return np.mean(-1.0 * (y_true * np.log(y_pred_proba) + (1.0 - y_true) * np.log(1.0 - y_pred_proba)))
def _calc_gradient(self):
y_pred = self.predict()
d_w = (y_pred - self.y).dot(self.x) / len(self.y)
d_b = np.mean(y_pred - self.y)
return d_w, d_b
def _update_step(self):
d_w, d_b = self._calc_gradient()
self.w = self.w - self.lr * d_w
self.b = self.b - self.lr * d_b
return self.w, self.b
# -*- coding: utf-8 -*-
import numpy as np
def generate_data(seed):
np.random.seed(seed)
data_size_1 = 300
x1_1 = np.random.normal(loc=5.0, scale=1.0, size=data_size_1)
x2_1 = np.random.normal(loc=4.0, scale=1.0, size=data_size_1)
y_1 = [0 for _ in range(data_size_1)]
data_size_2 = 400
x1_2 = np.random.normal(loc=10.0, scale=2.0, size=data_size_2)
x2_2 = np.random.normal(loc=8.0, scale=2.0, size=data_size_2)
y_2 = [1 for _ in range(data_size_2)]
x1 = np.concatenate((x1_1, x1_2), axis=0)
x2 = np.concatenate((x2_1, x2_2), axis=0)
x = np.hstack((x1.reshape(-1,1), x2.reshape(-1,1)))
y = np.concatenate((y_1, y_2), axis=0)
data_size_all = data_size_1+data_size_2
shuffled_index = np.random.permutation(data_size_all)
x = x[shuffled_index]
y = y[shuffled_index]
return x, y
def train_test_split(x, y):
split_index = int(len(y)*0.7)
x_train = x[:split_index]
y_train = y[:split_index]
x_test = x[split_index:]
y_test = y[split_index:]
return x_train, y_train, x_test, y_test
# -*- coding: utf-8 -*-
import numpy as np
def generate_data(seed):
np.random.seed(seed)
data_size_1 = 300
x1_1 = np.random.normal(loc=5.0, scale=1.0, size=data_size_1)
x2_1 = np.random.normal(loc=4.0, scale=1.0, size=data_size_1)
y_1 = [0 for _ in range(data_size_1)]
data_size_2 = 400
x1_2 = np.random.normal(loc=10.0, scale=2.0, size=data_size_2)
x2_2 = np.random.normal(loc=8.0, scale=2.0, size=data_size_2)
y_2 = [1 for _ in range(data_size_2)]
x1 = np.concatenate((x1_1, x1_2), axis=0)
x2 = np.concatenate((x2_1, x2_2), axis=0)
x = np.hstack((x1.reshape(-1,1), x2.reshape(-1,1)))
y = np.concatenate((y_1, y_2), axis=0)
data_size_all = data_size_1+data_size_2
shuffled_index = np.random.permutation(data_size_all)
x = x[shuffled_index]
y = y[shuffled_index]
return x, y
def train_test_split(x, y):
split_index = int(len(y)*0.7)
x_train = x[:split_index]
y_train = y[:split_index]
x_test = x[split_index:]
y_test = y[split_index:]
return x_train, y_train, x_test, y_test
# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt
# data generation
x, y = generate_data(seed=272)
x_train, y_train, x_test, y_test = train_test_split(x, y)
# visualize data
# plt.scatter(x_train[:,0], x_train[:,1], c=y_train, marker='.')
# plt.show()
# plt.scatter(x_test[:,0], x_test[:,1], c=y_test, marker='.')
# plt.show()
# data normalization
x_train = (x_train - np.min(x_train, axis=0)) / (np.max(x_train, axis=0) - np.min(x_train, axis=0))
x_test = (x_test - np.min(x_test, axis=0)) / (np.max(x_test, axis=0) - np.min(x_test, axis=0))
# Logistic regression classifier
clf = LogisticRegression(learning_rate=0.1, max_iter=500, seed=272)
clf.fit(x_train, y_train)
# plot the result
split_boundary_func = lambda x: (-clf.b - clf.w[0] * x) / clf.w[1]
xx = np.arange(0.1, 0.6, 0.1)
plt.scatter(x_train[:,0], x_train[:,1], c=y_train, marker='.')
plt.plot(xx, split_boundary_func(xx), c='red')
plt.show()
# loss on test set
y_test_pred = clf.predict(x_test)
y_test_pred_proba = clf.predict_proba(x_test)
print(clf.score(y_test, y_test_pred))
print(clf.loss(y_test, y_test_pred_proba))
# print(y_test_pred_proba)
sklearn库
import numpy as np
import matplotlib.pyplot as plt
# 设置随机种子,以便结果可复现
np.random.seed(42)
# 生成随机数据
# 两个特征的均值和方差
mean_1 = [2, 2]
cov_1 = [[2, 0], [0, 2]]
mean_2 = [-2, -2]
cov_2 = [[1, 0], [0, 1]]
# 生成类别1的样本
X1 = np.random.multivariate_normal(mean_1, cov_1, 50)
y1 = np.zeros(50)
# 生成类别2的样本
X2 = np.random.multivariate_normal(mean_2, cov_2, 50)
y2 = np.ones(50)
# 合并样本和标签
X = np.concatenate((X1, X2), axis=0)
y = np.concatenate((y1, y2))
# 绘制散点图
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1, edgecolor='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Logistic Regression Dataset')
plt.show()
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
#所使用数据集同上X,y
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 创建Logistic回归模型
logreg = LogisticRegression()
# 训练模型
logreg.fit(X_train, y_train)
# 预测测试集
y_pred = logreg.predict(X_test)
# 计算预测准确率
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
牛顿法
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
M = 3 #属性个数+1 属性加 偏移项b, 一个3个参数
N = 50#二分类。每类样本N个
#随机生成两个属性的N个第一类样本
feature11 = np.random.randint(0, 10, size = N)
feature12 = np.random.randint(0, 5, size= N)
splt = np.ones((1, N))
temp_X11 = np.row_stack((feature11, feature12))
temp_X1 = np.vstack((temp_X11, splt))
X_t1 = np.mat(temp_X1)
X1 = X_t1.T
Y1 = np.mat(np.zeros((N, 1)))
#随机生成两个属性的N个第二类样本
feature21 = np.random.randint(0,10, size= N)
feature22 = np.random.randint(6, 10, size= N)
splt = np.ones((1, N))
temp_X21 = np.row_stack((feature21, feature22))
temp_X2 = np.vstack((temp_X21, splt))
X_t2 = np.mat(temp_X2)
X2 = X_t2.T
Y2 = np.mat(np.ones((N, 1)))
#画样本散点图
fig = plt.figure(1)
plt.scatter(feature11, feature12, marker='o', color = 'b')
plt.scatter(feature21, feature22, marker='*', color = 'y')
plt.xlabel('feature1')
plt.ylabel('feature2')
plt.title('samples')
#牛顿迭代法,求Omega
X = np.vstack((X1, X2))
Y = np.vstack((Y1, Y2))
Omega = np.mat(np.zeros((M, 1)))
Epsilon = 0.001 #输出精度
Delta = 1
counts =0
while Delta > Epsilon :
counts += 1
df = np.mat(np.zeros((M, 1)))
d2f = np.mat(np.zeros((3)))
for i in range(2*N) :
f = X[i, :]*Omega
p1 = np.math.exp(f) / (1 + np.math.exp(f))
df -= X[i, :].T*(Y[i, 0] - p1)
d2f = d2f + X[i, :].T*X[i, :]*p1*(1 - p1)
Omega = Omega - np.linalg.pinv(d2f)*df
Delta = np.linalg.norm(df)
#print(Omega, end='\n')
#print("迭代次数{}, Delta = {}".format(counts, Delta), end='\n')
#分类函数
def Classficate(sample):
f = Omega.T*sample
y = 1/(1+np.math.exp(-f))
return y
#画分类面
K = 50
xx = np.linspace(0,10, num= K)
yy = np.linspace(0,10, num= K)
xx_1, yy_1 = np.meshgrid(xx, yy)
Omega_h = np.array(Omega.T)
r = np.exp(-(Omega_h[0, 0]*xx_1 + Omega_h[0, 1]*yy_1 + Omega_h[0, 2]))
zz_1 = 1/(1 + r)
fig = plt.figure(2)
ax1 = Axes3D(fig,auto_add_to_figure=False)
fig.add_axes(ax1)
ax1.plot_surface(xx_1, yy_1, zz_1, alpha= 0.6, color= 'r')
ax1.set_xlabel('feature1')
ax1.set_ylabel('feature2')
ax1.set_zlabel('class')
ax1.set_title('LogisiticRegression model')
plt.show()
原文地址:https://blog.csdn.net/m0_52049033/article/details/142845787
免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!