机器学习(1)-逻辑回归的理解、面试问题以及代码实现

最新推荐文章于 2025-06-30 10:16:26 发布

新名字的故事

最新推荐文章于 2025-06-30 10:16:26 发布

阅读量304

点赞数

CC 4.0 BY-SA版权

分类专栏：机器学习文章标签：机器学习算法逻辑回归面试 python

本文链接：https://blog.csdn.net/sabrinalx/article/details/105875879

机器学习专栏收录该内容

10 篇文章

订阅专栏

本文深入解析逻辑回归的模型原理，包括模型构建、策略选择及参数优化算法，并通过代码实现展示了逻辑回归在实际数据集上的应用，同时探讨了模型的优点与局限。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

记录一下自己的学习过程，主要包括以下几个方面

知识点和理解

按照《统计学习方法》中介绍的三个方面来理解逻辑回归：（1）模型（2）策略（3）算法
模型：即如何建模这个问题，首先逻辑回归主要应用于二分类问题，逻辑回归假设样本为正的概率为：
$=\frac{e^{w^Tx}}{1+e^{w^Tx}}$
样本为负的概率为：
$P(Y=0|x)=1-P(Y=1|x)=\frac{1}{1+e^{w^Tx}}$
这里的x是n+1维列向量，n是输入x的特征维度，x表示为 $x=(x^{(1)},x^{(2)},...,x^{(n)},1)^T$ ,w为 $w=(w^{(1)},w^{(2)},...,w^{(n)},b)^T$ ,b为偏置，从上面的假设概率分布可以看出，线性回归假定数据满足伯努利分布
策略：即如何评价模型的好坏，线性回归采用最大似然估计对模型参数进行估计，首先构建似然函数,设 $P(Y=1|x)=\pi(x),P(Y=0|x)=1-\pi(x)$ ,可得似然函数如下：
$L(w)=\displaystyle\prod_{i=1}^N[\pi(x_i)]^{y_i}[1-\pi(x_i)]^{1-y_i}$
最大似然估计就是要使似然函数最大化，我们通常都是最小化损失函数，这里对似然函数取负对数，就可以得到损失函数，还是用 $L (w)$ 表示
$\begin{aligned} L(w)&=-\sum_{i=1}^Ny_i\log\pi(x_i)+(1-y_i)\log(1-\pi(x_i))\\ &=-\sum_{i=1}^Ny_i\log\frac{\pi(x_i)}{1-\pi(x_i)}+\log(1-\pi(x_i))\\ &=-\sum_{i=i}^N(y_iw^Tx_i-\log(1+e^{w^Tx_i})) \end{aligned}$
整个学习的过程就最小化损失函数的过程
算法：即如何优化参数，使得损失函数最小，线性回归采用梯度下降的方法

面试中会遇到的问题

1.为什么线性回归的损失函数采用对数损失函数而不是平方损失函数（绝对值损失函数）？
对上面的损失函数求导可得：
$\frac{\partial{L(w)}}{\partial{w^{(j)}}}=-\sum_{i=1}^Nx_i^{(j)}(y_i-\frac{e^{w^Tx_i}}{1+e^{w^Tx_i}})=-\sum_{i=1}^Nx_i^{(j)}(y_i-\pi(x_i))$
可以看出损失函数的梯度是和 $y_i-\pi(x_i)$ 线性相关的，当预测值和真实值相差越大梯度也越大，当预测值和真实值越接近，梯度就越小，这样可以让模型更好的收敛，在看一下平方损失函数，对他求导的话我们会发现，他的梯度和 $\pi(x)$ 的梯度是成正比的， $\pi(x)$ 其实就是sigmoid函数，他的梯度除了中心位置附近其他都趋近于零，我们设想这样的情况，如果此时模型的预测值较大，也就是处于sigmoid函数的右边，这时候他的梯度趋近于零，但实际这个样本的真实值却是零，这个时候要更新参数就比较困难了

2.线性回归模型有什么优缺点？
优点：
（1）模型简单，可解释性好，从特征权重就可以看出不同特征对最后结果的影响程度
（2）资源占用小，收敛速度快，模型参数大小只和特征维度相关
（3）方便输出结果调整，模型直接输出类别的概率，针对数据的不同，可以合理调整分类的阈值
缺点：
（1）模型过于简单，很难拟合数据的真实分布，分类效果不是很理想
（2）线性回归只能解决线性可分的问题，因为模型是用线性函数去模拟输入到输出的映射

代码实现

用一个类来表示逻辑回归，fit函数代表了利用梯度下降求解参数的过程

class LogisticRegression(object):
	"""docstring for LogisticRegression"""
	def __init__(self, feature_num, lr=1, max_iter=500):
		super(LogisticRegression, self).__init__()
		self.weight = np.zeros(feature_num+1)
		self.max_iter = max_iter
		self.lr = lr

	def fit(self, x, y):
		"""
		x: np.array (n, m), n is total sample number, m is feature_dim
		y: np.array (n)
		"""
		n, m = x.shape
		x = np.concatenate((x, np.ones((n, 1))), axis=-1)

		for i  in range(self.max_iter):
			predict = self.sigmoid(np.dot(x, self.weight))
			error = y - predict #对应梯度求导公式中真实值和预测值的差
			self.weight = self.weight + self.lr * np.dot(x.T, error)

		return self.weight

	def sigmoid(self, x):
		return 1.0 / (1 + np.exp(-x))


	def predict(self, x):
		n, m = x.shape
		x = np.concatenate((x, np.ones((n, 1))), axis=-1)
		return self.sigmoid(np.dot(x, self.weight))

读取数据，训练模型，测试模型的效果

def getData(path, num):
	data, lable = [], []
	with open(path, 'r') as f:
		lines = f.readlines()
		for line in lines:
			items = line.strip().split('\t')
			if len(items) != num:
				continue
			data.append(list(map(float, items[:-1])))
			lable.append(float(items[-1]))
	return np.array(data), np.array(lable)

def errorRate(predict, lable):
	predict[predict>=0.5] = 1
	predict[predict<0.5] = 0
	return sum(abs(predict - lable)) / len(lable)

def plot_best(weight, x, y):
	y = y.astype('int')
	x_1 = x[y==1]
	x_0 = x[y==0]
	fig = plt.figure()
	graph = fig.add_subplot(111)
	graph.scatter(x_1[:,0], x_1[:,1], s=3, color='red', marker='^')
	graph.scatter(x_0[:,0], x_0[:,1], s=3, color='yellow', marker='s')
	x = np.arange(-3.0, 3.0, 0.1)
	#x1*w1 + x2*w2 + w3 = f(x)
	y = (-weight[2] - x*weight[0]) / weight[1] #这里的x2就是坐标上的y坐标点
	graph.plot(x, y)
	plt.xlabel('x')
	plt.ylabel('y')
	plt.show()


def test():
	data, lable = getData('../data/5.Logistic/TestSet.txt', 3)
	lr = LogisticRegression(2)
	weight = lr.fit(data, lable)
	plot_best(weight, data, lable)

if __name__ == '__main__':
	# test()
	train_data, train_lable = getData('../data/5.Logistic/HorseColicTraining.txt', 22)
	test_data, test_lable = getData('../data/5.Logistic/HorseColicTest.txt', 22)
	lr = LogisticRegression(21)
	weight = lr.fit(train_data, train_lable)
	predict = lr.predict(test_data)
	erro_rate = errorRate(predict, test_lable)
	print('test error is {}%'.format(erro_rate*100)) #47.76%this error rate is too large 23333, train and test total spend 0.7s