sklearn+python:线性回归案例

2023-07-31 18:07:04 317

使用一阶线性方程预测波士顿房价

载入的数据是随sklearn一起发布的，来自boston1993年之前收集的506个房屋的数据和价格。load_boston()用于载入数据。

fromsklearn.datasetsimportload_boston
fromsklearn.model_selectionimporttrain_test_split
importtime
fromsklearn.linear_modelimportLinearRegression


boston=load_boston()

X=boston.data
y=boston.target

print("X.shape:{}.y.shape:{}".format(X.shape,y.shape))
print('boston.feature_name:{}'.format(boston.feature_names))

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=3)

model=LinearRegression()

start=time.clock()
model.fit(X_train,y_train)

train_score=model.score(X_train,y_train)
cv_score=model.score(X_test,y_test)

print('timeused:{0:.6f};train_score:{1:.6f},sv_score:{2:.6f}'.format((time.clock()-start),
train_score,cv_score))

输出内容为：

X.shape:(506,13).y.shape:(506,)
boston.feature_name:['CRIM''ZN''INDUS''CHAS''NOX''RM''AGE''DIS''RAD''TAX''PTRATIO'
'B''LSTAT']
timeused:0.012403;train_score:0.723941,sv_score:0.794958

可以看到测试集上准确率并不高，应该是欠拟合。

使用多项式做线性回归

上面的例子是欠拟合的，说明模型太简单，无法拟合数据的情况。现在增加模型复杂度，引入多项式。

打个比方，如果原来的特征是[a,b]两个特征，

在degree为2的情况下，多项式特征变为[1,a,b,a^2,ab,b^2]。degree为其它值的情况依次类推。

多项式特征相当于增加了数据和模型的复杂性，能够更好的拟合。

下面的代码使用Pipeline把多项式特征和线性回归特征连起来，最终测试degree在1、2、3的情况下的得分。

fromsklearn.datasetsimportload_boston
fromsklearn.model_selectionimporttrain_test_split
importtime
fromsklearn.linear_modelimportLinearRegression
fromsklearn.preprocessingimportPolynomialFeatures
fromsklearn.pipelineimportPipeline

defpolynomial_model(degree=1):
polynomial_features=PolynomialFeatures(degree=degree,include_bias=False)

linear_regression=LinearRegression(normalize=True)
pipeline=Pipeline([('polynomial_features',polynomial_features),
('linear_regression',linear_regression)])
returnpipeline

boston=load_boston()
X=boston.data
y=boston.target
print("X.shape:{}.y.shape:{}".format(X.shape,y.shape))
print('boston.feature_name:{}'.format(boston.feature_names))

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=3)

foriinrange(1,4):
print('degree:{}'.format(i))
model=polynomial_model(degree=i)

start=time.clock()
model.fit(X_train,y_train)

train_score=model.score(X_train,y_train)
cv_score=model.score(X_test,y_test)

print('timeused:{0:.6f};train_score:{1:.6f},sv_score:{2:.6f}'.format((time.clock()-start),
train_score,cv_score))

输出结果为：

X.shape:(506,13).y.shape:(506,)
boston.feature_name:['CRIM''ZN''INDUS''CHAS''NOX''RM''AGE''DIS''RAD''TAX''PTRATIO'
'B''LSTAT']
degree:1
timeused:0.003576;train_score:0.723941,sv_score:0.794958
degree:2
timeused:0.030123;train_score:0.930547,sv_score:0.860465
degree:3
timeused:0.137346;train_score:1.000000,sv_score:-104.429619

可以看到degree为1和上面不使用多项式是一样的。degree为3在训练集上的得分为1，在测试集上得分是负数，明显过拟合了。

所以最终应该选择degree为2的模型。

二阶多项式比一阶多项式好的多，但是测试集和训练集上的得分仍有不少差距，这可能是数据不够的原因，需要更多的讯据才能进一步提高模型的准确度。

正规方程解法和梯度下降的比较

除了梯度下降法来逼近最优解，也可以使用正规的方程解法直接计算出最终的解来。

根据吴恩达的课程，线性回归最优解为：

theta=(X^T*X)^-1*X^T*y

其实两种方法各有优缺点：

梯度下降法：

缺点：需要选择学习率，需要多次迭代

优点：特征值很多（1万以上）时仍然能以不错的速度工作

正规方程解法：

优点：不需要设置学习率，不需要多次迭代

缺点：需要计算X的转置和逆，复杂度O3；特征值很多（1万以上）时特变慢

在分类等非线性计算中，正规方程解法并不适用，所以梯度下降法适用范围更广。

以上这篇sklearn+python:线性回归案例就是小编分享给大家的全部内容了，希望能给大家一个参考，也希望大家多多支持毛票票。

声明：本文内容来源于网络，版权归原作者所有，内容由互联网用户自发贡献自行上传，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任。如果您发现有涉嫌版权的内容，欢迎发送邮件至：czq8825#qq.com（发邮件时，请将#更换为@）进行举报，并提供相关证据，一经查实，本站将立刻删除涉嫌侵权内容。

sklearn+python:线性回归案例

热门推荐

随机推荐