大学生课程|统计基础与python分析2|实战：线性回归评估（免费下载所有课程材料）

news/发布时间2024/5/15 16:23:49

本文所有材料均可免费下载

线性回归模型评估

1 一元线性回归代码

1.1 线性回归模型评估代码

1.2 显示结果

1.3 结果解读

2 一元二次线性回归

2.1 一元二次回归代码

2.2 模型评估

2.3 显示结果

2.4 结果解读

3 获取R-squared值的另一种方法

久菜盒子工作室：大数据科学团队/全网可搜索的久菜盒子工作室我们是：985硕博/美国全奖doctor/计算机7年产品负责人/医学大数据公司医学研究员/SCI一区2篇/Nature子刊一篇/中文二区核心一篇/都是我们主要领域：医学大数据分析/经管数据分析/金融模型/统计数理基础/统计学/卫生经济学/流行与统计学/ 擅长软件：R/python/stata/spss/matlab/mySQL

团队理念：从零开始，让每一个人都得到优质的科研教育

点点关注，一起成长，会变更强哦

本次责任编辑：久菜老师

线性回归模型评估

R-squared，统计学中的R^2，衡量线性拟合的优劣。取值范围为0-1，越高（接近1），模型的拟合程度越高。

Adj. R-squared，即Adjusted R^2，衡量线性拟合的优劣。取值范围为0-1，越高（接近1），模型的拟合程度越高。

P值，衡量特征变量的显著性。本质是个概率值，其取值范围也为0-1。如果P值越低（接近0），那么该特征变量的显著性越高，也即真的和预测变量有相关性。通常以0.05为阈值，即小于0.05时，特征变量与目标变量有显著相关性。

需要安装statsmodels库

pip install statsmodels

1 一元线性回归代码

# 读取数据
import pandas as pd
df = pd.read_excel('IT行业收入表.xlsx')
# 自变量要构造成二维结构
x = df[['工龄']]  # 读出来是一个DataFrame
# 因变量一维结构即可
y = df['薪水']  # 读出来是一个Series# 模型搭建
from sklearn.linear_model import LinearRegression
regr = LinearRegression()
regr.fit(x, y)  # x需要是一个二维结构形式， y需要是一个一维结构形式；如果x是一个一维结构形式，会出错# 模型可视化
from matplotlib import pyplot as plt
# 用于正常显示中文
plt.rcParams['font.sans-serif'] = ['SimHei']
# x是一个DataFrame，x.values转成数组，才能被plot()函数读取
# plt.plot(x, regr.predict(x), color='red')，即x没有values，会出错
plt.scatter(x, y)
plt.plot(x.values, regr.predict(x), color='red')
plt.xlabel('工龄')
plt.ylabel('薪水')
plt.show()# 线性回归方程构造
print('系数a：' + str(regr.coef_[0]))
print('截距b：' + str(regr.intercept_))
print()

1.1 线性回归模型评估代码

# 引入用于评估线性回归模型的statsmodels库import statsmodels.api as sm# 利用add_constant()函数，给原来的特征变量x添加常数项，并赋给x2，使得y=ax+b有常数项，即截距bx2 = sm.add_constant(x)  #注意：一元用x，多元用x_# 用OLS()和fit()函数对y和x2进行线性回归方程搭建est = sm.OLS(y, x2).fit()# 打印输出模型的数据信息print(est.summary())

1.2 显示结果

系数a：2497.1513476046866

截距b：10143.131966873787

OLS Regression Results

==============================================================================

Dep. Variable: 薪水 R-squared: 0.855

Model: OLS Adj. R-squared: 0.854

Method: Least Squares F-statistic: 578.5

Date: Sun, 30 Jan 2022 Prob (F-statistic): 6.69e-43

Time: 02:25:40 Log-Likelihood: -930.83

No. Observations: 100 AIC: 1866.

Df Residuals: 98 BIC: 1871.

Df Model: 1

Covariance Type: nonrobust

==============================================================================

coef std err t P>|t| [0.025 0.975]

------------------------------------------------------------------------------

const 1.014e+04 507.633 19.981 0.000 9135.751 1.12e+04

工龄 2497.1513 103.823 24.052 0.000 2291.118 2703.185

==============================================================================

Omnibus: 0.287 Durbin-Watson: 0.555

Prob(Omnibus): 0.867 Jarque-Bera (JB): 0.463

Skew: 0.007 Prob(JB): 0.793

Kurtosis: 2.667 Cond. No. 9.49

==============================================================================

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

1.3 结果解读

R-squared=0.855，接近1，拟合程度高

Adj. R-squared=0.854，接近1，拟合程度高

coef有两个值：

const，常数项系数，对应P值为0.000，显著性高

工龄，特征变量系数，对应P值为0.000，显著性高

2 一元二次线性回归

2.1 一元二次回归代码

# 读取数据import pandas as pddf = pd.read_excel('IT行业收入表.xlsx')# 自变量要构造成二维结构x = df[['工龄']]  # 读出来是一个DataFrame# 因变量一维结构即可y = df['薪水']  # 读出来是一个Series# 引入多次项的模块PolynomialFeaturesfrom sklearn.preprocessing import PolynomialFeatures# 设置最高次项为二次项，为生成二次项数据（x^2）做准备poly_reg = PolynomialFeatures(degree=2)# 将原有的x转换为一个新的二维数组x_，该二维数组包含新生成的二次项数据（x^2）和原有的一次项数据（x）x_ = poly_reg.fit_transform(x)# 获得一元二次线性回归模型from sklearn.linear_model import LinearRegressionregr = LinearRegression()regr.fit(x_, y)# 一元二次线性回归模型可视化from matplotlib import pyplot as plt# 用于正常显示中文plt.rcParams['font.sans-serif'] = ['SimHei']plt.scatter(x, y)plt.plot(x.values, regr.predict(x_), color='red')plt.xlabel('工龄')plt.ylabel('薪水')plt.show()# 一元二次线性回归方程构造print(str(regr.coef_))  # 获取系数a、bprint(str(regr.intercept_))  # 获取常数项cprint()

2.2 模型评估

# 引入用于评估线性回归模型的statsmodels库
import statsmodels.api as sm
# 利用add_constant()函数，传入的是含有二次项（x^2）的x_
x2 = sm.add_constant(x)
# 用OLS()和fit()函数进行线性回归方程搭建
est = sm.OLS(y, x2).fit()
# 打印输出模型的数据信息
print(est.summary())

2.3 显示结果

[ 0. -743.68080444 400.80398224]

13988.159332096886

OLS Regression Results

==============================================================================

Dep. Variable: 薪水 R-squared: 0.931

Model: OLS Adj. R-squared: 0.930

Method: Least Squares F-statistic: 654.8

Date: Sun, 30 Jan 2022 Prob (F-statistic): 4.70e-57

Time: 02:50:37 Log-Likelihood: -893.72

No. Observations: 100 AIC: 1793.

Df Residuals: 97 BIC: 1801.

Df Model: 2

Covariance Type: nonrobust

==============================================================================

coef std err t P>|t| [0.025 0.975]

------------------------------------------------------------------------------

const 1.399e+04 512.264 27.307 0.000 1.3e+04 1.5e+04

x1 -743.6808 321.809 -2.311 0.023 -1382.383 -104.979

x2 400.8040 38.790 10.333 0.000 323.816 477.792

==============================================================================

Omnibus: 2.440 Durbin-Watson: 1.137

Prob(Omnibus): 0.295 Jarque-Bera (JB): 2.083

Skew: -0.352 Prob(JB): 0.353

Kurtosis: 3.063 Cond. No. 102.

==============================================================================

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

2.4 结果解读

R-squared=0.931，接近1，拟合程度高，比一元一次回归效果好

Adj. R-squared=0.930，接近1，拟合程度高，比一元一次回归效果好

coef有三个值：

const=1.399e+04，常数项，P值为0.000，显著性高

x1=-743.6808，一次项系数，P值为0.023，显著性高

x2=400.8040，二次项系数，P值为0.000，显著性高

3 获取R-squared值的另一种方法

# 获取R_squared值from sklearn.metrics import r2_scorer2 = r2_score(y, regr.predict(x))print(r2)

显示：0.8551365584870814

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.bcls.cn/TCUO/4384.shtml

如若内容造成侵权/违法违规/事实不符，请联系编程老四网进行投诉反馈email:xxxxxxxx@qq.com，一经查实，立即删除！