skLearn线性回归实战:从房价预测到模型评估全流程解析
用skLearn玩转房价预测线性回归实战与模型优化指南假设你手头有一份加州房价数据集包含收入中位数、房龄、房间数等8个特征目标是预测每套房产的价格。作为刚入门机器学习的数据分析师如何从零开始构建一个可靠的预测模型本文将带你完整走通线性回归的全流程涵盖数据探索、特征工程、模型训练与评估等关键环节并分享几个提升模型性能的实用技巧。1. 数据准备与探索性分析任何机器学习项目的第一步都是理解数据。我们使用sklearn内置的加州房价数据集这个经典数据集包含20640个样本每个样本有8个特征from sklearn.datasets import fetch_california_housing import pandas as pd housing fetch_california_housing() X pd.DataFrame(housing.data, columnshousing.feature_names) y housing.target1.1 关键特征解析查看数据的基本统计量能快速发现潜在问题print(X.describe())输出显示各特征的量纲差异显著MedInc收入中位数0-15范围HouseAge房龄0-52年AveRooms平均房间数1-141存在异常值1.2 可视化数据分布用直方图观察目标变量房价的分布import matplotlib.pyplot as plt plt.hist(y, bins50) plt.xlabel(House Price) plt.ylabel(Frequency)注意线性回归假设目标变量服从正态分布若严重偏态需考虑对数变换2. 构建基线模型2.1 数据分割与标准化from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler X_train, X_test, y_train, y_test train_test_split(X, y, test_size0.2, random_state42) scaler StandardScaler() X_train_scaled scaler.fit_transform(X_train) X_test_scaled scaler.transform(X_test)2.2 训练初始模型from sklearn.linear_model import LinearRegression lr LinearRegression() lr.fit(X_train_scaled, y_train)查看模型系数coef_df pd.DataFrame({ Feature: housing.feature_names, Coefficient: lr.coef_ }) print(coef_df.sort_values(Coefficient, ascendingFalse))输出示例Feature Coefficient 0 MedInc 0.829619 3 AveBedrms -0.153758 5 AveOccup -0.2666283. 模型评估与诊断3.1 常用评估指标对比指标公式特点理想值MSE$\frac{1}{n}\sum(y-\hat{y})^2$对异常值敏感接近0R²$1 - \frac{SS_{res}}{SS_{tot}}$解释方差比例接近1计算测试集表现from sklearn.metrics import mean_squared_error, r2_score y_pred lr.predict(X_test_scaled) print(fMSE: {mean_squared_error(y_test, y_pred):.3f}) print(fR²: {r2_score(y_test, y_pred):.3f})3.2 残差分析健康的模型残差应满足均值为0同方差性无漏斗形状无自相关绘制残差图residuals y_test - y_pred plt.scatter(y_pred, residuals) plt.axhline(y0, colorr, linestyle--)4. 特征工程优化4.1 处理非线性关系尝试对MedInc特征进行多项式扩展from sklearn.preprocessing import PolynomialFeatures poly PolynomialFeatures(degree2, include_biasFalse) X_train_poly poly.fit_transform(X_train[[MedInc]]) X_test_poly poly.transform(X_test[[MedInc]]) # 合并其他特征 X_train_enhanced np.hstack([X_train_scaled[:,1:], X_train_poly]) X_test_enhanced np.hstack([X_test_scaled[:,1:], X_test_poly])4.2 特征选择技巧使用Lasso回归自动筛选特征from sklearn.linear_model import LassoCV lasso LassoCV(cv5) lasso.fit(X_train_enhanced, y_train) print(Selected features:, np.sum(lasso.coef_ ! 0))5. 高级优化策略5.1 交叉验证调参from sklearn.model_selection import GridSearchCV from sklearn.linear_model import Ridge param_grid {alpha: [0.01, 0.1, 1, 10]} ridge GridSearchCV(Ridge(), param_grid, cv5) ridge.fit(X_train_enhanced, y_train) print(Best alpha:, ridge.best_params_)5.2 集成学习方法尝试用投票回归器组合多个模型from sklearn.ensemble import VotingRegressor from sklearn.svm import SVR models [ (lr, LinearRegression()), (ridge, Ridge(alpha1)), (svr, SVR(kernellinear)) ] ensemble VotingRegressor(models) ensemble.fit(X_train_enhanced, y_train)6. 模型部署与监控将最佳模型保存为pkl文件import joblib joblib.dump(ensemble, house_price_model.pkl)在实际项目中还需要建立数据漂移检测机制模型性能衰减预警自动化retraining流程我曾在一个房地产评估项目中发现当经济政策突变时模型的预测误差会在2周内增加30%。解决方案是设置一个监控指标当连续3天的平均绝对百分比误差超过阈值时触发模型重训练。