岭多项式回归:直接在 python 中实现

问题描述

想直接在python中实现岭多项式回归,不使用sklearn等相关库。直接计算由下式给出:

w = (XTX + lambda * I)^-1*XTy。

代码如下:

import numpy as np
import matplotlib.pyplot as plt
from openpyxl import load_workbook

wb = load_workbook('data.xlsx')
data = wb['data']
xv=[]
yv=[]
for i in range(1,100):
    
    xv = xv +[float(data.cell(row=i,column=1).value)]
    yv = yv +[float(data.cell(row=i,column=2).value)]

n=5 #polynomial degree

ex=1
m=len(xv)

max=xv[0]
for u in range(1,m):
    if xv[u]>max:
        max=xv[u]
        
x=[]
for i in range(0,m):
    xn=[]
    for j in range(0,n+1):
        xn=xn+[xv[i]**j]
    x=x+[xn]

lam = 5
X=np.array(x)
XtX=(X.T).dot(X)
#XtX_inv=np.linalg.inv(XtX)
Xty=(X.T).dot(np.array(yv))
I = np.identity(XtX.shape[0])
LI = np.dot(lam,I) 
XtXLI = np.add(XtX,LI)
XtXLI_inv = XtX_inv=np.linalg.inv(XtXLI)
teta=XtXLI_inv.dot(Xty)

def h(c):
    h=0
    for i in range(0,n+1):
        h=h+teta[i]*c**i
    return h

hv=[]
for i in range(0,m):
    hv=hv+[h(xv[i])]

预计可以通过调整 lambda 参数来实现更好的拟合。但是,随着 lambda 的增加,误差会显着增加。我该如何解决问题?

解决方法

您可以检查我使用以下公式制作的此实现:A^T * A * x = A^T * B

import numpy as np
import matplotlib.pyplot as plt

DATA = np.array([(-5,12),(-3,2),(-2,-7),(-1,-4),(2,3),(3,1),(5,4),(7,9)])

n,m = DATA.shape

def regression(degree: int):
    A = np.empty(shape=(n,degree + 1))

    for i,data in enumerate(DATA):
        # Evaluates the polynomial in order to get coefficients
        A[i] = np.array([data[0]**x for x in range(degree + 1)])

    # @ is a special python operator which performs matrix multiplication
    x = A.T @ A
    y = A.T @ np.array([d[1] for d in DATA])

    # Solves the linear system
    r = np.linalg.solve(x,y)

    # Evaluates in order to plot values
    x = np.linspace(DATA[0][0],DATA[-1][0],num=1000)
    y = np.array([np.sum(np.array([r[i]*(j**i) for i in range(len(r))])) for j in x])
    # Plots the polynomial
    plt.plot(x,y)

    # Plots the data points
    for data in DATA:
        plt.scatter(*data)

    # y has to be recalculated because linspace creates extra values in order to plot the graph
    y = np.array([np.sum(np.array([r[i] * (d[0] ** i) for i in range(len(r))])) for d in DATA])

    error = sum([abs(DATA[i][1]-y[i])**2 for _ in range(n)])**0.5

    # If the error is too small,it is depreciated
    if error > 1e-10:
        plt.title(f"Degree: {degree},Error: {error}")
    else:
        plt.title(f"Degree: {degree},Perfect aproximation")

    plt.show()

for i in range(1,n):
    regression(i)
,

这取决于您所谈论的错误类型。岭回归是一种对多项式回归进行正则化的方法。超参数 lambda(或 alpha)用于控制您想要对模型进行多少正则化。如果增加 lambda,则会增加模型的正则化:您的模型在训练数据上的表现会更差,但在测试数据上的表现会更好(它会更好地泛化)

不要忘记缩放你的数据,因为岭回归对输入特征的尺度很敏感。