python – theano hard_sigmoid()打破了梯度下降

对于突出问题的意图让我们按照这个tutorial.

theano有3种方法来计算张量的S形,即sigmoid,ultra_fast_sigmoid和hard_sidmoid.似乎使用后两种方法打破了梯度下降算法.

传统的sigmoid应该收敛,但其他sigmoid具有奇怪的不一致行为. ultra_fast_sigmoid,只是在尝试计算渐变’方法未定义(‘grad’,ultra_fast_sigmoid)’时抛出一个直接错误,而hard_sigmoid编译得很好,但无法收敛解决方案.

有谁知道这种行为的来源？在文档中没有强调这应该发生,它似乎反直觉.

码：

import theano
import theano.tensor as T
import theano.tensor.nnet as nnet
import numpy as np

x = T.dvector()
y = T.dscalar()

def layer(x,w):
    b = np.array([1],dtype=theano.config.floatX)
    new_x = T.concatenate([x,b])
    m = T.dot(w.T,new_x) #theta1: 3x3 * x: 3x1 = 3x1 ;;; theta2: 1x4 * 4x1

    h = nnet.sigmoid(m) ## THIS SIGMOID RIGHT HERE

    return h

def grad_desc(cost,theta):
    alpha = 0.1 #learning rate
    return theta - (alpha * T.grad(cost,wrt=theta))

theta1 = theano.shared(np.array(np.random.rand(3,3),dtype=theano.config.floatX))
theta2 = theano.shared(np.array(np.random.rand(4,1),dtype=theano.config.floatX))

hid1 = layer(x,theta1) #hidden layer

out1 = T.sum(layer(hid1,theta2)) #output layer
fc = (out1 - y)**2 #cost expression

cost = theano.function(inputs=[x,y],outputs=fc,updates=[
        (theta1,grad_desc(fc,theta1)),(theta2,theta2))])
run_forward = theano.function(inputs=[x],outputs=out1)

inputs = np.array([[0,1],[1,0],[0,0]]).reshape(4,2) #training data X
exp_y = np.array([1,1,0]) #training data Y
cur_cost = 0
for i in range(2000):
    for k in range(len(inputs)):
        cur_cost = cost(inputs[k],exp_y[k]) #call our Theano-compiled cost function,it will auto update weights
    if i % 500 == 0: #only print the cost every 500 epochs/iterations (to save space)
        print('Cost: %s' % (cur_cost,))

print(run_forward([0,1]))
print(run_forward([1,0]))
print(run_forward([0,0]))

我从代码中更改了以下行,以使该帖子的输出更短(它们与教程不同,但已包含在上面的代码中)：

from theano.tensor.nnet import binary_crossentropy as cross_entropy #imports
fc = cross_entropy(out1,y) #cost expression
for i in range(4000): #training iteration

乙状结肠

Cost: 1.62724279493
Cost: 0.545966632545
Cost: 0.156764560912
Cost: 0.0534911098234
Cost: 0.0280394147992
Cost: 0.0184933786794
Cost: 0.0136444190935
Cost: 0.0107482836159
0.993652087577
0.00848194143055
0.990829396285
0.00878482655791

ultra_fast_sigmoid

  File "test.py",line 30,in dist-packages/theano/gradient.py",line 545,in grad
    grad_dict,wrt,cost_name)
  File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py",line 1283,in _populate_grad_dict
    rval = [access_grad_cache(elem) for elem in wrt]
  File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py",line 1241,in access_grad_cache
    term = access_term_cache(node)[idx]
  File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py",line 951,in access_term_cache
    output_grads = [access_grad_cache(var) for var in node.outputs]
  File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py",line 1089,in access_term_cache
    input_grads = node.op.grad(inputs,new_output_grads)
  File "/usr/local/lib/python2.7/dist-packages/theano/tensor/elemwise.py",line 662,in grad
    rval = self._bgrad(inputs,ograds)
  File "/usr/local/lib/python2.7/dist-packages/theano/tensor/elemwise.py",line 737,in _bgrad
    scalar_igrads = self.scalar_op.grad(scalar_inputs,scalar_ograds)
  File "/usr/local/lib/python2.7/dist-packages/theano/scalar/basic.py",line 878,in grad
    self.__class__.__name__)
theano.gof.utils.MethodNotDefined: ('grad',larsigmoid'>,'UltraFastScalarsigmoid')

hard_sigmoid

Cost: 1.19810193303
Cost: 0.684360309062
Cost: 0.692614056124
Cost: 0.697902474354
Cost: 0.701540531661
Cost: 0.703807604483
Cost: 0.70470238116
Cost: 0.704385738831
0.4901260624
0.486248177053
0.489490785078
0.493368670425

最佳答案

这是hard_sigmoid的源代码：

def hard_sigmoid(x):
    """An approximation of sigmoid.
    More approximate and faster than ultra_fast_sigmoid.
    Approx in 3 parts: 0,scaled linear,1
    Removing the slope and shift does not make it faster.
    """
    # Use the same dtype as determined by "upgrade_to_float",# and perform computation in that dtype.
    out_dtype = scalar.upgrade_to_float(scalar.Scalar(dtype=x.dtype))[0].dtype
    slope = tensor.constant(0.2,dtype=out_dtype)
    shift = tensor.constant(0.5,dtype=out_dtype)
    x = (x * slope) + shift
    x = tensor.clip(x,1)
    return x

所以它只是作为分段线性函数实现,其梯度在(-2.5,2.5)范围内为0.2,在其他地方为0.这意味着如果输入超出区域(-2.5,2.5),其梯度将为零,并且不会进行任何学习.

因此它可能不适合训练,但可用于近似预测结果.

编辑：
要评估网络参数的梯度,通常我们使用backpropagation.
这是一个非常简单的例子.

x = theano.tensor.scalar()
w = theano.shared(numpy.float32(1))
y = theano.tensor.nnet.hard_sigmoid(w*x)  # y=w*x,w is initialized to 1.

dw = theano.grad(y,w)  # gradient wrt w,which is equal to slope*x in this case
net = theano.function([x],[y,dw])

print net(-3)
print net(-1)
print net(0)
print net(1)
print net(3)

Output:
[array(0.0),array(-0.0)]  # zero gradient because the slope is zero
[array(0.3),array(-0.2)]
[array(0.5),array(0.0)]  # zero gradient because x is zero
[array(0.7),array(0.2)]
[array(1.0),array(0.0)]  # zero gradient because the slope is zero

OPIT编辑：
如果你看一下源代码实现,ultra_hard_sigmoid会失败,因为它在python中是硬编码的,而不是由张量表达式处理的.

gradient-descent

python – theano hard_sigmoid()打破了梯度下降

相关文章