如何使用 pymc3 对分类效果进行建模

问题描述

我有一个数据集,它看起来像是正态分布的样本,其中有几个独立的影响移动特定数据点。应用于任何点的效果都是已知的,需要从数据中恢复偏移。 我应该如何重新参数化这个模型以使其适合数据?

这是一个玩具数据集生成器:

import os
os.environ["MKL_THREADING_LAYER"] = "GNU"

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%config InlineBackend.figure_format='retina'
import seaborn as sns

import pymc3 as pm
import arviz as az
import theano.tensor as tt
import theano as T

np.random.seed(0)
numeff = 2
numdata = 10000
shifts = []
index = []

for i in range(numeff):
    n = np.random.randint(2,4)
    s = np.random.uniform(size=n)
    s -= np.mean(s) #keep each effect shifts centered
    shifts.append(s)
    
data = np.random.normal(0.0,0.03,size=numdata)
for s in shifts: 
    idx = np.random.randint(0,len(s),size=numdata)
    index.append(idx)
    data += s[idx]

plt.hist(data,100)
print(shifts)

[数组([-0.12571057,0.12571057]),数组([ 0.00673915,-0.11448923,0.10775008])]

Histogram of example data

我正在尝试使用 pymc3 对这些数据进行建模,如下所示:

coords = {}
for i in range(numeff):
    name = "effect_{}".format(i)
    value = np.arange(len(shifts[i]))
    coords[name]=value

with pm.Model(coords=coords) as model:  
    idx = pm.Data("idx",np.array(index))
    
    theta = tt.zeros(numdata)

    for i in range(numeff):
        dimname = "effect_{}".format(i)
        mu = pm.normal("mu_{}".format(i),mu=0.0,sigma=10.0,dims=(dimname))
        mu -= mu.mean(keepdims=True)
        theta += mu[idx[i,:]]
    
    sigma = pm.Exponential('sigma',1.0)    
    obs = pm.normal('obs',theta,sigma=sigma,observed=data)
    
pm.model_to_graphviz(model)

enter image description here

后采样步骤:

with model:
    idata = pm.sample(draws=1000,tune=1000,chains=4,cores=4,return_inferencedata=True,target_accept=0.99)

我有一个非常缓慢且行为不端的采样器:

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma,mu_1,mu_0]

 100.00% [8000/8000 09:59<00:00 Sampling 4 chains,0 divergences]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 599 seconds.
The chain reached the maximum tree depth. Increase max_treedepth,increase target_accept or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth,increase target_accept or reparameterize.
The rhat statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.

az.plot_trace(idata)

enter image description here

有没有更好的方法来设计这样的模型?在理想的世界中,我应该有 3 种不同的影响来驱动数据……但以我目前的模型设计,这是无望的。

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)