问题描述
我试图排除对星象可能造成的影响,但统计学上无意义,但无济于事。我正在使用Pearson的卡方检验对来自两个不同人群的两种分布的太阳星座进行测试,其中一个是宇航员,另一个是名人。一定有问题,但我可能找不到它,可能是在统计方面。
import numpy as np
import pandas as pd
import ephem
from collections import Counter,namedtuple
import matplotlib.pyplot as plt
from scipy import stats
models = pd.read_csv('models.csv',delimiter=',')
astronauts = pd.read_csv('astronauts.csv',')
models = models.sample(229)
astronauts = astronauts.sample(229)
sun = ephem.Sun()
def get_planet_constellation(planet,dataset):
person_planet_constellation = []
for person in dataset['Birth Date']:
planet.compute(person)
person_planet_constellation += [ephem.constellation(planet)[1]]
return person_planet_constellation
def plot_bar_group(planet,data1,data2):
fig,ax = plt.subplots()
plt.bar(data1.keys(),data1.values(),alpha=0.5)
plt.bar(data2.keys(),data2.values(),alpha=0.5)
plt.legend(['astronauts','models'])
ylabel = 'Percentages of ' + planet.name + ' in constellation'
ax.set_ylabel(ylabel)
title = 'Histogram of ' + planet.name + ' in constellation by group'
ax.set_title(title)
plt.show()
astronaut_sun_constellation = Counter(
get_planet_constellation(sun,astronauts))
model_sun_constellation = Counter(get_planet_constellation(sun,models))
plot_bar_group(sun,astronaut_sun_constellation,model_sun_constellation)
a = list(astronaut_sun_constellation.values())
b = list(model_sun_constellation.values())
s = np.array([a,b])
stat,p,dof,expected = stats.chi2_contingency(s)
print(stat,expected)
prob = 0.95
critical = stats.chi2.ppf(prob,dof)
if abs(stat) >= critical:
print('Dependent (reject H0)')
else:
print('Independent (fail to reject H0)')
# interpret p-value
alpha = 1.0 - prob
if p <= alpha:
print('Dependent (reject H0)')
else:
print('Independent (fail to reject H0)')
https://www.dropbox.com/s/w7rye6m5lbihjlh/astronauts.csv https://www.dropbox.com/s/xlxanr0pxqtxcvv/models.csv
解决方法
我最终发现了该错误,它是在将计数器作为列表传递给chisquare函数时,必须首先对其进行排序,否则chisquare看到counter值存在重大差异。现在,所有的占星术影响都没有达到预期的0.95水平