问题描述
我使用 mapply() 计算了 R 中数据集行之一中特征值的百分位数。 这是 R 代码:
library(MASS)
boston = Boston
# Suburb(s) with lowest median home value
low.medv <- boston[boston$medv == min(boston$medv),]
low.medv
# crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
# 399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 396.90 30.59 5
# 406 67.9208 0 18.1 0 0.693 5.683 100 1.4254 24 666 20.2 384.97 22.98 5
# quantile ranks for the values of medv suburbs
perc = data.frame(round(mapply(function(x,y) ecdf(x)(y),boston,low.medv1),3),row.names = paste(rownames(low.medv1),'_P',sep = ""))
期望输出
perc
# crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
# 399_P 0.988 0.735 0.887 0.931 0.858 0.077 1 0.057 1 0.99 0.889 1.00 0.978 0.004
# 406_P 0.996 0.735 0.887 0.931 0.858 0.136 1 0.042 1 0.99 0.889 0.35 0.899 0.004
问题:
我正在尝试在 python 中复制它。
这里是重现的python代码:
import pandas as pd
import numpy as np
from scipy.stats import percentileofscore as ptile
from sklearn.datasets import load_boston
boston = load_boston()
df = pd.DataFrame(boston.data,columns=boston.feature_names)
df['medv'] = boston.target
df.columns
# Suburb(s) with lowest median home value
low_medv = df[df.medv == min(df.medv)]
low_medv
可以用两个 for 循环来完成:
perc = pd.DataFrame()
for c in low_medv.columns:
for i in low_medv.index:
perc.loc[i,c] = round(ptile(df[c],low_medv.loc[i,c]),3)
# ptile calculates the percentile rank of a given value
perc
但这是最有效的方法吗?
解决方法
熊猫有rank
:
# min(df.medv) is not vectorized
low_medv = df[df.medv == df.medv.min()]
(df.rank(method='average',pct=True)
.loc[low_medv.index]
.mul(100).round(3)
)
输出:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT medv
398 98.814 36.858 75.791 46.64 84.486 7.708 95.85 5.731 87.055 86.067 75.198 88.142 97.826 0.296
405 99.605 36.858 75.791 46.64 84.486 13.636 95.85 4.150 87.055 86.067 75.198 34.980 89.921 0.296