计算数据框Python中所有类别变量的频率和频率百分比

问题描述

我是python的新手,我正在研究一个要求,以便在分类列中列出所有唯一值以及每个值的频率和列中每个值的%频率,并使用for循环在完整的数据集。另外,我不确定是否必须使用pd.Series根据所附的屏幕快照将数据附加到数据帧中,因为基于列中的唯一值,列的长度是不同的。

感谢您的帮助。

以下是我尝试计算出的代码,但我无法在其他列上进行锻炼以获取唯一值和频率百分比,并将其创建为数据框,以便将其导出为CSV

Count_df = []
for item in df.columns:
    Count_df_ = pd.DataFrame(df1[item].value_counts())
    Count_df.append(Count_df_)
Count_dfdf = pd.DataFrame(Count_df)
Count_dfdf
Count_dfdf.to_csv(path_or_buf = Output + '_' + 'Count_.csv')

预期的输入和输出如下,并附有:

[输入数据和预期输出][1]

预先感谢

解决方法

没有魔术。只需耐心逐列追加输出DataFrame。

在这里,我假设在单个.csv文件中有4列的输出。根据个人工作经验,此格式比电子表格软件的单独文件更方便。但是,循环中也可以分离输出。

代码

import pandas as pd

# please provide copy-able sample data next time
df = pd.DataFrame(
    data={
        "Name": ["A","B","C","A","F"],"col2": [True,False,True],"col3": [1,2,3,1,3],}
)

# Construct an empty dataframe with convenient column order.
# The ordering can be adjusted later on.
df_ans = pd.DataFrame(
    data={
        "var_name": [],"var_count": [],"var_freq": [],"col_name": [],}
)

# process each column
for col in df.columns:

    # get variable name and count
    df_col_count = df[col].value_counts().to_frame().reset_index()
    # rename columns
    df_col_count.columns = ["var_name","var_count"]

    # compute frequency
    df_col_count["var_freq"] = df_col_count["var_count"] / df_col_count["var_count"].sum()

    # append column name
    df_col_count["col_name"] = col

    # sort (optional)
    # (1) by name
    df_col_count.sort_values(by="var_name",inplace=True)
    # (2) by descending frequency
    # df_col_count.sort_values(by="var_freq",ascending=False,inplace=True)

    # append
    df_ans = df_ans.append(df_col_count)

    # For separated CSV output,output here (and "col_name" can be removed)
    #df_col_count.to_csv(f"/path/to/{col}_freq.csv")

# reorder columns
df_ans = df_ans[["col_name","var_name","var_count","var_freq"]]
# reindex
df_ans.reset_index(drop=True,inplace=True)

# write csv
# df_ans.to_csv(f"/path/to/all_freq.csv")

输出

# Each column (variable) is sorted by name.
df_ans   

Out[12]: 
  col_name var_name  var_count  var_freq
0     Name        A        2.0  0.333333
1     Name        B        1.0  0.166667
2     Name        C        2.0  0.333333
3     Name        F        1.0  0.166667
4     col2    False        4.0  0.666667
5     col2     True        2.0  0.333333
6     col3        1        3.0  0.500000
7     col3        2        1.0  0.166667
8     col3        3        2.0  0.333333