如何在python中编写一个函数来将我的函数输出的目录文件名写入数据帧？

问题描述

我对 Python 非常陌生，并试图了解如何使用遍历。

#我的代码有效-

import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
df = pd.read_csv("1003285474_1003285465_0a54173ed4c58b7354e0dd48.csv",encoding="utf-8")
s = ' '.join(df['transcript'])

sid = SentimentIntensityAnalyzer()
sid.polarity_scores(s)
Out[68]: {'neg': 0.046,'neu': 0.707,'pos': 0.247,'compound': 0.9922}

正如你在上面看到的，我有 2 个函数，一个连接一列的所有行，另一个返回情绪极性分数。我的目标是遍历一个文件夹并对该文件夹中的所有 csv 执行上述操作。我的最终目标是拥有一个具有以下内容的数据框-

filename                                             neg    neu    pos     compound
1003285474_1003285465_0a54173ed4c58b7354e0dd48.csv   0.046  0.707  0.247   0.9922
1003285474_1003285465_0a54173ed4c58b7354e0dd41.csv   0.192  0.731  0.122   0.7222

我应该如何遍历所有 csv 文件，将上述应用于函数并将上述结果获取到所有这些 csv 的数据框？

解决方法

import os    
from glob import glob
import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# use glob to get a list of csv files in a folder
files = glob('path/to/folder/*.csv')
sid = SentimentIntensityAnalyzer()
# use dict comprehension to apply you analysis
data = {os.path.basename(file): sid.polarity_scores(' '.join(pd.read_csv(file,encoding="utf-8")['transcript'])) for file in files}
# create a data frame from the dictionary above
df = pd.DataFrame.from_dict(data,orient='index')

首先，创建一个函数来包装您的分析：

def analyse_data(file_path):
    df = pd.read_csv(file_path,encoding='utf-8')
    s = ' '.join(df['transcript'])

    sid = SentimentIntensityAnalyzer()
    score = sid.polarity_scores(s)
    score['filename'] = os.path.basename(file_path)

此函数采用文件路径并在最终数据框中返回一行。一个示例返回是：

{'filename': '1003285474_1003285465_0a54173ed4c58b7354e0dd48.csv','neg': 0.046,'neu': 0.707,'pos': 0.247,'compound': 0.9922}

然后，使用 os.walk 遍历目录中的所有文件并应用该函数。

def create_dataframe(root_dir):
    data = []
    for path,subdirs,files in os.walk(root_dir):
        for file_name in files:
            full_path = os.path.join(path,file_name)
            data.append(analyse_data(full_path))

    return pd.DataFrame(data)

我认为 root_dir 及其子目录下只存在 CSV 文件，因此在应用分析功能之前无需检查文件类型。

glob pandas pandas python