如何在访问语料库中的文件时修复FileNotFoundError

问题描述

我正在尝试编写代码,以访问名为Mini-CORE的语料库中的文件。我打印FileNotFoundError: [Errno 2] No such file or directory: '1+IN+EN+IN-IN-IN-IN+EN-EN-EN-EN+WIKI+9990014.txt'并从中提取体裁代码没有问题。但是,当我尝试访问文件本身以提取文本时,它给了我import os import re import spacy from spacy import displacy from collections import Counter nlp = spacy.load('en') entries = os.listdir('Mini-CORE') entry_list = [] # this returns the genre codes for each file def genre_code(filename): for entry in entries: regex1 = r'((?<=1\+)\w*)' # This captures the genre code genre = re.findall(regex1,entry) entry_list.append(genre) genre_code(entries) print(entry_list) # FileNotFoundError??? # This captures the text after after the <h> or <p> tags def relevant_text(filename): for filename in entries: with open(filename) as current_file: text = current_file.read() regex2 = r'((?<=<h>|<p>).*)' text2 = re.findall(regex2,text) print(text2) print(relevant_text(entries)) ,这是文件夹中的第一个文件名。所以我很困惑为什么它告诉我文件名是否声称它不存在?我刚刚在某个地方犯了语法错误吗?

{{ (today | date: 'MMM d,y,h:mm') + (today | date: 'a' | lowercase) }}

解决方法

os.listdir返回不带路径的文件名。打开文件时,需要文件的父目录。 pathlib是一个面向对象的路径库,它使传递路径变得更加容易,而不必担心目录和路径名。

使用Path.glob列出目录,返回的路径将同时具有文件名及其路径供您的程序使用。经过一些清理,您的代码可能会被

from pathlib import Path
import re
import spacy
from spacy import displacy
from collections import Counter

nlp = spacy.load('en')

entries = Path('Mini-CORE').glob("*")

# this returns the genre codes for each file
def genre_code(entries):
    entry_list = []
    for entry in entries:
        regex1 = r'((?<=1\+)\w*)'  # This captures the genre code
        genre = re.findall(regex1,entry.name)
        entry_list.append(genre)
    return entry_list
    
entry_list = genre_code(entries)
print(entry_list)

# This captures the text after after the <h> or <p> tags
def relevant_text(entries):
        for filename in entries:
            with open(filename) as current_file:
                text = current_file.read()
                regex2 = r'((?<=<h>|<p>).*)'
                text2 = re.findall(regex2,text)
                print(text2)

print(relevant_text(entries))