Python在文件中找到不存在的字符,替换为非预期字符非英语字符的编码问题

问题描述

我在python中创建了一个脚本,用于修复.srt文件中错误编码的土耳其语字符。 例如用正确的字符“ı”代替“ý”。

我打开文件(读取),将行遍历到.replace('ý','ı'),然后使用'w',encoding='utf8'将新的行集写入新文件。 第一次很棒!问题在于,每次迭代都通过用其他2个字符替换固定字符来弄乱固定字符。如有需要,可以提供更多信息!

部分输入内容:

yakýn deðillerdi,ama
bir þeyler yapmak istedim

第一次输出:

yakın değillerdi,ama
bir şeyler yapmak istedim

第二次输出:

yakın değillerdi,ama
bir ÅŸeyler yapmak istedim

第三次输出:

yakın değillerdi,ama
bir ÅŸeyler yapmak istedim

每次运行都会变得更糟。有什么想法吗?如果我不得不猜测,我找到的字符('ý')与文件中已有的('ı')匹配,然后将其替换为('ı'),该字符被错误地编码为('ı' )?这也不是每次都进行的系统更改(请参阅第二->第三迭代),所以我很困惑。 我是个新手,所以请原谅我可能没有的任何“明显”知识!

编辑: 要求的代码:

import os

directoryPath = 'D:\\tv\\b99'

fileTypes = ['.srt']

fullFilePaths = []

def get_filepaths(directory,filetype):
    """
    This function will generate the file names in a directory
    tree by walking the tree either top-down or bottom-up. For each
    directory in the tree rooted at directory top (including top itself),it yields a 3-tuple (dirpath,dirnames,filenames).
    """
    filePathslist = []
    for root,directories,files in os.walk(directory):
        for filename in files:
            # Join the two strings in order to form the full filepath.
            filepath = os.path.join(root,filename)
            # include only the specific file types,except their hidden/shadow files
            if filepath.endswith(filetype) and not filename.startswith('.'):
                filePathslist.append(filepath)  # Add it to the list.
    return filePathslist

n=0
def replaceChars(folderAsListOfPaths):
    """
    This function takes a list as argument,containing file paths.
    The file is read line by line,and for each of the "special" 
    characters in Turkish that get encoded incorrectly,the appropriate 
    replacement - shown below - is made,and the existing file is overwritten.
    ('ý'->'ı') / ('Ý'->'İ') / ('þ'->'ş') / ('Þ'->'Ş') / ('ð'->'ğ')
    The filenames are printed when the replacement is done,for confirmation.
    """

    # read file line by line
    file = open(folderAsListOfPaths[n],"r")
    lines = file.readlines()

    newFileContent = ''
    for line in lines:
        origLine = line
        fixedLine = origLine.replace('ý','ı')
        fixedLine = fixedLine.replace('Ý','İ')
        fixedLine = fixedLine.replace('þ','ş')
        fixedLine = fixedLine.replace('Þ','Ş')
        fixedLine = fixedLine.replace('ð','ğ')
        newFileContent += fixedLine
    file.close()

    newFile = open(folderAsListOfPaths[n],'w',encoding='utf8')
    # print(newFileContent)
    newFile.write(newFileContent)
    newFile.close()

    cleaned_name = folderAsListOfPaths[n].replace(directoryPath,'')
    cleaned_name = cleaned_name.replace('\\','')
    print(cleaned_name)


for type in fileTypes: 
    fullFilePaths.extend(get_filepaths(directoryPath,type))
# filled the fullFilePaths list with the files

print('Finished with files:')

for file in fullFilePaths:  # for every file in this folder
    replaceChars(fullFilePaths) # replace the characters
    n+=1    # move onto the next file

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)