清理URL并将其保存到txt文件Python3中

问题描述

我正在尝试清理和规范化文本文件中的URL。

这是我当前的代码：

import re

with open("urls.txt",encoding='utf-8') as f:
    content = f.readlines()
content = [x.strip() for x in content]

url_format = "https://www.google"
for item in content:
    if not item.startswith(url_format):
        old_item = item
        new_item = re.sub(r'.*google',url_format,item)
        content.append(new_item)
        content.remove(old_item)

with open('result.txt',mode='wt',encoding='utf-8') as myfile:
    myfile.write('\n'.join(content))

问题是，如果我在循环中打印旧项目和新项目，则表明每个URL均已清除。但是当我在循环外打印URL列表时，这些URL仍然没有被清除，其中一些被删除，而另一些则没有。

我可以问为什么在我的for循环中删除错误的URL并添加清除的URL后，错误的URL仍在列表内吗？也许应该以其他方式解决这个问题？

此外，我已经注意到，使用大量URL来运行代码需要花费大量时间，也许我应该使用其他工具？

任何帮助将不胜感激。

解决方法

这是因为在迭代时从列表中删除项目是一件坏事，您可以创建另一个具有新值的列表并追加到列表中，也可以使用索引就地修改列表，您也可以仅使用列表理解功能来完成此任务：

import pyodbc 
cnxn = pyodbc.connect('DRIVER={Devart ODBC Driver for ASE}; Server=myserver; Port=myport; Database=mydatabase; User ID=myuserid; Password=mypassword; String Types=Unicode')

或者，使用其他列表：

content = [item if item.startswith(url_format) else re.sub(r'.*google',url_format,item) for item in content]

或者，使用索引就地修改列表：

new_content = []

for item in content:
    if item.startswith(url_format):
        new_content.append(item)
    else:
        new_content.append(re.sub(r'.*google',item))

normalization normalize python python-3.x url