从文本文件中读取网址后，如何将所有响应保存到单独的文件中？

问题描述

我有一个脚本，该脚本从文本文件读取url，执行请求，然后将所有响应保存在一个文本文件中。如何将每个响应保存在不同的文本文件中，而不是全部保存在同一文件中？例如，如果我标记为input.txt的文本文件具有20个url，我想将响应保存在20个不同的.txt文件中，例如output1.txt，output2.txt，而不只是一个.txt文件。因此，对于每个请求，响应都保存在一个新的.txt文件中。谢谢

import requests
from bs4 import BeautifulSoup


with open('input.txt','r') as f_in:
    for line in map(str.strip,f_in):
        if not line:
            continue

        response = requests.get(line)
        data = response.text
        soup = BeautifulSoup(data,'html.parser')
        categories = soup.find_all("a",{"class":'navlabellink nvoffset nnormal'})

        for category in categories:
            data = line + "," + category.text

            with open('output.txt','a+') as f:
                f.write(data + "\n")
                print(data)

解决方法

这是一种实现其他人提示的快速方法：

import requests
from bs4 import BeautifulSoup


with open('input.txt','r') as f_in:
    for i,line in enumerate(map(str.strip,f_in)):
        if not line:
            continue

        ...

            with open(f'output_{i}.txt','w') as f:
                f.write(data + "\n")
                print(data)

您可以使用open('something.txt','w')创建一个新文件。如果找到该文件，它将删除其内容。否则，它将创建一个名为“ something.txt”的新文件。现在，您可以使用file.write()来编写您的信息！

我不确定我是否理解您的问题。

我将创建一个数组/列表，并为每个url请求和响应创建一个对象。然后将对象添加到数组/列表中，并为每个对象写入不同的文件。

至少有两种方法可以为每个URL生成文件。如下所示，一个方法是创建文件的某些数据唯一数据的哈希。在这种情况下，我选择了类别，但是您也可以使用文件的全部内容。这样会创建一个用于文件名的唯一字符串，这样，具有相同类别文本的两个链接在保存时不会相互覆盖。

未显示的另一种方法是在数据本身中找到一些唯一值，并将其用作文件名而不散列。但是，由于不应该信任Internet上的数据，这可能会导致更多的问题，而无法解决。

这是您的代码，其中MD5哈希用于文件名。 MD5 is not a secure hashing function用于输入密码，但可以安全地创建唯一的文件名。

更新的代码段

import hashlib

import requests
from bs4 import BeautifulSoup



with open('input.txt','r') as f_in:
    for line in map(str.strip,f_in):
        if not line:
            continue

    response = requests.get(line)
    data = response.text
    soup = BeautifulSoup(data,'html.parser')
    categories = soup.find_all("a",{"class":'navlabellink nvoffset nnormal'})

    for category in categories:
        data = line + "," + category.text
        filename = hashlib.sha256()
        filename.update(category.text.encode('utf-8'))
        with open('{}.html'.format(filename.hexdigest()),'w') as f:
            f.write(data + "\n")
            print(data)

已添加代码

filename = hashlib.sha256()
filename.update(category.text.encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()),'w') as f:

捕获更新的页面

如果您关心在不同的时间点捕获页面的内容，请对文件的全部内容进行哈希处理。这样，即使页面中的任何内容发生更改，页面的先前内容也不会丢失。在这种情况下，我将同时对url和文件内容进行哈希处理，并将哈希值与URL哈希值（随后是文件内容的哈希值）连接起来。这样，对目录进行排序时，文件的所有版本都是可见的。

hashed_contents = hashlib.sha256()
hashed_contents.update(category['href'].encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()),'w') as f:

for category in categories:
        data = line + "," + category.text
        hashed_url = hashlib.sha256()
        hashed_url.update(category['href'].encode('utf-8'))
        page = requests.get(category['href'])
        hashed_content = hashlib.sha256()
        hashed_content.update(page.text.encode('utf-8')
        filename = '{}_{}.html'.format(hashed_url.hexdigest(),hashed_content.hexdigest())
        with open('{}.html'.format(filename.hexdigest()),'w') as f:
            f.write(data + "\n")
            print(data)

beautifulsoup file file file http-post python python-requests