有没有办法从部分下载的文件中去除 HTTP 标头？

问题描述

我有一个 discord 机器人的功能，可以下载上传的日志文件进行解析（使用 aiohttp），这仅限于前几个和最后一个字节，以防止滥用更大的文件。

这是我的下载功能：

async def download_file(self,log_url):
        async with aiohttp.ClientSession() as session:
            # grabs first and last few bytes of log file to prevent abuse from large files
            headers = {"Range": "bytes=0-20000,-6000"}
            async with session.get(log_url,headers=headers) as response:
                return await response.text("UTF-8")

但是，我遇到了响应文本中显示标题的超大文件的问题。根据以下示例：

普通文件：

00:00:00.021 |N| Application PrintSystemInfo: Ryujinx Version: 1.0.6769\r\n
00:00:00.025 |N| Application Print: Operating System: Microsoft Windows 10.0.19042 (X64)\r\n
...

带有不需要的标题的文件：

\r\n--00000000000014435092\r\n
Content-Type: text/plain;%20charset=Windows-1252\r\n
Content-Range: bytes 0-20000/243182\r\n\r\n
00:00:00.024 |N| Application PrintSystemInfo: Ryujinx Version: 1.0.6781\r\n
00:00:00.028 |N| Application Print: Operating System: Microsoft Windows 10.0.19042 (X64)
...

我一直在用正则表达式去除标题信息，但我想知道是否有更优雅或更“正确”的方法来处理这个问题？这是我目前的剥离方法：

log_file = await self.download_file(attached_log.url)
# Large files show a header value when not downlodaed completely
# this regex makes sure that the log text to read starts from the first timestamp,ignoring headers
log_file_header_regex = re.compile(r"\d{2}:\d{2}:\d{2}\.\d{3}.*",re.DOTALL)
log_file = re.search(log_file_header_regex,log_file).group(0)

非常感谢任何帮助，谢谢！

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

aiohttp download http-headers python