Docker POSTGRESQL - 使用字符串和字节数据插入多行

问题描述

我在使用 bytea 在 Postgresql 数据库中插入二进制值 (psycopg2) 时遇到了一些困难。我是 Postgres 的新手，请原谅我的无知。

所以，这是场景：

我正在抓取我的大学网站（使用 BeautifulSoup）并检索他们在“通知”部分发布的通知。
这是一个糟糕的网站，所以通知只发布在一个网页上，通知文本超链接到一个 pdf 文档。
我抓取了 7 个子页面。我知道我没有使用正确的网络开发术语，所以我的意思是：根 URL 是 https://exampleuniversity.com/，子页面是（粗体）'https://exampleuniversity.com/ univ-circulars/'、'https://exampleuniversity.com/univ-notification/' 等 - 其中有 7 个
这 7 个子页面放在一个列表中（称为 paths），我遍历每个页面（并在其后附加根 URL）并抓取通知文本、URL 和pdf文件本身。这一切都像一个魅力。问题是当我需要将它们写入 (docker) Postgresql 表时。

仅供参考 - docker Postgresql 容器部分本身工作正常。

如您所想，当我遍历子页面的列表时，我会得到一批（已刮取的）文本 + URL + pdf 文件，我需要将这些文件写入数据库。

在https://www.postgresqltutorial.com/postgresql-python/blob/之后，我了解到我必须open(file,'rb').read()，并在插入时执行psycopg2.Binary(<variable_name>)。

好的。但这必须在每次迭代中完成。所以有一个列列表（例如，texts 是一列，urls 是列等）。因此，按照 psycopg2: insert multiple rows with one query（ant32 的回答），我了解到我必须 args_str = ','.join(cur.mogrify("(%s,%s,%s)",x) for x in tup) 然后 cur.execute("INSERT INTO table VALUES " + args_str) - 请假设我已正确放置变量和占位符。>

所以，我压缩检索到的文本 + URLs + open('file','rb').read()（这是每次迭代）并创建一个元组并尝试遵循上面的示例，并且（自然地）它失败了（仅供参考：元组中的每个项目现在是一组三个元素，它形成了需要插入到数据库中的行 - 与前面讨论的列相反）。这就是我遇到的问题。

抛出的错误是：sequence item 0: expected str instance,bytes found

在下面的代码中，一些注释被标记为“TO StackOverflow”。这会向你解释更多我的想法。

代码如下：

for path in paths:
    for i in range(retry_attempts):
        try:
            #Creating directoty to save the pdf files
            dir_to_create = f'./app/{today}/{path}'
            if not os.path.exists(dir_to_create):
                os.makedirs(dir_to_create)

            print(f'\n****** Scraping - {path} ******'.upper())
            univ_notification_url = univ_url + path
            univ_notification_page = requests.get(univ_notification_url)
            soup = BeautifulSoup(univ_notification_page.content,'html.parser')
            items = soup.find_all('div',{'class': 'entry'})
            desired_p_tags = items[0].find_all('p')

            #Removing <p> tags with no href (only with the string values as mentioned below)
            for _ in desired_p_tags:
                if _.text == 'University Circulars' or 'Circulars,Notifications,Letters' in _.text:
                    desired_p_tags.remove(_)

            #Only retaining <p> tags that point to a pdf files (so the ones linking to another webpage is omitted)
            desired_p_tags = [_ for _ in desired_p_tags if '.pdf'  in (_.find('a'))['href']]

            texts = [_.text for _ in desired_p_tags]
            urls = [(_.find('a'))['href'] for _ in desired_p_tags]

            url_hash = [] #TO STACKOVERFLOW: I am also creating a blake2b hash for the files that I download and save it in the db. So the zipped tuple will contain this too
            pdf_obj = []
            for url in urls:
                filename = os.path.basename(url) # To get the name of the downloaded file 
                urllib.request.urlretrieve(url,f'{dir_to_create}/{filename}') # Downloading the file
                pdf_obj.append(open(f'{dir_to_create}/{filename}','rb').read()) # Appending into the list the binary form of the downloaded file
                url_hash.append(hashlib.blake2b(open(f'{dir_to_create}/{filename}','rb').read()).hexdigest()) # Hashing the file
                print(f"Downloaded {filename}\n")
                
            data_to_insert = tuple(zip(texts,urls,pdf_obj,url_hash)) #Zipping the variables into a tuple

            # DATABASE PART
            conn = psycopg2.connect(
                host = 'pg_db',port = '5432',database = 'bu_notifications',user = 'superuser',password = 'User1234!'
            )

            cur = conn.cursor()

            args_str = ',x) for x in data_to_insert)
            cur.execute("INSERT INTO table00 (notification_text,item_url,pdf_file,pdf_hash) VALUES (%s,%s)" + args_str)
            conn.commit()
            //Close connection and cursor (also not writing down the except clause that follows after a try)

解决方法

好的。我设法解决了它。我就是这样做的：

我创建了一个列表，其中包含 psycopg2.Binary(open(f'{dir_to_create}/{filename}','rb').read()) 本身的值，以及已经存在的 URL 和文本列表。然后从三个中创建元组。

然后遍历元组并像这样将每个元素插入到表中（注意 - 还有一个额外的列）：

         for _ in data_to_insert:
         cur.execute("INSERT INTO table00 (notification_text,item_url,pdf_file,pdf_hash) VALUES (%s,%s,%s)",_)
         conn.commit()

它现在写入数据库（至少我在 PGAdmin 中看到它）。现在我需要找到一种从本地下载它的方法，但这是另一个 Stack Overflow 帖子。

bytea insert insert postgresql psycopg2 python

Docker POSTGRESQL - 使用字符串和字节数据插入多行

问题描述

解决方法

相关问答