bs4抓取代码片段后,如何清理代码内容?

问题描述

我正在尝试抓取代码中的所有数据内容,但是我的代码code_snippet = soup.find('code')上看起来有点怪异,因为它显示了以下不同的数据:

<code class="language-plaintext Highlighter-rouge">backend/src</code>
None
hh2019/09/22/dragonteaser19-rms/
<code>What do?
        list [p]ending requests
        list [f]inished requests
        [v]iew result of request
        [a]dd new request
        [q]uit
Choice? [pfvaq]
</code>
None
hh2019/01/02/exploiting-math-expm1-v8/
<code class="language-plaintext Highlighter-rouge">nc 35.246.172.142 1</code>
None
hh2018/12/23/xmas18-white-rabbit/
<code class="MathJax_Preview">n</code>
None
hh2018/12/02/pwn2win18-tpm20/
<code>Welcome to my trusted platform. Tell me what do you want:
hh2018/05/21/rctf18-stringer/
<code class="language-plaintext Highlighter-rouge">calloc</code>
None

但是,打印soup = BeautifulSoup(content['value'],"html.parser")会返回正确的数据pre > code在这里只使我感兴趣的是这些标记中的内容,看起来像这样

<h3 id="overview">Overview</h3>
<p>The challenge shipped with several cave templates.
A user can build a cave from an existing template and populate it with treasures in random positions.
For caves created by the gamebot,the treasures are flags.
Any user can visit a cave by providing a program written in a custom programming language.
The program has to navigate around the cave.
If it terminates on a treasure,the treasure’s contents will be printed.</p>
<p>I was drawn to this challenge because the custom programming language is compiled to machine code using LLVM,and then executed.
It seemed like a fun place to look for bugs.</p>
<p>The challenge ships the backend’s source code in <code class="language-plaintext Highlighter-rouge">backend/src</code>,some program samples in <code class="language-plaintext Highlighter-rouge">backend/samples</code>,and the prebuilt binaries in <code class="language-plaintext Highlighter-rouge">backend/build</code>.
The <code class="language-plaintext Highlighter-rouge">backend/build/SaarlangCompiler</code> executable is a standalone compiler for the language.
It’s useful for testing,but it is not used in the challenge.
The actual server is <code class="language-plaintext Highlighter-rouge">backend/build/SchlossbergCaveServer</code>.
It binds to the local port 9081,and it is exposed to other teams through a Nginx reverse proxy on port 9080.
I will use port 9081 in examples and exploits so that they can be tested locally without Nginx.</p>
<h3 id="api-interactions">API interactions</h3>
<p>The APIs are defined in <code class="language-plaintext Highlighter-rouge">backend/src/api.cpp</code>.
We will take a look at some typical API interactions.
I will prettify JSON responses for your convenience.</p>
<p>First,we need to register a user:</p>
<div class="language-plaintext Highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ curl -c cookies -X POST -H 'Content-Type: application/json' \
       -d '{"username": "abiondo","password": "secret"}'     \
       http://localhost:9081/api/users/register
{
    "username": "abiondo"
}
</code></pre></div></div>

我想刮擦所有<pre *><code>并用code_snippet.get_text()清洗,但是我不确定,我对此遗失了什么,但是,我使用asyncio + Feedparser + bs4做刮刀,但在某些时候,这给了我错误的数据。

for entrie in entries:
    print(entrie['link'])
    for content in entrie['content']:
        soup = BeautifulSoup(content['value'],"html.parser")
        code_snippet = soup.find('code')
        print(soup)

解决方法

您可以尝试使用

soup.findAll("code",{"class":"language-plaintext"}.text
,

您可以使用CSS选择器pre > code。这将直接选择<code>下的所有<pre>

import requests 
from bs4 import BeautifulSoup


url = 'https://abiondo.me/2020/03/22/saarctf20-schlossberg/'
soup = BeautifulSoup(requests.get(url).content,'html.parser')

for code in soup.select('pre > code'):
    print(code.get_text())
    print('-' * 80)

打印:

$ curl -c cookies -X POST -H 'Content-Type: application/json' \
       -d '{"username": "abiondo","password": "secret"}'     \
       http://localhost:9081/api/users/register
{
    "username": "abiondo"
}

--------------------------------------------------------------------------------
$ curl -b cookies -X POST -H 'Content-Type: application/json' \
       -d '{"name": "MyFancyCave","template": 1}'            \
       http://localhost:9081/api/caves/rent
{
    "created": 1584867401,"id": "1584867401_1345632849","name": "MyFancyCave","owner": "abiondo","template_id": 1,"treasure_count": 0,"treasures": []
}

--------------------------------------------------------------------------------
$ curl -b cookies -X POST -H 'Content-Type: application/json' \
       -d '{"cave_id": "1584867401_1345632849","names": [    \
                "SAAR{OneFancyFlagOneFancyFlag00000000}",\
                "SAAR{TwoFancyFlagsTwoFancyFlags000000}"]}'   \
       http://localhost:9081/api/caves/hide-treasures
{
    "created": 1584867401,"treasure_count": 2,"treasures": [
        {
            "name": "SAAR{OneFancyFlagOneFancyFlag00000000}","x": 645,"y": 97
        },{
            "name": "SAAR{TwoFancyFlagsTwoFancyFlags000000}","x": 505,"y": 14
        }
    ]
}

--------------------------------------------------------------------------------


...and so on.