DataFrame仅使用BeautifulSoup打印XML的最后一行

问题描述

您好,我正在尝试从发布的XML数据集中提取一些信息。这是我的代码的第一部分:

from bs4 import BeautifulSoup as bs
import pandas as pd

content = []
with open("phosphiltestfilepmc.xml","r") as file:
    content = file.readlines()
    content = "".join(content)
    bs_content = bs(content,"lxml")
    available_contacts = 139
    start_list = 0
    input_tag = bs_content.find_all(attrs={'ref-type': 'corresp'})

我正在使用find_all函数返回所有带有'ref-type'='corresp'的属性,这会输出一个'resultset'

从那里我遍历它们并获取父元素,如下所示:

    l = []
    a = []
    for i in range(start_list,available_contacts):
        d = {}
        b = {}
        try:
            d['firstname'] = input_tag[i].parent('given-names')
        except:
            None
        try:
            d['lastname'] = input_tag[i].parent('surname'))
        except:
            None
        try:
            d['email'] = input_tag[i].parent.parent.parent.parent('corresp')[0]('email')
        except:
            d['email'] = 'j@g.com'
        l.append(d)
    print(l)

print(l)的结果是字典列表(这是一个片段): [{'firstname': [<given-names>Inn-Ho</given-names>],'lastname': [<surname>Tsai</surname>],'email': [<email>bc201@gate.sinica.edu.tw</email>]}]

我正在尝试从这些词典中获取文字。我认为get_text()不能用于resultSet。

我的解决方案是再次遍历它们,这次使用text.strip(),请参见以下内容

        for tag,tag2,tag3,in zip(d['firstname'],d['lastname'],d['email']):
            try:
                b['First Name'] = tag.text.strip()
            except:
                None
            try:
                b['Last Name'] = tag2.text.strip()
            except:
                None
            try:
                b['Email Address'] = tag3.text.strip()
            except:
                None
            a.append(b)
    print(a)

“ a”的输出是词典列表(这只是一个片段):[{'First Name': 'José María','Last Name': 'Gutiérrez','Email Address': 'jgutierr@icp.ucr.ac.cr'}]

当我尝试从'a'获取一个DataFrame时,问题浮出水面

import pandas
df = pandas.DataFrame(a)
df

输出仅是a列表中的姓。请帮忙。

这是xml代码的片段。

<?xml version="1.0" ?>
<!DOCTYPE pmc-articleset PUBLIC "-//NLM//DTD ARTICLE SET 2.0//EN" "https://dtd.nlm.nih.gov/ncbi/pmc/articleset/nlm-articleset-2.0.dtd">
<pmc-articleset><article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" article-type="research-article">
  <?properties open_access?>
  <front>
    <journal-Meta>
      <journal-id journal-id-type="nlm-ta">Braz J Med Biol Res</journal-id>
      <journal-id journal-id-type="iso-abbrev">Braz. J. Med. Biol. Res</journal-id>
      <journal-id journal-id-type="publisher-id">bjmbr</journal-id>
      <journal-title-group>
        <journal-title>Brazilian Journal of Medical and Biological Research</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">0100-879X</issn>
      <issn pub-type="epub">1414-431X</issn>
      <publisher>
        <publisher-name>Associa&#xE7;&#xE3;o Brasileira de Divulga&#xE7;&#xE3;o Cient&#xED;fica</publisher-name>
      </publisher>
    </journal-Meta>
    <article-Meta>
      <article-id pub-id-type="pmid">31721904</article-id>
      <article-id pub-id-type="pmc">6853074</article-id>
      <article-id pub-id-type="other">00606</article-id>
      <article-id pub-id-type="doi">10.1590/1414-431X20198441</article-id>
      <article-categories>
        <subj-group subj-group-type="heading">
          <subject>Research Article</subject>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title>Behavioral effects of <italic>Bj</italic>-PRO-7a,a proline-rich oligopeptide from <italic>Bothrops jararaca</italic> venom</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0003-4646-5682</contrib-id>
          <name>
            <surname>Turones</surname>
            <given-names>L.C.</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0002-2318-9809</contrib-id>
          <name>
            <surname>da Cruz</surname>
            <given-names>K.R.</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0002-4061-8804</contrib-id>
          <name>
            <surname>Camargo-Silva</surname>
            <given-names>G.</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0003-1799-1106</contrib-id>
          <name>
            <surname>Reis-Silva</surname>
            <given-names>L.L.</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0002-4997-2658</contrib-id>
          <name>
            <surname>Graziani</surname>
            <given-names>D.</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Ferreira</surname>
            <given-names>P.M.</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0003-2836-5565</contrib-id>
          <name>
            <surname>galdino</surname>
            <given-names>P.M.</given-names>
          </name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0003-0488-5400</contrib-id>
          <name>
            <surname>Pedrino</surname>
            <given-names>G.R.</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0001-8738-5852</contrib-id>
          <name>
            <surname>Santos</surname>
            <given-names>R.</given-names>
          </name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0003-1996-0901</contrib-id>
          <name>
            <surname>Costa</surname>
            <given-names>E.A.</given-names>
          </name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0001-5709-9329</contrib-id>
          <name>
            <surname>Ianzer</surname>
            <given-names>D.</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="corresp" rid="cor1">*</xref>
        </contrib>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0003-4006-8213</contrib-id>
          <name>
            <surname>Xavier</surname>
            <given-names>C.H.</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="corresp" rid="cor1">*</xref>
        </contrib>
        <aff id="aff1">
<label>1</label>Laborat&#xF3;rio de Neurobiologia de Sistemas,Departamento de Ci&#xEA;ncias Fisiol&#xF3;gicas,Instituto de Ci&#xEA;ncias Biol&#xF3;gicas,Universidade Federal de Goi&#xE1;s,Goi&#xE2;nia,GO,Brasil</aff>
        <aff id="aff2">
<label>2</label>Laborat&#xF3;rio de Farmacologia de Produtos Naturais e Sint&#xE9;ticos,Departamento de Farmacologia,Brasil</aff>
        <aff id="aff3">
<label>3</label>Departamento de Fisiologia e Biof&#xED;sica,Universidade Federal de Minas Gerais,Belo Horizonte,MG,Brasil</aff>
      </contrib-group>
      <author-notes>
        <corresp id="cor1">Correspondence: C.H. Xavier: &lt;<email>carlosxavier@ufg.br</email>&gt;</corresp>
        <fn fn-type="equal" id="fn1">
          <p>*These authors contributed equally to his work.</p>
        </fn>
      </author-notes>
      <pub-date pub-type="epub">
        <day>07</day>
        <month>11</month>
        <year>2019</year>
      </pub-date>
      <pub-date pub-type="collection">
        <year>2019</year>
      </pub-date>
      <volume>52</volume>
      <issue>11</issue>
      <elocation-id>e8441</elocation-id>
      <history>
        <date date-type="received">
          <day>12</day>
          <month>2</month>
          <year>2019</year>
        </date>
        <date date-type="accepted">
          <day>30</day>
          <month>8</month>
          <year>2019</year>
        </date>
      </history>
      <permissions>
        <license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/">
          <license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License,which permits unrestricted use,distribution,and reproduction in any medium,provided the original work is properly cited.</license-p>
        </license>
      </permissions>
      <abstract>
        <p>The heptapeptide <italic>Bj</italic>-PRO-7a,isolated and identified from <italic>Bothrops jararaca</italic> (<italic>Bj</italic>) venom,produces antihypertensive and other cardiovascular effects that are independent on angiotensin converting enzyme inhibition,possibly relying on cholinergic muscarinic receptors subtype 1 (M<sub>1</sub>R). However,whether <italic>Bj</italic>-PRO-7a acts upon the central nervous system and modifies behavior is yet to be determined. Therefore,the aims of this study were: i) to assess the effects of acute administration of <italic>Bj</italic>-PRO-7a upon behavior; ii) to reveal mechanisms involved in the effects of <italic>Bj</italic>-PRO-7a upon locomotion/exploration,anxiety,and depression-like behaviors. For this purpose,adult male Wistar (WT,wild type) and spontaneous hypertensive rats (SHR) received intraperitoneal injections of vehicle (0.9% NaCl),diazepam (2 mg/kg),imipramine (15 mg/kg),<italic>Bj</italic>-PRO-7a (71,213 or 426 nmol/kg),pirenzepine (852 nmol/kg),&#x3B1;-methyl-DL-tyrosine (200 mg/kg),or chlorpromazine (2 mg/kg),and underwent elevated plus maze,open field,and forced swimming tests. The heptapeptide promoted anxiolytic and antidepressant-like effects and increased locomotion/exploration. These effects of <italic>Bj</italic>-PRO-7a seem to be dependent on M<sub>1</sub>R activation and dopaminergic receptors and rely on catecholaminergic pathways.</p>
      </abstract>
      <kwd-group>
        <kwd><italic>Bj</italic>-PRO-7a</kwd>
        <kwd>Snake venom</kwd>
        <kwd>Neuroactive compounds</kwd>
        <kwd>Anxiety</kwd>
        <kwd>Depression</kwd>
        <kwd>Behavior</kwd>
      </kwd-group>
      <counts>
        <fig-count count="9"/>
        <table-count count="0"/>
        <equation-count count="0"/>
        <ref-count count="35"/>
      </counts>
    </article-Meta>
  </front>

这是整个脚本:

from bs4 import BeautifulSoup as bs
import pandas as pd

content = []
with open("phosphiltestfilepmc.xml","lxml")
    available_contacts = 139
    start_list = 0
    #article_Meta = bs_content.find_all('article-Meta')
    input_tag = bs_content.find_all(attrs={'ref-type': 'corresp'})
    
    # something = []
    # for link in input_tag:
    #     something.append(link.parent.get('given-names'))
    # print(something)
    
    l = []
    a = []
    for i in range(start_list,available_contacts):
        d = {}
        b = {}
        try:
            d['firstname'] = input_tag[i].parent('given-names')
        except:
            None
        try:
            d['lastname'] = input_tag[i].parent('surname')
        except:
            None
        try:
            d['email'] = input_tag[i].parent.parent.parent.parent('corresp')[0]('email')
        except:
            d['email'] = 'j@g.com'
        l.append(d)
    #print(l)
    
        for tag,d['email']):
            try:
                b['First Name'] = tag.text.strip()
            except:
                None
            try:
                b['Last Name'] = tag2.text.strip()
            except:
                None
            try:
                b['Email Address'] = tag3.text.strip()
            except:
                None
            a.append(b)
    print(a)
    
import pandas
df = pandas.DataFrame(a)
df
  
    

解决方法

我希望我对您的问题理解正确:您想从<contrib>标签(其中有<xref ref-type="corresp">txt包含问题的XML代码段)中提取名称:

import pandas as pd
from bs4 import BeautifulSoup

soup = BeautifulSoup(txt,'html.parser')

all_data = []
for contrib in soup.select('contrib:has(> xref[ref-type="corresp"])'):
    cor_id = contrib.select_one('xref[ref-type="corresp"]')['rid']
    email = soup.select_one('corresp#{} email'.format(cor_id))
    email = email.text if email else '-'

    all_data.append({
        'First Name': contrib.select_one('given-names').text,'Last Name': contrib.select_one('surname').text,'Email Address': email
    })

df = pd.DataFrame(all_data)
print(df)

打印:

  First Name Last Name        Email Address
0         D.    Ianzer  carlosxavier@ufg.br
1       C.H.    Xavier  carlosxavier@ufg.br