尝试读取新文件时解析 bibtex 文件失败

问题描述

通常，我将所有参考文献都保存在 Mendeley 中。从那里我导出选定的引用组以最终 .bib 文件，通常这不会导致任何问题。然而，当最近尝试执行上述操作时，我最终在控制台中收到一条消息，说

    Traceback (most recent call last):
      File "C:/Users/.../Abstract_App/Abstract_Read_test_5.py",line 42,in <module>
        articles_books_misc = [_ for _ in articles_books_misc if _[0] == '{']
      File "C:/Users/jcst/PycharmProjects/Abstract_App/Abstract_Read_test_5.py",in <listcomp>
        articles_books_misc = [_ for _ in articles_books_misc if _[0] == '{']
    IndexError: string index out of range

在下面的代码字符串中，错误标记为“此处的代码失败”。

代码是一样的：

    import warnings
    
    warnings.simplefilter(action='ignore',category=FutureWarning)
    warnings.simplefilter(action='ignore',category=DeprecationWarning)
    warnings.simplefilter(action='ignore',category=RuntimeWarning)
    warnings.simplefilter(action='ignore',category=UserWarning)
    
    from collections import Counter
    import seaborn as sns
    import matplotlib.pyplot as plt
    plt.rcParams.update({'font.size': 7})
    import pandas as pd
    
    sns.set_style("white")
    
    # Directly read the bibtex file
    filename = 'C:\\Users\\jcst\\...\\My_Collection_Dip_Bach_UK_literature.txt' # Test_Collection1_1235_new1.txt
    
    with open(filename,encoding="ISO-8859-1") as f:
        lines = f.read().splitlines()
    
    
    rawtext = ''.join(lines)
    
    
    
    # Split by the identifiers @article,@book,@misc
    
    articles = rawtext.split('@article')
    
    articles_books = []
    
    for article in articles:
        articles_books.extend(article.split('@book'))
    
    articles_books_misc = []
    
    for article in articles_books:
        articles_books_misc.extend(article.split('@misc'))
    
    articles_books_misc = [_ for _ in articles_books_misc if _[0] == '{'] **<== CODE FAILS HERE!!!**
    
    print('A total of {} articles/books/misc.'.format(len(articles_books_misc)))
    
    """
    A simple parser to parse the bibtex file. This will not work if the text contains specical characters such as = or }
    
    """
    
    parsed_articles = []
    for article in articles_books_misc:
        # if '56 research ' in article:
        #    raise
        identifier = article.split(',')[0]
        content = article[len(identifier) + 1:]
        identifiers = [_.split(',')[-1] for _ in content.split(' = {')[:-1]]
        contents = [_.split('},')[0] for _ in content.split(' = {')[1:]]
        article = {}
        for k,v in zip(identifiers,contents):
            article[k] = v
    
        article['identifier'] = identifier
    
        parsed_articles.append(article)
    
    df = pd.DataFrame(parsed_articles)
    print(df)
    df.to_csv("C:\\Users\\...\\Articles_stored_2.csv",sep=";",index=False)
    
    # Remove entries with nan years
    df = df[pd.notnull(df['year'])]
    
    print('A total of {} elements remaining'.format(len(df)))
    
    # Parse files to match format
    
    df['Year'] = df['year'].astype(str).str[0:4].astype(int)
    
    # Authors
    authors_flat = []
    for authors in list(df["author"].dropna()):
        authors_flat.extend(authors.split(' and '))  # FAU format seems to be better here..
    
    publication_data = df
    
    publication_data.to_csv("C:\\Users\\jcst\\OneDrive - Danmarks Tekniske Universitet\\Skrivebord\\Private\\Python data\\pubs_test.csv",index=False)
    
    # Top 25 authors
    
    # plt.figure(figsize=(10,10),dpi=600)
    top10authors = pd.DataFrame.from_records(
        Counter(authors_flat).most_common(25),columns=["Name","Count"]
    )
    plt.figure(figsize=(8,8),dpi=600)
    sns.barplot(x="Count",y="Name",data=top10authors,palette="RdBu_r")
    plt.title("Top 25 Authors")
    plt.tight_layout()
    plt.subplot(1,1,1)
    plt.subplots_adjust(top=1,bottom=0,left=0,right=1,hspace=0.2,wspace=0.2)
    plt.show()
    plt.savefig("C:\\Users\\jcst\\...\\My_Collection_topAU.png",dpi=600) #format='pdf',# Publications over Time
    plt.figure(figsize=(8,dpi=600)
    yearly = pd.DataFrame(publication_data["Year"].value_counts().reset_index())
    yearly.columns = ["Year","Count"]
    sns.lineplot(x="Year",y="Count",data=yearly)
    plt.title("Publications over Time")
    plt.xlim([1995,2019])
    plt.tight_layout()
    plt.subplots_adjust(top=1,hspace=0.3,wspace=0.3)
    plt.show()
    plt.savefig("C:\\Users\\jcst\\...\\My_Collection_topPB.png")
    
    plt.figure(figsize=(8,dpi=600)
    # Top 25 Journals
    top10journals = pd.DataFrame.from_records(
        Counter(publication_data["journal"]).most_common(25),columns=["Journal","Count"],)
    
    sns.barplot(x="Count",y="Journal",data=top10journals,palette="RdBu_r")
    plt.title("Top 25 Journals")
    plt.tight_layout()
    plt.show()
    plt.savefig("C:\\Users\\...\\My_Collection_topJN.png")
    
    # Top associated keywords
    
    flat_kw = [
        _.lower()
        for kws in list(publication_data["keywords"].dropna())
        for _ in kws.split(",")
    ]
    
    top10kw = pd.DataFrame.from_records(
        Counter(flat_kw).most_common(25),columns=["Keyword",y="Keyword",data=top10kw,palette="RdBu_r")
    plt.title("Top 25 Associated Keywords")
    plt.tight_layout()
    plt.show()
    plt.savefig("C:\\Users\...\\My_Collection_topKW.png")

这是所期望的（使用可行的导出数据）。其他三个图表也被分析，以下只是一个例子：

    A total of 1235 articles/books/misc.
                                                     author  ... mendeley-tags
    0              Enterprises,Medium and Mots,Innovation  ...           NaN
    1               Gadatsch,Can Adam Albayrak und Andreas  ...           NaN
    2     Beier,Michael and Wagner,Kerstin and Beier,...  ...           NaN
    3     Emrich,Andreas and Klein,Sabine and Frey,Mi...  ...           NaN
    4               Rampersad,Giselle and Troshani,Indrit  ...           NaN
    ...                                                 ...  ...           ...
    1230                 Team,Marine Corps and Shield,Sea  ...           NaN
    1231                                          Day,marc  ...           NaN
    1232  Vitae,...........................................  ...           NaN
    1233                                                NaN  ...           NaN
    1234                                                NaN  ...           NaN
    
    [1235 rows x 23 columns]
    A total of 1119 elements remaining
    
    Process finished with exit code 0

比较导出文件我注意到它们显示出差异，例如一些引用带有如下属性：

数字 = {4}，页数 = {412--429}

其他人甚至没有相同的属性，但算法运行没有任何问题。

来自平稳运行算法的数据示例：

            @article{Enterprises2015,author = {Enterprises,Innovation},file = {:C$\backslash$:/Users/jcst/OneDrive - .../Digital{\_}capabilities{\_}for{\_}SMEs{\_}innovation.pdf:pdf},keywords = {- collaborative networks,- it,absorptive capacity,acap,cn,dc,digital capability},number = {Dc},pages = {1--20},title = {{Title : Digital capabilities for SMEs ' innovation in collaborative networks : A literature review Titre : Capacit{\'{e}}s digitales pour l ' innovation des PMEs en r{\'{e}}seaux collaboratifs : Une revue de la litt{\'{e}}rature}},year = {2015}
    }
    @article{Gadatsch2018,abstract = {A study carried out by the authors showed clear indications that many small and medium-sized enterprises (SMEs) currently do not have sufficient maturity for digital transformation. To solve the problem,it is proposed to develop an agile IT management concept in order to control the IT area dynamically and without the formal burden of classic IT management},author = {Gadatsch,Can Adam Albayrak und Andreas},doi = {.1037//0033-2909.I26.1.78},file = {:C$\backslash$:/Users/jcst/OneDrive - .../ARE SMALL AND MEDIUM-SIZED ENTERPRISES ALREADY PREPARED FOR DIGITAL TRANSFORMATION.pdf:pdf},isbn = {5856420187},journal = {Conference Paper},keywords = {Grade Coding Instructions and Tables,May 2018,SEER Program Coding and Staging Manual 2015},pages = {10},title = {{ARE SMALL AND MEDIUM-SIZED ENTERPRISES ALREADY PREPARED FOR DIGITAL TRANSFORMATION?}},url = {papers2://publication/uuid/512EBCE8-D635-4348-A67D-22DD52988F4C},volume = {Volume 201},year = {2018}
    }

或....

    @article{Beedie2016,author = {Beedie,Chris J and Hurst,Philip and Coleman,damian and Foad,Abby and Kingdom,United and Christ,Canterbury and Manikowske,Trista L and brown,Jessica M and jansson,Cristina and Smith,Jeremy D and Hayward,Reid and Colorado,northern and Morielli,Andria R and Usmani,Nawaid and Boul{\'{e}},normand G and Nijjar,Tirath and Kurian,Joseph and Tankel,Keith and Severin,Diane and Courneya,Kerry S},file = {:C$\backslash$:/Users/jcst/.../Placebo And Nocebo Effects Of A Purported Ergogenic Aid On Repeat Sprint Performance.pdf:pdf},pages = {2016},title = {{Placebo And Nocebo Effects Of A Purported Ergogenic Aid On Repeat Sprint Performance C-17 Thematic Poster - Exercise Therapy in Cancer Self-Reported Fatigue Does Not Highly Correlate with Objectively Measured Fatigue in Cancer Survivors Feasibility of an }},year = {2016}
    }
    @article{Holt2009,abstract = {Growth hormone (GH) was first isolated from the pituitary gland in the 1940s. It is believed that athletes have been abusing GH for its anabolic and lipolytic effects since the early 1980s,at least a decade before endocrinologists began to treat adults with GH deficiency. Most of our kNowledge about GH abuse is anecdotal but a number of high-profile athletes have admitted using GH. Despite its widespread abuse,there is debate about whether GH is ergogenic. Indeed most scientific studies have not shown a performance enhancing effect. This review will addresswhy this discrepancy of opinion between athletes and scientists exists and why the author believes that the scientists are wrong. copyright {\textcopyright} 2009 John Wiley {\&} Sons,Ltd.},author = {Holt,Richard I.G.},doi = {10.1002/dta.58},file = {:C$\backslash$:/.../Is human growth hormone an ergogenic aid.pdf:pdf},issn = {19427603},journal = {Drug Testing and Analysis},keywords = {Anabolic,Clinical trial,Growth hormone,Lipolytic,Performance},number = {9-10},pages = {412--418},title = {{Is human growth hormone an ergogenic aid?}},volume = {1},year = {2009}
    }

算法失败并列出错误时的数据：

    @article{Hasler2013,abstract = {....},author = {Hasler,Carol C.},doi = {10.4414/smw.2013.13714},file = {:C$\backslash$:/.../Back pain during growth.pdf:pdf},issn = {14247860},journal = {Swiss Medical Weekly},keywords = {Adolescents,Back pain,Children,Growth},number = {January},pmid = {23299906},title = {{Back pain during growth}},volume = {143},year = {2013}
    }
    @article{Swetha2018,abstract = {Introduction: .....},author = {Swetha,T. Vinaya and Babu,K. Yuvaraj and Mohanraj,Karthik Ganesh},file = {:C$\backslash$:/.../Survey on back pain.pdf:pdf},issn = {09757619},journal = {Drug Invention Today},keywords = {Acute pain,Chronic pain,Health organizations,Injury,Low back Pain,Pregnancy,Sitting posture},number = {12},pages = {2477--2480},title = {{Survey on back pain}},volume = {10},year = {2018}
    }

这可能是我最终出现上面显示的回溯错误的原因可能是数据集太小了吗？非常感谢有关如何继续/概括未来解析我的数据库中所有导出文件的任何帮助。谢谢@hubsandspokes

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

bibtex parsing parsing python-3.x text

尝试读取新文件时解析 bibtex 文件失败

问题描述

解决方法

相关问答