问题描述
我从在线数据库下载了文件。所有文件都包含CAML格式的一两个段落,这确实给我使用带来了不便。
例如:
"<caml:Content xmlns:caml=\"http://lc.ca.gov/legalservices/schemas/caml.1#\"><p>(a)<span class=\"EnSpace\"/>Every home solicitation contract or offer for home improvement goods or services which provides for a lien on real property is subject to the provisions of Chapter 1 (commencing with Section 1801) of Title 2 of Part 4 of Division 3.</p><p>(b)<span class=\"EnSpace\"/>For purposes of this section,“home improvement goods or services” means goods and services,as defined in Section 1689.5,which are bought in connection with the improvement of real property. Such home improvement goods and services include,but are not limited to,burglar alarms,carpeting,texture coating,fencing,air conditioning or heating\nequipment,and termite extermination. Home improvement goods include goods which,at the time of sale or subsequently,are to be so affixed to real property as to become a part of real property whether or not severable therefrom.</p></caml:Content>"
我有一个Shell脚本将这些数据(存储在成千上万个文件中)推送到本地数据库中。我打算在我的应用程序中使用此数据,但我只需要文本部分-无需所有CAML标签。
我正在寻找一个脚本或工具来将所有文件中的所有CAML段落转换为纯文本。然后,我将重新填充我的MysqL本地数据库。有谁知道最好的方法吗?
谢谢!
解决方法
经过一番尝试之后,我的解决方法是使用python脚本。更具体地说,我使用了beautifulsoup库。
像这样:
import bs4 as bs
source = """"<caml:Content xmlns:caml=\"http://lc.ca.gov/legalservices/schemas/caml.1#\"><p>(a)<span class=\"EnSpace\"/>Every home solicitation contract or offer for home improvement goods or services which provides for a lien on real property is subject to the provisions of Chapter 1 (commencing with Section 1801) of Title 2 of Part 4 of Division 3.</p><p>(b)<span class=\"EnSpace\"/>For purposes of this section,“home improvement goods or services” means goods and services,as defined in Section 1689.5,which are bought in connection with the improvement of real property. Such home improvement goods and services include,but are not limited to,burglar alarms,carpeting,texture coating,fencing,air conditioning or heating\nequipment,and termite extermination. Home improvement goods include goods which,at the time of sale or subsequently,are to be so affixed to real property as to become a part of real property whether or not severable therefrom.</p></caml:Content>"""
soup = bs.BeautifulSoup(source,"lxml")
print(soup.get_text())
输出:
"(a)Every home solicitation contract or offer for home improvement goods or services which provides for a lien on real property is subject to the provisions of Chapter 1 (commencing with Section 1801) of Title 2 of Part 4 of Division 3.(b)For purposes of this section,air conditioning or heating
equipment,are to be so affixed to real property as to become a part of real property whether or not severable therefrom.