问题描述
我希望在 total 和 USD 之间出现的总费用中,字符串形式为整数。
示例数据框:
id name lastname message
0 1 John Doe John have 100 USD,so he buy 5 eggs which total cost 10 USD
1 2 Mar Aye Mar have 10 USD,he just buy a banana from another shop for 16 USD
所以最终结果应该是:
id name lastname message total
0 1 John Doe John have 100 USD,so he buy 5 eggs which total cost 10 USD 10
1 2 Mar Aye Mar have 10 USD,he just buy a banana from another shop for 16 USD 0
解决方法
您可以使用正则表达式捕获介于“总计”和“美元”之间的任何数字。
下面的代码将捕获任何数字(如果是第一个,则为第一个,如果应接受浮点数,则需要进行一些调整,但由于类型应为int,因此不需要),然后将其转换为int类型。
df['total'] = df['message'].str.extract('total.*?(\d+).*?USD').fillna(0).astype(int)
结果:
id name lastname message total
0 1 John Doe John have 100 USD,so he buy 5 eggs which total cost 10 USD 10
1 2 Mar Aye Mar have 10 USD,he just buy a banana from another shop for 16 USD 0