问题描述
我的df行如下:
index | text
0 | '28,3" LEDTV K98765 AB12345 EU'
1 | '65" LEDTV K98765 AB12345 EU'
2 | '55,3" LEDTV K98765 AB12345 EU'
3 | 'MON 22,8" LED U754 PL333 DE'
4 | 'DAB Radio Work 34RT55 Blue'
每个电视都以英寸为单位(“ 28,3” /“ 65” /“ 55,3”)开始,并且在文本中的某个位置带有“ TV”一词。
我需要知道哪些产品是电视,如果是,那么它们的屏幕尺寸是否大于55英寸。
在此示例中,第1行和第2行均符合此条件。
最终结果应为:
index | text | tvandbiggerthan55
0 | '28,3" LEDTV K98765 AB12345 EU' | 0
1 | '65" LEDTV K98765 AB12345 EU' | 1
2 | '55,3" LEDTV K98765 AB12345 EU' | 1
3 | 'MON 22,8" LED U754 PL333 DE' | 0
4 | 'DAB Radio Work 34RT55 Blue' | 0
如何一次性检查整个列?
解决方法
使用Series.str.extract
获取"
之前的数字,替换,
并转换为浮点数,因此可以用Series.gt
进行比较以获得更大的值,第二个掩码使用{{3} },对于1,0
使用了地图Series.str.contains
:
m1 = (df['text'].str.extract('(\d+,\d+|\d+)"',expand=False)
.str.replace(',','.')
.astype(float)
.gt(55))
m2 = df['text'].str.contains('TV')
df['tvandbiggerthan55'] = (m1 & m2).view('i1')
print (df)
text tvandbiggerthan55
0 '28,3" LEDTV K98765 AB12345 EU' 0
1 '65" LEDTV K98765 AB12345 EU' 1
2 '55,3" LEDTV K98765 AB12345 EU' 1
3 'MON 22,8" LED U754 PL333 DE' 0
4 'DAB Radio Work 34RT55 Blue' 0
,
尝试使用此链式解决方案;
df['tvandbiggerthan55']=((df.assign(tvandbiggerthan55=\
df[df.text.str.contains('^\d|TV')])\
['tvandbiggerthan55'].str.extract\
('(^\d+)')).astype(float)>=55).astype(int)
text tvandbiggerthan55
0 28,3" LEDTV K98765 AB12345 EU 0
1 65" LEDTV K98765 AB12345 EU 1
2 55,3" LEDTV K98765 AB12345 EU 1
3 MON 22,8" LED U754 PL333 DE 0
4 DAB Radio Work 34RT55 Blue 0
工作方式
# Extract df where text begins with a digit and also contains TV
df.assign(tvandbiggerthan55=df[df.text.str.contains('^\d|TV')])
#modify the df above to extract RV inches
df.assign(tvandbiggerthan55=df[df.text.str.contains('^\d|TV')])['tvandbiggerthan55'].str.extract('(^\d+)')
# Converts the TV inches extracted above into a float and test if it is equal or greater than 55
((df.assign(tvandbiggerthan55=df[df.text.str.contains('^\d|TV')])['tvandbiggerthan55'].str.extract('(^\d+)')).astype(float)>=55)
# Convert the boolean from above into integers by chaining
.astype(int)