问题描述
我试图在 pyspark 中获得关于 regexp_extract 的一些见解,并尝试使用此选项进行检查以更好地理解。
下面是我的数据框
data = [('2345','Checked|by John|for kamal'),('2398','Checked|by John|for kamal '),('2328','Verified|by Srinivas|for kamal than some random text'),('3983','Verified|for Stacy|by John')]
df = sc.parallelize(data).toDF(['ID','Notes'])
df.show()
+----+-----------------------------------------------------+
| ID| Notes |
+----+-----------------------------------------------------+
|2345|Checked|by John|for kamal |
|2398|Checked|by John|for kamal |
|2328|Verified|by Srinivas|for kamal than some random text |
|3983|Verified|for Stacy|by John |
+----+-----------------------------------------------------+
所以在这里我试图确定一个 ID 是否被 John 检查或验证过
在 SO 成员的帮助下,我能够破解 regexp_extract 的使用并得出以下解决方案
result = df.withColumn('Employee',regexp_extract(col('Notes'),'(Checked|Verified)(\\|by John)',1))
result.show()
+----+------------------------------------------------+------------+
| ID| Notes |Employee|
+----+------------------------------------------------+------------+
|2345|Checked|by John|for kamal | Checked|
|2398|Checked|by John|for kamal | Checked|
|2328|Verified|by Srinivas|for kamal than some random text| |
|3983|Verified|for Stacy|by John | |
+----+--------------------+----------------------------------------+
对于少数 ID,这给了我完美的结果,但对于最后一个 ID,它没有打印已验证。有人可以告诉我是否需要在提到的正则表达式中执行任何其他操作吗?
我觉得 (Checked|Verified)(\\|by John)
只匹配相邻的值。我尝试了 * 和 $,仍然没有打印 ID 3983 的验证。
解决方法
我会将正则表达式表述为:
(Checked|Verified)\b.*\bby John
Demo
此模式查找 Checked/Verified
后跟 by John
,两者由任意数量的文本分隔。请注意,我在这里只使用字边界而不是管道。
更新代码:
result = df.withColumn('Employee',regexp_extract(col('Notes'),'\b(Checked|Verified)\b.*\bby John',1))
,
另一种方法是检查 Notes 列是否包含字符串 by John
:
df.withColumn('Employee',F.when(col('Notes').like('%Checked|by John%'),'Checked').when(col('Notes').like('%by John'),'Verified').otherwise(" ")).show(truncate=False)
+----+----------------------------------------------------+--------+
|ID |Notes |Employee|
+----+----------------------------------------------------+--------+
|2345|Checked|by John|for kamal |Checked |
|2398|Checked|by John|for kamal |Checked |
|2328|Verified|by Srinivas|for kamal than some random text| |
|3983|Verified|for Stacy|by John |Verified|
+----+----------------------------------------------------+--------+
,
你可以试试这个正则表达式:
import pyspark.sql.functions as F
result = df.withColumn('Employee',F.regexp_extract('Notes','(Checked|Verified)\\|.*by John',1))
result.show()
+----+--------------------+--------+
| ID| Notes|Employee|
+----+--------------------+--------+
|2345|Checked|by John|f...| Checked|
|2398|Checked|by John|f...| Checked|
|2328|Verified|by Srini...| |
|3983|Verified|for Stac...|Verified|
+----+--------------------+--------+