如果匹配模式位于字符串中的任何位置,如何使用 regexp_extract - pyspark

问题描述

我试图在 pyspark 中获得关于 regexp_extract 的一些见解,并尝试使用此选项进行检查以更好地理解。

下面是我的数据框

data = [('2345','Checked|by John|for kamal'),('2398','Checked|by John|for kamal '),('2328','Verified|by Srinivas|for kamal than some random text'),('3983','Verified|for Stacy|by John')]

df = sc.parallelize(data).toDF(['ID','Notes'])

df.show()

+----+-----------------------------------------------------+
|  ID|               Notes                                 |
+----+-----------------------------------------------------+
|2345|Checked|by John|for kamal                            |
|2398|Checked|by John|for kamal                            |
|2328|Verified|by Srinivas|for kamal than some random text |
|3983|Verified|for Stacy|by John                           |
+----+-----------------------------------------------------+

所以在这里我试图确定一个 ID 是否被 John 检查或验证过

在 SO 成员的帮助下,我能够破解 regexp_extract 的使用并得出以下解决方

result = df.withColumn('Employee',regexp_extract(col('Notes'),'(Checked|Verified)(\\|by John)',1))

result.show()

+----+------------------------------------------------+------------+
|  ID|               Notes                                |Employee|
+----+------------------------------------------------+------------+
|2345|Checked|by John|for kamal                           | Checked|
|2398|Checked|by John|for kamal                           | Checked|
|2328|Verified|by Srinivas|for kamal than some random text|        |
|3983|Verified|for Stacy|by John                          |        |
+----+--------------------+----------------------------------------+

对于少数 ID,这给了我完美的结果,但对于最后一个 ID,它没有打印已验证。有人可以告诉我是否需要在提到的正则表达式中执行任何其他操作吗?

我觉得 (Checked|Verified)(\\|by John) 只匹配相邻的值。我尝试了 * 和 $,仍然没有打印 ID 3983 的验证。

解决方法

我会将正则表达式表述为:

(Checked|Verified)\b.*\bby John

Demo

此模式查找 Checked/Verified 后跟 by John,两者由任意数量的文本分隔。请注意,我在这里只使用字边界而不是管道。

更新代码:

result = df.withColumn('Employee',regexp_extract(col('Notes'),'\b(Checked|Verified)\b.*\bby John',1))
,

另一种方法是检查 Notes 列是否包含字符串 by John

   df.withColumn('Employee',F.when(col('Notes').like('%Checked|by John%'),'Checked').when(col('Notes').like('%by John'),'Verified').otherwise(" ")).show(truncate=False)

+----+----------------------------------------------------+--------+
|ID  |Notes                                               |Employee|
+----+----------------------------------------------------+--------+
|2345|Checked|by John|for kamal                           |Checked |
|2398|Checked|by John|for kamal                           |Checked |
|2328|Verified|by Srinivas|for kamal than some random text|        |
|3983|Verified|for Stacy|by John                          |Verified|
+----+----------------------------------------------------+--------+
,

你可以试试这个正则表达式:

import pyspark.sql.functions as F

result = df.withColumn('Employee',F.regexp_extract('Notes','(Checked|Verified)\\|.*by John',1))

result.show()
+----+--------------------+--------+
|  ID|               Notes|Employee|
+----+--------------------+--------+
|2345|Checked|by John|f...| Checked|
|2398|Checked|by John|f...| Checked|
|2328|Verified|by Srini...|        |
|3983|Verified|for Stac...|Verified|
+----+--------------------+--------+