如何将半结构化json字符串列转换为pyspark中的数据框?

问题描述

我正在尝试将以下半结构化json字符串从列转换为结构化数据框

2020-09-24T08:03:01.633Z 10.1.20.1 {"timstamp":"2020-09-24 13:33:01","sourcename":"local","Keys":-9serverkey,"Type":"status","key1":2,"key2":"INFO","key3":5145,"key4":"valuekey4","key5":"{valuekey5}","key6":0,"key7":12,"key8":0,"key9":76,"key10":5,"other_key1":5,"other_key2":"value2","other_key3":"other value 3\r\n\t\r\nSubject:\r\n\tsecurity other_key4:\t\totherKey4\r\n\taccount otherkey5:\t\tothervalue5$\r\n\taccount}

我首先创建了架构,以将上述数据加载到数据框


 schema = StructType([
        StructField("Date",DateType()),StructField("Source IP",StringType()),StructField("Event Type",StringType())
    ])

df = session.read.option("header","true").option("delimiter"," ").csv(
            "mypath\\logs.txt",schema=self.schema)

输出返回以下结构

+----------+-------------+--------------------+
|      Date|    Source IP|          Event Type|
+----------+-------------+--------------------+
|2020-09-2 |10.1.20.1    |{"timstamp":"202...|

现在我只需要从上面的日志数据中从“ timstamp”中提取json到“ key10”,并且可以排除其余的json字符串..因此,如何将包含json字符串的“ Event Type”列转换为此结构化json情况?

感谢帮助吗?

谢谢

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)