选择使用点或括号表示法时列的空值,而不是使用 UDF 时

问题描述

我正在尝试清理一些嵌套数据并提取我关心的字段。

我的嵌套值架构是:

 |-- maritalstatus: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- text_: string (nullable = true)
 |    |-- text__extensions: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- extension: array (nullable = true)
 |    |    |-- element: string (containsNull = true)

我想将 text_ 字段提取为它自己的列。

我试过: df.select(col("maritalstatus.text_")).show()df.select(col("maritalstatus")["text_"]).show() 但它返回:

+-----+
|text_|
+-----+
| null|
| null|
 ...
| null|
+-----+

当我将 UDF 定义为:

def getMaritalStatus(ms):
    return ms.text_
gms = udf(getMaritalStatus,StringType())

并执行 df.select(gms(col("maritalstatus")).show() 它返回我期望的数据

有趣的是,我有另一个嵌套结构字段,它具有类似的结构,但以数字代替名称作为键,并且我能够使用 df.select(col("birthdate")["0"]).show() 表示法

生日的架构:

root
 |-- birthdate: struct (nullable = true)
 |    |-- 0: date (nullable = true)
 |    |-- 1: integer (nullable = true)

是否可以在不使用 UDF 的情况下提取 maritalstatus.text_?我听说 UDF 的性能不如其他方法?

cassandra 表结构:

CREATE TABLE keyspace.patient (
    id text PRIMARY KEY,active boolean,active_extensions list<text>,address list<frozen<address>>,birthdate frozen<tuple<date,int>>,birthdate_extensions list<text>,communication list<frozen<patient_communication>>,contact list<frozen<patient_contact>>,contained list<frozen<tuple<text,text,text>>>,deceasedboolean boolean,deceasedboolean_extensions list<text>,deceaseddatetime frozen<tuple<timestamp,deceaseddatetime_extensions list<text>,extension list<text>,gender text,gender_extensions list<text>,generalpractitioner list<text>,identifier list<frozen<identifier>>,implicitrules text,implicitrules_extensions list<text>,language text,language_extensions list<text>,link list<frozen<patient_link>>,managingorganization text,maritalstatus frozen<codeableconcept>,meta frozen<meta>,modifierextension list<text>,multiplebirthboolean boolean,multiplebirthboolean_extensions list<text>,multiplebirthinteger int,multiplebirthinteger_extensions list<text>,name list<frozen<humanname>>,photo list<frozen<attachment>>,telecom list<frozen<contactpoint>>,text_ frozen<narrative>
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL','rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy','max_threshold': '32','min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64','class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';

codeableconcept

CREATE TYPE keyspace.codeableconcept (
    extension list<text>,text_ text,text__extensions list<text>,id text,coding list<frozen<coding>>
);

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)