Pypsark解压具有不同长度的字符串的嵌套列表的列

问题描述

我有一个pyspark数据框,如下所示,其中包含不同长度的嵌套列表:

ID          BioID    Pvalue    Significance
Sample1     "AATC"    0.01          1
Sample2     "AATC"    0.01          1
Sample2     "AATG"    0.02          0
Sample2     "AAAA"    0.50          0
Sample3     "TGCC"    0.04          0

我需要解压缩数据框,以便为每个嵌套列表和以下列保留ID:

df.select("ID",F.explode("results")).show(5)

ID          col    
Sample1     ["AATC","AATC","AATG","AAAA","TGCC"]
Sample2     [0.01,0.02,0.50,0.04]
Sample2     [1,1,0]             
Sample2     ["AATC","TGCC"]             
Sample3     [0.01,0.04]     

我尝试爆炸,但是它给了我更多列表:

root
 |-- ID: string (nullable = true)
 |-- features: array (nullable = true)
 |    |-- element: string (containsNull = true)         

编辑:基于建议添加架构

### ALABAMA FILTER
al_filter <- reactive({
  if(input$selectcounty == "Autauga-AL") {
    demographics_autauga <- subset.data.frame(demographics,NAME=="Autauga-AL")
    nodes_autauga <- as.Node(demographics_autauga)
  } 
  else {
    return("ERROR2")
  }
})

##### ARKANSAS FILTER
ar_filter <- reactive ({
  if(input$selectcounty== "Arkansas-AR") {
    demographics_ArkansasAR <-subset.data.frame(demographics,NAME=="Arkansas-AR")
    nodes_ArkansasAR<- as.Node(demographics_ArkansasAR)
  }   
  else {
    return("ERROR2")
  }
})

##### STATES FILTER
demographics_filter <- reactive({
   if(grepl("-AL",input$selectcounty)){
    return(al_filter())
  }
  else if (grepl("-AR",input$selectcounty)){
    return (ar_filter())
  }
  else {
    return(" ERROR")
  }
})

解决方法

如果您具有 nested list ,并且具有如下所示的架构 Array-> Array-> string )使用 transform 使用高阶函数 inline 将所需的列组合到数组中的结构中) >爆炸结构数组)以获取所需的输出。

df.show(truncate=False)

#+-------+--------------------------------------------------+
#|ID     |Features                                          |
#+-------+--------------------------------------------------+
#|Sample1|[[AATC,0.01,1]]                                 |
#|Sample2|[[AATC,1],[AATG,0.02,0],[AAAA,0.5,0]]|
#|Sample3|[[TGCC,0.04,0]]                                 |
#+-------+--------------------------------------------------+

df.printSchema()

#root
# |-- ID: string (nullable = true)
# |-- Features: array (nullable = true)
# |    |-- element: array (containsNull = true)
# |    |    |-- element: string (containsNull = true)

from pyspark.sql import functions as F

df.withColumn("Features",F.expr("""transform(Features,x-> struct(x[0] as BioID,x[1] as Pvalue,x[2] as Significance))"""))\
  .select("ID",F.expr("""inline(Features)""")).show()

#+-------+-----+------+------------+
#|     ID|BioID|Pvalue|Significance|
#+-------+-----+------+------------+
#|Sample1| AATC|  0.01|           1|
#|Sample2| AATC|  0.01|           1|
#|Sample2| AATG|  0.02|           0|
#|Sample2| AAAA|   0.5|           0|
#|Sample3| TGCC|  0.04|           0|
#+-------+-----+------+------------+

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...