问题描述
我有一个pyspark数据框,如下所示,其中包含不同长度的嵌套列表:
ID BioID Pvalue Significance
Sample1 "AATC" 0.01 1
Sample2 "AATC" 0.01 1
Sample2 "AATG" 0.02 0
Sample2 "AAAA" 0.50 0
Sample3 "TGCC" 0.04 0
我需要解压缩数据框,以便为每个嵌套列表和以下列保留ID:
df.select("ID",F.explode("results")).show(5)
ID col
Sample1 ["AATC","AATC","AATG","AAAA","TGCC"]
Sample2 [0.01,0.02,0.50,0.04]
Sample2 [1,1,0]
Sample2 ["AATC","TGCC"]
Sample3 [0.01,0.04]
我尝试爆炸,但是它给了我更多列表:
root
|-- ID: string (nullable = true)
|-- features: array (nullable = true)
| |-- element: string (containsNull = true)
编辑:基于建议添加架构
### ALABAMA FILTER
al_filter <- reactive({
if(input$selectcounty == "Autauga-AL") {
demographics_autauga <- subset.data.frame(demographics,NAME=="Autauga-AL")
nodes_autauga <- as.Node(demographics_autauga)
}
else {
return("ERROR2")
}
})
##### ARKANSAS FILTER
ar_filter <- reactive ({
if(input$selectcounty== "Arkansas-AR") {
demographics_ArkansasAR <-subset.data.frame(demographics,NAME=="Arkansas-AR")
nodes_ArkansasAR<- as.Node(demographics_ArkansasAR)
}
else {
return("ERROR2")
}
})
##### STATES FILTER
demographics_filter <- reactive({
if(grepl("-AL",input$selectcounty)){
return(al_filter())
}
else if (grepl("-AR",input$selectcounty)){
return (ar_filter())
}
else {
return(" ERROR")
}
})
解决方法
如果您具有 nested list
,并且具有如下所示的架构( Array-> Array-> string )使用 transform
(使用高阶函数 inline
(将所需的列组合到数组中的结构中) >爆炸结构数组)以获取所需的输出。
df.show(truncate=False)
#+-------+--------------------------------------------------+
#|ID |Features |
#+-------+--------------------------------------------------+
#|Sample1|[[AATC,0.01,1]] |
#|Sample2|[[AATC,1],[AATG,0.02,0],[AAAA,0.5,0]]|
#|Sample3|[[TGCC,0.04,0]] |
#+-------+--------------------------------------------------+
df.printSchema()
#root
# |-- ID: string (nullable = true)
# |-- Features: array (nullable = true)
# | |-- element: array (containsNull = true)
# | | |-- element: string (containsNull = true)
from pyspark.sql import functions as F
df.withColumn("Features",F.expr("""transform(Features,x-> struct(x[0] as BioID,x[1] as Pvalue,x[2] as Significance))"""))\
.select("ID",F.expr("""inline(Features)""")).show()
#+-------+-----+------+------------+
#| ID|BioID|Pvalue|Significance|
#+-------+-----+------+------------+
#|Sample1| AATC| 0.01| 1|
#|Sample2| AATC| 0.01| 1|
#|Sample2| AATG| 0.02| 0|
#|Sample2| AAAA| 0.5| 0|
#|Sample3| TGCC| 0.04| 0|
#+-------+-----+------+------------+