先验排名为零时如何分配排名第2部分

问题描述

这是我先前的问题How to distribute values when prior rank is zero的扩展。该解决方案对于postgres环境非常有效，但是现在我需要复制到databricks环境（spark sql）。

问题与以前相同，但现在尝试确定如何将此Postgres查询转换为Spark sql。基本上，如果数据中有空白（即按位置和geo3进行分组时没有micro_geo），它将对分配量进行汇总。所有位置和zip3组的“估算分配”将等于1。

这是postgres查询，效果很好：

    select location_code,geo3,distance_group,has_micro_geo,imputed_allocation from 
        (
        select ia.*,(case when has_micro_geo > 0
                     then sum(allocation) over (partition by location_code,grp)
                     else 0
                end) as imputed_allocation
        from (select s.*,count(*) filter (where has_micro_geo <> 0) over (partition by location_code,geo3 order by distance_group desc) as grp
              from staging_groups s
             ) ia
        )z

但是它不能很好地翻译，并在数据块中产生此错误：

    Error in sql statement: ParseException: 
    mismatched input 'from' expecting <EOF>(line 1,pos 78)

    == sql ==
    select location_code,imputed_allocation from 
    ------------------------------------------------------------------------------^^^
        (
        select ia.*,geo3 order by distance_group desc) as grp
              from staging_groups s
             ) ia
        )z

或者至少，如何仅转换此内部查询的一部分以创建“ grp”，然后其余部分可能会起作用。我一直在尝试用其他方法替换此filter-where逻辑，但尝试并未按预期进行。

    select s.*,geo3 order by distance_group desc) as grp
    from staging_groups s

这是一个包含数据https://www.db-fiddle.com/f/wisvDZJL9BkWxNFkfLXdEu/0的db-fiddle，当前设置为postgres，但是同样，我需要在spark sql环境中运行它。我尝试将其分解并创建不同的表，但是我的组无法正常工作。

下面是一张可以更好地可视化输出的图像：

解决方法

您需要重写此子查询：

select s.*,count(*) filter (where has_micro_geo <> 0) over (partition by location_code,geo3 order by distance_group desc) as grp
from staging_groups s

尽管窗口和聚合函数的filter()子句是标准SQL，但到目前为止，很少有数据库支持它。相反，请考虑产生相同结果的条件窗口sum()：

select s.*,sum(case when has_micro_geo <> 0 then 1 else 0 end) over (partition by location_code,geo3 order by distance_group desc) as grp
from staging_groups s

我认为查询的其余部分应在Spark SQL中正常运行。

由于pscommand = ".\myscript2.ps1" cmd = "powershell.exe -noprofile -command " & pscommand Set oshell = CreateObject("WScript.Shell") Set executor = oshell.Exec(cmd) executor.StdIn.Close strS = executor.StdOut.ReadAll已经是0/1标记，因此您可以将count（filter）重新设置为

Set shell = CreateObject("WScript.Shell")

添加has_micro_geo可以避免使用默认的sum(has_micro_geo) over (partition by location_code,geo3 order by distance_group desc rows unbounded preceding) as grp，而默认情况下rows unbounded preceding的性能可能会降低。

顺便说一句，我在评论戈登对您先前问题的解决方案时已经写过：-）

apache-spark-sql databricks gaps-and-islands sql sql window-functions