问题描述
这是我的桌子的样子:
SELECT
user_id,order_date,product
FROM example_table
WHERE user_id = 1
ORDER BY order_date ASC
user_id | 订单日期 | 产品 |
---|---|---|
1 | 2021-01-01 | A |
1 | 2021-01-01 | B |
1 | 2021-01-04 | A |
1 | 2021-01-07 | C |
1 | 2021-01-09 | C |
1 | 2021-01-20 | A |
这是我想要实现的目标:
user_id | 订单日期 | 产品 | cum_dist_count |
---|---|---|---|
1 | 2021-01-01 | A | 1 |
1 | 2021-01-02 | B | 2 |
1 | 2021-01-04 | A | 2 |
1 | 2021-01-07 | C | 3 |
1 | 2021-01-09 | C | 3 |
1 | 2021-01-20 | A | 3 |
换句话说,我希望能够看到客户到目前为止有多少独特的商品,并且能够看到特定日期的商品(例如上面的示例:他们在 2021-01-04 购买了2 个独特的项目,2021 年 1 月 7 日,该数字为 3)。
我尝试通过在 CTE 中选择 user_id 和 product 以及 min(order_date) 进行分组,然后在该 CTE 中对 user_id 和 product 执行 ROW_NUMBER 并且部分起作用 - 我能够看到唯一产品计数的日期已更改(因此对于此示例:2021-01-01、2021-01-02 和 2021-01-07,但随后我丢失了仍然希望能够访问的行“之间”。
with cte as (
SELECT
user_id,product,min(order_date) as first_order
FROM example_table
GROUP BY 1,2
ORDER BY order_date ASC
)
SELECT
user_id,first_order,ROW_NUMBER() OVER (PARTITION BY user_id,product ORDER BY first_order) AS number_of_unique_products
WHERE user_id = 1
根据以上内容,我会得到:
user_id | 订单日期 | 产品 | cum_dist_count |
---|---|---|---|
1 | 2021-01-01 | A | 1 |
1 | 2021-01-02 | B | 2 |
1 | 2021-01-07 | C | 3 |
非常感谢任何帮助!
解决方法
对于每个项目,您可以记录它出现的最早日期。然后把它们加起来:
select et.* except (seqnum),countif(seqnum = 1) over (partition by user_id order by order_date) as running_distinct_count
from (select et.*,row_number() over (partition by user_id,product order by order_date) as seqnum
from example_table et
) et
,
以下是 BigQuery
select * except(cum_products),(select count(distinct product) from t.cum_products product) as cum_dist_count
from (
select *,array_agg(product) over prev_rows as cum_products
from example_table
window prev_rows as (partition by user_id order by order_date)
) t
如果应用于您问题中的样本数据
with example_table as (
select 1 user_id,'2021-01-01' order_date,'A' product union all
select 1,'2021-01-02','B' union all
select 1,'2021-01-04','A' union all
select 1,'2021-01-07','C' union all
select 1,'2021-01-09','2021-01-20','A'
)
输出是