问题描述
我有一个 numbers
有这样的数据。它几乎遵循两种格式,其中 CTE
和 counts
将具有这两种类型的数据。
process_ids
现在我想在 client_id day counts process_ids
--------------------------------------------------------------------------------------------
abc1 Feb-01-2021 3 C1,C2 | C3,C4,C5 | C6,C7
abc2 Feb-05-2021 2,3 C10,C11,C12 | C13,C14 # C15,C16 | C17,C18
和 CTE
上将其拆分后从上面的 counts
获得以下输出 -
process_ids
基本上,如果 client_id day counts process_ids
--------------------------------------------------------
abc1 Feb-01-2021 3 C1
abc1 Feb-01-2021 3 C2
abc1 Feb-01-2021 3 C3
abc1 Feb-01-2021 3 C4
abc1 Feb-01-2021 3 C5
abc1 Feb-01-2021 3 C6
abc1 Feb-01-2021 3 C7
abc2 Feb-05-2021 2 C10
abc2 Feb-05-2021 2 C11
abc2 Feb-05-2021 2 C12
abc2 Feb-05-2021 2 C13
abc2 Feb-05-2021 2 C14
abc2 Feb-05-2021 3 C15
abc2 Feb-05-2021 3 C16
abc2 Feb-05-2021 3 C17
abc2 Feb-05-2021 3 C18
和 counts
遵循这些格式中的任何一种,则其想法是根据以下两个用例拆分 process_ids
和 counts
。
用例 1
如果 process_ids
列只有一位数且 |
列有 counts
分隔符。
用例 2
如果 ,
列只有由 process_ids
分隔符分隔的两位数,并且 #
列有 pipe
分隔符和 Amazon Redshift
。
我在这里与 l1 = ['POSITIVE','NEGATIVE','POSITIVE',# mismatch
'POSITIVE','POSITIVE'] # mismatch
l2 = ['POSITIVE','NEGATIVE'] # mismatch
mismatch = [i for i,j in zip(l1,l2) if i != j]
print(mismatch)
['POSITIVE','POSITIVE']
# expected output
[7,9]
合作,但我对如何根据需要将它们分开感到困惑。
这有可能吗?
解决方法
这乍一看可能有点毛茸茸的,但它是通过扎实的技术建立起来的,并提供了预期的结果......
SQL
WITH seq_0_9 AS (
SELECT 0 AS d
UNION ALL SELECT 1 AS d
UNION ALL SELECT 2 AS d
UNION ALL SELECT 3 AS d
UNION ALL SELECT 4 AS d
UNION ALL SELECT 5 AS d
UNION ALL SELECT 6 AS d
UNION ALL SELECT 7 AS d
UNION ALL SELECT 8 AS d
UNION ALL SELECT 9 AS d
),numbers AS (
SELECT a.d + b.d * 10 + c.d * 100 + 1 AS n
FROM seq_0_9 a,seq_0_9 b,seq_0_9 c
),processed AS
(SELECT client_id,day,REPLACE(counts,' ','') AS counts,REPLACE(REPLACE(process_ids,''),'|',',') AS process_ids
FROM tbl),split_pids AS
(SELECT
client_id,counts,split_part(process_ids,'#',n) AS process_ids,n AS n1
FROM processed
CROSS JOIN numbers
WHERE
split_part(process_ids,n) IS NOT NULL
AND split_part(process_ids,n) != ''),split_counts AS
(SELECT
client_id,split_part(counts,n) AS counts,process_ids,n1,n AS n2
FROM split_pids
CROSS JOIN numbers
WHERE
split_part(counts,n) IS NOT NULL
and split_part(counts,matched_up AS
(SELECT * FROM split_counts WHERE n1 = n2)
SELECT
client_id,n) AS process_ids
FROM
matched_up
CROSS JOIN
numbers
WHERE
split_part(process_ids,n) IS NOT NULL
AND split_part(process_ids,n) != '';
演示
在线 rextester 演示(使用 PostgreSQL 但应与 Redshift 兼容):https://rextester.com/FNA16497
简要说明
This technique 用于生成数字表(从 1 到 1000 包括在内)。然后将 This technique 与多个 Common Table Expressions 一起使用多次以在单个 SQL 语句中实现它。
,我已经构建了一个示例脚本,从这个 TSV 开始
client_id day counts process_ids
abc1 Feb-01-2021 3 C1,C2 | C3,C4,C5 | C6,C7
abc2 Feb-05-2021 2,3 C10,C11,C12 | C13,C14 # C15,C16 | C17,C18
这是漂亮的印刷版
+-----------+-------------+--------+-------------------------------------------+
| client_id | day | counts | process_ids |
+-----------+-------------+--------+-------------------------------------------+
| abc1 | Feb-01-2021 | 3 | C1,C7 |
| abc2 | Feb-05-2021 | 2,3 | C10,C18 |
+-----------+-------------+--------+-------------------------------------------+
我已经编写了这个 Miller 过程
mlr --tsv clean-whitespace then put -S '
if ($process_ids=~"|" && $counts=~"^[0-9]$")
{$process_ids=gsub($process_ids," *[|] *",",")}
elif($process_ids=~"[#]")
{$process_ids=gsub(gsub($process_ids,")," *# *","#");$counts=gsub($counts,"#")}' then \
put '
asplits = splitnv($counts,"#");
bsplits = splitnv($process_ids,"#");
n = length(asplits);
for (int i = 1; i <= n; i += 1) {
outrec = $*;
outrec["counts"] = asplits[i];
outrec["process_ids"] = bsplits[i];
emit outrec;
}
' then \
uniq -a then \
filter -x -S '$counts=~"[#]"' then \
cat -n then \
nest --explode --values --across-records -f process_ids --nested-fs "," then \
cut -x -f n input.tsv
给你
client_id day counts process_ids
abc1 Feb-01-2021 3 C1
abc1 Feb-01-2021 3 C2
abc1 Feb-01-2021 3 C3
abc1 Feb-01-2021 3 C4
abc1 Feb-01-2021 3 C5
abc1 Feb-01-2021 3 C6
abc1 Feb-01-2021 3 C7
abc2 Feb-05-2021 2 C10
abc2 Feb-05-2021 2 C11
abc2 Feb-05-2021 2 C12
abc2 Feb-05-2021 2 C13
abc2 Feb-05-2021 2 C14
abc2 Feb-05-2021 3 C15
abc2 Feb-05-2021 3 C16
abc2 Feb-05-2021 3 C17
abc2 Feb-05-2021 3 C18