如何在 Redshift 中使用两个不同的分隔符从两列拆分数据?

问题描述

我有一个 numbers 有这样的数据。它几乎遵循两种格式,其中 CTEcounts 将具有这两种类型的数据。

process_ids

现在我想在 client_id day counts process_ids -------------------------------------------------------------------------------------------- abc1 Feb-01-2021 3 C1,C2 | C3,C4,C5 | C6,C7 abc2 Feb-05-2021 2,3 C10,C11,C12 | C13,C14 # C15,C16 | C17,C18 CTE 上将其拆分后从上面的 counts 获得以下输出 -

process_ids

基本上,如果 client_id day counts process_ids -------------------------------------------------------- abc1 Feb-01-2021 3 C1 abc1 Feb-01-2021 3 C2 abc1 Feb-01-2021 3 C3 abc1 Feb-01-2021 3 C4 abc1 Feb-01-2021 3 C5 abc1 Feb-01-2021 3 C6 abc1 Feb-01-2021 3 C7 abc2 Feb-05-2021 2 C10 abc2 Feb-05-2021 2 C11 abc2 Feb-05-2021 2 C12 abc2 Feb-05-2021 2 C13 abc2 Feb-05-2021 2 C14 abc2 Feb-05-2021 3 C15 abc2 Feb-05-2021 3 C16 abc2 Feb-05-2021 3 C17 abc2 Feb-05-2021 3 C18 counts 遵循这些格式中的任何一种,则其想法是根据以下两个用例拆分 process_idscounts

用例 1

如果 process_ids 列只有一位数且 | 列有 counts 分隔符。

enter image description here

用例 2

如果 , 列只有由 process_ids 分隔符分隔的两位数,并且 # 列有 pipe 分隔符和 Amazon Redshift

enter image description here

在这里l1 = ['POSITIVE','NEGATIVE','POSITIVE',# mismatch 'POSITIVE','POSITIVE'] # mismatch l2 = ['POSITIVE','NEGATIVE'] # mismatch mismatch = [i for i,j in zip(l1,l2) if i != j] print(mismatch) ['POSITIVE','POSITIVE'] # expected output [7,9] 合作,但我对如何根据需要将它们分开感到困惑。

这有可能吗?

解决方法

这乍一看可能有点毛茸茸的,但它是通过扎实的技术建立起来的,并提供了预期的结果......

SQL

WITH seq_0_9 AS (
  SELECT 0 AS d
  UNION ALL SELECT 1 AS d
  UNION ALL SELECT 2 AS d
  UNION ALL SELECT 3 AS d
  UNION ALL SELECT 4 AS d
  UNION ALL SELECT 5 AS d
  UNION ALL SELECT 6 AS d
  UNION ALL SELECT 7 AS d
  UNION ALL SELECT 8 AS d
  UNION ALL SELECT 9 AS d
),numbers AS (
  SELECT a.d + b.d * 10 + c.d * 100 + 1 AS n
  FROM seq_0_9 a,seq_0_9 b,seq_0_9 c
),processed AS
  (SELECT client_id,day,REPLACE(counts,' ','') AS counts,REPLACE(REPLACE(process_ids,''),'|',',') AS process_ids
   FROM tbl),split_pids AS
  (SELECT
     client_id,counts,split_part(process_ids,'#',n) AS process_ids,n AS n1
   FROM processed
   CROSS JOIN numbers
   WHERE 
     split_part(process_ids,n) IS NOT NULL
     AND split_part(process_ids,n) != ''),split_counts AS
  (SELECT
     client_id,split_part(counts,n) AS counts,process_ids,n1,n AS n2
   FROM split_pids
   CROSS JOIN numbers
   WHERE
     split_part(counts,n) IS NOT NULL
     and split_part(counts,matched_up AS
  (SELECT * FROM split_counts WHERE n1 = n2)
SELECT
  client_id,n) AS process_ids
FROM
  matched_up
CROSS JOIN
  numbers
WHERE
  split_part(process_ids,n) IS NOT NULL
  AND split_part(process_ids,n) != '';

演示

在线 rextester 演示(使用 PostgreSQL 但应与 Redshift 兼容):https://rextester.com/FNA16497

简要说明

This technique 用于生成数字表(从 1 到 1000 包括在内)。然后将 This technique 与多个 Common Table Expressions 一起使用多次以在单个 SQL 语句中实现它。

,

我已经构建了一个示例脚本,从这个 TSV 开始

client_id   day counts  process_ids
abc1    Feb-01-2021 3   C1,C2 | C3,C4,C5 | C6,C7
abc2    Feb-05-2021 2,3 C10,C11,C12 | C13,C14 # C15,C16 | C17,C18

这是漂亮的印刷版

+-----------+-------------+--------+-------------------------------------------+
| client_id | day         | counts | process_ids                               |
+-----------+-------------+--------+-------------------------------------------+
| abc1      | Feb-01-2021 | 3      | C1,C7                  |
| abc2      | Feb-05-2021 | 2,3    | C10,C18 |
+-----------+-------------+--------+-------------------------------------------+

我已经编写了这个 Miller 过程

mlr --tsv clean-whitespace then put -S '
  if ($process_ids=~"|" && $counts=~"^[0-9]$")
    {$process_ids=gsub($process_ids," *[|] *",",")}
  elif($process_ids=~"[#]")
    {$process_ids=gsub(gsub($process_ids,")," *# *","#");$counts=gsub($counts,"#")}'  then \
put '
  asplits = splitnv($counts,"#");
  bsplits = splitnv($process_ids,"#");
  n = length(asplits);
  for (int i = 1; i <= n; i += 1) {
    outrec = $*;
    outrec["counts"] = asplits[i];
    outrec["process_ids"] = bsplits[i];
    emit outrec;
  }
' then \
uniq -a then \
filter -x -S '$counts=~"[#]"' then \
cat -n then \
nest --explode --values --across-records -f process_ids --nested-fs "," then \
cut -x -f n input.tsv

给你

client_id       day     counts  process_ids
abc1    Feb-01-2021     3       C1
abc1    Feb-01-2021     3       C2
abc1    Feb-01-2021     3       C3
abc1    Feb-01-2021     3       C4
abc1    Feb-01-2021     3       C5
abc1    Feb-01-2021     3       C6
abc1    Feb-01-2021     3       C7
abc2    Feb-05-2021     2       C10
abc2    Feb-05-2021     2       C11
abc2    Feb-05-2021     2       C12
abc2    Feb-05-2021     2       C13
abc2    Feb-05-2021     2       C14
abc2    Feb-05-2021     3       C15
abc2    Feb-05-2021     3       C16
abc2    Feb-05-2021     3       C17
abc2    Feb-05-2021     3       C18