BigQuery 与 dbt - 查找 STRING 列的频率分布

问题描述

我正在使用 dbt 查找所有表的 STRING 数据类型列的频率分布,给定模式。这是我在 blog 之后提出的宏。

{%- macro col_freq(table_schema) -%}

{{ config(schema='profiles') }}

{% set tables = dbt_utils.get_relations_by_prefix(table_schema,'') %}

SELECT * FROM (
{% for table in tables %}
  SELECT *
  FROM
(
  WITH
    `table` AS (SELECT * FROM {{ table }} ),table_as_json AS (SELECT REGEXP_REPLACE(TO_JSON_STRING(t),r'^{|}$','') AS ROW FROM `table` AS t ),pairs AS (SELECT REPLACE(column_name,'"','') AS column_name,IF (SAFE_CAST(column_value AS STRING)='null',NULL,column_value) AS column_value
              FROM table_as_json,UNnesT(SPLIT(ROW,',"')) AS z,UNnesT([SPLIT(z,':')[SAFE_OFFSET(0)]]) AS column_name,':')[SAFE_OFFSET(1)]]) AS column_value ),str_cols AS (
      SELECT
      split(replace('{{ table }}','`',''),'.' )[safe_offset(0)] as table_catalog,split(replace('{{ table }}','.' )[safe_offset(1)] as table_schema,'.' )[safe_offset(2)] as table_name,column_name,column_value
      FROM pairs),matching_cols AS (
     SELECT str_cols.table_catalog,str_cols.table_schema,str_cols.table_name,str_cols.column_name,column_value FROM str_cols
      JOIN
      (
        SELECT
          table_catalog,table_schema,table_name,column_name
        FROM
          {{ table_schema }}.informatION_SCHEMA.COLUMNS
        WHERE data_type='STRING'
      ) sample
      ON str_cols.table_catalog = sample.table_catalog
      AND str_cols.table_schema = sample.table_schema
      AND str_cols.table_name = sample.table_name
      AND str_cols.column_name = sample.column_name
    ),profile AS (
      SELECT
            table_catalog,column_value,COUNT(1) as frequency
      FROM matching_cols
      GROUP BY  table_catalog,column_value
    )

    SELECT * FROM profile
)
{%- if not loop.last %}
    UNION ALL
{%- endif %}
{% endfor %}
)

{%- endmacro -%}

这很有效,除了像 email 这样的唯一列。我想避免计算唯一值百分比 = 1.0 的列的频率分布。我能够计算唯一百分比,但我无法使用它来过滤我卡住的地方。关于如何做到这一点的任何想法?

safe_divide(COUNT(disTINCT column_value),COUNT(*)) AS pct_unique

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)