BigQuery Union不同,其中值不在先前数据集中

问题描述

我正在尝试将一些学生数据库与GSuite电子邮件进行调和,其中用户名的创建多年来一直不一致。

我要在BigQuery上进行查询的要旨是:

  1. 从电子邮件模式1向学生发送电子邮件,并与之合并
  2. 从电子邮件模式2向学生发送电子邮件并与之合并
  3. 电子邮件不在1和2之间。

或在sql中:

with mymatches as (
    with emaildataset as (
        select 'testA' as col
        union all
        select 'testB'
        union all
        select 'testC'
        union all
        select 'testD'

    )

    select * from emaildataset where col like '%A'
    union distinct
    select * from emaildataset where col like '%B'
),emaildataset2 as (
        select 'testA' as col
        union all
        select 'testB'
        union all
        select 'testC'
        union all
        select 'testD'
)

select * from mymatches

union distinct

select * from emaildataset2 where emaildataset2.col not in (select col from mymatches)

这很高兴运行,但是当我运行真实代码时,就会得到重复的代码

现在的真实代码是:

with matchedEmails as (
    with g as (
        select * from gsuite.StudentUsers
        union all
        select * from gsuite.AlumniUsers
    )

    select
    std.STDCODE,g.*

    from g
    inner join quick.all_students_alumni as std
    on split(lower(g.Email),'@')[offset(0)] = split(quick.studentEmail(std.FNAME,std.MNAME,std.LNAME,std.STATUSTYPE),'@')[offset(0)]

    where g.OU like '/Student%' or OU like '/Alumni%'

    union distinct select
    std.STDCODE,'','@')[offset(0)]

    where g.OU like '/Student%' or OU like '/Alumni%'

)

select * from matchedEmails

union distinct select

'NOT MATCHED' as STDCODE,g.*

from (
    select * from gsuite.StudentUsers
    union all
    select * from gsuite.AlumniUsers
) as g

where g.Email not in (select Email from matchedEmails)
and g.OU like '/Student%' or OU like '/Alumni%'

但是,结果是,由于where g.Email not in (select Email from matchedEmails)子句,基于上面的知识和测试,我在“电子邮件”列中得到了重复。

我做错什么了吗?

解决方法

我认为,最后一个WHERE子句应固定为如下所示

where g.Email not in (select Email from matchedEmails)
and (g.OU like '/Student%' or OU like '/Alumni%')    

如您所见-g.OU like '/Student%' or OU like '/Alumni%'周围的括号丢失了

也许还有其他东西仍需要修复-但这会在以下问题中回答您

但是,结果是,我在Email列中得到了重复,基于我的知识和上面的测试,该列不应该是重复的,原因是g.Email不在(从matchedEmails中选择Email)子句中。