如何删除postgreSQL中差异小于10秒的重复数据

问题描述

目前,表格如下所示。

日期时间 数字 内容
2018-01-01 02:49:04 1 spring
2018-01-01 02:49:10 1 spring
2018-01-01 02:49:24 1 spring
2018-01-01 02:49:29 1 夏天
2018-01-01 02:49:44 1 spring
2018-01-01 02:49:49 1 spring
2018-01-01 02:49:50 1 冬天
2018-01-01 02:49:51 1 spring

如果 'number' 和 'content' 列具有相同的值,则时间差必须超过 10 秒。 ( * 表示应该删除) 所以表格应该如下所示。

日期时间 数字 内容
2018-01-01 02:49:04 1 spring
2018-01-01 02:49:24 1 spring
2018-01-01 02:49:29 1 夏天
2018-01-01 02:49:44 1 spring
2018-01-01 02:49:50 1 冬天

我提到了 Delete Duplicate Data on PostgreSQL,但这与我的情况大不相同。 我认为代码会像

DELETE FROM table a USING table b 
WHERE
(DATETIME CALculaTION CODE)
AND a.number = b.number 
AND a.content = b.content

感谢您提前提供帮助。

解决方法

您可以使用exists()


DELETE FROM foo del
WHERE EXISTS (
        SELECT * FROM foo x
        WHERE x.znumber = del.znumber AND x.content = del.content
        AND x.dt < del.dt
        AND x.dt >= del.dt - '10 sec'::interval
        )
        ;

select * from foo;

结果:


DELETE 16
 id |         dt          | znumber | content 
----+---------------------+---------+---------
  1 | 2018-01-01 02:49:04 |       1 | spring
 16 | 2018-01-01 02:49:29 |       1 | summer
 17 | 2018-01-01 02:49:44 |       1 | spring
 19 | 2018-01-01 02:49:50 |       1 | winter
(4 rows)

额外:如果您想消除中间人,但保留双方的记录,您可以使用: [这将总是保留第一条和最后一条记录,即使它们太接近了]


WITH laglead AS (
        SELECT id
        --,dt AS this
        --,znumber,content,lag(dt) OVER www AS prev,lead(dt) OVER www AS next
        FROM foo x
        WINDOW www AS (PARTITION BY znumber,content ORDER BY dt)
        )
DELETE FROM foo del
WHERE EXISTS (
        SELECT *
        FROM laglead x
        WHERE x.id = del.id
        AND x.next < x.prev + '10 sec'::interval
        )
        ;
,

您可以先计算所有间隔小于 10 秒的行,然后使用这些行的 ctid 值删除它们。如果您有一个真正的唯一键(例如 identity)列,最好使用它,因为 ctid 比较非常慢。

with flagged as (
  select ctid as rid,row_number() over w as rn,datetime - lag(datetime) over w < interval '10 seconds' as small_gap
  from the_table
  window w as (partition by number,content order by datetime)
)
delete from the_table
where ctid in (select rid
               from flagged
               where small_gap)

Online example

,

假设你表中的记录器有唯一的id-s,首先使用lag窗口函数找到要删除的记录器的id-s,然后通过id删除它们。分两步完成的原因是因为您不能在 where 子句中使用窗口函数。

with t as
(
 select id,"datetime" - lag("datetime",1) over (partition by "number","content" order by "datetime") < interval '10 seconds' as to_delete
 from _table
)
delete from _table where id in (select id from t where to_delete);
,

我认为在一般情况下你不能用窗口函数来做到这一点。

假设该表包含一个时间戳序列,每秒一个。您对“时差必须超过 10 秒”的要求意味着保留第一个时间戳,然后跳过接下来的 9 条记录,但下一条记录将与第一条记录相差 10 秒,因此应保留。这意味着这不能通过仅比较当前行和前一行来解决,而是必须比较当前行和将保留的最后一行,如果差异小于 10 秒,则删除行,否则保留该行。

所以...plpgsql。

DROP TABLE foo;
CREATE TABLE foo( id SERIAL PRIMARY KEY,dt TIMESTAMP WITHOUT TIME ZONE,number INT NOT NULL,content TEXT NOT NULL);
\copy foo(dt,number,content) FROM stdin
2018-01-01 02:49:04 1   spring
2018-01-01 02:49:10 1   spring
2018-01-01 02:49:11 1   spring
2018-01-01 02:49:12 1   spring
2018-01-01 02:49:13 1   spring
2018-01-01 02:49:14 1   spring
2018-01-01 02:49:15 1   spring
2018-01-01 02:49:16 1   spring
2018-01-01 02:49:17 1   spring
2018-01-01 02:49:18 1   spring
2018-01-01 02:49:19 1   spring
2018-01-01 02:49:20 1   spring
2018-01-01 02:49:21 1   spring
2018-01-01 02:49:22 1   spring
2018-01-01 02:49:24 1   spring
2018-01-01 02:49:29 1   summer
2018-01-01 02:49:44 1   spring
2018-01-01 02:49:49 1   spring
2018-01-01 02:49:50 1   winter
2018-01-01 02:49:51 1   spring
\.

CREATE OR REPLACE FUNCTION foo_del( )
RETURNS SETOF INT
AS $$
DECLARE
    row         foo%ROWTYPE;
    last_row    foo%ROWTYPE;
BEGIN
    FOR row IN SELECT * FROM foo ORDER BY number,dt
    LOOP
        IF last_row.id IS NULL OR row.number != last_row.number OR row.content != last_row.content 
            OR row.dt >= last_row.dt THEN
            last_row := row;
            last_row.dt := last_row.dt + '10 SECOND'::INTERVAL;
        ELSE
            IF row.dt < last_row.dt THEN
                RETURN NEXT row.id;
            END IF;
        END IF;
    END LOOP;
END;
$$ LANGUAGE PLPGSQL;

SELECT * FROM foo LEFT JOIN (SELECT * FROM foo_del()) d ON (foo.id=d.foo_del) ORDER BY id;

DELETE FROM foo WHERE id IN (SELECT * FROM foo_del());

SELECT * FROM foo ORDER BY id;