问题描述
目前,表格如下所示。
日期时间 | 数字 | 内容 |
---|---|---|
2018-01-01 02:49:04 | 1 | spring |
2018-01-01 02:49:10 | 1 | spring |
2018-01-01 02:49:24 | 1 | spring |
2018-01-01 02:49:29 | 1 | 夏天 |
2018-01-01 02:49:44 | 1 | spring |
2018-01-01 02:49:49 | 1 | spring |
2018-01-01 02:49:50 | 1 | 冬天 |
2018-01-01 02:49:51 | 1 | spring |
如果 'number' 和 'content' 列具有相同的值,则时间差必须超过 10 秒。 ( * 表示应该删除) 所以表格应该如下所示。
日期时间 | 数字 | 内容 |
---|---|---|
2018-01-01 02:49:04 | 1 | spring |
2018-01-01 02:49:24 | 1 | spring |
2018-01-01 02:49:29 | 1 | 夏天 |
2018-01-01 02:49:44 | 1 | spring |
2018-01-01 02:49:50 | 1 | 冬天 |
我提到了 Delete Duplicate Data on PostgreSQL,但这与我的情况大不相同。 我认为代码会像
DELETE FROM table a USING table b
WHERE
(DATETIME CALculaTION CODE)
AND a.number = b.number
AND a.content = b.content
感谢您提前提供帮助。
解决方法
您可以使用exists()
:
DELETE FROM foo del
WHERE EXISTS (
SELECT * FROM foo x
WHERE x.znumber = del.znumber AND x.content = del.content
AND x.dt < del.dt
AND x.dt >= del.dt - '10 sec'::interval
)
;
select * from foo;
结果:
DELETE 16
id | dt | znumber | content
----+---------------------+---------+---------
1 | 2018-01-01 02:49:04 | 1 | spring
16 | 2018-01-01 02:49:29 | 1 | summer
17 | 2018-01-01 02:49:44 | 1 | spring
19 | 2018-01-01 02:49:50 | 1 | winter
(4 rows)
额外:如果您想消除中间人,但保留双方的记录,您可以使用: [这将总是保留第一条和最后一条记录,即使它们太接近了]
WITH laglead AS (
SELECT id
--,dt AS this
--,znumber,content,lag(dt) OVER www AS prev,lead(dt) OVER www AS next
FROM foo x
WINDOW www AS (PARTITION BY znumber,content ORDER BY dt)
)
DELETE FROM foo del
WHERE EXISTS (
SELECT *
FROM laglead x
WHERE x.id = del.id
AND x.next < x.prev + '10 sec'::interval
)
;
,
您可以先计算所有间隔小于 10 秒的行,然后使用这些行的 ctid 值删除它们。如果您有一个真正的唯一键(例如 identity
)列,最好使用它,因为 ctid
比较非常慢。
with flagged as (
select ctid as rid,row_number() over w as rn,datetime - lag(datetime) over w < interval '10 seconds' as small_gap
from the_table
window w as (partition by number,content order by datetime)
)
delete from the_table
where ctid in (select rid
from flagged
where small_gap)
,
假设你表中的记录器有唯一的id-s,首先使用lag
窗口函数找到要删除的记录器的id-s,然后通过id删除它们。分两步完成的原因是因为您不能在 where
子句中使用窗口函数。
with t as
(
select id,"datetime" - lag("datetime",1) over (partition by "number","content" order by "datetime") < interval '10 seconds' as to_delete
from _table
)
delete from _table where id in (select id from t where to_delete);
,
我认为在一般情况下你不能用窗口函数来做到这一点。
假设该表包含一个时间戳序列,每秒一个。您对“时差必须超过 10 秒”的要求意味着保留第一个时间戳,然后跳过接下来的 9 条记录,但下一条记录将与第一条记录相差 10 秒,因此应保留。这意味着这不能通过仅比较当前行和前一行来解决,而是必须比较当前行和将保留的最后一行,如果差异小于 10 秒,则删除行,否则保留该行。
所以...plpgsql。
DROP TABLE foo;
CREATE TABLE foo( id SERIAL PRIMARY KEY,dt TIMESTAMP WITHOUT TIME ZONE,number INT NOT NULL,content TEXT NOT NULL);
\copy foo(dt,number,content) FROM stdin
2018-01-01 02:49:04 1 spring
2018-01-01 02:49:10 1 spring
2018-01-01 02:49:11 1 spring
2018-01-01 02:49:12 1 spring
2018-01-01 02:49:13 1 spring
2018-01-01 02:49:14 1 spring
2018-01-01 02:49:15 1 spring
2018-01-01 02:49:16 1 spring
2018-01-01 02:49:17 1 spring
2018-01-01 02:49:18 1 spring
2018-01-01 02:49:19 1 spring
2018-01-01 02:49:20 1 spring
2018-01-01 02:49:21 1 spring
2018-01-01 02:49:22 1 spring
2018-01-01 02:49:24 1 spring
2018-01-01 02:49:29 1 summer
2018-01-01 02:49:44 1 spring
2018-01-01 02:49:49 1 spring
2018-01-01 02:49:50 1 winter
2018-01-01 02:49:51 1 spring
\.
CREATE OR REPLACE FUNCTION foo_del( )
RETURNS SETOF INT
AS $$
DECLARE
row foo%ROWTYPE;
last_row foo%ROWTYPE;
BEGIN
FOR row IN SELECT * FROM foo ORDER BY number,dt
LOOP
IF last_row.id IS NULL OR row.number != last_row.number OR row.content != last_row.content
OR row.dt >= last_row.dt THEN
last_row := row;
last_row.dt := last_row.dt + '10 SECOND'::INTERVAL;
ELSE
IF row.dt < last_row.dt THEN
RETURN NEXT row.id;
END IF;
END IF;
END LOOP;
END;
$$ LANGUAGE PLPGSQL;
SELECT * FROM foo LEFT JOIN (SELECT * FROM foo_del()) d ON (foo.id=d.foo_del) ORDER BY id;
DELETE FROM foo WHERE id IN (SELECT * FROM foo_del());
SELECT * FROM foo ORDER BY id;