跨多个依赖条件在 data.table 中高效索引/加入停止检测算法

问题描述

编辑：可用的真实数据集here

感谢

Wang、Rui、Fanglin Chen、Zhenyu Chen、Tianxing Li、Gabriella Harari、Stefanie Tignor、Xia Zhou、Dror Ben-Zeev 和 Andrew T. Campbell。 “StudentLife：评估大学生使用智能手机的心理健康、学习成绩和行为趋势。”在 ACM 无处不在计算会议的论文集。 2014.

说明

我正在进行一项模拟研究，其中我根据相对简单的标准对位置数据（纬度/经度坐标）执行停止检测。

如果在 A 之后存在时间戳至少为 180 秒的另一个位置 (B)，并且如果 A 和 B 之间的所有位置与 A 的距离小于或等于 80，则一个位置 (A) 是一个站点米。

我已尝试减少数据以使其仍然有效但不需要实际坐标。

data <- data.table(id = c(1,2,3,4,5,6,7,8,9,10),latlon = c(0,50,80,90,100,190,110,110),time = c(0,60,120,180,240,300,360,420,480,520))

id 1 不是停靠点，因为时间差 > 180 (id 5) 的第一个位置的经纬度距离为 90。

id 2 是一个停止点，因为它与第一个时间差 > 180 (id 6) 的所有位置之间的距离都小于 80 (0,30,40,50）。

id 6 不是停止，因为即使 id 10 在时间上相差 > 180，但落在之间的 id 7 的距离大于 80。

id 8 不是停靠点，因为在至少 180 秒之后没有位置。

最终，我需要能够贪婪地分配“停止 ID”，例如，如果我发现 id 2 具有满足距离要求的点，直到 id 7，位置 ID 为 2:7停止 ID 为 2。

矩阵和for循环

如果我运行这个：

nrows <- nrow(data)

latlon_dist <- outer(data$latlon,data$latlon,`-`)
latlon_dist[upper.tri(latlon_dist)] <- NA
time_window <- outer(data$time,data$time,`-`)
time_window[upper.tri(time_window)] <- NA

foo <- function(x){
  mindist <- min(which(x[,1] > 80),nrows)
  if (mindist >= min(which(x[,2] > 180),nrows + 1)) mindist else NA
}

bar <- array(c(latlon_dist,time_window),dim = c(nrows,nrows,2))


apply(bar,foo)

它返回阈值 > NA 7 7 NA NA NA NA NA NA NA，我可以在 for 循环中使用它来适当地设置停止 ID。

threshholds <- apply(bar,foo) - 1

prevIoUs_threshhold <- 0
for (i in seq_along(threshholds)) {
  current_threshhold <- threshholds[i]
  
  if (!is.na(current_threshhold) && current_threshhold > prevIoUs_threshhold) {
    data[i:current_threshhold,stop_id := i]
    prevIoUs_threshhold <- current_threshhold
  }
}

此时，这是我能够保证准确性的唯一方法。我尝试过的所有其他方法我都认为是正确的，只是发现它的行为与这种情况不同。但是，正如您可能想象的那样，这是非常低效的，并且在我的模拟研究中运行了 116,000 次。

我的假设是处理这个问题的最好方法是在 data.table 中使用非等值连接。

当数据集中的行数使数组过于占用内存时，我目前正在运行的另一个实现功能会更好。我不会把它翻译成处理数据，但它在这里是为了给任何人提供想法。我把它放在一个 while 循环中，这样当它已经为多个点分配了一个 stop_id 时，它可以跳过一些迭代。如果点 1:7 都属于 stop_id 1，则它们本身不被视为候选停靠点，我们将继续在第 8 点再次进行测试。从技术上讲，它返回不同的解决方案，但“足够接近”的停靠点稍后会合并这个过程，所以最终的结果不太可能有太大的不同。

For 循环，无矩阵

stopFinder <- function(dt){
  
  nrows <- nrow(dt)
  
  if (nrows < 20000){
    return(quickStopFinder(dt))
  }
  i <- 1
  remove_indices <- 0
  while (i < nrows) {
    this_ends  <- dt[!remove_indices,Position(
                       geodist_vec(rep(longitude[1],.N),rep(latitude[1],longitude,latitude,paired = TRUE),f = function(x) x > 80,nomatch = .N + 1) ] + i - 1
    # A) Do some number of points occur within the distance?
    # B) If so,is it at least three minutes out?
    if (this_ends > (i + 1) && dt[c(i,(this_ends - 1)),timestamp[.N] > time_window[1]]) {
      # Last index is the one before the distance is broken
      last_index_of_stop <- this_ends - 1
      
      # Next run,we will remove all prior considerations
      remove_indices <- c(1:last_index_of_stop)
      
      # Set the point itself
      dt[i,`:=`(candidate_stop = TRUE,stop_id = id,within_stop = TRUE)]
      # Set the attached points
      dt[(i + 1):last_index_of_stop,`:=`(within_stop = TRUE,stop_id = i)]
      
      # Start iterating again on the point that broke the distance
      i <- this_ends
    } else {
      # If no stop,move on and leave out this point
      remove_indices <- c(1:i)
      i <- i + 1
    }
  }
  dt[]
}

quickStopFinder 或多或少是我一开始分享的实现，它占用大量内存且速度较慢，但比 stopFinder 慢一些。

以前，我有这样的东西作为基础，但它需要很多后续步骤，并不总是给我想要的结果，但我会添加它以供后代使用。

  res <- dt[dt,on = .(timestamp >= timestamp_dup,timestamp <= time_window)]
  res[,dist := geodist_vec(x1 = longitude,y1 = latitude,x2 = i.longitude,y2 = i.latitude,paired = TRUE,measure = "haversine")]
  res[,candidate_stop := all(dist <= 80),i.id]

新的真实数据

使用真实数据中的示例进行编辑：

这可以处理连接的情况，但增长得太快太快了。数据少时速度很快。

sm2 <- read.csv(file = "http://daniellemc.cool/sm.csv",row.names = NULL)
sm <- copy(sm2)
setDT(sm)
sm <- sm[,.(timestamp,id)]
sm[,timestamp := as.POSIXct(timestamp)]
sm[,id2 := id]

# This is problematic on my data because of how quickly it grows.
test <- sm[sm,on = .(id >= id)]
test[,i.id2 := NULL]
setnames(test,c("time.2","longitude.2","latitude.2","id.1","id.2","time.1","longitude.1","latitude.1"))


# Time and distance differences calculated between each pair
test[,distdiff := geodist_vec(longitude.1,latitude.1,longitude.2,latitude.2,paired = TRUE)]
test[,timediff := time.2 - time.1]

# Include the next distance to make sure there's at least one within distance and 
# over 180 timediff.
test[,nextdistdiff := shift(distdiff,-1),id.1]

# Are all distances within 180 sec within 80,and is the next following also < 80
test[,dist_met := FALSE]
test[timediff < 180,dist_met := all(distdiff < 80 & nextdistdiff < 80),id.1]
test[,dist_met := any(dist_met),id.1]

# Test how many occur consecutively 
# This keeps us from having > 80 dist but then coming back within 80
test[,consecutive := FALSE]
test[distdiff < 80,consecutive :=  c(TRUE,cummin(diff(id.2) == 1) == 1),id.1]
test[consecutive == TRUE & dist_met == TRUE,stop_id := min(id.1),id.2]
test[test[consecutive == TRUE & dist_met == TRUE],stop_id := i.stop_id,on = .(id.1 = id.2)]
test <- unique(test[,.(stop_id,id.1)])

# Join it back to the data.
sm[test,stop_id := stop_id,on = .(id = id.1)]

解决方法

使用 data.table 的非对等连接功能，您可以将数据连接到自身，同时避免过于昂贵的笛卡尔积。

由于 data.table 只允许 >、< 或 =，连接首先在矩形区域上完成，然后过滤出适当的距离。根据您提供的真实数据，这使计算量减少了 7 倍。

library(data.table)
library(geosphere)

data <- copy(sm)

minduration <- 180
maxdistance <- 80

data[,latmin := destPoint(cbind(longitude,latitude),b = 180,d=maxdistance)[,2]]
data[,latmax := destPoint(cbind(longitude,b = 0,2]]

data[,lonmin := destPoint(cbind(longitude,b = 270,1]]
data[,lonmax := destPoint(cbind(longitude,b = 90,1]]

data[,timestampmin := timestamp+minduration]

# Cross product with space and time windows
cross <- data[data,.(i.id,x.id,i.latitude,i.longitude,x.latitude,x.longitude,dist = distGeo(cbind(x.longitude,x.latitude),cbind(i.longitude,i.latitude)),i.timestamp,x.timestamp),on=.(timestamp>timestampmin,longitude >= lonmin,longitude<=lonmax,latitude >= latmin,latitude <= latmax)][
                    dist<maxdistance]

# Summarizing the results
cross[,.(keep=cumsum(fifelse(diff(x.id-i.id)==1,1,NA_integer_))),by=i.id][
      !is.na(keep),.(startid = i.id,nextid = i.id+keep)][
      !(startid %in% nextid)][,.(maxid=max(nextid)),by=startid][,.(stopid = min(startid)),by=maxid]

     maxid stopid
  1:     6      1
  2:    18     10
  3:    26     22
  4:    33     28
  5:    48     40
 ---             
162:  4273   4269
163:  4276   4274
164:  4295   4294
165:  4303   4301
166:  4306   4305

data.table data.table non-equi-join r r self-join