问题描述
我正在构建大型有向图(使用R中的igraph),并且发现了一个奇怪的问题,其中对于某些顶点名称,显然顶点是重复的。在小型图中不会发生此问题,并且仅在顶点名称达到1e + 05时才会出现此问题。重复的顶点有明显的规律性。为了向前跳,顶点复制看起来像这样(在下面的代码的第2部分中生成):
name_vertex id_cell id_vertex
1: 100000 100000 97355
2: 1e+05 100000 1435205
3: 200000 200000 197106
4: 2e+05 200000 1435206
5: 400000 400000 396605
6: 4e+05 400000 1435207
7: 500000 500000 496356
8: 5e+05 500000 1435208
9: 700000 700000 695855
10: 7e+05 700000 1435209
11: 800000 800000 795606
12: 8e+05 800000 1435210
13: 1000000 1000000 995105
14: 1e+06 1000000 1435211
当到达1e + 05时发生重复,然后为此生成重复,并且随后的每个顶点xe + 0n都生成,其中x在1:9且n> = 5(请注意,在此图中没有3e + 05通过构造对顶点进行了评估-它位于矩阵边界上-这就是为什么它不存在的原因。
所有x0 ..版本的顶点都包含输出边缘,而xe + 0 ..版本的顶点都包含输入边缘。
可复制的示例: (请注意:生成邻接关系数据帧的方式更多地归功于我一直在为用例生成图形的管道。该问题可能更直接地产生了。)
下面的代码生成一个矩阵,识别每个单元格的邻接关系,然后从中构造一个图形。为矩阵边缘处的单元分配0值,以将其从邻接表中删除(以防止环绕边缘)。
共有三个部分:
(1)以100x100的尺寸运行:正确的行为
(2)运行的矩阵尺寸为1200x1200:重复
(3)解开重复问题
注意:在(2)中生成图形需要30秒左右的时间,并且需要3-4GB的内存
# packages
library(data.table); library(igraph)
# function to get adjacent cells in a matrix
get_adjacent <- function(cells,n_row,n_col) {
adjacencies_i <- c(cells-n_row - 1,cells-n_row,cells-n_row+1,cells-1,cells+1,cells+n_row-1,cells+n_row,cells+n_row+1)
return(adjacencies_i)
}
# function to get the margins of a matrix (i.e. 1-deep outer margin of cells)
get_margins <- function(matrix) {
dims <- dim(matrix)
bottom_right <- prod(dims)
top_right <- (bottom_right - dims[1])
c(1:dims[1],# first column
top_right:bottom_right,# last column
seq(1,top_right,dims[1]),# top row
seq(dims[1],bottom_right,dims[1])) # bottom row
}
# (1) Before creating the failure case,produce a much smaller graph that
# has the correct behavIoUr
# generate a matrix of 1-valued cells
test_mat <- matrix(1,ncol=100,nrow=100)
# remove edge cells to prevent the adjacencies wrapping around the edges
test_mat[get_margins(test_mat)] <- 0
# plot: all black cells are those that should be represented in the graph,and
# each of these cells should each be linked to their immediately adjacent neighbours
# (including diagonals - see get_adjacent function)
image(test_mat,asp=1,col=c("red","black"))
# calculate the adjacency dataframe to calculate a graph from
permitted_cells <- which(test_mat[] == 1)
n_row <- dim(test_mat)[1]
n_col <- dim(test_mat)[2]
# full set of adjacencies
adj <- data.table(from = rep(permitted_cells,(1*2 + 1)^2 - 1),to = get_adjacent(permitted_cells,n_col))
# remove those that are 0-valued
adj_permitted <- adj[to %in% permitted_cells,]
# calculate graph
g <- graph_from_data_frame(adj_permitted[,list(from,to)],directed = T)
# get vertex names
vertex_names <- names(V(g))
graph_vertices <- data.table(name_vertex = vertex_names,id_cell = as.integer(vertex_names),id_vertex = 1:length(vertex_names))
setorder(graph_vertices,id_cell)
# looks good: same number of vertices in graph as there are 1-valued cells in the
# original matrix
print(paste0("n_vertices: ",nrow(graph_vertices)))
print(paste0("n_cells: ",sum(test_mat)))
## (2) failure case. Code is identical to the above,save for the dimensions of
## the matrix being much larger (1200 rather than 100),and the image() function
## is commented out.
# generate a matrix of 1-valued cells
test_mat <- matrix(1,ncol=1200,nrow=1200)
# remove edge cells to prevent the adjacencies wrapping around the edges
test_mat[get_margins(test_mat)] <- 0
# plot: all black cells are those that should be represented in the graph,and
# each of these cells should each be linked to their immediately adjacent neighbours
# (including diagonals - see get_adjacent function)
# image(test_mat,id_cell)
# there are 7 more vertices than there are 1-valued cells
print(paste0("n_vertices: ",sum(test_mat)))
print(paste0("n_extra_vertices: ",nrow(graph_vertices) - sum(test_mat)))
# (3) What are these extra vertices?
# get duplicated vertices
duplicated_vertices <-
graph_vertices[id_cell %in% graph_vertices[duplicated(id_cell),id_cell]]
setorder(duplicated_vertices,id_cell,id_vertex)
# the 7 additional vertices arise through duplication
nrow(duplicated_vertices)
print(duplicated_vertices)
# xe+.. version has the incoming edges
incoming <- adjacent_vertices(g,duplicated_vertices$id_vertex,mode="in")
incoming[unlist(lapply(incoming,function(x) length(x) != 0))]
# x0.. version has outgoing edges
outgoing <- adjacent_vertices(g,mode="out")
outgoing[unlist(lapply(outgoing,function(x) length(x) != 0))]
(最终)解决问题。这里发生了什么?我有什么办法可以防止这种行为?我目前的解决方法是删除xe + 0 ..版本的顶点,然后将xe + 0 ..版本接收的输入边添加到这些顶点,并为x0 ..版本的这些顶点添加边。 >
解决方法
问题似乎是由R(或igraph)将100000
和1e+05
两种形式等同引起的。我设法通过在脚本的开头添加语句options(scipen=99)
来解决该问题,这使R停止使用e
表示法。