在大型Index对象上优化MongoDB聚合查询

问题描述

我的MongoDb集合中有2000万个对象。当前运行在具有7.5Gb ram和40Gb磁盘的M30 MongoDb实例上。

数据像这样存储在集合中-

{
 _id:xxxxx,id : 1 (int),from : xxxxxxxx (int),to : xxxxxx (int),status : xx (int)
 .
 .
 .
 .
},{

 _id:xxxxx,id : 2 (int),status : xx (int)
 .
 .
 .
 .
}
.
.
.
. and so on..

id 是唯一索引，来自是该集合中的索引。

我正在运行一个查询，将“ to”分组，并返回给我最大ID，并按给定条件（即“来自”）按最大ID排序

$collection->aggregate([
            ['$project' => ['id'=>1,'to'=>1,'from'=>1],[ '$match'=> [
                        '$and'=> 
                                [ 
                                    [ 'from'=> xxxxxxxxxx],[ 'status'=> xx ],] 
                        ] 
            ],['$group' => [
                        '_id' => 
                                '$to','max_revision'=>['$max' => '$id'],]
            ],['$sort' => ['max_revision' => -1]],['$limit' => 20],]);

上面的查询在索引 from 上的小型数据集上运行良好（〜2秒），就像集合中50-100k个相同的“ from”值一样。但是对于某些情况，例如，如果2M对象具有相同的“起始”值，则执行并给出结果要花费超过10秒的时间。

一个简单的例子，情况1-如果使用 from 作为12345执行同一查询，则该查询将在2秒内运行，因为12345在集合中存在5万次。

如果

案例2查询用 from 执行为98765，则将花费10秒以上的时间，因为98765在集合中存在2M次。

编辑：下面的解释查询-

{
  "command": {
    "aggregate": "mycollection","pipeline": [
      {
        "$project": {
          "id": 1,"to": 1,"from": 1
        }
      },{
        "$match": {
          "$and": [
            {
              "from": {
                "$numberLong": "12345"
              }
            },{
              "status": 22
            }
          ]
        }
      },{
        "$group": {
          "_id": "$to","max_revision": {
            "$max": "$id"
          }
        }
      },{
        "$sort": {
          "max_revision": -1
        }
      },{
        "$limit": 20
      }
    ],"allowdiskUse": false,"cursor": {},"$db": "mongo_jc","lsid": {
      "id": {
        "$binary": "8LktsSkpTjOzF3GIC+m1DA==","$type": "03"
      }
    },"$clusterTime": {
      "clusterTime": {
        "$timestamp": {
          "t": 1597230985,"i": 1
        }
      },"signature": {
        "hash": {
          "$binary": "PHh4eHh4eD4=","$type": "00"
        },"keyId": {
          "$numberLong": "6859724943999893507"
        }
      }
    }
  },"planSummary": [
    {
      "IXSCAN": {
        "from": 1
      }
    }
  ],"keysexamined": 1246529,"docsexamined": 1246529,"hasSortStage": 1,"cursorExhausted": 1,"numYields": 9747,"nreturned": 0,"queryHash": "29DAFB9E","planCacheKey": "F5EBA6AE","reslen": 231,"locks": {
    "ReplicationStateTransition": {
      "acquireCount": {
        "w": 9847
      }
    },"Global": {
      "acquireCount": {
        "r": 9847
      }
    },"Database": {
      "acquireCount": {
        "r": 9847
      }
    },"Collection": {
      "acquireCount": {
        "r": 9847
      }
    },"Mutex": {
      "acquireCount": {
        "r": 100
      }
    }
  },"storage": {
    "data": {
      "bytesRead": {
        "$numberLong": "6011370213"
      },"timeReadingMicros": 4350129
    },"timeWaitingMicros": {
      "cache": 2203
    }
  },"protocol": "op_msg","millis": 8548
}

解决方法

对于这种特定情况，mongod查询执行程序可以将索引用于初始匹配，但不能用于排序。

如果要重新排序和修改阶段，可以在{from:1,status:1,id:1}上使用索引进行匹配和排序：

$collection->aggregate([
            [ '$match'=> [
                        '$and'=> 
                                [ 
                                    [ 'from'=> xxxxxxxxxx],[ 'status'=> xx ],] 
                        ] 
            ],['$sort' => ['id' => -1]],['$project' => ['id'=>1,'to'=>1,'from'=>1],['$group' => [
                        '_id' => '$to','max_revision'=>['$first' => '$id'],]
            ],['$limit' => 20],]);

这样，它应该能够将$match和$sort阶段组合到单个索引扫描中。

aggregation-framework mongodb query-optimization