Linq查询对大数据集的性能

问题描述

我正在运行一种方法来对ConcurrentQueue<T>中存储的数据进行事务处理。在CPU性能分析中，主要的打击似乎是：

foreach (Item inSequence in items.Where(w => w.SequenceNumber == i.SequenceNumber && w.Device == i.Device)) {}

使用1,000和10,000，实际上非常快。在100,000个项目时，性能变得至关重要-特定的Linq查询从占总运行时CPU的约4.5％变为占总运行时CPU的58％以上。我假设性能下降主要是由于ConcurrentQueue的大小引起的，但是我不确定该怎么做。如果避免使用Linq查询，则可以解决此问题。我只是在做什么。还有其他一些并发类型会更有效吗？

这是CQ，因为数据是异步生成和读取的。但是，在这种特殊方法中，这种情况发生在构建数据之后和读出数据之前，它在单个线程上运行。

非常宽松的示例在这里：https://dotnetfiddle.net/hjDOva

using System;
using System.Diagnostics;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Linq;

public class Program
{
    static int count = 100000;

    public static void Main()
    {
        var items = new ConcurrentQueue<Item>();
        var r = new Random();
        for (int i = 0; i < count; i++)
        {
            items.Enqueue(new Item());
        }

        var sw = Stopwatch.StartNew();
        foreach (Item i in items.DistinctBy(d => new { d.SequenceNumber,d.Device }))
            foreach (Item inSequence in items.Where(w => w.Device == i.Device && w.SequenceNumber == i.SequenceNumber))
            {

            }

        Console.WriteLine(sw.Elapsed);
    }
}

public static class Extensions
{
    public static IEnumerable<TSource> DistinctBy<TSource,TKey>(this IEnumerable<TSource> source,Func<TSource,TKey> keySelector)
    {
        HashSet<TKey> seenKeys = new HashSet<TKey>();
        foreach (TSource element in source)
        {
            if (seenKeys.Add(keySelector(element)))
            {
                yield return element;
            }
        }
    }
}

public class Item
{
    #region Fields
    protected bool fixDates;
    protected string randomSerial;
    protected decimal amount;
    protected string device;
    protected DateTime depositTime;
    public int SequenceNumber = -1;
    [NonSerialized()]
    protected System.Random rnd = new Random(Int32.Parse(Guid.NewGuid().ToString().Substring(0,8),System.Globalization.NumberStyles.HexNumber));
    #endregion

    #region Properties
    public bool FixDates
    {
        get
        {
            return this.fixDates;
        }

        set
        {
            this.fixDates = value;
        }
    }

    public string Amount
    {
        get
        {
            return this.amount.ToString();
        }

        set
        {
            this.amount = Convert.ToDecimal(value);
        }
    }

    public string RandomSerial
    {
        get { return randomSerial; }
        set { randomSerial = value; }
    }

    public string Device
    {
        get { return this.device; }
        set { this.device = value; }
    }

    public DateTime DepositTime
    {
        get { return this.depositTime; }
        set { this.depositTime = value; }
    }
    #endregion

    #region Constructors
    public Item()
    {
        fixDates = false;
        RandomSerial = Guid.NewGuid().ToString().Substring(0,8);
        this.amount = 5.00m;
        this.device = "IC" + rnd.Next(6).ToString();
        this.depositTime = DateTime.Now;
        this.SequenceNumber = rnd.Next(10);
    }
    #endregion
}

但是它不能提供100,000个项目所需的内存。

关于使用CQ的问题，是的，我了解队列并不适合此。该工具生成数据以测试各种产品类型的进口。只有一种产品需要使用该方法的Transactionalize()。大多数情况下，不使用此代码。

这是一个CQ，因为系统是并行创建对象的（这在发生时显着提高了性能），并且在大多数情况下，它们也都以并行方式出队。

解决方法

假设以下代码的目的是按组处理项目，每个组具有相同的SequenceNumber和Device，

foreach (Item i in items.DistinctBy(d => new { d.SequenceNumber,d.Device }))
    foreach (Item inSequence in items
        .Where(w => w.Device == i.Device && w.SequenceNumber == i.SequenceNumber))
    {

    }

...您可以通过使用Linq方法GroupBy来更有效地完成相同的事情：

var groups = items.GroupBy(i => (i.SequenceNumber,i.Device));
foreach (IGrouping<(string,string),Item> group in groups)
    foreach (Item inSequence in group)
    {

    }

请注意，我使用了anonymous types而不是轻量级的ValueTuple作为键，不需要垃圾回收。

如果您还希望以后能够搜索特定的组，那么非常有效，而不是GroupBy，请使用类似的ToLookup。

c#c#linq linq performance

Linq查询对大数据集的性能

问题描述

解决方法

相关问答