按月分组获取数据集的百分位数

问题描述

ew。这是一个真正的脑筋急转弯。首先,我用于测试的表架构为:

Create Table scores 
( 
    Id int not null identity(1,1) primary key clustered
    , [Date] datetime not null
    , score int not null
)

现在,首先,我在sql 2008中使用CTE计算值以检查答案,然后构建了一个可以在sql 2000中使用的解决方案。因此,在sql 2008中,我们执行以下操作:

;With 
    SummaryStatistics As
    (
        Select Year([Date]) As YearNum
            , Month([Date]) As MonthNum
            , Min(score) As Minscore
            , Max(score) As Maxscore
            , Avg(score) As Avgscore
        From scores
        Group By Month([Date]), Year([Date])
    )
    , percentiles As
    (
        Select Year([Date]) As YearNum
            , Month([Date]) As MonthNum
            , score
            , NTile( 100 ) Over ( Partition By Month([Date]), Year([Date]) Order By score ) As percentile
        From scores
    )
    , Reportedpercentiles As
    (
        Select YearNum, MonthNum
            , Min(Case When percentile = 45 Then score End) As percentile45
            , Min(Case When percentile = 55 Then score End) As percentile55
        From percentiles
        Where percentile In(45,55)
        Group By YearNum, MonthNum
    )
Select SS.YearNum, SS.MonthNum
    , SS.Minscore, SS.Maxscore, SS.Avgscore
    , RP.percentile45, RP.percentile55
From SummaryStatistics As SS
    Join Reportedpercentiles As RP
        On  RP.YearNum = SS.YearNum
            And RP.MonthNum = SS.MonthNum
Order By SS.YearNum, SS.MonthNum

现在是一个sql 2000解决方案。本质上,诀窍是使用几个临时表来计算分数的出现。

If object_id('tempdb..#Working') is not null
    DROP TABLE #Working
GO
Create Table #Working 
    (
    YearNum int not null
    , MonthNum int not null
    , score int not null
    , Occurances int not null
    , Constraint PK_#Working Primary Key Clustered ( MonthNum, YearNum, score )
    )
GO
Insert #Working(MonthNum, YearNum, score, Occurances)
Select Month([Date]), Year([Date]), score, Count(*)
From scores
Group By Month([Date]), Year([Date]), score
GO
If object_id('tempdb..#SummaryStatistics') is not null
    DROP TABLE #SummaryStatistics
GO
Create Table #SummaryStatistics
    (
    MonthNum int not null
    , YearNum int not null
    , score int not null
    , Occurances int not null
    , Cumulativetotal int not null
    , percentile float null
    , Constraint PK_#SummaryStatistics Primary Key Clustered ( MonthNum, YearNum, score )
    )
GO
Insert #SummaryStatistics(YearNum, MonthNum, score, Occurances, Cumulativetotal)
Select W2.YearNum, W2.MonthNum, W2.score, W2.Occurances, Sum(W1.Occurances)-W2.Occurances
From #Working As W1
    Join #Working As W2 
        On W2.YearNum = W1.YearNum
            And W2.MonthNum = W1.MonthNum
Where W1.score <= W2.score
Group By W2.YearNum, W2.MonthNum, W2.score, W2.Occurances

Update #SummaryStatistics
Set percentile = SS.Cumulativetotal * 100.0 / MonthTotal.Total
From #SummaryStatistics As SS
    Join    (
            Select SS1.YearNum, SS1.MonthNum, Max(SS1.Cumulativetotal) As Total
            From #SummaryStatistics As SS1
            Group By SS1.YearNum, SS1.MonthNum
            ) As MonthTotal
        On MonthTotal.YearNum = SS.YearNum
            And MonthTotal.MonthNum = SS.MonthNum

Select GeneralStats.*, percentiles.percentile45, percentiles.percentile55
From    (
        Select  Year(S1.[Date]) As YearNum
            , Month(S1.[Date]) As MonthNum
            , Min(S1.score) As Minscore
            , Max(S1.score) As Maxscore
            , Avg(S1.score) As Avgscore
        From scores As S1
        Group By Month(S1.[Date]), Year(S1.[Date])
        ) As GeneralStats
    Join    (
            Select SS1.YearNum, SS1.MonthNum
                , Min(Case When SS1.percentile >= 45 Then score End) As percentile45
                , Min(Case When SS1.percentile >= 55 Then score End) As percentile55
            From #SummaryStatistics As SS1
            Group By SS1.YearNum, SS1.MonthNum 
            ) As percentiles
        On percentiles.YearNum = GeneralStats.YearNum
            And percentiles.MonthNum = GeneralStats.MonthNum

解决方法

我有一个带有全部记录的SQL表,如下所示:

| Date       | Score |
+ -----------+-------+
| 01/01/2010 |     4 |
| 02/01/2010 |     6 |
| 03/01/2010 |    10 |
  ...
| 16/03/2010 |     2 |

我将其绘制在图表上,因此我在图表上得到一条漂亮的线,表示随时间推移的得分。迷人的。

现在,我需要做的是在图表上包括平均得分,以便我们可以看到平均得分随时间的变化,因此我可以简单地将其添加到组合中:

SELECT 
    YEAR(SCOREDATE) 'Year',MONTH(SCOREDATE) 'Month',MIN(SCORE) MinScore,AVG(SCORE) AverageScore,MAX(SCORE) MaxScore
FROM SCORES
GROUP BY YEAR(SCOREDATE),MONTH(SCOREDATE) 
ORDER BY YEAR(SCOREDATE),MONTH(SCOREDATE)

到目前为止没有问题。

问题是,如何轻松计算每个时间段的百分位数?我不确定这是正确的短语。我总共需要的是:

  • 图表上的分数线(简单)
  • 图表上的平均值线(简单)
  • 图表上的一条线显示了95%的分数占据的频段(已绊脚步)

这是我不明白的第三个。我需要计算5%的百分位数,我可以单独这样做:

SELECT MAX(SubQ.SCORE) FROM
    (SELECT TOP 45 PERCENT SCORE 
    FROM SCORES
    WHERE YEAR(SCOREDATE) = 2010 AND MONTH(SCOREDATE) = 1
    ORDER BY SCORE ASC) AS SubQ

SELECT MIN(SubQ.SCORE) FROM
    (SELECT TOP 45 PERCENT SCORE 
    FROM SCORES
    WHERE YEAR(SCOREDATE) = 2010 AND MONTH(SCOREDATE) = 1
    ORDER BY SCORE DESC) AS SubQ

但是我无法弄清楚如何获得所有月份的表格。

| Date       | Average | 45% | 55% |
+ -----------+---------+-----+-----+
| 01/01/2010 |      13 |  11 |  15 |
| 02/01/2010 |      10 |   8 |  12 |
| 03/01/2010 |       5 |   4 |  10 |
  ...
| 16/03/2010 |       7 |   7 |   9 |

此刻,我将不得不把这些加载到我的应用程序中,然后自己计算这些数字。或运行大量的单个查询并整理结果。