sql-server – 通过在T-SQL中设置种子,从均匀分布生成随机值

我想从T-sql中给定数​​据表的每行的均值分布生成一个随机值,其中mean = 0,标准devation = 1.另外,我想设置一个种子,以确保分析的重现性.以下是不起作用的想法:

>使用具有声明数字的函数RAND()不符合此目标:为数据集的每一行生成相同的随机值.
>这样的解决方案:

SELECT ABS(CAST(CAST(NEWID()AS VARBINARY)AS INT))AS [RandomNumber]

由于不可重复,因此也不能解决问题.

编辑:

表现很重要,因为我的桌子有数亿条记录.

解决方法

这里的主要问题IMHO是怎么看到“可重复性”?或者不同的问题:什么驱动着随机性?只要数据不变,我可以设想一个解决方案,每个运行的每个记录都附加相同的随机数.但是,如果数据发生变化,预计会发生什么?

为了乐趣,我对一个(不是非常有代表性的)100万行的测试表进行了以下测试:

-- seed
SELECT Rand(0)

-- will show the same random number for EVERY record
SELECT Number,blah = Convert(varchar(100),NewID()),random = Rand()
  INTO #test
  FROM master.dbo.fn_int_list(1,1000000)

CREATE UNIQUE CLUSTERED INDEX uq0_test ON #test (Number)

SET NOCOUNT ON

GO
DECLARE @start_time datetime = CURRENT_TIMESTAMP,@c_number int

-- update each record (one by one) and set the random number based on 'the next Rand()' value
-- => the order of the records drives the distribution of the Rand() value !

-- seed
SELECT @c_number = Rand(0) 

-- update 1 by 1 
DECLARE cursor_no_transaction CURSOR LOCAL STATIC
    FOR SELECT Number
          FROM #test
         ORDER BY Number
OPEN cursor_no_transaction 
FETCH NEXT FROM cursor_no_transaction INTO @c_number
WHILE @@FETCH_STATUS = 0
    BEGIN
        UPDATE #test 
           SET random = Rand()
         WHERE Number = @c_number

        FETCH NEXT FROM cursor_no_transaction INTO @c_number
    END
CLOSE cursor_no_transaction 
DEALLOCATE cursor_no_transaction 

PRINT 'Time needed (no transaction) : ' + Convert(nvarchar(100),DateDiff(ms,@start_time,CURRENT_TIMESTAMP)) + ' ms.'

SELECT _avg = AVG(random),_stdev = STDEV(random) FROM #test

GO

DECLARE @start_time datetime = CURRENT_TIMESTAMP,@c_number int

BEGIN TRANSACTION

-- update each record (one by one) and set the random number based on 'the next Rand()' value
-- => the order of the records drives the distribution of the Rand() value !

-- seed
SELECT @c_number = Rand(0) 

-- update 1 by 1 but all of it inside 1 single transaction
DECLARE cursor_single_transaction CURSOR LOCAL STATIC
    FOR SELECT Number
          FROM #test
         ORDER BY Number
OPEN cursor_single_transaction 
FETCH NEXT FROM cursor_single_transaction INTO @c_number
WHILE @@FETCH_STATUS = 0
    BEGIN
        UPDATE #test 
           SET random = Rand()
         WHERE Number = @c_number

        FETCH NEXT FROM cursor_single_transaction INTO @c_number
    END
CLOSE cursor_single_transaction 
DEALLOCATE cursor_single_transaction 

COMMIT TRANSACTION

PRINT 'Time needed (single transaction) : ' + Convert(nvarchar(100),_stdev = STDEV(random) FROM #test

GO

DECLARE @start_time datetime = CURRENT_TIMESTAMP

-- update each record (single operation),use the Number column to reseed the Rand() function for every record
UPDATE #test 
    SET random = Rand(Number)

PRINT 'Time needed Rand(Number) : ' + Convert(nvarchar(100),use 'a bunch of fields' to reseed the Rand() function for every record
UPDATE #test 
    SET random = Rand(BINARY_CHECKSUM(Number,blah))

PRINT 'Time needed Rand(BINARY_CHECKSUM(Number,blah)) : ' + Convert(nvarchar(100),_stdev = STDEV(random) FROM #test

结果或多或少是如预期的:

Time needed (no transaction) : 24570 ms.
_avg                   _stdev
---------------------- ----------------------
0.499630943538644      0.288686960086461

Time needed (single transaction) : 14813 ms.
_avg                   _stdev
---------------------- ----------------------
0.499630943538646      0.288686960086461

Time needed Rand(Number) : 1203 ms.
_avg                   _stdev
---------------------- ----------------------
0.499407423620328      0.291093824839539

Time needed Rand(BINARY_CHECKSUM(Number,blah)) : 1250 ms.
_avg                   _stdev
---------------------- ----------------------
0.499715398881586      0.288579510523627

所有这些都是“可重复的”,问题是“可重复”是指你想要的意思.我已经坚持使用AVG()和STDEV()来获取一个粗略的分布概念,我会留给你看看他们是否真的适合账单(如果不是,如何改进它)

1百万行的1.2秒对于100万行IMHO来说听起来不太好.也就是说,如果您的表格包含额外的列,它将占用更多的空间,因此需要更多的时间!

希望这能让你开始…

相关文章

SELECT a.*,b.dp_name,c.pa_name,fm_name=(CASE WHEN a.fm_n...
if not exists(select name from syscolumns where name=&am...
select a.*,pano=a.pa_no,b.pa_name,f.dp_name,e.fw_state_n...
要在 SQL Server 2019 中设置定时自动重启,可以使用 Window...
您收到的错误消息表明数据库 'EastRiver' 的...
首先我需要查询出需要使用SQL Server Profiler跟踪的数据库标...