随机选择10个主题并保留其所有观察结果

问题描述

我在SAS中遇到以下问题。我有这种格式的数据集：

enter image description here

数据集由500个ID组成，每个ID的观察次数不同。我试图随机选择5id，同时保留所有观察结果。首先，我建立了一个随机生成器，以[1,500]的间隔保存带有10个数字的向量。但是，当我尝试使用此向量来选择与具有随机数的向量相对应的id时，它变得笨拙。更清楚地说，我希望我的最终结果是一个数据集，其中包括与ID 1,10,43、22、67或任何其他5个数字序列相对应的所有观测值。

任何提示将不胜感激！

解决方法

根据您的问题，我假设您已经有10个随机数。如果将它们保存在表/数据集中，则可以按ID在它们和原始数据集之间运行左联接。这将拉出所有具有相同ID的原始观测值。

比方说，您经过严格筛选的数字保存在名为“ random_ids”的表中。然后，您可以执行以下操作：

proc sql;
create table want as
select distinct
t1.id,t2.*
from random_ids as t1
left join have as t2 on t1.id = t2.id;
quit;

如果您的随机数未保存在数据集中，则可以将其简单地复制到where语句，例如：

proc sql;
create table want as
select distinct
*
from have
where id in (1 10 43 22 67) /*here you put the ids you want*/
quit;

最好

Proc SURVEYSELECT是你的朋友。

data have;
  call streaminit(123);
  do _n_ = 1 to 500;
    id = rand('integer',1e6);
    do seq = 1 to rand('integer',35);
      output;
    end;
  end;
run;

proc surveyselect noprint data=have sampsize=5 out=want;
  cluster id;
run;

proc sql noprint;
  select count(distinct id) into :id_count trimmed from want;

%put NOTE: &=id_count;

如果您没有将该程序作为SAS许可证的一部分，则可以按照k/n算法进行样本选择。注意：最早的k / n归档帖子是May 1996 SAS-L message，其代码基于1995年《 SAS观察》杂志的文章。

proc sql noprint;
  select count(distinct id) into :N trimmed from have;

proc sort data=have;
  by id;

data want_kn;
  retain N &N k 5;

  if _n_ = 1 then call streaminit(123);

  keep = rand('uniform') < k / N;
  if keep then k = k - 1;

  do until (last.id);
    set have;
    by id;
    if keep then output;
  end;

  if k = 0 then stop;

  N = N - 1;

  drop k N keep;
run;

proc sql noprint;
  select count(distinct id) into :id_count trimmed from want_kn;

%put NOTE: &=id_count;