问题描述
我有一个销售数据框。我需要通过2列ProductID
和Day
来汇总df,并汇总来自不同列Amount
的每个汇总组的值,以便现在显示总数。我希望保留其他可以分组的列(行之间的值相同),在这种情况下,只需Product
。最后的列Store
将不会保留,因为值在分组的行中可能会有所不同。但是,我需要添加一列UniqueStores
,该列计算具有相同ProductID和Day的每个组的唯一存储量。例如,ID = 1和Day = Monday的第一个组将有1个唯一的商店“ N”,因此值将为1。
我尝试在此处以文本形式草拟表格,但无法正确设置其格式,因此此处显示的是表格在汇总之前的外观:
我尝试使用group_by + summary和df [,sum,by]进行聚合,但是它们没有保留未作为索引提供的变量。是否有一种解决方法而不必手动插入剩余的每一列?
谢谢,我希望我能说清楚。
输入值:
df <- data.frame("ProductID" = c(1,1,2,2),"Day"=c("Monday","Monday","Tuesday","Wednesday","Friday","Friday"),"Amount"=c(5,5,3,7,6,9,"Product"=c("Food","Food","Toys","Toys"),"Store"=c("N","N","W","S","S"))
解决方法
我们可以通过name := "sbt-validation"
version := "0.1"
scalaVersion := "2.12.4"
libraryDependencies ++= Seq(
"com.github.tototoshi" %% "scala-csv" % "1.3.6","io.netty" % "netty-all" % "4.1.42.Final","org.apache.hive" % "hive-jdbc" % "3.0.0","com.lihaoyi" %% "requests" % "0.6.5","mysql" % "mysql-connector-java" % "8.0.15","org.apache.spark" %% "spark-sql" % "3.0.0","org.apache.spark" %% "spark-hive" % "3.0.0","org.apache.spark" %% "spark-core" % "3.0.0"
exclude(name="ch.qos.logback",org="ch.qos.logback")
)
和dplyr
中的summarise
的“金额”和sum
(“商店”的不同元素的数量)进行分组。 / p>
n_distinct
如果有多个列,并且只想将一部分列作为子集,而保留其余部分,则可以选择在数据集中使用library(dplyr)
df %>%
group_by(ProductID,Day,Product) %>%
summarise(Amount = sum(Amount),UniqueStores = n_distinct(Store),.groups = 'drop')
# A tibble: 4 x 5
# ProductID Day Product Amount UniqueStores
# <dbl> <chr> <chr> <dbl> <int>
#1 1 Monday Food 10 1
#2 1 Tuesday Food 10 2
#3 2 Friday Toys 7 1
#4 2 Wednesday Toys 15 2
,然后使用mutate
获取第一行
distinct
,
在data.table
中:
library(data.table)
setDT(df)[,.(Amount = sum(Amount,na.rm = TRUE),UniqueStores = uniqueN(Store,na.rm = TRUE)),by = .(ProductID,Product)
]
输出:
ProductID Day Product Amount UniqueStores
1: 1 Monday Food 10 1
2: 1 Tuesday Food 10 2
3: 2 Wednesday Toys 15 2
4: 2 Friday Toys 7 1