问题描述
通常聚合函数中的分组变量会对数据框进行分组,并且分组变量是与以下数据框相同的数据框的一部分:
aggregate(iris[,1:4],by = list(iris$Species),mean)
但是在分层聚类中,当我们使用 cutree() 时,返回的列表在aggregate() 函数中用于创建每个聚类的摘要。
members <- cutree(c,k = 9)
aggregate(customer_sample[,2:4],by = list(members),mean)
现在就我而言,members 包含集群编号(1 到 8)和唯一 ID,而我的数据框 customer_sample 仅包含唯一 ID。我不明白的是聚合函数如何将来自成员变量的唯一 ID 连接到数据框 customer_sample 中的唯一 ID。
这是我的完整代码。
data <- read.table("purchases.txt")
head(data)
colnames(data) = c('customer_id','purchase_amount','date_of_purchase')
#----------------Set Date and extract No of days elapsed ---------------------
data$date_of_purchase = as.Date(data$date_of_purchase,"%Y-%m-%d")
data$days_since = as.numeric(difftime(time1 = "2016-01-01",time2 = data$date_of_purchase,units = "days"))
#----------------Compute Recency,Frequency,Monetary Value-------------------
customers <- data %>% group_by(customer_id) %>%
summarize(recency = min(days_since),freq = n(),amount = mean(purchase_amount))
#----------------Explore Recency,Monetary Value-------------------
head(customers)
summary(customers)
hist(customers$recency)
hist(customers$freq)
hist(customers$amount)
hist(customers$amount,breaks = 100)
#-------------------------Make a copy of customers df ------------------------
new_data <- customers
head(new_data)
#--------------Transform Data to compute similarity/dissimilarity-------------
new_data$amount <- log(new_data$amount)
hist(new_data$amount)
vec_id <- new_data$customer_id
new_data <- subset(new_data,select= -c(customer_id),drop = FALSE)
rownames(new_data) <- vec_id
head(new_data)
#---------------------------- Standardize Data -------------------------------
new_data = scale(new_data)
head(new_data)
#-------------------- Take small sample for efficiency -----------------------
sample = seq(1,18417,10)
head(sample)
customer_sample <- customers[sample,]
new_data_sample <- new_data[sample,]
#/////////////////////////////////////////////////////////////////////////////
#---------------------------- Hierarchical Clustering ------------------------
#/////////////////////////////////////////////////////////////////////////////
#-------------------------------- distance Matrix ----------------------------
d <- dist(new_data_sample)
#---------------------------------- Make Clusters ----------------------------
c = hclust(d,method = "ward.D2")
#------------------------------- Plot Dendrogram ----------------------------
plot(c)
#------------------------------- Cut the Dendrogram --------------------------
members <- cutree(c,k = 9) #k gives the number of clusters/segments
#--------------------------- Show first 30 customers -------------------------
members[1:30]
#---------------------- Compute frequency in each cluster -------------------
table(members)
#------------------------- Show profile of each customer ---------------------
aggregate(customer_sample[,mean)
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)