如何在没有空列/单元格的情况下实现Twitter tweet，回复和转发数据库架构

问题描述

您将如何以最佳和最高效的方式为类似Twitter的应用程序设计数据库？

在此应用中，用户可以鸣叫，回复和转发。推文，回复和推文最终创建了一条新推文。这样我们就可以轻松创建这样的模式：

tweets
  - id - integer
  - text - string -> this is non nullable
  - userId - integer
  - createdAt - date
  - parentId - integer -> when replying,this will be the parent tweet of the reply
  - isRetweet - boolean -> this becomes true only when the current tweet is a retweet. The parentId will have the id of the tweet that was retweeted

但是，让我们介绍另一个功能：用户无需写text就可以转发（这是Twitter当前的工作方式）。在这种情况下，我们必须使text列为可空。突然，这意味着我们可以在没有text的情况下创建推文和答复，因为现在对此没有数据库约束（让我们暂时忽略应用程序层验证）。

我的方法是为此再创建3个表。最后，我们将有4个表：

tweets
  - id - integer
  - text - string
  - userId - integer
  - createdAt - date

replies
  - tweetId - integer -> this will contain the content of the reply
  - parentId - integer -> this is the parent of the reply

retweetsWithText
  - tweetId - integer -> this will contain the content of the retweet
  - parentId - integer -> this is the parent of the retweet

retweetsWithoutText
  - tweetId - integer -> this will contain the reference for the retweeted tweet
  - userId - integer -> who retweeted the tweet. We need this because we won't save anything in the `tweets` table regarding retweets without text
  - createdAt - date

我知道查询起来可能很复杂，但最终我认为这将带来最佳的存储性能（不需要parentId和isRetweet）。这也将解决无文本转发的问题。第一种方法的另一个问题是parentId很可能为null，因为基于统计数据，只有30％的tweet获得评论或转发。因此，我们最终将为parentId提供很多NULL列

有更少的表格并且最好没有空/冗余单元格来解决此问题吗？

解决方法

在最后一个具有多个表的架构中：如何计划在不查询多个表的情况下查询用户的时间轴，然后合并结果，最后对随后合并的结果进行切片以允许分页？

也就是说，即使依赖NULL的方法可能导致磁盘上已用空间，查询也更快。

在数据库模式设计中，始终需要在插入速度和查询速度之间进行权衡。对于类似Twitter的应用程序，查询速度胜出（这在OLTP工作负载中大部分时间都是这种情况）。

您需要对这两个解决方案进行基准测试，然后添加第三个选项（如果使用PostgreSQL，则依赖表继承）。

提示：我对PostgreSQL文件序列化并不了解，但是如果它类似于FoundationDB元组模块，则在行的列中对NULL值进行编码只能占用1个字节。

database database-schema relational-database