联接返回Datediff的预期结果不正确

问题描述

我有两个表:

import org.apache.spark.sql.types.{IntegerType,StringType,StructField,StructType} val sourcesFolders = List("/home/mykolavasyliv/tmp/source1/","/home/mykolavasyliv/tmp/source2/","/home/mykolavasyliv/tmp/source3/") // :~/tmp$ tree // . // ├── source1 // │   └── person-data-1.csv // ├── source2 // │   └── person-data-2.csv // └── source3 // └── person-data-3.csv // person-data-1.csv: // source-1-1,Mykola,23,100 // source-1-2,Jon,34,76 // source-1-3,Marry,25,123 // person-data-2.csv // source-2-1,100 // source-2-2,76 // source-2-3,123 // person-data-3.csv // source-3-1,100 // source-3-2,76 // source-3-3,123 // Read csv files not use schema val sourceDF = spark.read.csv(sourcesFolders:_*) sourceDF.show(false) // +----------+-------+---+---+ // |_c0 |_c1 |_c2|_c3| // +----------+-------+---+---+ // |source-1-1|Mykola |23 |100| // |source-1-2|Jon |34 |76 | // |source-1-3|Marry |25 |123| // |source-2-1|Mykola |23 |100| // |source-2-2|Jon |34 |76 | // |source-2-3|Marry |25 |123| // |source-3-1|Mykola |23 |100| // |source-3-2|Jon |34 |76 | // |source-3-3|Marry |25 |123| // +----------+-------+---+---+ // Read csv files use schema val schema = StructType( List( StructField("id",true),StructField("name",StructField("age",IntegerType,StructField("NotKNow",true) ) ) val source2DF = spark.read.schema(schema).csv(sourcesFolders:_*) source2DF.show(false) // +----------+-------+---+-------+ // |id |name |age|NotKNow| // +----------+-------+---+-------+ // |source-1-1|Mykola |23 |100 | // |source-1-2|Jon |34 |76 | // |source-1-3|Marry |25 |123 | // |source-2-1|Mykola |23 |100 | // |source-2-2|Jon |34 |76 | // |source-2-3|Marry |25 |123 | // |source-3-1|Mykola |23 |100 | // |source-3-2|Jon |34 |76 | // |source-3-3|Marry |25 |123 | // +----------+-------+---+-------+ 是带有日期的表,另外还有[Date Master]列,通过该列,我们可以确定是否是实际工作日。

+-------------------------------------+--+---+----------+
|             Master Date             |  |   |  Workday |
+-------------------------------------+--+---+----------+
|                                     |  |   |          |
| 2020-03-16 00:00:00.000             |  |   |        1 |
| 2020-03-17 00:00:00.000             |  |   |        1 |
| 2020-03-18 00:00:00.000             |  |   |        1 |
| 2020-03-19 00:00:00.000             |  |   |        1 |
| 2020-03-20 00:00:00.000             |  |   |        1 |
| 2020-03-21 00:00:00.000             |  |   |        0 |
| 2020-03-22 00:00:00.000             |  |   |        0 |
| 2020-03-23 00:00:00.000             |  |   |        1 |
| 2020-03-24 00:00:00.000             |  |   |        1 |
| 2020-03-25 00:00:00.000             |  |   |        1 | 
| 2020-03-26 00:00:00.000             |  |   |        1 |
| 2020-03-27 00:00:00.000             |  |   |        1 |
| 2020-03-28 00:00:00.000             |  |   |        0 |
| 2020-03-29 00:00:00.000             |  |   |        0 |
| 2020-03-30 00:00:00.000             |  |   |        1 |
| 2020-03-31 00:00:00.000             |  |   |        1 |
+-------------------------------------+--+---+----------+

第二张表[Workday]是一种绩效表,我们将各种同事的出勤情况存储在办公室中。

+-----------------------------------------+--+--+--+------------------------+----------------------+
|                      ID                 |  |  |  |     Start Date         |       End Date       |
+-----------------------------------------+--+--+--+------------------------+----------------------+
| ---------------------- ---------- ------|  |  |  |                        |                      |
|                    528950               |  |  |  |     2020-03-19         |      2020-03-23      |
+-----------------------------------------+--+--+--+------------------------+----------------------+

我写了一个选择,应该使用前面提到的[MAIN]列值显示[Start Date][End Date] with 间的差异。

[Workday]

有趣的部分到了:此选择返回4天:


+-----------------------------------------------+-----------------------------------------+
|                      ID                       |   Start Date - End Date (Business Days) |
+-----------------------------------------------+-----------------------------------------+
| ------- ------------------------------------- |                                         |
|                    528950                     |                   4                     |
+-----------------------------------------------+-----------------------------------------+

但是如果我开始手动进行计算,我将获得3天的时间:

+-------------------------+---------+
|       Master Date       | Workday |
+-------------------------+---------+
| 2020-03-19 00:00:00.000 |       1 |
| 2020-03-20 00:00:00.000 |       1 |
| 2020-03-21 00:00:00.000 |       0 |
| 2020-03-22 00:00:00.000 |       0 |
| 2020-03-23 00:00:00.000 |       1 |
+-------------------------+---------+

我在做什么错?也许这很容易,但是我陷入了思路。

谢谢。

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)