问题描述
我有两个表:
import org.apache.spark.sql.types.{IntegerType,StringType,StructField,StructType}
val sourcesFolders = List("/home/mykolavasyliv/tmp/source1/","/home/mykolavasyliv/tmp/source2/","/home/mykolavasyliv/tmp/source3/")
// :~/tmp$ tree
// .
// ├── source1
// │ └── person-data-1.csv
// ├── source2
// │ └── person-data-2.csv
// └── source3
// └── person-data-3.csv
// person-data-1.csv:
// source-1-1,Mykola,23,100
// source-1-2,Jon,34,76
// source-1-3,Marry,25,123
// person-data-2.csv
// source-2-1,100
// source-2-2,76
// source-2-3,123
// person-data-3.csv
// source-3-1,100
// source-3-2,76
// source-3-3,123
// Read csv files not use schema
val sourceDF = spark.read.csv(sourcesFolders:_*)
sourceDF.show(false)
// +----------+-------+---+---+
// |_c0 |_c1 |_c2|_c3|
// +----------+-------+---+---+
// |source-1-1|Mykola |23 |100|
// |source-1-2|Jon |34 |76 |
// |source-1-3|Marry |25 |123|
// |source-2-1|Mykola |23 |100|
// |source-2-2|Jon |34 |76 |
// |source-2-3|Marry |25 |123|
// |source-3-1|Mykola |23 |100|
// |source-3-2|Jon |34 |76 |
// |source-3-3|Marry |25 |123|
// +----------+-------+---+---+
// Read csv files use schema
val schema = StructType(
List(
StructField("id",true),StructField("name",StructField("age",IntegerType,StructField("NotKNow",true)
)
)
val source2DF = spark.read.schema(schema).csv(sourcesFolders:_*)
source2DF.show(false)
// +----------+-------+---+-------+
// |id |name |age|NotKNow|
// +----------+-------+---+-------+
// |source-1-1|Mykola |23 |100 |
// |source-1-2|Jon |34 |76 |
// |source-1-3|Marry |25 |123 |
// |source-2-1|Mykola |23 |100 |
// |source-2-2|Jon |34 |76 |
// |source-2-3|Marry |25 |123 |
// |source-3-1|Mykola |23 |100 |
// |source-3-2|Jon |34 |76 |
// |source-3-3|Marry |25 |123 |
// +----------+-------+---+-------+
是带有日期的表,另外还有[Date Master]
列,通过该列,我们可以确定是否是实际工作日。
+-------------------------------------+--+---+----------+ | Master Date | | | Workday | +-------------------------------------+--+---+----------+ | | | | | | 2020-03-16 00:00:00.000 | | | 1 | | 2020-03-17 00:00:00.000 | | | 1 | | 2020-03-18 00:00:00.000 | | | 1 | | 2020-03-19 00:00:00.000 | | | 1 | | 2020-03-20 00:00:00.000 | | | 1 | | 2020-03-21 00:00:00.000 | | | 0 | | 2020-03-22 00:00:00.000 | | | 0 | | 2020-03-23 00:00:00.000 | | | 1 | | 2020-03-24 00:00:00.000 | | | 1 | | 2020-03-25 00:00:00.000 | | | 1 | | 2020-03-26 00:00:00.000 | | | 1 | | 2020-03-27 00:00:00.000 | | | 1 | | 2020-03-28 00:00:00.000 | | | 0 | | 2020-03-29 00:00:00.000 | | | 0 | | 2020-03-30 00:00:00.000 | | | 1 | | 2020-03-31 00:00:00.000 | | | 1 | +-------------------------------------+--+---+----------+
第二张表[Workday]
是一种绩效表,我们将各种同事的出勤情况存储在办公室中。
+-----------------------------------------+--+--+--+------------------------+----------------------+ | ID | | | | Start Date | End Date | +-----------------------------------------+--+--+--+------------------------+----------------------+ | ---------------------- ---------- ------| | | | | | | 528950 | | | | 2020-03-19 | 2020-03-23 | +-----------------------------------------+--+--+--+------------------------+----------------------+
我写了一个选择,应该使用前面提到的[MAIN]
列值显示[Start Date]
和[End Date]
with 之间的差异。
[Workday]
有趣的部分到了:此选择返回4天:
+-----------------------------------------------+-----------------------------------------+ | ID | Start Date - End Date (Business Days) | +-----------------------------------------------+-----------------------------------------+ | ------- ------------------------------------- | | | 528950 | 4 | +-----------------------------------------------+-----------------------------------------+
但是如果我开始手动进行计算,我将获得3天的时间:
+-------------------------+---------+ | Master Date | Workday | +-------------------------+---------+ | 2020-03-19 00:00:00.000 | 1 | | 2020-03-20 00:00:00.000 | 1 | | 2020-03-21 00:00:00.000 | 0 | | 2020-03-22 00:00:00.000 | 0 | | 2020-03-23 00:00:00.000 | 1 | +-------------------------+---------+
我在做什么错?也许这很容易,但是我陷入了思路。
谢谢。
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)