有没有办法将镶木地板分区下的所有文件读取到单个spark分区上?

问题描述

数据以实木复合地板格式存储。实木复合地板文件根据分区键列(用户ID列的哈希值)进行分区

import random
import itertools
command = ""
player_on = False
paused = False
songs = [
    "Baby One More Time","Hands Up","I Believe in a Thing Called Love","Unchained Melody","Come On Eileen","I Want It That Way"
]

current_song_index = 0

while True:
    command = input("What do you want to do?: ").lower()
    if command == "play":
        if player_on and not paused:
            print("Player is already on.")
            paused = False
        elif player_on and paused:
            paused = False
            print("un-paused")
        else:
            player_on = True
            paused = False
            print("Playing.")
    elif command == "pause":
        if paused and player_on:
            paused = True
            print("player already paused.")
        elif player_on and not paused:
            print(". . .")
            paused = True
        else:
            print("Turn player on first.")
    elif command == "shuffle":
        if player_on:
            print("Shuffles . . .")
            print(random.choice(songs))
        else:
            print("Turn player on first")
    elif command == "next":
        if player_on:
            paused = False
            current_song_index += 1
            if current_song_index < len(songs):                
                print(f"Next song: {songs[current_song_index]}")
            else:
                print('End of playlist')
                current_song_index = 0
        else:
            print("Turn player on first.")
    elif command == "quit":
        if not player_on:
            print("Player is already off.")
        else:
            player_on = False
            break
    else:
        print("I don't understand that command.")

鉴于分区方案,我们知道:

  1. 给定用户的所有数据将属于同一分区
  2. 一个分区可以包含多个用户数据

在读取数据时,我希望1个用户的所有数据落入同一spark分区。一个spark分区可以有1个以上的用户,但是它应该具有所有这些用户的所有行。

目前,我使用的是: SparkSession.read.parquet(“ ../ userData”)。repartition(200,col(“ UserId”))

(还尝试了使用自定义分区程序进行partitionBy;操作顺序:DataFrame-> RDD-> KeyedRDD-> partitionBy-> RDD-> DataFrame;在partitionBy之前,有一个反序列化到对象的步骤会激增随机写入)

有没有办法避免重新分区并利用输入文件夹结构将用户数据放在单个分区上?

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)