将Spark数据框中的选定列插入SQL Server表中

问题描述

我有一个 sql Server表，该表的架构与我的数据框不同。我想从数据框中选择一些列，然后将选择的值“插入”表中。

基本上类似于以下代码，但位于pyspark中：

INSERT INTO Cust_Diff_Schema_tbl
(acct_num,name)
SELECT account_no,name
FROM customers
WHERE customer_id > 5000;

我可以使用spark.read使用jdbc读取数据。如下所示：

df_s3 = spark.read.format("jdbc")\
                .option("driver",db_driver_name)\
                .option("url",db_url+ ":1433;databaseName="+stage_db)\
                .option("dbtable",tbl_name)\
                .option("query","""(select * from customers)""")\
                .option("user",db_username)\
                .option("password",db_password)\
                .load()
    
    df_s3.printSchema()
    df_s3.show(20)

要将具有选定值的数据写入/追加到表中，我相信我仍然可以使用“ df_s3.write”，但是我需要一个示例，说明如何使用通过“ .option”函数或其他方法使用插入语句这不起作用。

谢谢。

解决方法

//create dataframe

val df = //fetch from  db,read file or other options

df.write.format("jdbc")
      .option("numPartitions",20)
      .option("batchsize",10000)
      .option("truncate","true")
      .option("url","jdbcURL")
      .option("driver","Driver name")
      .option("dbtable","tablename")
      .mode("append")
      .save()

apache-spark-sql aws-glue-spark jdbc pyspark pyspark