雅典娜无法读取CSV字段中的多行文本

问题描述

此athena表正确读取了文件的第一行。

CREATE EXTERNAL TABLE `test_delete_email5`(
`col1` string,`col2` string,`col3` string,`col4` string,`col5` string,`col6` string,`col7` string,`col8` string,`col9` string,`col10` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
  WITH SERDEPROPERTIES (
'serialization.format' = ',','field.delim' = ','LInes TERMINATED BY' = '\n','ESCAPED BY' = '\\','quoteChar'     = '\"'
) LOCATION 's3://testme162/email_backup/email5/'
TBLPROPERTIES ('has_encrypted_data'='false')

由于在第5列中找到了HTML代码,因此未正确导入此表。还有其他办法吗?

解决方法

您的文件似乎在 #!/bin/sh #\ exec sudo tclsh "$0" "$@" # List KVM snapshots of all machines (domains) including their description # [email protected] 2020-08-21 package require tdom ### Acquire list of machines (domains) from "virsh list --all" set machines "" foreach machineInfo [lrange [split [exec virsh list --all] \n] 2 end-1] { set name [string trim [string range $machineInfo 7 37]] set state [string trim [string range $machineInfo 38 end]] dict set machines $name state $state } ;# foreach ### Acquire list of snapshots for all machines (name,time and description) foreach m [dict keys $machines] { foreach snapshot [lrange [split [exec virsh snapshot-list --domain $m] \n] 2 end-1] { set name [string trim [string range $snapshot 1 21]] set xmlRoot [[dom parse [exec virsh snapshot-dumpxml --domain $m --snapshotname $name]] documentElement] set descr [[$xmlRoot selectNodes /domainsnapshot/description/text()] data] set creaTime [clock format [[$xmlRoot selectNodes /domainsnapshot/creationTime/text()] data] -format {%Y-%m-%d %H:%M}] dict set machines $m snapshots $name time $creaTime dict set machines $m snapshots $name descr $descr } ;# foreach snapshot } ;# foreach machine ### Output a list of all machines with their snapshots including time and description foreach m [dict keys $machines] { puts [format "\nMACHINE '%s' (%s)" $m [dict get $machines $m state]] catch {unset snapshots} dict with machines $m { if [info exists snapshots] { foreach sn [dict keys $snapshots] { puts " SNAPSHOT '$sn',created: [dict get $snapshots $sn time]" foreach line [split [dict get $snapshots $sn descr] \n] { puts " $line" } } ;# foreach snapshot } ;# if snapshot exists } ;# dict with } ;# foreach machine puts "" 字段中包含许多多行文本。这不是CSV标准(至少,OpenCSVSerde无法理解)。

作为测试,我做了一个简单的文件:

textbody
  • 第1行是标题
  • 第2行是正常的
  • 第3行的字段中包含"newsletterid","name","format","subject","textbody","htmlbody","createdate","active","archive","ownerid" "one","two","three","four","five","six","seven","eight","nine","ten" "one","five \" quote \" five2","five \ five2","ten" 个转义引号
  • 第4行逃脱了换行符

然后我从您的问题中运行命令,并将其指向该数据文件。

结果:

  • 返回第1-3行(包括标题行)
  • 第4行仅工作到\"为止,直到此后的数据丢失

底线:您的文件格式与CSV格式不兼容。

可能能够找到一些可以处理它的Serde,但是OpenCSVSerde似乎不理解它,因为行通常由换行符分隔。