问题描述
因此,当将CSV文件添加到我的HQL代码并在HDFS上运行它时,我遇到了这个问题。 我发现插入数据时,分区部分会出现Nulls值,某些列会被删除,我尝试了多种不同的方法来插入数据,但仍然得到了这个奇怪的符号和丢失的列,就像它无法读取CSV文件一样, 这是一张照片 enter image description here,这是代码`
CREATE database covid_db;
use covid_db;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=500;
set hive.exec.max.dynamic.partitions.pernode=500;
CREATE TABLE IF NOT EXISTS covid_db.covid_staging
(
Country STRING,Total_Cases DOUBLE,New_Cases DOUBLE,Total_Deaths DOUBLE,New_Deaths DOUBLE,Total_Recovered DOUBLE,Active_Cases DOUBLE,SerIoUs DOUBLE,Tot_Cases DOUBLE,Deaths DOUBLE,Total_Tests DOUBLE,Tests DOUBLE,CASES_per_Test DOUBLE,Death_in_Closed_Cases STRING,Rank_by_Testing_rate DOUBLE,Rank_by_Death_rate DOUBLE,Rank_by_Cases_rate DOUBLE,Rank_by_Death_of_Closed_Cases DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED by ','
STORED as TEXTFILE
LOCATION '/user/cloudera/ds/COVID_HDFS_LZ'
tblproperties ("skip.header.line.count"="1","serialization.null.format" = "''");
CREATE EXTERNAL TABLE IF NOT EXISTS covid_db.covid_ds_partitioned
(
Country STRING,Rank_by_Death_of_Closed_Cases DOUBLE
)
PARTITIONED BY (COUNTRY_NAME STRING)
STORED as TEXTFILE
LOCATION '/user/cloudera/ds/COVID_HDFS_PARTITIONED';
FROM
covid_db.covid_staging
INSERT INTO TABLE covid_db.covid_ds_partitioned PARTITION(COUNTRY_NAME)
SELECT *,Country WHERE Country is not null;
CREATE EXTERNAL TABLE covid_db.covid_final_output
(
TOP_DEATH STRING,TOP_TEST STRING
)
PARTITIONED BY (COUNTRY_NAME STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED by ','
STORED as TEXTFILE
LOCATION '/user/cloudera/ds/COVID_FINAL_OUTPUT';
`
解决方法
1st:您正在检查文件内容,并且分区列未存储在文件中,而是存储在元数据中。动态创建的分区还有格式为key = value的目录。因此,您在文件中看到的最后一列不是分区列,而是Rank_by_Death_of_Closed_Cases。
2nd:您未在第二个表DDL和NULL格式中指定定界符。默认分隔符为“ \ 001”(Ctrl-A)。您可以指定定界符,例如TAB(\ t)和所需的NULL:
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
NULL DEFINED AS ''
STORED AS TEXTFILE;
但是,如果您希望能够区分NULL和空字符串,最好不要重新定义NULL格式。