数据插入问题

问题描述

因此,当将CSV文件添加到我的HQL代码并在HDFS上运行它时,我遇到了这个问题。 我发现插入数据时,分区部分会出现Nulls值,某些列会被删除,我尝试了多种不同的方法来插入数据,但仍然得到了这个奇怪的符号和丢失的列,就像它无法读取CSV文件一样, 这是一张照片 enter image description here,这是代码`

CREATE database covid_db;

use covid_db;


SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=500;
set hive.exec.max.dynamic.partitions.pernode=500;


CREATE TABLE IF NOT EXISTS covid_db.covid_staging 
(
 Country                            STRING,Total_Cases                        DOUBLE,New_Cases                          DOUBLE,Total_Deaths                       DOUBLE,New_Deaths                         DOUBLE,Total_Recovered                    DOUBLE,Active_Cases                       DOUBLE,SerIoUs                            DOUBLE,Tot_Cases                          DOUBLE,Deaths                             DOUBLE,Total_Tests                        DOUBLE,Tests                              DOUBLE,CASES_per_Test                     DOUBLE,Death_in_Closed_Cases              STRING,Rank_by_Testing_rate               DOUBLE,Rank_by_Death_rate                 DOUBLE,Rank_by_Cases_rate                 DOUBLE,Rank_by_Death_of_Closed_Cases      DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED by ','
STORED as TEXTFILE
LOCATION '/user/cloudera/ds/COVID_HDFS_LZ'
tblproperties ("skip.header.line.count"="1","serialization.null.format" = "''");

CREATE EXTERNAL TABLE IF NOT EXISTS covid_db.covid_ds_partitioned 
(
 Country                            STRING,Rank_by_Death_of_Closed_Cases      DOUBLE
)
PARTITIONED BY (COUNTRY_NAME STRING)
STORED as TEXTFILE
LOCATION '/user/cloudera/ds/COVID_HDFS_PARTITIONED';

FROM
covid_db.covid_staging
INSERT INTO TABLE covid_db.covid_ds_partitioned PARTITION(COUNTRY_NAME)
SELECT *,Country WHERE Country is not null;


CREATE EXTERNAL TABLE covid_db.covid_final_output 
(
 TOP_DEATH                          STRING,TOP_TEST                           STRING
)
PARTITIONED BY (COUNTRY_NAME STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED by ','
STORED as TEXTFILE
LOCATION '/user/cloudera/ds/COVID_FINAL_OUTPUT';

`

解决方法

1st:您正在检查文件内容,并且分区列未存储在文件中,而是存储在元数据中。动态创建的分区还有格式为key = value的目录。因此,您在文件中看到的最后一列不是分区列,而是Rank_by_Death_of_Closed_Cases。

2nd:您未在第二个表DDL和NULL格式中指定定界符。默认分隔符为“ \ 001”(Ctrl-A)。您可以指定定界符,例如TAB(\ t)和所需的NULL:

ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
NULL DEFINED AS ''
STORED AS TEXTFILE;

但是,如果您希望能够区分NULL和空字符串,最好不要重新定义NULL格式。