随机身份验证错误-oozie工作流程中的oozie shell操作

问题描述

我正在尝试执行约30个oozie工作流程,每个工作流程都具有以下操作:

  1. 一个Shell动作:识别一组10个配置单元表中更新的最后一条记录
  2. 10个派生的Sqoop操作:向RDBMS查询相关表之后更新的记录
  3. 10个派生的Hive动作:将新压缩的数据与相应的Hive表合并。

以下是工作流xml文件中的shell操作部分:

<start to="shell-node_SET10_wf_hive_tables"/>
    <action name="shell-node_SET10_wf_hive_tables">

        <shell xmlns="uri:oozie:shell-action:0.1">

            <job-tracker>${jobTracker}</job-tracker>

            <name-node>${nameNode}</name-node>
            
            <exec>LAST_VALUE_10.sh</exec>
            <argument>${hive_table1}</argument>
            <argument>${hive_table2}</argument>
            <argument>${hive_table3}</argument>
            <argument>${hive_table4}</argument>
            <argument>${hive_table5}</argument>
            <argument>${hive_table6}</argument>
            <argument>${hive_table7}</argument>
            <argument>${hive_table8}</argument>
            <argument>${hive_table9}</argument>
            <argument>${hive_table10}</argument>
            <file>${last_value_script_path}#LAST_VALUE_10.sh</file>
            <file>${keytabpath}/${keytabaccount}#${keytabaccount}</file>
            <capture-output/>
        </shell>

        <ok to="forking"/>
        <error to="Failed-notification-email_SET10_SHELL"/>
    </action>

在上述shell操作中被调用的LAST_VALUE_10.sh bash脚本的内容如下:

#!/bin/bash

kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM

last_val_temp1=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $1"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp2=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $2"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp3=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $3"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp4=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $4"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp5=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $5"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp6=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $6"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp7=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $7"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp8=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $8"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp9=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $9"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp10=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from ${10}"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM




[ "$last_val_temp1" = "NULL" ] && last_val_temp_1=0 || last_val_temp_1=$last_val_temp1
[ "$last_val_temp2" = "NULL" ] && last_val_temp_2=0 || last_val_temp_2=$last_val_temp2
[ "$last_val_temp3" = "NULL" ] && last_val_temp_3=0 || last_val_temp_3=$last_val_temp3
[ "$last_val_temp4" = "NULL" ] && last_val_temp_4=0 || last_val_temp_4=$last_val_temp4
[ "$last_val_temp5" = "NULL" ] && last_val_temp_5=0 || last_val_temp_5=$last_val_temp5
[ "$last_val_temp6" = "NULL" ] && last_val_temp_6=0 || last_val_temp_6=$last_val_temp6
[ "$last_val_temp7" = "NULL" ] && last_val_temp_7=0 || last_val_temp_7=$last_val_temp7
[ "$last_val_temp8" = "NULL" ] && last_val_temp_8=0 || last_val_temp_8=$last_val_temp8
[ "$last_val_temp9" = "NULL" ] && last_val_temp_9=0 || last_val_temp_9=$last_val_temp9
[ "${last_val_temp10}" = "NULL" ] && last_val_temp_10=0 || last_val_temp_10=${last_val_temp10}

printf "last_val_1=$last_val_temp_1\nlast_val_2=$last_val_temp_2\nlast_val_3=$last_val_temp_3\nlast_val_4=$last_val_temp_4\nlast_val_5=$last_val_temp_5\nlast_val_6=$last_val_temp_6\nlast_val_7=$last_val_temp_7\nlast_val_8=$last_val_temp_8\nlast_val_9=$last_val_temp_9\nlast_val_10=${last_val_temp_10}"

这是我所有30个工作流程的相同格式,每个工作流程都导入10个表。这些工作流程中的每个工作流程都有一个唯一的LAST_VALUE脚本,我已将keytab文件复制了30次,每个工作流程都使用了每个唯一的keytab文件名。我已经与我的oozie协调员安排了这些工作,以便每天在每个工作流程之间延迟15分钟来启动它们。

我看到每天随机有一些工作流因以下错误而失败。今天失败的工作流将在下一次运行中成功,但是随机失败可能需要10天左右的时间。每天有一个或两个工作流失败,这些工作流已经成功运行了几天,但始终存在相同的错误

错误

[main] INFO com.unraveldata.agent.ResourceCollector - Unravel Sensor 4.5.1.1rc0013/1.3.11.3 initializing.
./LAST_VALUE_10.sh: line 3: kinit: command not found
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
scan complete in 2ms
Connecting to jdbc:hive2://ntd001:10000/hadoop_instance_1;principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM
20/10/08 02:31:12 [main]: ERROR transport.TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate Failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
    at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
    at org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
    at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
    at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
    at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
    at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupinformation.doAs(UserGroupinformation.java:1924)
    at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
    at org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:203)
    at org.apache.hive.jdbc.HiveConnection.<init>(HiveConnection.java:168)
    at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
    at java.sql.DriverManager.getConnection(DriverManager.java:664)
    at java.sql.DriverManager.getConnection(DriverManager.java:208)
    at org.apache.hive.beeline.DatabaseConnection.connect(DatabaseConnection.java:146)
    at org.apache.hive.beeline.DatabaseConnection.getConnection(DatabaseConnection.java:211)
    at org.apache.hive.beeline.Commands.connect(Commands.java:1529)
    at org.apache.hive.beeline.Commands.connect(Commands.java:1424)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hive.beeline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:52)
    at org.apache.hive.beeline.BeeLine.execCommandWithPrefix(BeeLine.java:1139)
    at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:1178)
    at org.apache.hive.beeline.BeeLine.initArgs(BeeLine.java:818)
    at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:898)
    at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:518)
    at org.apache.hive.beeline.BeeLine.main(BeeLine.java:501)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:141)
Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
    at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
    at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122)
    at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
    at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224)
    at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
    at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
    at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192)
    ... 35 more
UnkNown HS2 problem when communicating with Thrift server.
Error: Could not open client transport with JDBC Uri: jdbc:hive2://ntd001:10000/hadoop_instance_1;principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM: GSS initiate Failed (state=08S01,code=0)
No current connection

我无法查明此类随机故障的原因,需要帮助来识别和修复该故障。我尝试了多种方法,例如使用bash脚本仅查询一个配置单元表来分叉每个工作流中的shell动作,等等……但无法解决

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)