问题描述
我正在尝试执行约30个oozie工作流程,每个工作流程都具有以下操作:
以下是工作流xml文件中的shell操作部分:
<start to="shell-node_SET10_wf_hive_tables"/>
<action name="shell-node_SET10_wf_hive_tables">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>LAST_VALUE_10.sh</exec>
<argument>${hive_table1}</argument>
<argument>${hive_table2}</argument>
<argument>${hive_table3}</argument>
<argument>${hive_table4}</argument>
<argument>${hive_table5}</argument>
<argument>${hive_table6}</argument>
<argument>${hive_table7}</argument>
<argument>${hive_table8}</argument>
<argument>${hive_table9}</argument>
<argument>${hive_table10}</argument>
<file>${last_value_script_path}#LAST_VALUE_10.sh</file>
<file>${keytabpath}/${keytabaccount}#${keytabaccount}</file>
<capture-output/>
</shell>
<ok to="forking"/>
<error to="Failed-notification-email_SET10_SHELL"/>
</action>
在上述shell操作中被调用的LAST_VALUE_10.sh bash脚本的内容如下:
#!/bin/bash
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp1=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $1"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp2=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $2"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp3=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $3"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp4=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $4"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp5=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $5"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp6=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $6"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp7=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $7"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp8=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $8"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp9=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from $9"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
last_val_temp10=`beeline --showHeader=false --outputformat=tsv2 --hiveconf mapreduce.job.queuename=TEST_QUEUE -u 'jdbc:hive2://ntd001:10000/'"hadoop_instance_1"';principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM' -e "select max(ORA_ROWSCN) from ${10}"`
kinit -kt XYZ_10.keytab XYZ@ABC.DEF.GHI.COM
[ "$last_val_temp1" = "NULL" ] && last_val_temp_1=0 || last_val_temp_1=$last_val_temp1
[ "$last_val_temp2" = "NULL" ] && last_val_temp_2=0 || last_val_temp_2=$last_val_temp2
[ "$last_val_temp3" = "NULL" ] && last_val_temp_3=0 || last_val_temp_3=$last_val_temp3
[ "$last_val_temp4" = "NULL" ] && last_val_temp_4=0 || last_val_temp_4=$last_val_temp4
[ "$last_val_temp5" = "NULL" ] && last_val_temp_5=0 || last_val_temp_5=$last_val_temp5
[ "$last_val_temp6" = "NULL" ] && last_val_temp_6=0 || last_val_temp_6=$last_val_temp6
[ "$last_val_temp7" = "NULL" ] && last_val_temp_7=0 || last_val_temp_7=$last_val_temp7
[ "$last_val_temp8" = "NULL" ] && last_val_temp_8=0 || last_val_temp_8=$last_val_temp8
[ "$last_val_temp9" = "NULL" ] && last_val_temp_9=0 || last_val_temp_9=$last_val_temp9
[ "${last_val_temp10}" = "NULL" ] && last_val_temp_10=0 || last_val_temp_10=${last_val_temp10}
printf "last_val_1=$last_val_temp_1\nlast_val_2=$last_val_temp_2\nlast_val_3=$last_val_temp_3\nlast_val_4=$last_val_temp_4\nlast_val_5=$last_val_temp_5\nlast_val_6=$last_val_temp_6\nlast_val_7=$last_val_temp_7\nlast_val_8=$last_val_temp_8\nlast_val_9=$last_val_temp_9\nlast_val_10=${last_val_temp_10}"
这是我所有30个工作流程的相同格式,每个工作流程都导入10个表。这些工作流程中的每个工作流程都有一个唯一的LAST_VALUE脚本,我已将keytab文件复制了30次,每个工作流程都使用了每个唯一的keytab文件名。我已经与我的oozie协调员安排了这些工作,以便每天在每个工作流程之间延迟15分钟来启动它们。
我看到每天随机有一些工作流因以下错误而失败。今天失败的工作流将在下一次运行中成功,但是随机失败可能需要10天左右的时间。每天有一个或两个工作流失败,这些工作流已经成功运行了几天,但始终存在相同的错误。
错误:
[main] INFO com.unraveldata.agent.ResourceCollector - Unravel Sensor 4.5.1.1rc0013/1.3.11.3 initializing.
./LAST_VALUE_10.sh: line 3: kinit: command not found
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
scan complete in 2ms
Connecting to jdbc:hive2://ntd001:10000/hadoop_instance_1;principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM
20/10/08 02:31:12 [main]: ERROR transport.TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate Failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
at org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupinformation.doAs(UserGroupinformation.java:1924)
at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
at org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:203)
at org.apache.hive.jdbc.HiveConnection.<init>(HiveConnection.java:168)
at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at org.apache.hive.beeline.DatabaseConnection.connect(DatabaseConnection.java:146)
at org.apache.hive.beeline.DatabaseConnection.getConnection(DatabaseConnection.java:211)
at org.apache.hive.beeline.Commands.connect(Commands.java:1529)
at org.apache.hive.beeline.Commands.connect(Commands.java:1424)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hive.beeline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:52)
at org.apache.hive.beeline.BeeLine.execCommandWithPrefix(BeeLine.java:1139)
at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:1178)
at org.apache.hive.beeline.BeeLine.initArgs(BeeLine.java:818)
at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:898)
at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:518)
at org.apache.hive.beeline.BeeLine.main(BeeLine.java:501)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
at org.apache.hadoop.util.RunJar.main(RunJar.java:141)
Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122)
at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224)
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192)
... 35 more
UnkNown HS2 problem when communicating with Thrift server.
Error: Could not open client transport with JDBC Uri: jdbc:hive2://ntd001:10000/hadoop_instance_1;principal=hive/ntd001.ABC.DEF.GHI.COM@ABC.DEF.GHI.COM: GSS initiate Failed (state=08S01,code=0)
No current connection
我无法查明此类随机故障的原因,需要帮助来识别和修复该故障。我尝试了多种方法,例如使用bash脚本仅查询一个配置单元表来分叉每个工作流中的shell动作,等等……但无法解决。
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)