问题描述
我需要运行数据质量测试,因此我为此使用了 Amazon Deequ。 我可以使用下面的代码找到数据质量成功/失败状态,但接下来我想获取检查失败的所有行并将其存储到另一个数据帧/Hive 表中。请帮助我如何获得它。我们也可以同时在多个数据集上执行 Amazon Deequ 吗? 下面是正在运行的代码,需要帮助获取存储错误失败记录的代码。
import com.amazon.deequ.VerificationSuite
import com.amazon.deequ.checks.{Check,CheckLevel,CheckStatus}
import com.amazon.deequ.constraints.ConstraintStatus
object Test extends App {
val spark = SparkSession.builder()
.master("local[*]")
.appName("amazon-deequ-test")
.getorCreate();
val data = Seq((1,"Thingy A","awesome thing.","high",0),(2,"Thingy B","available at http://thingb.com",null,(3,"low",5),(4,"Thingy D","checkout https://thingd.ca",-10),(5,"Thingy E",12))
val cols = Seq("id","productName","description","priority","numViews")
val data = spark.createDataframe(data).toDF(cols: _*)
data.show(false)
val verificationResult: verificationResult = VerificationSuite() {
VerificationSuite()
.onData(data)
.addCheck(
Check(CheckLevel.Error,"integrity checks")
// we expect 5 records
.hasSize(_ == 5)
// 'id' should never be NULL
.isComplete("id")
// 'id' should not contain duplicates
.isUnique("id")
// 'productName' should never be NULL
.isComplete("productName")
// 'priority' should only contain the values "high" and "low"
.isContainedIn("priority",Array("high","low"))
// 'numViews' should not contain negative values
.isNonNegative("numViews"))
.addCheck(
Check(CheckLevel.Warning,"distribution checks")
// at least half of the 'description's should contain a url
.containsURL("description",_ >= 0.5)
// half of the items should have less than 10 'numViews'
.hasApproxQuantile("numViews",0.5,_ <= 10))
.run()
}
val resultDataFrame = checkResultAsDataFrame(spark,verificationResult).show(false)
}
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)