如何将 CustomDataAsset 传递给 DataContext 以在批处理上运行自定义期望?

问题描述

我有一个带有自定义期望的 CustomPandasDataset

from great_expectations.data_asset import DataAsset
from great_expectations.dataset import PandasDataset
from datetime import date,datetime,timedelta

class CustomPandasDataset(PandasDataset):

    _data_asset_type = "CustomPandasDataset"
      
    @DataAsset.expectation(["column","datetime_match","datetime_diff"])
    def expect_column_max_value_to_match_datetime(self,column:str,datetime_match: datetime = None,datetime_diff: tuple = None) -> dict:
        """
        Check if data is constantly updated by matching the max datetime column to a
        datetime value or to a datetime difference.
        """
        max_datetime = self[column].max()

        if datetime_match is None:

            from datetime import date

            datetime_match = date.today()

        if datetime_diff:
            
            from datetime import timedelta

            success = (datetime_match - timedelta(*datetime_diff)) <= max_datetime <= datetime_match

        else:

            success = (max_datetime == datetime_match)

        result = {
            "data_max_value": max_datetime,"expected_max_value": str(datetime_match),"expected_datetime_diff": datetime_diff
        }

        return {
            "success": success,"result": result
        }

我想对给定的 Pandas 数据框运行期望 expect_column_max_value_to_match_datetime

expectation_suite_name = "df-raw-expectations"

suite = context.create_expectation_suite(expectation_suite_name,overwrite_existing=True)

df_ge = ge.from_pandas(df,dataset_class=CustomPandasDataset)

batch_kwargs = {'dataset': df_ge,'datasource': 'df_raw_datasource'}

# Get batch of data
batch = context.get_batch(batch_kwargs,suite)

我从 DataContext 中得到的,现在当我对这个批次运行期望时

datetime_diff = 4,batch.expect_column_max_value_to_match_datetime(column='DATE',datetime_diff=datetime_diff)

我收到以下错误

AttributeError: 'PandasDataset' object has no attribute 'expect_column_max_value_to_match_datetime'

根据文档,我在构建 GreatExpectations 数据集时指定了 dataset_class=CustomPandasDataset 属性,确实上运行期望值,df_ge 工作但不适用于批处理数据

解决方法

根据docs

要在数据源或 DataContext 中使用自定义期望,您需要在数据源配置或 batch_kwargs 中为特定批次定义自定义 DataAsset。

所以通过CustomPandasDataset函数的data_asset_type参数传递get_batch()

# Get batch of data
batch = context.get_batch(batch_kwargs,suite,data_asset_type=CustomPandasDataset)

或在上下文配置中定义

from great_expectations.data_context.types.base import DataContextConfig
from great_expectations.data_context import BaseDataContext

data_context_config = DataContextConfig(
    ...
    datasources={
        "sales_raw_datasource": {
            "data_asset_type": {
                "class_name": "CustomPandasDataset","module_name": "custom_dataset",},"class_name": "PandasDatasource","module_name": "great_expectations.datasource",}
    },... 
    )
context = BaseDataContext(project_config=data_context_config)

其中 CustomPandasDataset 可从模块/脚本 custom_dataset.py