问题描述
我想说这主要是基于意见的,尽管它看起来不必要地冗长,并且PythonTransformers
无法与其余Pipeline
API很好地集成。
还值得指出的是,您可以轻松实现此处的所有功能sqlTransformer
。例如:
from pyspark.ml.feature import sqlTransformer
def column_selector(columns):
return sqlTransformer(
statement="SELECT {} FROM __THIS__".format(", ".join(columns))
)
要么
def na_dropper(columns):
return sqlTransformer(
statement="SELECT * FROM __THIS__ WHERE {}".format(
" AND ".join(["{} IS NOT NULL".format(x) for x in columns])
)
)
稍加努力,您就可以将sqlAlchemy与Hive方言结合使用来避免手写sql。
解决方法
pyspark.ml
通过管道API入门,我发现自己为典型的预处理任务编写了自定义转换器,以便在管道中使用它们。例子:
from pyspark.ml import Pipeline,Transformer
class CustomTransformer(Transformer):
# lazy workaround - a transformer needs to have these attributes
_defaultParamMap = dict()
_paramMap = dict()
_params = dict()
class ColumnSelector(CustomTransformer):
"""Transformer that selects a subset of columns
- to be used as pipeline stage"""
def __init__(self,columns):
self.columns = columns
def _transform(self,data):
return data.select(self.columns)
class ColumnRenamer(CustomTransformer):
"""Transformer renames one column"""
def __init__(self,rename):
self.rename = rename
def _transform(self,data):
(colNameBefore,colNameAfter) = self.rename
return data.withColumnRenamed(colNameBefore,colNameAfter)
class NaDropper(CustomTransformer):
"""
Drops rows with at least one not-a-number element
"""
def __init__(self,cols=None):
self.cols = cols
def _transform(self,data):
dataAfterDrop = data.dropna(subset=self.cols)
return dataAfterDrop
class ColumnCaster(CustomTransformer):
def __init__(self,col,toType):
self.col = col
self.toType = toType
def _transform(self,data):
return data.withColumn(self.col,data[self.col].cast(self.toType))
它们可以工作,但是我想知道这是模式还是反模式-这样的转换器是使用管道API的好方法吗?是否有必要实现它们,或者在其他地方提供了等效功能?