问题描述
我已经在python h2o软件包中训练了word2vec模型。 我有一种简单的方法可以保存word2vec模型并稍后将其加载回使用吗?
我已经尝试了h2o.save_model()和h2o.load_model()函数,但是没有运气。 使用类似的方法
时出现错误ERROR: Unexpected HTTP Status code: 412 Precondition Failed (url = http://localhost:54321/99/Models.bin/)
water.exceptions.H2OIllegalArgumentException
[1] "water.exceptions.H2OIllegalArgumentException: Illegal argument: dir of function: importModel:
我正在使用相同版本的h2o训练并重新加载模型,因此此问题中概述的问题不适用Can't import binay h2o model with h2o.loadModel() function: 412 Precondition Failed
有人对如何保存和加载h2o word2vec模型有任何见解吗?
import h2o
from h2o.estimators import H2OWord2vecEstimator
df['text'] = df['text'].ascharacter()
# Break text into sequence of words
words = tokenize(df["text"])
# Initializing h2o
print('Initializing h2o.')
h2o.init(ip=h2o_ip,port=h2o_port,min_mem_size=h2o_min_memory)
# Build word2vec model:
w2v_model = H2OWord2vecEstimator(sent_sample_rate = 0.0,epochs = 10)
w2v_model.train(training_frame=words)
# Calculate a vector for each row
word_vecs = w2v_model.transform(words,aggregate_method = "AVERAGE")
#Save model to path
wv_path = '/models/wordvec/'
model_path = h2o.save_model(model = w2v_model,path= wv_path,force=True)
# Load model in later script
w2v_model = h2o.load_model(model_path)
解决方法
听起来您尝试读取的目录可能存在访问问题。我刚刚在w2v example from docs之后在H2O 3.30.0.1上进行了测试,并且运行良好:
job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"),col_names = ["category","jobtitle"],col_types = ["string","string"],header = 1)
STOP_WORDS = ["ax","i","you","edu","s","t","m","subject","can","lines","re","what","there","all","we","one","the","a","an","of","or","in","for","by","on","but","is","not","with","as","was","if","they","are","this","and","it","have","from","at","my","be","that","to","com","org","like","likes","so"]
# Make the 'tokenize' function:
def tokenize(sentences,stop_word = STOP_WORDS):
tokenized = sentences.tokenize("\\W+")
tokenized_lower = tokenized.tolower()
tokenized_filtered = tokenized_lower[(tokenized_lower.nchar() >= 2) | (tokenized_lower.isna()),:]
tokenized_words = tokenized_filtered[tokenized_filtered.grep("[0-9]",invert=True,output_logical=True),:]
tokenized_words = tokenized_words[(tokenized_words.isna()) | (~ tokenized_words.isin(STOP_WORDS)),:]
return tokenized_words
# Break job titles into a sequence of words:
words = tokenize(job_titles["jobtitle"])
# Build word2vec model:
w2v_model = H2OWord2vecEstimator(sent_sample_rate = 0.0,epochs = 10)
w2v_model.train(training_frame=words)
#Save model
wv_path = 'models/'
model_path = h2o.save_model(model = w2v_model,path= wv_path,force=True)
#Load Model
w2v_model2 = h2o.load_model(model_path)