有没有在python中保存和加载h2o word2vec模型的首选方法?

问题描述

我已经在python h2o软件包中训练了word2vec模型。 我有一种简单的方法可以保存word2vec模型并稍后将其加载回使用吗?

我已经尝试了h2o.save_model()和h2o.load_model()函数,但是没有运气。 使用类似的方法

时出现错误
ERROR: Unexpected HTTP Status code: 412 Precondition Failed (url = http://localhost:54321/99/Models.bin/)

water.exceptions.H2OIllegalArgumentException
[1] "water.exceptions.H2OIllegalArgumentException: Illegal argument: dir of function: importModel:

我正在使用相同版本的h2o训练并重新加载模型,因此此问题中概述的问题不适用Can't import binay h2o model with h2o.loadModel() function: 412 Precondition Failed

有人对如何保存和加载h2o word2vec模型有任何见解吗?

带有一些重要代码段的示例代码

import h2o
from h2o.estimators import H2OWord2vecEstimator

df['text'] = df['text'].ascharacter()
  
# Break text into sequence of words
words = tokenize(df["text"])
    
# Initializing h2o
print('Initializing h2o.')
h2o.init(ip=h2o_ip,port=h2o_port,min_mem_size=h2o_min_memory) 
   
# Build word2vec model:
w2v_model = H2OWord2vecEstimator(sent_sample_rate = 0.0,epochs = 10)
w2v_model.train(training_frame=words)
    
    
# Calculate a vector for each row
word_vecs = w2v_model.transform(words,aggregate_method = "AVERAGE")

#Save model to path
wv_path = '/models/wordvec/'
model_path = h2o.save_model(model = w2v_model,path= wv_path,force=True)

# Load model in later script
w2v_model = h2o.load_model(model_path)

解决方法

听起来您尝试读取的目录可能存在访问问题。我刚刚在w2v example from docs之后在H2O 3.30.0.1上进行了测试,并且运行良好:

job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"),col_names = ["category","jobtitle"],col_types = ["string","string"],header = 1)
STOP_WORDS = ["ax","i","you","edu","s","t","m","subject","can","lines","re","what","there","all","we","one","the","a","an","of","or","in","for","by","on","but","is","not","with","as","was","if","they","are","this","and","it","have","from","at","my","be","that","to","com","org","like","likes","so"]

# Make the 'tokenize' function:
def tokenize(sentences,stop_word = STOP_WORDS):
    tokenized = sentences.tokenize("\\W+")
    tokenized_lower = tokenized.tolower()
    tokenized_filtered = tokenized_lower[(tokenized_lower.nchar() >= 2) | (tokenized_lower.isna()),:]
    tokenized_words = tokenized_filtered[tokenized_filtered.grep("[0-9]",invert=True,output_logical=True),:]
    tokenized_words = tokenized_words[(tokenized_words.isna()) | (~ tokenized_words.isin(STOP_WORDS)),:]
    return tokenized_words

# Break job titles into a sequence of words:
words = tokenize(job_titles["jobtitle"])

# Build word2vec model:
w2v_model = H2OWord2vecEstimator(sent_sample_rate = 0.0,epochs = 10)

w2v_model.train(training_frame=words)

#Save model
wv_path = 'models/'
model_path = h2o.save_model(model = w2v_model,path= wv_path,force=True)

#Load Model
w2v_model2 = h2o.load_model(model_path)