在此示例中,为什么在librosa.core.stft中将window_length / hop_length与采样率相乘?

问题描述

我是语音识别的新手,我正在讲this implementation中有关说话者验证的详细信息。在data_preprocess.py中,作者使用librosa库。这是代码的简化版本:

def preprocess_data(data_dir,res_dir,N,M,tdsv_frame,sample_rate,nfft,window_len,hop_len):
    os.makedirs(res_dir,exist_ok=True)
    batch_frames = N * M * tdsv_frame
    batch_number = 0
    batch = []
    batch_len = 0
    for i,path in enumerate(tqdm(os.listdir(data_dir))):
        data,sr = librosa.core.load(os.path.join(data_dir,path),sr=sample_rate)
        S = librosa.core.stft(y=data,n_fft=nfft,win_length=int(window_len * sample_rate),hop_length=int(hop_len * sample_rate))
        batch.append(S)
        batch_len += S.shape[1]
        if batch_len < batch_frames: continue
        batch = np.concatenate(batch,axis=1)[:,:batch_frames]
        np.save(os.path.join(res_dir,"voice_%d.npy" % batch_number),batch)
        batch_number += 1
        batch = []
        batch_len = 0


N = 2               # number of speakers of batch
M = 400             # number of utterances per speaker
tdsv_frame = 80     # feature size
sample_rate = 8000  # sampling rate
nfft = 512          # fft kernel size
window_len = 0.025  # window length (ms)
hop_len = 0.01      # hop size (ms)
data_dir = "./data/clean_testset_wav/"
res_dir = "./data/clean_testset_wav_prep/"

他们希望根据论文中的图形创建一批(N*M)*tdsv_frame大小的特征。

enter image description here

我想我了解window_length,hop_length的概念,但是对我来说,一个问题是作者如何设置这些参数。为什么要像在这里这样用sample_rate来增加这些长度:

S = librosa.core.stft(y=data,hop_length=int(hop_len * sample_rate))

谢谢。

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)