先是从书上学习,然后利用随书代码实现word2vec,代码在此:https://github.com/PacktPublishing/Natural-Language-Processing-with-TensorFlow/blob/master/ch3/ch3_word2vec.ipynb
因为想做领域词的识别,故没有用已有的英文数据试验,用的自己找的专业领域的小段语料做实验,先用jieba分词,然后开始Word2vec,这篇就写写报的错以及解决,有时间再详解代码。
1、在Generating Batches of Data for Skip-Gram阶段,报错:
print(' batch:', [reverse_dictionary[bi] for bi in batch])
KeyError: 326960996
原因是:batch一开始是通过np.ndarray随机初始化的任意数值数组,当2倍window_size的大小没有被batch_size整除时,batch里剩下的值(如上面报错的326960996)作为reverse_dictionary的索引必然报错。举个例子如下,一切了然:
# data=[44,45,46,47,48,49,0,0,0,5,0,0,0,15,16.......]
# 示例:batchsize=16, windowsize=1,buffer队列长度=3,numsamples=2的时候
# batch=[45,45,46,46,47,47,48,48,49,49,0,0,0,0,0,