Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

google.bin 相关疑问 #16

Open
burette opened this issue Dec 2, 2019 · 7 comments
Open

google.bin 相关疑问 #16

burette opened this issue Dec 2, 2019 · 7 comments

Comments

@burette
Copy link

burette commented Dec 2, 2019

前辈您好,
您在代码里注释的
#tf.flags.DEFINE_string("word2vec", "./data/rt-polaritydata/google.bin", "Word2vec file with pre-trained embeddings (default: None)")
这个google.bin文件就是谷歌的GoogleNews-vectors-negative300.bin文件是么?

@Aliang-CN
Copy link

大佬,negative300.bin这个文件试过吗

@burette
Copy link
Author

burette commented Dec 6, 2019

大佬,negative300.bin这个文件试过吗

这个文件试过了。用的就是GoogleNews-vectors-negative300.bin这个预训练的。原代码使用Python2.7,我使用的python3.5,按照原来代码读这个文件的地方,会出现错误,内存溢出。python3下使用下面的片段进行读取negative300.bin:
for line in tqdm(range(vocab_size)):
# word = []
# while True:
# ch = f.read(1)
# if ch == b' ':
# # word = ''.join(word)
# break
# if ch != b'\n':
# word.append(ch)
word = b''
while True:
ch = f.read(1)
if ch == b' ':
break
word += ch
这个可以走通整个流程。

@Aliang-CN
Copy link

大佬有试过gensim读取bin文件吗

@Aliang-CN
Copy link

你这种方法读取太慢了,要3个小时

@burette
Copy link
Author

burette commented Dec 9, 2019

你这种方法读取太慢了,要3个小时

读取三个小时可能是机器性能问题?我这边几台机子都是几分钟读完i5的机子

@burette
Copy link
Author

burette commented Dec 9, 2019

大佬有试过gensim读取bin文件吗

from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format(
'GoogleNews-vectors-negative300.bin', binary=True, limit=300000)

@Aliang-CN
Copy link

我两种方法都试过了,我遍历vocabulary_user的词,发现在model里面都没有这个词,你那边是什么情况呢?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants