Word2Vec

The Word2Vec algorithm is one of the well-known algorithms in the NLP field. It can learn the vector representation of words from text data and serve as input to other NLP algorithms.

Algorithm Introduction

The original intention of developing the Word2Vec module is to implement another commonly used Network Embedding algorithm, the Node2Vec algorithm. The Node2Vec algorithm is divided into two phases:

Random walk around the network
Use the Word2Vec algorithm

We only provide the implementation of the second phase here.

Distributed Implemention

The Word2Vec algorithm used for Network Embedding needs to handle network with billion nodes. We implement the SkipGram model with negative sampling optimization according to Yahoo's paper[1]

Running

Algorithm IO parameters

input: hdfs path, random walks out of the sentences, word need to be consecutively numbered from 0, separated by white space or comma, such as: 0 1 3 5 9 2 1 5 1 7 3 1 4 2 8 3 2 5 1 3 4 1 2 9 4
modelPath: hdfs path, the final model save path is hdfs:///.../epoch_checkpoint_x, where x represents the xth round epoch
modelCPInterval: save the model every few rounds of epoch

Algorithm parameters

vectorDim: The embedded vector space dimension, which is the vector dimension of the embedding vector and the context
netSample: the number of negative samples
learningRate: The learning rate of the batch gradient decent
BatchSize: the size of each mini batch
maxEpoch: the number of rounds used by the sample(the samples will be shuffled after each round)
window: the size of the trained window

Submitting scripts

Several steps must be done before editing the submitting script and running.

confirm Hadoop and Spark have ready in your environment
unzip sona--bin.zip to local directory (SONA_HOME)
upload sona--bin directory to HDFS (SONA_HDFS_HOME)
Edit $SONA_HOME/bin/spark-on-angel-env.sh, set SPARK_HOME, SONA_HOME, SONA_HDFS_HOME and ANGEL_VERSION

Here's an example of submitting scripts, remember to adjust the parameters and fill in the paths according to your own task.

HADOOP_HOME=my-hadoop-home
input=hdfs://my-hdfs/data
output=hdfs:hdfs://my-hdfs/model
queue=my-queue

export HADOOP_HOME=$HADOOP_HOME
source ./bin/spark-on-angel-env.sh
$SPARK_HOME/bin/spark-submit \
 --master yarn-cluster\
 --conf spark.yarn.allocation.am.maxMemory=55g \
 --conf spark.yarn.allocation.executor.maxMemory=55g \
 --conf spark.driver.maxResultSize=20g \
 --conf spark.kryoserializer.buffer.max=2000m\
 --conf spark.ps.instances=2 \
 --conf spark.ps.cores=2 \
 --conf spark.ps.jars=$SONA_ANGEL_JARS \
 --conf spark.ps.memory=15g \
 --conf spark.ps.log.level=INFO \
 --conf spark.offline.evaluate=200 \
 --conf spark.hadoop.angel.model.partitioner.max.partition.number=1000\
 --conf spark.hadoop.angel.ps.backup.interval.ms=2000000000 \
 --conf spark.hadoop.angel.matrixtransfer.request.timeout.ms=60000\
 --conf spark.hadoop.angel.ps.jvm.direct.factor.use.direct.buff=0.20\
 --queue $queue \
 --name "word2vec sona" \
 --jars $SONA_SPARK_JARS  \
 --driver-memory 5g \
 --num-executors 2 \
 --executor-cores 2 \
 --executor-memory 10g \
 --class org.apache.spark.angel.examples.graph.Word2vecExample \
 ./lib/angelml-$SONA_VERSION.jar \
 input:$input output:$output embedding:100 negative:5 window:5 epoch:5 stepSize:0.1 numParts:20 batchSize:2560 subSample:false modelType:cbow interval:10000

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word2vec_sona_en.md

word2vec_sona_en.md

Word2Vec

Algorithm Introduction

Distributed Implemention

Running

Algorithm IO parameters

Algorithm parameters

Submitting scripts

Files

word2vec_sona_en.md

Latest commit

History

word2vec_sona_en.md

File metadata and controls

Word2Vec

Algorithm Introduction

Distributed Implemention

Running

Algorithm IO parameters

Algorithm parameters

Submitting scripts