You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently, I'm interested in your idea of model data on word-pairs in a document for short texts, but I'm a bit of confused at how you count the biterm sets in BTM. You did a nice job to implement it in C++, but I'm not good at it, and feel hard to read c++ code. I wonder if counts of every word-pairs within a document is one, and the biterm vector of the whole biterm sets can be updated by calculating the word pairs from document to document. Wish you to answer my puzzle. Thank you very much!
The text was updated successfully, but these errors were encountered:
Not exactly right. A biterm is defined as a pair of words co-occurring in the same text window. For example,
a doc is "A B C B ", and suppose the window size=3, so their are two text windows which can generate biterms as follows:
text window "A B C" => "A B", "B C", "A C"
text window "B C B" => "B C", "C B", "B B"
Since a biterm is an unorder word pair, "B C"="C B". Thus, the doc will count the biterm "B C" 3 times, and the biterms "A B", "A C", "B B" 1 time.
PS: Thanks to other contributors, you can find the implementation of BTM with other language (e.g, python, julia, scala) on github :)
Recently, I'm interested in your idea of model data on word-pairs in a document for short texts, but I'm a bit of confused at how you count the biterm sets in BTM. You did a nice job to implement it in C++, but I'm not good at it, and feel hard to read c++ code. I wonder if counts of every word-pairs within a document is one, and the biterm vector of the whole biterm sets can be updated by calculating the word pairs from document to document. Wish you to answer my puzzle. Thank you very much!
The text was updated successfully, but these errors were encountered: