Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does BTM create its biterm sets? #10

Open
LjessonS opened this issue Mar 15, 2017 · 3 comments
Open

How does BTM create its biterm sets? #10

LjessonS opened this issue Mar 15, 2017 · 3 comments

Comments

@LjessonS
Copy link

Recently, I'm interested in your idea of model data on word-pairs in a document for short texts, but I'm a bit of confused at how you count the biterm sets in BTM. You did a nice job to implement it in C++, but I'm not good at it, and feel hard to read c++ code. I wonder if counts of every word-pairs within a document is one, and the biterm vector of the whole biterm sets can be updated by calculating the word pairs from document to document. Wish you to answer my puzzle. Thank you very much!

@xiaohuiyan
Copy link
Owner

Not exactly right. A biterm is defined as a pair of words co-occurring in the same text window. For example,
a doc is "A B C B ", and suppose the window size=3, so their are two text windows which can generate biterms as follows:

  • text window "A B C" => "A B", "B C", "A C"
  • text window "B C B" => "B C", "C B", "B B"
    Since a biterm is an unorder word pair, "B C"="C B". Thus, the doc will count the biterm "B C" 3 times, and the biterms "A B", "A C", "B B" 1 time.

PS: Thanks to other contributors, you can find the implementation of BTM with other language (e.g, python, julia, scala) on github :)

@himanshi-sinha
Copy link

Hi could you please provide the link for the python implementation for BTM.

@rtrad89
Copy link

rtrad89 commented Jul 7, 2020

Hi could you please provide the link for the python implementation for BTM.

Here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants