-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vocabulary extraction #280
base: master
Are you sure you want to change the base?
Conversation
Hi Alex, thanks for the PR.
|
src/document.jl
Outdated
ordered_dict = OrderedDict{String,Int}() | ||
sizehint!(ordered_dict, length(string_vector)) | ||
|
||
# reverse the order of the keys and values in the enumerate iterator to get an ordered dict. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is unclear. The code doesn't contain an actual change of the order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove the comment if it makes things unclear. I mean that we get the tuples (index, key) in that order and we reverse the order within the loop. However, it might be confusing and probably redundant as a comment. I would remove it entirely.
for (index, key) in enumerate(string_vector)
ordered_dict[key] = index
Hi there! Thanks for the comments!
Sorry for teh style changes. They come from vscode editor automatically saving these kind of changes. I guess will have to undo these specific changes that it does (deleting tabs in exports, most of the times).
Sounds like a good idea. On the other hand, this is the place related to language modeling. Am not sure either. I wll give it a thought and will come back soon.
Indeed!
So, I guess this means that in order to have it pass the tests, I simply need to use the using statement in the beginning of the file just above @testset, right? It seems that was the case and it worked.
Thanks! I updated the comments.
Nice! |
In the VS Code there is a tool for doing partial commits. https://www.youtube.com/watch?v=sYTwr1OSUlo Regarding names conversation. Technically, there is https://docs.julialang.org/en/v1/manual/style-guide/ . But the strict requirement for function names is no longer present. In most scripting languages, the snake style of function names is considered more readable than the initial math style of names without any separators in Julia.
correct. Also, my personal preference is to always have independent test files and be able to run them separately. This makes it easy to develop and debug your own code when you are working on a few specific functions.
It would be good to take a pause and review this again. We need to avoid spreading similar functionality around the codebase. But from another side, some changes could be made to the existing structure. |
Thanks! I will check out the video and install it.
Snake style it is then. Thanks for giving me the background!
I think you are probably right. Having them separately is clearer.
Yes, I agree with both points. I'm planning to conduct a survey comparing Python and R's analogous packages/libraries, and I'll return with suggestions. I believe Slack is a better platform for discussing package design. However, I have two preliminary points:
I realize these suggestions involve significant changes and discussions. But, having worked in the humanities for years and conducted tutorials for various audiences (e.g. philologists, linguists, teachers, psychologists, historians, librarians), I believe my insights are valuable in attracting people from these fields. |
Dear maintainers. (@rssdev10 )
One of the common challenges I faced with the
StringDocument
type was extracting a vocabulary in the form of anOrderedDict
for use in the cooccurrence matrix. Previously, it was necessary to create custom functions for theOrderedDict
. A ready-made vocabulary dictionary would be an extremely useful tool for various tasks involved in processing a corpus of texts or a single text.I added the function
vocab()
with its docstring in thedocument.jl
with two types of input (StringDocument
andVector{String}
) that returns theOrderedDict
of the vocabulary. I also added two tests intest.jl
and updated the documentation incoom.jl
forTextAnalysis.coo_matrix()
.PS: Do not know why I get the
UndefVarError: OrderedDict not defined
error in the tests. I thought that the DataStructures were imported. Locally, it runs ok!