Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly detect emoticons #25

Open
diasks2 opened this issue Jan 21, 2016 · 2 comments
Open

Properly detect emoticons #25

diasks2 opened this issue Jan 21, 2016 · 2 comments

Comments

@diasks2
Copy link
Owner

diasks2 commented Jan 21, 2016

#4
it 'preserves emoticons' do
  text = "lol :-D"
  pt = PragmaticTokenizer::Tokenizer.new(text, downcase: false)
  expect(pt.tokenize).to eq(
    ["lol", ":-D"]
  )
end
@maia
Copy link
Collaborator

maia commented Jan 30, 2016

I've just came across retext-emoji and wonder if it might be smart to convert emoticons to emoji:

When encode, converts short-codes into their unicode equivalent (e.g., :heart: and <3 to ❤️)

While I think it might be too much ton attempt to convert all possible emoticons, one could pragmatically do so for the 10-20 most common emoticons very early in the processing, and only later handle remaining punctuation.

@diasks2
Copy link
Owner Author

diasks2 commented Jan 31, 2016

Interesting idea. It might be something to consider doing internally so we don't confuse emoticons with other punctuation...however, in the output that is returned to a user I am kind of a purist and think that it should match the original input (i.e. <3 in the input text would not return ❤️‍ as a token)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants