Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to remove punctuations but exclude cases like "drive-thru"? #19

Open
Jess0-0 opened this issue Nov 17, 2021 · 4 comments

Comments

@Jess0-0
Copy link

Jess0-0 commented Nov 17, 2021

I'd like to remove punctuations from the text but would like to include "-".
For example, "text---cleaning" will become "text cleaning" but "drive-thru" will still be "drive-thru" after the cleaning/

@jfilter
Copy link
Owner

jfilter commented Nov 17, 2021

Right now, this is not possible. But this seems to me a feature this package should provide. I will look into it but this may take a while.

@jfilter
Copy link
Owner

jfilter commented Jan 29, 2022

You are mainly interested to keep hyphens in compound words, right? So other punctuation such as "." or "," should get removed.

@Jess0-0
Copy link
Author

Jess0-0 commented Feb 4, 2022

Yes that's correct. Other punctuation such as "." or "," should get removed.

@tanwirahmad
Copy link

I had the same kind of scenario. I solved it like this.

from cleantext import clean

def clean_with_exceptions(text, *args, **kwargs):
    exceptions = kwargs.pop("exceptions", [])
    for idx, exp in enumerate(exceptions):
        text = text.replace(exp, "exp{}exp".format("z" * (idx + 1)))
    text = clean(text, *args, **kwargs)
    for idx, exp in enumerate(exceptions):
        text = text.replace("exp{}exp".format("z" * (idx + 1)), exp)
    return text

cleaned_text = clean_with_exceptions(
    text,
    exceptions=["-"],
    no_line_breaks=True,
    no_urls=True,  # replace all URLs with a special token
    no_emails=True,  # replace all email addresses with a special token
    no_currency_symbols=True,  # replace all currency symbols with a special token
    no_punct=True,
)

It is a bit hackish, but it worked for my case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants