Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pantip clean website to readable data #250

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

phasinA1learn
Copy link

Why this PR

Why we need this PR? This PR is for cleaning html tag, website and error message.

Changes

  • Add remove_error function
  • Add remove_website function

Related Issues

Close #

Checklist

  • PR should be in the Naming convention
  • Assign yourself in to Assigneees
  • Tag related issues
  • Constants name should be ALL_CAPITAL, function name should be snake_case, and class name should be CamelCase
  • complex function/algorithm should have Docstring
  • 1 PR should not have more than 200 lines changes (Exception for test files). If more than that please open multiple PRs
  • At least PR reviewer must come from the task's team (model, eval, data)

@codecov
Copy link

codecov bot commented Jun 28, 2023

Codecov Report

Patch and project coverage have no change.

Comparison is base (c441682) 94.95% compared to head (6b95346) 94.95%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #250   +/-   ##
=======================================
  Coverage   94.95%   94.95%           
=======================================
  Files          12       12           
  Lines         337      337           
=======================================
  Hits          320      320           
  Misses         17       17           
Flag Coverage Δ
unittests 94.95% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Collaborator

@boat1603 boat1603 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests

"""
check_point = 0
minus_N = 10 # privote of range - minus_N
while check_point < len(text) - minus_N:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have easier code of these logic?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

current_text = ""

for item in reader.iter(skip_invalid=True):
tid = item["tid"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONSTS

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

@boat1603 boat1603 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add remove_web_and_tag and remove_error tests.

def test_clean_data():
for test_case in CLEAN_HTML_TAGS_TEST_CASES:
assert clean_data(test_case["data"]) == test_case["expected_output"]
print(clean_data(test_case["data"]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should delete print.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

text = re.sub(r"\[.*?\/.*?\]", "", text)
# Remove website
# Remove http
text = re.sub(r"http\S+", " website", text, flags=re.MULTILINE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

website should be CONST

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants