-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pantip clean website to readable data #250
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportPatch and project coverage have no change.
Additional details and impacted files@@ Coverage Diff @@
## main #250 +/- ##
=======================================
Coverage 94.95% 94.95%
=======================================
Files 12 12
Lines 337 337
=======================================
Hits 320 320
Misses 17 17
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add tests
""" | ||
check_point = 0 | ||
minus_N = 10 # privote of range - minus_N | ||
while check_point < len(text) - minus_N: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have easier code of these logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
current_text = "" | ||
|
||
for item in reader.iter(skip_invalid=True): | ||
tid = item["tid"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CONSTS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add remove_web_and_tag
and remove_error
tests.
def test_clean_data(): | ||
for test_case in CLEAN_HTML_TAGS_TEST_CASES: | ||
assert clean_data(test_case["data"]) == test_case["expected_output"] | ||
print(clean_data(test_case["data"])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should delete print.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
text = re.sub(r"\[.*?\/.*?\]", "", text) | ||
# Remove website | ||
# Remove http | ||
text = re.sub(r"http\S+", " website", text, flags=re.MULTILINE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
website
should be CONST
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Why this PR
Why we need this PR? This PR is for cleaning html tag, website and error message.
Changes
Related Issues
Close #
Checklist