Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrape Admincourt Pipeline [LM-244] #332

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Scrape Admincourt Pipeline [LM-244] #332

wants to merge 2 commits into from

Conversation

boss-chanon
Copy link
Contributor

@boss-chanon boss-chanon commented Nov 27, 2023

Why this PR

Make pipeline for scrape data from admincourt

Changes

  • add pipeline for scrape data from admincourt
  • add pipeline for convert scraped data to jsonl

Related Issues

Close #

Checklist

  • PR should be in the Naming convention
  • Assign yourself in to Assigneees
  • Tag related issues
  • Constants name should be ALL_CAPITAL, function name should be snake_case, and class name should be CamelCase
  • complex function/algorithm should have Docstring
  • 1 PR should not have more than 200 lines changes (Exception for test files). If more than that please open multiple PRs
  • At least PR reviewer must come from the task's team (model, eval, data)

Copy link

linear bot commented Nov 27, 2023

LM-244 Scrape The Administrative Court (ศาลปกครอง) - บทความวิชาการ

Rationale: Using The Administrative Court (ศาลปกครอง) as a part of the Law dataset to expand our pre-trained model knowledge based.

Step by Step

  1. Download data from this website: บทความวิชาการ

  2. We can scrape information in PDF format

  3. Scrape all document that has been available in this sub-section

  4. Exclude คำนำ, สารบัญ ออก

  5. Convert data into JSONL format

    image.png

  6. Pull request to our GitHub repository

Reviewers kwankoravich

Copy link

codecov bot commented Nov 27, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (4d5c647) 64.47% compared to head (646c304) 64.47%.
Report is 9 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #332   +/-   ##
=======================================
  Coverage   64.47%   64.47%           
=======================================
  Files          11       11           
  Lines         425      425           
=======================================
  Hits          274      274           
  Misses        151      151           
Flag Coverage Δ
unittests 64.47% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@boss-chanon boss-chanon requested a review from boat1603 December 1, 2023 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants