Adding Thai rules for CV Sentence Extractor #137

bact · 2021-04-15T03:45:56Z

th.toml:

other_patterns borrowing from:
https://github.com/common-voice/sentence-collector/blob/main/server/lib/validation/languages/th.js
(BEGIN_REGEX, END_REGEX, STRUCTURE_REGEX, and ABBREVIATION_REGEX with few adjustments)
replacements borrowing from:
https://github.com/common-voice/sentence-collector/blob/main/server/lib/cleanup/languages/th.js
(with some adjustments)
min_word_count and max_word_count are set on the basis of treating "word" as "a group of character between two whitespaces/punctuations", since currently there's no Thai word tokenization in the extractor.

This will close #133

How many sentences did you get at the end?

478

How did you create the blocklist file?

Since the current tokenizer does not work well with a language using no space as a word delimiter, cvtools seems doesn't work, so I haven't create one.

Review / Error ratio

(from 184 samples)

Category	%
OK	88
A: Spelling is not correct	1
B: Grammar is not correct	0
C: It's not easily speakable (including uncommon non-native words)	1
D: Other	10

"D" are mostly sentences with a "dangling word" in the beginning (it is meant to be a last word in the previous sentence).

Since the total number of the sentences I have is just below 500, and the suggested amount of random sample is "100-500", I'm not sure if the amount of sentences I have is just unexpectedly low or not.

Would like to clarify this before I ask more people for review.

(I may have to "relax" the rules, but still not sure if this related to the way the punkt sentence tokenizer works or not).

The extracted sentences are here: https://docs.google.com/spreadsheets/d/1pKBH_YQiO9ZdXIduvrb37HvCLlKBt8mGeDCpX8e8dT4/edit?usp=sharing

Questions

Does the original number of articles in Wikipedia also affect the number of extracted output as well?

Tried to extract all the articles, without rules applying, with this command:

cargo run -- extract -l th -d ../wikiextractor/text/ --no_check >> wiki.th.all.txt

Got this

$ wc -l wiki.th.*
 1985699 wiki.th.all.txt
     478 wiki.th.txt

We actually have a lot of lines extracted in wiki.th.all.txt (1,314,274 lines after blank lines removed), but looks like these "sentences" are tend to be very long. In fact, a lot of lines contains more than one sentence (can be a whole paragraph).

And the longer the line/sentence is, the more likely that it will got hit by one of the disallowing rules.

Few sample lines from wiki.th.all.txt (applying no rules):

ดาราศาสตร์เป็นหนึ่งในสาขาของวิทยาศาสตร์ที่เก่าแก่ที่สุด นักดาราศาสตร์ในวัฒนธรรมโบราณสังเกตการณ์ดวงดาวบนท้องฟ้าในเวลากลางคืน และวัตถุทางดาราศาสตร์หลายอย่างก็ได้ถูกค้นพบเรื่อยมาตามยุคสมัย อย่างไรก็ตาม กล้องโทรทรรศน์เป็นสิ่งประดิษฐ์ที่จำเป็นก่อนที่จะมีการพัฒนามาเป็นวิทยาศาสตร์สมัยใหม่ ตั้งแต่อดีตกาล ดาราศาสตร์ประกอบไปด้วสาขาที่หลากหลายเช่น การวัดตำแหน่งดาว การเดินเรือดาราศาสตร์ ดาราศาสตร์เชิงสังเกตการณ์ การสร้างปฏิทิน และรวมทั้งโหราศาสตร์ แต่ดาราศาสตร์ทุกวันนี้ถูกจัดว่ามีความหมายเหมือนกับฟิสิกส์ดาราศาสตร์ ตั้งแต่คริสต์ศตวรรษที่ 20 เป็นต้นมา ดาราศาสตร์ได้แบ่งออกเป็นสองสาขาได้แก่ ดาราศาสตร์เชิงสังเกตการณ์ และดาราศาสตร์เชิงทฤษฎี ดาราศาสตร์เชิงสังเกตการณ์จะให้ความสำคัญไปที่การเก็บและการวิเคราะห์ข้อมูล โดยการใช้ความรู้ทางกายภาพเบื้องต้นเป็นหลัก ส่วนดาราศาสตร์เชิงทฤษฎีให้ความสำคัญไปที่การพัฒนาคอมพิวเตอร์หรือแบบจำลองเชิงวิเคราะห์ เพื่ออธิบายวัตถุท้องฟ้าและปรากฏการณ์ต่าง ๆ ทั้งสองสาขานี้เป็นองค์ประกอบซึ่งกันและกัน กล่าวคือ ดาราศาสตร์เชิงทฤษฎีใช้อธิบายผลจากการสังเกตการณ์ และดาราศาสตร์เชิงสังเกตการณ์ใช้ในการรับรองผลจากทางทฤษฎี
เมื่อสังคมมีวิวัฒนาการขึ้นในดินแดนต่าง ๆ การสังเกตการณ์ทางดาราศาสตร์ก็ซับซ้อนมากขึ้น โดยเฉพาะอย่างยิ่งใน เมโสโปเตเมีย กรีก จีน อียิปต์ อินเดีย และ มายา เริ่มมีแนวคิดเกี่ยวกับความสัมพันธ์ของธรรมชาติแห่งจักรวาลกว้างขวางขึ้น ผลการศึกษาดาราศาสตร์ในยุคแรก ๆ จะเป็นการบันทึกแผนที่ตำแหน่งของดวงดาวต่าง ๆ อันเป็นศาสตร์ที่ปัจจุบันเรียกกันว่า การวัดตำแหน่งดาว (astrometry) ผลจากการเฝ้าสังเกตการณ์ทำให้แนวคิดเกี่ยวกับการเคลื่อนที่ของดวงดาวต่าง ๆ เริ่มก่อตัวเป็นรูปร่างขึ้น ธรรมชาติการเคลื่อนที่ของดวงอาทิตย์ ดวงจันทร์ และโลก นำไปสู่แนวคิดเชิงปรัชญาเพื่อพยายามอธิบายปรากฏการณ์เหล่านั้น ความเชื่อดั้งเดิมคือโลกเป็นศูนย์กลางของจักรวาล โดยมีดวงอาทิตย์ ดวงจันทร์ และดวงดาวต่าง ๆ เคลื่อนที่ไปโดยรอบ แนวคิดนี้เรียกว่า แบบจำลองแบบโลกเป็นศูนย์กลางจักรวาล (geocentric model)
เคปเลอร์ได้คิดค้นระบบแบบใหม่ขึ้นโดยปรับปรุงจากแบบจำลองเดิมของโคเปอร์นิคัส ทำให้รายละเอียดการโคจรต่าง ๆ ของดาวเคราะห์และดวงอาทิตย์ที่ศูนย์กลางสมบูรณ์ถูกต้องมากยิ่งขึ้น แต่เคปเลอร์ก็ไม่ประสบความสำเร็จในการนำเสนอทฤษฎีนี้เนื่องจากกฎหมายในยุคสมัยนั้น จนกระทั่งต่อมาถึงยุคสมัยของเซอร์ ไอแซค นิวตัน ผู้คิดค้นหลักกลศาสตร์ท้องฟ้าและกฎแรงโน้มถ่วงซึ่งสามารถอธิบายการเคลื่อนที่ของดาวเคราะห์ได้อย่างสมบูรณ์ นิวตันยังได้คิดค้นกล้องโทรทรรศน์แบบสะท้อนแสงขึ้นด้วย
ไม่ควรสับสนระหว่างดาราศาสตร์โบราณกับโหราศาสตร์ ซึ่งเป็นความเชื่อที่นำเอาเหตุการณ์และพฤติกรรมของมนุษย์ไปเกี่ยวโยงกับตำแหน่งของวัตถุท้องฟ้า แม้ว่าทั้งดาราศาสตร์และโหราศาสตร์เกิดมาจากจุดร่วมเดียวกัน และมีส่วนหนึ่งของวิธีการศึกษาที่เหมือนกัน เช่นการบันทึกตำแหน่งดาว (ephemeris) แต่ทั้งสองอย่างก็แตกต่างกัน

I guess if we can make the lines shorter, we can get more extracted sentences in wiki.th.txt

Need some suggestions here. Thank you.

bact · 2021-04-15T04:24:56Z

What is general recommendation for numbers (0-9) btw?

I see languages like en, de allow them, but language like ka doesn't.

MichaelKohler · 2021-04-15T17:20:53Z

Thanks for your efforts here. This perfectly well shows how broken the sentence segmentation is for some languages :( There's #11 already on file for this issue. I've also created a discussion/proposal at https://discourse.mozilla.org/t/future-of-the-sentence-extractor-your-input-is-required/78139 .

bact · 2021-04-16T05:46:03Z

Reviewed 184 samples from the current extracted sentences, got "OK" for 88%.

The rest of the errors are mostly due to a "dangling word" - words that meant to be a first/last word in the next/previous sentence, but got incorrectly included in the sentence in question. (probably due to a space)

I updated the first comment with error table.

… pasted from some text editors (like MS Word and iOS Notes) - Simplify rules to reflex the fact that `replacements` will run before other rules

bact · 2021-06-09T18:08:39Z

Continue from discussion in #139 (comment) , I'm thinking of one possible way to extract Thai sentences and guarantee the 3 sentences limit.

A sentence splitter may work with JSON files inside wikiextractor/text (created by WikiExtractor.py).

The sentence splitter will read text value from each JSON objects inside those files and insert a newline character to assist the sentence extraction (later by Common Voice's Sentence Extractor).

I will try to have a prototype on this. If success, this will work on top of current pipeline:

1-Get dump -> 2-Extract dump -> 3-Extract sentences

and expand it to

1-Get dump -> 2-Extract dump -> 3-Split sentences -> 4-Extract sentences.

MichaelKohler · 2021-07-17T21:14:51Z

@bact I've created a proof of concept to use a Python based sentence splitting algorithm, to make sure that the Sentence Extractor can also be used for language that rust-punkt does not support. I've created a PR and would like your input on whether it's it's clear what you would need to do when reading the README. Happy to hear your feedback on https://github.com/common-voice/cv-sentence-extractor/pull/150/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R233 :)

MichaelKohler · 2021-07-18T12:29:36Z

The segmenter PR has now been merged, check out https://github.com/common-voice/cv-sentence-extractor#using-a-different-segmenter-to-split-sentences for more info. Looking forward to hear if that helps with Thai :)

bact · 2021-08-08T11:41:20Z

Thank you @MichaelKohler . The new option segmenter is a welcome. I think this will make the pipeline more standardized, even with different language-specific processors. Will take a look more on this.

bact · 2021-08-08T11:55:56Z

I was initially thought that crfcut may work for this, but after several tries and inspections into the split text - some of the output starts or ends with an ill-formed word, very likely because the text got segmented at an invalid point (like before a following vowel: ก|า ).

Currently trying to see if I can have a wrapper to post-process the output from crfcut, or does there any other alternative.

bact added 5 commits December 12, 2020 07:48

Initial rule for Thai sentences

52604ac

Update th.toml

1cb7a92

Merge remote-tracking branch 'upstream/main' into main

a1056f3

Add/adjust more rules for Thai

736d4aa

Add more orphan word rules

decd939

bact mentioned this pull request Apr 15, 2021

Adjusting max chars of Thai sentence common-voice/sentence-collector#429

Merged

bact added 5 commits April 16, 2021 12:49

Deal with zero-width spaces

6b9030b

Merge remote-tracking branch 'upstream/main' into main

05d23b1

Allow question mark and exclamation mark

4e05035

Merge remote-tracking branch 'upstream/main' into main

3df252d

- Remove U+2063 (invisible separator) which occurs in Thai text cut &…

37e113d

… pasted from some text editors (like MS Word and iOS Notes) - Simplify rules to reflex the fact that `replacements` will run before other rules

MichaelKohler mentioned this pull request May 25, 2021

Add Thai language #133

Closed

MichaelKohler marked this pull request as draft May 25, 2021 20:18

MichaelKohler added the punkt-issue label May 25, 2021

bact added 3 commits June 8, 2021 08:41

Adjust other_patterns and replacements

60956ec

Merge branch 'common-voice:main' into main

e90fa6d

deal with double vowels

9e5ec79

MichaelKohler removed the punkt-issue label Jul 18, 2021

MichaelKohler added the waiting on feedback label Oct 23, 2021

Merge branch 'common-voice:main' into main

b874da6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Thai rules for CV Sentence Extractor #137

Adding Thai rules for CV Sentence Extractor #137

bact commented Apr 15, 2021 •

edited

Loading

bact commented Apr 15, 2021

MichaelKohler commented Apr 15, 2021

bact commented Apr 16, 2021 •

edited

Loading

bact commented Jun 9, 2021

MichaelKohler commented Jul 17, 2021

MichaelKohler commented Jul 18, 2021

bact commented Aug 8, 2021

bact commented Aug 8, 2021

Adding Thai rules for CV Sentence Extractor #137

Are you sure you want to change the base?

Adding Thai rules for CV Sentence Extractor #137

Conversation

bact commented Apr 15, 2021 • edited Loading

How many sentences did you get at the end?

How did you create the blocklist file?

Review / Error ratio

Questions

bact commented Apr 15, 2021

MichaelKohler commented Apr 15, 2021

bact commented Apr 16, 2021 • edited Loading

bact commented Jun 9, 2021

MichaelKohler commented Jul 17, 2021

MichaelKohler commented Jul 18, 2021

bact commented Aug 8, 2021

bact commented Aug 8, 2021

bact commented Apr 15, 2021 •

edited

Loading

bact commented Apr 16, 2021 •

edited

Loading