Skip to content

Latest commit

 

History

History
106 lines (88 loc) · 2.28 KB

README.md

File metadata and controls

106 lines (88 loc) · 2.28 KB

Assem's Arabic Stemmer Gitter

This is an algorithm for Arabic stemming written on Snowball framework language. If offers light stemming and text normalization. voc

Requirements:

    $ make download
  • Install python requirements
    $ sudo pip install -r requirements.txt

or manually by:

  • extracting snowball into the root folder {Root}/snowball
  • extracting snowball-data/arabic/voc.txt.gz into {Root}/test_data/voc.txt

Build:

  • light stemming
      $ make build
  • root-based stemming
      $ make build_root_based_stemmer

Run:

  • Light Stemmer
  	 $ make run
  	  الطالب
  	  طالب
  • Root-Based Stemmer
      $ make run_root
      الطالب
      طلب

Test:

We configured tests to run against snowball-data arabic sample.

  • time:
      $ make time
  • grouping effect:
      $ make grouping
  • all:
      $ make test
  • Test SAS with golden arabic corpus:
      $ make test_arabicstemmer
  • Test ISRI Stemmer with golden arabic corpus:
     $ make test_isri

Distributions:

  • dist light stemmer to available languages:
    $ make dist
  • dist root-based stemmer to available languages:
    $ make dist_rooter

Results:

Snowball Arabic (Stemmer & rooter) Results

Word Stem root
طفل طفل طفل
اطفال اطفال طفل
الاطفال اطفال طفل
اطفالكم اطفال طفل
فأطفالكم اطفال طفل
اطفالهم اطفال طفل
والاطفال اطفال طفل
فاطفالهم اطفال طفل
وطفل طفل طفل
الطفولة طفول طفل
والطفلتين طفل طفل
طفلتان طفل طفل