README

2010-11-21

This is an experimental version of the Aspell language toolkit
(temporary name).  It can be used to create dictionaries for Aspell
both Aspell 0.50 and 0.60.

**********************************************************************
                           Getting Started
**********************************************************************

Since Aspell is 8-bit internally you need to first decide on a charset
to use.  See the section "Provided Character Sets" for a list of
available character sets.  If none of the character sets are adequate
then you need to create a new one for your language.  If this is
necessary please email aspell-dict@gnu.org for help.

Now cd to the location of the aspell-lang package (the directory this
file is in) and run
  ./pre LANG CHARSET 
where LANG is the iso language code for your language and CHARSET is
the charset you decided

This will create a directory LANG with the following files in it:
  info
  LANG.dat
  LANG.wl
  Copyright
  proc (symbolic link)
and possible
  CHARSET.cset
  CHARSET.cmap
  misc/CHARSET.txt

Edit the "info" and "Copyright" file as appropriate.  See the next
section for what these fields mean.

If you chose a charset which Aspell provides than the default
encoding will be that charset.  If you rather use "utf-8" than
uncomment the line "data-encoding utf-8" in LANG.dat.

Replace LANG.wl with a small word list for your language.

Now to build the word list:
  ./proc
  ./configure
  make

And if all goes well you should have a very basic dictionary for your
language.  You can install it if you want using "make install".

If you want to make a dictionary package use "make dist".

Please see "Adding Support For Other Languages" in the Aspell manual
and the rest of the document for where to go from here.

When you have something ready to disribute check over requirements
section in this file and once you are reasonably sure you have
something ready to upload send it to aspell-dict@gnu.org.

**********************************************************************
     Draft Documentation on the layout of Aspell dicts packages
**********************************************************************

The overall goal of Aspell dicts is to provide a uniform method to
distribute dictionaries for Aspell for any language that Aspell
supports.

This documentation is still in an early stage and rather incomplete.
It is meant to give you enough of an overview so you know what is
going on, but probably won't be enough information for you to actually
create a distribution.

Layout of the Distribution:

An Aspell Word List Package contains several type of files, many of
them generated by the proc script.  These must be provided:

info: the main file which contains all of the important word lists
*.wl: word list files
Copyright: the copyright notice
??.dat: The language data file

Several optional ones:

additional language data files (must be listed under data-file)
COPYING: The actual license agreement.  Automatically provided for some
  licenses
doc/* additional documentation
misc/* other files to include in distribution

and finally some automatically generated or provided ones:

configure: the configure script which finds the appropriate paths
  and generates the actual makefile.  This file needs to be
  copied from aspell-gen package.
??.dat: the data file for the language.
*.multi: the dictionary files
Makefile.pre: the makefile which configure uses.

*** Format of the Info File

(Note: For a better idea of how this file is laid out see some of the
sample info files included)

The info file is the main file which contains most of the information.
It is expected to be in utf-8.  It has two types of entries.  Single
value settings, and group settings.  Single value settings have the
form:
  <key> <value>
And group settings which have the form:
  <group key>:
    <key> <value>
    <key> <value>
    ...
If there is ANY whitespace before a key it is assumed to belong to a
group entry.

The following Single value settings are mandatory:

name_english: The english name of the language
lang: The language code
copyright: The copyright one of:
  LGPLv2.1
  LGPLv3
  GPLv2
  GPLv3
  FDLv1.1
  FDLv1.2
  Artistic
  Copyrighted (Copyright message must remain)
  Free Software (Meets FSF definition of free)
  Open Source (Meets OSI definition)
  Public Domain (ie none)
  Other
  Unknown
version: A version string
complete: "true" if the dictionary is reasonably complete, "almost", 
  if its close, "false" otherwise, or "unknown"
accurate: "true" if the dictionary is accurate (ie every word is a
  valid), "false" otherwise, or "unknown"


In addition there must be at least one of each of the following group
entries:

author:
  name: The name of the author written using the Latin script,
    preferably spelled in English.  Accents are allowed.
  name-native: The name of the author written in the native script
    and spelling.
  email: The email address of the author.  The email needs to 
    be translated into an anti-spam versions.  '.' are replaced with
    spaces and '@' is replaced with ' at '.   For example 
    "kevina@gnu.org" becomes "kevina at gnu org".
  maintainer: Set to 'true' if this person actively maintains the 
    Aspell version of the word list.  Set to 'false' or leave out
    otherwise.

Multiple author groups may be specified.

dict:  The defining entry for a dictionary
  name: The name of this dict
  alias: An alternate name (may be repeated)
  add: A word list to add (may be repeated)

multiple dictionaries may be defined.  If a particular dictionary
should not have a awli entry associated with it add "awli false".

Dictionary name should be of the form
  <code>[_<country>][-<jargon>][-<size>]

Where <country> is the two letter ISO 3166 country code which should
be in all upper case, <jargon> is any extra information to distinguish
the dictionary from other dictionaries, <size> is the dictionary size
and should be a two digit number which should roughly follow these
guide lines:

10: tiny
20: really small
30: small
40: med-small
50: med
60: med-large (the default size)
70: large
80: huge
90: insane

See SCOWL (http://wordlist.sourceforge.net) for an example of how
these sizes are used.

Aliases for individual dictionaries can automatically be created if a
global alias line is defined.  Each global alias represents a part of
a dictionary name.  For example:
  alias fr francais french
  alias 40 sml small
will cause the following alias to automatically be generated:
  francais-40
  francais-sml
  francais-small
  french-40
  french-sml
  french-small
  fr-sml
  fr-small

Aliases normally do not have awli entries associated with them.  If you
wish a particular alias to have a awli entry simply tag ":awli" after
the alias.  For example

  alias en_GB en:awli

If an alias has a awli entry associated with it the final alias must
be of the proper form

In additional to the above the info file can also contain the following
optional entries

data-file: Additional language data files to be installed.  May
  be given multiple times for more than one file.
readme-extra: A text file in the doc/ directory to be append to the
  end of the README file.  If is not in utf-8 than the encoding it
  is in should be specified after the file name (seperated by a space).
doc-encoding: The encoding the documentation should be in
alt-encoding: Alternate encoding for documentation.  Each entry
  should have the form "<encoding> <ext>".
url: Url of the official version of the dictionary for Aspell
source_url: Url of the original word list
source_version: Version of the original word list used
name_ascii: The language name in spelled in its own language in all
ascii characters
name_native: Like above but not limited to ASCII characters or the Latin
  script.
copyright_desc: A BRIEF description of the copyright if the copyright line
  doesn't adequately describe it
notes: A BRIEF description of any major problems with this dictionary,
  other than being incomplete or inaccurate, such as being too large.

mode: Controls if the dictionary package will be created for Aspell
  0.50 or 0.60.  Either "aspell5" or "aspell6".  The default is "aspell6".

And a bunch of other entries which I will document latter.

*** The *.wl/*.cwl

For each add entry in the dict entry there should in general be one
word list. Each of these words lists will be compiled into a separate
hash files so you should keep the number to a minimum.  Each file is
expected to have the following format:
  <code>[-...].wl
These files will be compressed for you with prezip-bin and renamed to
*.cwl.

*** Copyright file

The copyright file simply states the terms in which this word list is
available.  If the license is a standard one or is more than a
paragraph or so the actual license should be included in a separate
file "COPYING".  If you are using one of the GNU licenses the COPYING
file will automatically be generated for you.

*** running proc

Once the info file is created you are ready to run the
proc script.  The proc script needs to be copied or linked into the
current directory for things to work correctly.  Once that is done.
Simply type:
  perl proc create
and if there are no errors you should have the above listed generated 
files.

To try building a word list run configure with
  ./configure

and then to build and install it
  make
  make install

To create a distribution do a 
  make dist


**********************************************************************
         Requirements in order to be upload to ftp.gnu.org
**********************************************************************

The number one requiment is that the dictionary package MUST be made
using "make dist" using the "proc" script as previously desribed.
This will check for a large number of things.

When building the dictionary there should, in general, not be any
warnings.

The version string must end in "-NUM" where NUM is generally 0.  This
is to allow for minor updates.  In addition there should not be any
other "-" in the version string.

"name_native" should be given a value if it is diffrent from the
English spelling

The "complete" and "accurate" fields should have a value other than
unknown.

If the dictionary package is based on another dictionary, then
"source-version" and propabably "url" should be given a value.  Also,
the version string should be made to resemble the upstream version to
make the relationship clear.

If one of the authors plans to act as the maintainer for the
dictionary package set add the line "maintainer true" for that author.
There may be more than one maintainer.

The file Copyright should contain a clear Copyright notice, which
icnluded the owner of the Copyright.  It should be something like:

  Copyright (c) YEAR by SO AND SO under the WHAT.

The copyright must meet FSF defination of free.  See 
  http://www.gnu.org/licenses/license-list.html

**********************************************************************
       GNU Aspell mkchardata Perl script and Unicode data file
**********************************************************************

This version of mkchardata will only work for GNU Aspell 0.60 or
better.  It will not work for Aspell 0.50 or any of Aspell 0.51/0.60
snapshots before 2004-03-02

The mkchardata Perl script will read in a textual reference table(s)
and convert them into Aspell character data file(s).  Its usage is

  mkchardata <textual reference table(s)>

The files "unicode.txt" and "decomp.txt" are expected to be in the
current directory.

mkchardata will convert each textual reference table to an Aspell
character data file and normalization map file.  It expects the table
to be in the form

  0x?? 0x???? # ...

Where 0x?? is the 8-bit character value in hex and 0x???? is the
Unicode value.  Anything after the '#' is ignored.  Ranges can also be
specified in the form

0x??..0x?? = 0x????..0x???? # ...

The table may alternatively have the form:

  =?? U+???? ...


Another file can be included by using:

  include <file name>


The directive

  == <charset>

indicates that the _unicode mapping_ is the same for the current file as it
is in <charset>.  The only difference is the character properties.


The directives:

  no-latin
  letter <char>
  letters <char> <char> ...
  vowel <char>
  vowels <char> <char> ...
  case <upper> <lower> [<title>]

can be used to customize the character properties.  None of these effect
the actual mapping.

The "no-latin" line can be used to avoid marking Latin letters as part
of a word.  It is useful if the charset is based on an exiting one
which maps the Latin letters but your language in not written using
the Latin script.

The "letter" or "letters" directives can be used to indicate that an
accented letter is really a unique letter and not a letter with an
accent.  Each <char> is a single pre-composed character in UTF-8 or a
Unicode code point of the form (U+)XXXX where XXXX is in hex.

The "vowel" or "vowels" directive can be used to identify the vowels
of a language.  If used it is necessary to list ALL vowels of the
language.  If not specified than the information is taken from the
unicode data file.  Specifying a characters here implies "letter".

The "case" directive can be used to identify special case rules which
are different from the Unicode default such as the rules involving
the dotless I for Turkish.

See the file l-tr.txt for an example of the "letter" and "case"
directive.


As of Aspell 0.60 the following characters may be remapped:

  01-0F (  1- 15) # Control characters
  11-1F ( 17- 31) # Control characters
  41-5A ( 65- 90) # Uppercase Latin alphabet
  61-7A ( 97-122) # Lowercase Latin alphabet
  80-FF (128-255)

Giving you a total of 210 characters to work with.


If your language uses characters not found in iso-8859-1 (code points
U+00 to U+FF) you might want to look over unicode.txt and make sure
everything is correct for your language.  If you find any errors
please send them to me at kevina@gnu.org.


**********************************************************************
                       Provided Character Sets
**********************************************************************

INCLUDING WITH ASPELL:

ISO-8859:
  iso-8859-1 - Latin1 (Western)
  iso-8859-2 - Latin2 (Central European)
  iso-8859-3 - Latin3 (South  European)
  iso-8859-4 - Latin4 (Old Baltic)
  iso-8859-5 - Cyrillic
  iso-8859-6 - Arabic
  iso-8859-7 - Greek
  iso-8859-8 - Hebrew
  iso-8859-9 - Latin5 (Turkish)
  iso-8859-10 - Latin6 (Nordic)
  iso-8859-11 - Thai
  iso-8859-13 - Latin7 (Baltic)
  iso-8859-14 - Latin8 (Celtic)
  iso-8859-15 - Latin9 (New Western)
  iso-8859-16 - Latin10 (Romanian)

See http://aspell.net/charsets/iso8859.html

Microsoft Code Pages:
  cp1250 - Central European (Latin)
  cp1251 - Cyrillic
  cp1252 - Western (Latin)
  cp1253 - Greek
  cp1254 - Turkish (Latin)
  cp1255 - Hebrew
  cp1256 - Arabic
  cp1257 - Baltic (Latin)
  cp1258 - Vietnamese (Latin)

See http://aspell.net/charsets/codepages.html

Crylic:
  koi8-r 
  koi8-u - Ukrainian
  iso-8859-5
  cp1251

See http://aspell.net/charsets/cyrillic.html

OTHERS:

These mappings are available under the maps/ directory.  If you use
one of them for your dictionary they should be included with the
tarball.  You can convert all of them to Aspell's charset files by using:
  perl mkchardata maps/*.txt

Since there is the possibility of two different dictionaries providing
the same charset file, DO NOT modify the mappings or the charset files.
If you wish to customize it for your language rename it to l-<lang>.cset.

These are like the base character set except that the C0 and C1
control areas were remapped to include any decomposed letter found the
unicode blocks "Latin-1 Supplement" and "Latin Extended-A" and any
combining marks used in any of the latin unicode code blocks "Latin-1
Supplement", "Latin Extended-A", "Latin Extended-B", "Latin Extended
Additional".
  iso-8859-1-u
  iso-8859-2-u
  iso-8859-3-u
  iso-8859-4-u
  iso-8859-9-u
  iso-8859-10-u
  iso-8859-13-u
  iso-8859-14-u
  iso-8859-15-u
  iso-8859-16-u

These are identical to the base character set except that latin
letters are not used so that Aspell won't flag words written using
the Latin script as incorrect.
  cp1251-nl
  cp1253-nl
  cp1255-nl
  cp1256-nl
  iso-8859-5-nl
  iso-8859-6-nl
  iso-8859-7-nl
  iso-8859-8-nl
  iso-8859-11-nl
  koi8-r-nl
  koi8-u-nl

Vietnamese:
  viscii
  tcvn3

Other standard mapings:
  iso-6438 - Extended African Latin Alphabet

Simple Unicode mappings:
  u-armn - Armenian   (U+0530..U+058F to 0xA0..0xFF)
  u-beng - Bengali    (U+0980..U+09FF to 0x80..0xFF)
  u-deva - Devanagari (U+0900..U+097F to 0x80..0xFF)
  u-geor - Georgian   (U+10A0..U+10FF to 0xA0..0xFF)
  u-gujr - Gujarati   (U+0A80..U+0AFF to 0x80..0xFF)
  u-guru - Gurmukhi   (U+0A00..U+0A7F to 0x80..0xFF)
  u-knda - Kannada    (U+0C80..U+0CFF to 0x80..0xFF)
  u-mong - Mongolian  (U+1800..U+187F to 0x80..0xFF)
  u-mymr - Myanmar    (U+1000..U+105F to 0xA0..0xFF)
  u-orya - Oriya      (U+0B00..U+0B7F to 0x80..0xFF)
  u-sinh - Sinhala    (U+0D80..U+0DFF to 0x80..0xFF)
  u-taml - Tamil      (U+0B80..U+0BFF to 0x80..0xFF)
  u-telu - Telugu     (U+0C01..U+0C7F to 0x80..0xFF)
  u-tglg - Tagalog    (U+1700..U+171F to 0xA0..0xBF)
  u-thaa - Thaana     (U+0780..U+07BF to 0xC0..0xFF)

Not so simple Unicode mappings:
  u-mlym - Malayalam
  u-hebr - Hebrew

Special mappings using private use characters:
  s-ethi - Ethiopic

The latin letters are not used in any of the above unicode mappings.

Language specific mappings.  Unlike the other mappings, it is
permissible to modify these.  However to avoid future problems,
please let me know about the changes at kevina@gnu.org.
  l-az - Azerbaijani
  l-fa - Persian
  l-ky - Kirghiz
  l-sr - Serbian (supports both the Cyrillic and Latin script)
  l-tg - Tajik 
  l-tr - Turkish (iso-8859-9 with special case rules for dotless I)
  l-uz - Uzbek

Some other language specific mappings are also available which I
created for various people, most have not been used in an official
dictionary yet and might still be incomplete.