-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathREADME
553 lines (431 loc) · 17.8 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
2010-11-21
This is an experimental version of the Aspell language toolkit
(temporary name). It can be used to create dictionaries for Aspell
both Aspell 0.50 and 0.60.
**********************************************************************
Getting Started
**********************************************************************
Since Aspell is 8-bit internally you need to first decide on a charset
to use. See the section "Provided Character Sets" for a list of
available character sets. If none of the character sets are adequate
then you need to create a new one for your language. If this is
necessary please email [email protected] for help.
Now cd to the location of the aspell-lang package (the directory this
file is in) and run
./pre LANG CHARSET
where LANG is the iso language code for your language and CHARSET is
the charset you decided
This will create a directory LANG with the following files in it:
info
LANG.dat
LANG.wl
Copyright
proc (symbolic link)
and possible
CHARSET.cset
CHARSET.cmap
misc/CHARSET.txt
Edit the "info" and "Copyright" file as appropriate. See the next
section for what these fields mean.
If you chose a charset which Aspell provides than the default
encoding will be that charset. If you rather use "utf-8" than
uncomment the line "data-encoding utf-8" in LANG.dat.
Replace LANG.wl with a small word list for your language.
Now to build the word list:
./proc
./configure
make
And if all goes well you should have a very basic dictionary for your
language. You can install it if you want using "make install".
If you want to make a dictionary package use "make dist".
Please see "Adding Support For Other Languages" in the Aspell manual
and the rest of the document for where to go from here.
When you have something ready to disribute check over requirements
section in this file and once you are reasonably sure you have
something ready to upload send it to [email protected].
**********************************************************************
Draft Documentation on the layout of Aspell dicts packages
**********************************************************************
The overall goal of Aspell dicts is to provide a uniform method to
distribute dictionaries for Aspell for any language that Aspell
supports.
This documentation is still in an early stage and rather incomplete.
It is meant to give you enough of an overview so you know what is
going on, but probably won't be enough information for you to actually
create a distribution.
Layout of the Distribution:
An Aspell Word List Package contains several type of files, many of
them generated by the proc script. These must be provided:
info: the main file which contains all of the important word lists
*.wl: word list files
Copyright: the copyright notice
??.dat: The language data file
Several optional ones:
additional language data files (must be listed under data-file)
COPYING: The actual license agreement. Automatically provided for some
licenses
doc/* additional documentation
misc/* other files to include in distribution
and finally some automatically generated or provided ones:
configure: the configure script which finds the appropriate paths
and generates the actual makefile. This file needs to be
copied from aspell-gen package.
??.dat: the data file for the language.
*.multi: the dictionary files
Makefile.pre: the makefile which configure uses.
*** Format of the Info File
(Note: For a better idea of how this file is laid out see some of the
sample info files included)
The info file is the main file which contains most of the information.
It is expected to be in utf-8. It has two types of entries. Single
value settings, and group settings. Single value settings have the
form:
<key> <value>
And group settings which have the form:
<group key>:
<key> <value>
<key> <value>
...
If there is ANY whitespace before a key it is assumed to belong to a
group entry.
The following Single value settings are mandatory:
name_english: The english name of the language
lang: The language code
copyright: The copyright one of:
LGPLv2.1
LGPLv3
GPLv2
GPLv3
FDLv1.1
FDLv1.2
Artistic
Copyrighted (Copyright message must remain)
Free Software (Meets FSF definition of free)
Open Source (Meets OSI definition)
Public Domain (ie none)
Other
Unknown
version: A version string
complete: "true" if the dictionary is reasonably complete, "almost",
if its close, "false" otherwise, or "unknown"
accurate: "true" if the dictionary is accurate (ie every word is a
valid), "false" otherwise, or "unknown"
In addition there must be at least one of each of the following group
entries:
author:
name: The name of the author written using the Latin script,
preferably spelled in English. Accents are allowed.
name-native: The name of the author written in the native script
and spelling.
email: The email address of the author. The email needs to
be translated into an anti-spam versions. '.' are replaced with
spaces and '@' is replaced with ' at '. For example
"[email protected]" becomes "kevina at gnu org".
maintainer: Set to 'true' if this person actively maintains the
Aspell version of the word list. Set to 'false' or leave out
otherwise.
Multiple author groups may be specified.
dict: The defining entry for a dictionary
name: The name of this dict
alias: An alternate name (may be repeated)
add: A word list to add (may be repeated)
multiple dictionaries may be defined. If a particular dictionary
should not have a awli entry associated with it add "awli false".
Dictionary name should be of the form
<code>[_<country>][-<jargon>][-<size>]
Where <country> is the two letter ISO 3166 country code which should
be in all upper case, <jargon> is any extra information to distinguish
the dictionary from other dictionaries, <size> is the dictionary size
and should be a two digit number which should roughly follow these
guide lines:
10: tiny
20: really small
30: small
40: med-small
50: med
60: med-large (the default size)
70: large
80: huge
90: insane
See SCOWL (http://wordlist.sourceforge.net) for an example of how
these sizes are used.
Aliases for individual dictionaries can automatically be created if a
global alias line is defined. Each global alias represents a part of
a dictionary name. For example:
alias fr francais french
alias 40 sml small
will cause the following alias to automatically be generated:
francais-40
francais-sml
francais-small
french-40
french-sml
french-small
fr-sml
fr-small
Aliases normally do not have awli entries associated with them. If you
wish a particular alias to have a awli entry simply tag ":awli" after
the alias. For example
alias en_GB en:awli
If an alias has a awli entry associated with it the final alias must
be of the proper form
In additional to the above the info file can also contain the following
optional entries
data-file: Additional language data files to be installed. May
be given multiple times for more than one file.
readme-extra: A text file in the doc/ directory to be append to the
end of the README file. If is not in utf-8 than the encoding it
is in should be specified after the file name (seperated by a space).
doc-encoding: The encoding the documentation should be in
alt-encoding: Alternate encoding for documentation. Each entry
should have the form "<encoding> <ext>".
url: Url of the official version of the dictionary for Aspell
source_url: Url of the original word list
source_version: Version of the original word list used
name_ascii: The language name in spelled in its own language in all
ascii characters
name_native: Like above but not limited to ASCII characters or the Latin
script.
copyright_desc: A BRIEF description of the copyright if the copyright line
doesn't adequately describe it
notes: A BRIEF description of any major problems with this dictionary,
other than being incomplete or inaccurate, such as being too large.
mode: Controls if the dictionary package will be created for Aspell
0.50 or 0.60. Either "aspell5" or "aspell6". The default is "aspell6".
And a bunch of other entries which I will document latter.
*** The *.wl/*.cwl
For each add entry in the dict entry there should in general be one
word list. Each of these words lists will be compiled into a separate
hash files so you should keep the number to a minimum. Each file is
expected to have the following format:
<code>[-...].wl
These files will be compressed for you with prezip-bin and renamed to
*.cwl.
*** Copyright file
The copyright file simply states the terms in which this word list is
available. If the license is a standard one or is more than a
paragraph or so the actual license should be included in a separate
file "COPYING". If you are using one of the GNU licenses the COPYING
file will automatically be generated for you.
*** running proc
Once the info file is created you are ready to run the
proc script. The proc script needs to be copied or linked into the
current directory for things to work correctly. Once that is done.
Simply type:
perl proc create
and if there are no errors you should have the above listed generated
files.
To try building a word list run configure with
./configure
and then to build and install it
make
make install
To create a distribution do a
make dist
**********************************************************************
Requirements in order to be upload to ftp.gnu.org
**********************************************************************
The number one requiment is that the dictionary package MUST be made
using "make dist" using the "proc" script as previously desribed.
This will check for a large number of things.
When building the dictionary there should, in general, not be any
warnings.
The version string must end in "-NUM" where NUM is generally 0. This
is to allow for minor updates. In addition there should not be any
other "-" in the version string.
"name_native" should be given a value if it is diffrent from the
English spelling
The "complete" and "accurate" fields should have a value other than
unknown.
If the dictionary package is based on another dictionary, then
"source-version" and propabably "url" should be given a value. Also,
the version string should be made to resemble the upstream version to
make the relationship clear.
If one of the authors plans to act as the maintainer for the
dictionary package set add the line "maintainer true" for that author.
There may be more than one maintainer.
The file Copyright should contain a clear Copyright notice, which
icnluded the owner of the Copyright. It should be something like:
Copyright (c) YEAR by SO AND SO under the WHAT.
The copyright must meet FSF defination of free. See
http://www.gnu.org/licenses/license-list.html
**********************************************************************
GNU Aspell mkchardata Perl script and Unicode data file
**********************************************************************
This version of mkchardata will only work for GNU Aspell 0.60 or
better. It will not work for Aspell 0.50 or any of Aspell 0.51/0.60
snapshots before 2004-03-02
The mkchardata Perl script will read in a textual reference table(s)
and convert them into Aspell character data file(s). Its usage is
mkchardata <textual reference table(s)>
The files "unicode.txt" and "decomp.txt" are expected to be in the
current directory.
mkchardata will convert each textual reference table to an Aspell
character data file and normalization map file. It expects the table
to be in the form
0x?? 0x???? # ...
Where 0x?? is the 8-bit character value in hex and 0x???? is the
Unicode value. Anything after the '#' is ignored. Ranges can also be
specified in the form
0x??..0x?? = 0x????..0x???? # ...
The table may alternatively have the form:
=?? U+???? ...
Another file can be included by using:
include <file name>
The directive
== <charset>
indicates that the _unicode mapping_ is the same for the current file as it
is in <charset>. The only difference is the character properties.
The directives:
no-latin
letter <char>
letters <char> <char> ...
vowel <char>
vowels <char> <char> ...
case <upper> <lower> [<title>]
can be used to customize the character properties. None of these effect
the actual mapping.
The "no-latin" line can be used to avoid marking Latin letters as part
of a word. It is useful if the charset is based on an exiting one
which maps the Latin letters but your language in not written using
the Latin script.
The "letter" or "letters" directives can be used to indicate that an
accented letter is really a unique letter and not a letter with an
accent. Each <char> is a single pre-composed character in UTF-8 or a
Unicode code point of the form (U+)XXXX where XXXX is in hex.
The "vowel" or "vowels" directive can be used to identify the vowels
of a language. If used it is necessary to list ALL vowels of the
language. If not specified than the information is taken from the
unicode data file. Specifying a characters here implies "letter".
The "case" directive can be used to identify special case rules which
are different from the Unicode default such as the rules involving
the dotless I for Turkish.
See the file l-tr.txt for an example of the "letter" and "case"
directive.
As of Aspell 0.60 the following characters may be remapped:
01-0F ( 1- 15) # Control characters
11-1F ( 17- 31) # Control characters
41-5A ( 65- 90) # Uppercase Latin alphabet
61-7A ( 97-122) # Lowercase Latin alphabet
80-FF (128-255)
Giving you a total of 210 characters to work with.
If your language uses characters not found in iso-8859-1 (code points
U+00 to U+FF) you might want to look over unicode.txt and make sure
everything is correct for your language. If you find any errors
please send them to me at [email protected].
**********************************************************************
Provided Character Sets
**********************************************************************
INCLUDING WITH ASPELL:
ISO-8859:
iso-8859-1 - Latin1 (Western)
iso-8859-2 - Latin2 (Central European)
iso-8859-3 - Latin3 (South European)
iso-8859-4 - Latin4 (Old Baltic)
iso-8859-5 - Cyrillic
iso-8859-6 - Arabic
iso-8859-7 - Greek
iso-8859-8 - Hebrew
iso-8859-9 - Latin5 (Turkish)
iso-8859-10 - Latin6 (Nordic)
iso-8859-11 - Thai
iso-8859-13 - Latin7 (Baltic)
iso-8859-14 - Latin8 (Celtic)
iso-8859-15 - Latin9 (New Western)
iso-8859-16 - Latin10 (Romanian)
See http://aspell.net/charsets/iso8859.html
Microsoft Code Pages:
cp1250 - Central European (Latin)
cp1251 - Cyrillic
cp1252 - Western (Latin)
cp1253 - Greek
cp1254 - Turkish (Latin)
cp1255 - Hebrew
cp1256 - Arabic
cp1257 - Baltic (Latin)
cp1258 - Vietnamese (Latin)
See http://aspell.net/charsets/codepages.html
Crylic:
koi8-r
koi8-u - Ukrainian
iso-8859-5
cp1251
See http://aspell.net/charsets/cyrillic.html
OTHERS:
These mappings are available under the maps/ directory. If you use
one of them for your dictionary they should be included with the
tarball. You can convert all of them to Aspell's charset files by using:
perl mkchardata maps/*.txt
Since there is the possibility of two different dictionaries providing
the same charset file, DO NOT modify the mappings or the charset files.
If you wish to customize it for your language rename it to l-<lang>.cset.
These are like the base character set except that the C0 and C1
control areas were remapped to include any decomposed letter found the
unicode blocks "Latin-1 Supplement" and "Latin Extended-A" and any
combining marks used in any of the latin unicode code blocks "Latin-1
Supplement", "Latin Extended-A", "Latin Extended-B", "Latin Extended
Additional".
iso-8859-1-u
iso-8859-2-u
iso-8859-3-u
iso-8859-4-u
iso-8859-9-u
iso-8859-10-u
iso-8859-13-u
iso-8859-14-u
iso-8859-15-u
iso-8859-16-u
These are identical to the base character set except that latin
letters are not used so that Aspell won't flag words written using
the Latin script as incorrect.
cp1251-nl
cp1253-nl
cp1255-nl
cp1256-nl
iso-8859-5-nl
iso-8859-6-nl
iso-8859-7-nl
iso-8859-8-nl
iso-8859-11-nl
koi8-r-nl
koi8-u-nl
Vietnamese:
viscii
tcvn3
Other standard mapings:
iso-6438 - Extended African Latin Alphabet
Simple Unicode mappings:
u-armn - Armenian (U+0530..U+058F to 0xA0..0xFF)
u-beng - Bengali (U+0980..U+09FF to 0x80..0xFF)
u-deva - Devanagari (U+0900..U+097F to 0x80..0xFF)
u-geor - Georgian (U+10A0..U+10FF to 0xA0..0xFF)
u-gujr - Gujarati (U+0A80..U+0AFF to 0x80..0xFF)
u-guru - Gurmukhi (U+0A00..U+0A7F to 0x80..0xFF)
u-knda - Kannada (U+0C80..U+0CFF to 0x80..0xFF)
u-mong - Mongolian (U+1800..U+187F to 0x80..0xFF)
u-mymr - Myanmar (U+1000..U+105F to 0xA0..0xFF)
u-orya - Oriya (U+0B00..U+0B7F to 0x80..0xFF)
u-sinh - Sinhala (U+0D80..U+0DFF to 0x80..0xFF)
u-taml - Tamil (U+0B80..U+0BFF to 0x80..0xFF)
u-telu - Telugu (U+0C01..U+0C7F to 0x80..0xFF)
u-tglg - Tagalog (U+1700..U+171F to 0xA0..0xBF)
u-thaa - Thaana (U+0780..U+07BF to 0xC0..0xFF)
Not so simple Unicode mappings:
u-mlym - Malayalam
u-hebr - Hebrew
Special mappings using private use characters:
s-ethi - Ethiopic
The latin letters are not used in any of the above unicode mappings.
Language specific mappings. Unlike the other mappings, it is
permissible to modify these. However to avoid future problems,
please let me know about the changes at [email protected].
l-az - Azerbaijani
l-fa - Persian
l-ky - Kirghiz
l-sr - Serbian (supports both the Cyrillic and Latin script)
l-tg - Tajik
l-tr - Turkish (iso-8859-9 with special case rules for dotless I)
l-uz - Uzbek
Some other language specific mappings are also available which I
created for various people, most have not been used in an official
dictionary yet and might still be incomplete.