- Model Info:
- Tied embeddings: True
- LM head uses bias: False
- Embeddings shape: [65536, 2048]
- Tokenizer Info:
- Vocab Size: 65536
- Tokenizer Class: PreTrainedTokenizerFast
- Tokenizer Type: BPE
- Bytes handling: Byte Input
- Token for verification prompt building: ArgumentException
- Token id for verification prompt building: 52922
- Indicator summary:
- Indicator for under-trained tokens: E_{out} Cosine Distance
- Overall distribution: 0.873 +/- 0.052
- Detected Token Counts:
- Number of tested under-trained tokens: 1297, 1294 non-special, 6 below p = 0.01 threshold, 11 below soft indicator threshold
- Number of single byte tokens: 243, of which 0 below indicator threshold
- Number of special tokens: 0, of which 0 below indicator threshold
- Number of non-single-byte UTF-fragment tokens: 603, of which 5 below soft indicator threshold
11 entries below threshold of 0.605
token_id | token | indicator | max_prob | in_other_tokens |
---|---|---|---|---|
33234 | NdEx |
4.76837e-07 | 6.5e-10 | ▁iNdEx , iNdEx |
60031 | abogon |
0.256473 | 1.2e-05 | ▁gihabogon |
56780 | ▁talags |
0.259384 | 1.3e-07 | ▁talagsaon |
54187 | alakip |
0.312524 | 1 | ilalakip , ▁nahilalakip |
44733 | imutangan |
0.403777 | 1 | ▁nahimutangan |
54194 | ilalakip |
0.459803 | 1 | ▁nahilalakip |
25491 | ámci |
0.520261 | 0.99 | rámci , ▁rámci |
45088 | ahabog |
0.529495 | 0.99 | ahabogang , ▁kinahabogang |
35475 | inaugahan |
0.537871 | 1 | ▁kinaugahan |
55607 | ÜÜÜÜ |
0.593691 | 0.87 | |
54053 | ▁MILAWA |
0.603315 | 0.5 |
5 entries below threshold of 0.605
token_id | token | indicator | in_other_tokens |
---|---|---|---|
42545 | <0xE1><0xB1> |
0.565722 | |
18385 | <0xEE><0xA2> |
0.569999 | \ue89e\ , ▁\ue89e , \ue89e , \ue8a0 |
22434 | <0xEA><0xA6> |
0.580088 | |
12157 | <0x99><0xA6> |
0.580312 | ♦♦ , ▁♦ , ♦ |
22401 | <0xA1><0xB0> |
0.598273 | 조 , ▁조 |
0 entries below threshold of 0.461
1 entries below threshold of 0.461
token_id | token | indicator | max_prob |
---|---|---|---|
1 | <|begin_of_text|> |
0.2422 | 1.9e-10 |