Skip to content

Latest commit

 

History

History
73 lines (56 loc) · 6.41 KB

PleIAs_Pleias_1_2b_Preview.md

File metadata and controls

73 lines (56 loc) · 6.41 KB

Report for PleIAs/Pleias-1.2b-Preview

Model info

  • Model Info:
    • Tied embeddings: True
    • LM head uses bias: False
    • Embeddings shape: [65536, 2048]
  • Tokenizer Info:
    • Vocab Size: 65536
    • Tokenizer Class: PreTrainedTokenizerFast
    • Tokenizer Type: BPE
    • Bytes handling: Byte Input
    • Token for verification prompt building: ArgumentException
    • Token id for verification prompt building: 52922
  • Indicator summary:
    • Indicator for under-trained tokens: E_{out} Cosine Distance
    • Overall distribution: 0.873 +/- 0.052
  • Detected Token Counts:
    • Number of tested under-trained tokens: 1297, 1294 non-special, 6 below p = 0.01 threshold, 11 below soft indicator threshold
    • Number of single byte tokens: 243, of which 0 below indicator threshold
    • Number of special tokens: 0, of which 0 below indicator threshold
    • Number of non-single-byte UTF-fragment tokens: 603, of which 5 below soft indicator threshold

Under-trained token indicators plot

Indicators scatter plots

Verification plot

Verification plot

Under-trained token verification results

11 entries below threshold of 0.605

token_id token indicator max_prob in_other_tokens
33234 NdEx 4.76837e-07 6.5e-10 ▁iNdEx, iNdEx
60031 abogon 0.256473 1.2e-05 ▁gihabogon
56780 ▁talags 0.259384 1.3e-07 ▁talagsaon
54187 alakip 0.312524 1 ilalakip, ▁nahilalakip
44733 imutangan 0.403777 1 ▁nahimutangan
54194 ilalakip 0.459803 1 ▁nahilalakip
25491 ámci 0.520261 0.99 rámci, ▁rámci
45088 ahabog 0.529495 0.99 ahabogang, ▁kinahabogang
35475 inaugahan 0.537871 1 ▁kinaugahan
55607 ÜÜÜÜ 0.593691 0.87
54053 ▁MILAWA 0.603315 0.5

Tokens with partial UTF-8 sequences

5 entries below threshold of 0.605

token_id token indicator in_other_tokens
42545 <0xE1><0xB1> 0.565722
18385 <0xEE><0xA2> 0.569999 \ue89e\, ▁\ue89e, \ue89e, \ue8a0
22434 <0xEA><0xA6> 0.580088
12157 <0x99><0xA6> 0.580312 ♦♦, ▁♦,
22401 <0xA1><0xB0> 0.598273 , ▁조

Byte tokens

0 entries below threshold of 0.461

Special tokens

1 entries below threshold of 0.461

token_id token indicator max_prob
1 <|begin_of_text|> 0.2422 1.9e-10