Report for h2oai/h2o-danube2-1.8b-base
Model Info:
Tied embeddings: False
LM head uses bias: False
Embeddings shape: [32000, 2560]
Tokenizer Info:
Vocab Size: 32000
Tokenizer Class: LlamaTokenizer
Tokenizer Type: BPE
Bytes handling: Byte Fallback
Token for verification prompt building: includegraphics
Token id for verification prompt building: 7621
Indicator summary:
Indicator for under-trained tokens: E_{in} L2 Norm
Overall distribution: 0.659 +/- 0.083
Detected Token Counts:
Number of tested under-trained tokens: 637, 617 non-special, 62 below p = 0.01 threshold, 23 below soft indicator threshold
Number of single byte tokens: 380, of which 140 below indicator threshold
Number of special tokens: 0, of which 0 below indicator threshold
Under-trained token indicators plot
Under-trained token verification results
23 entries below threshold of 0.109
token_id
token
indicator
max_prob
in_other_tokens
26382
\+\_\
0.0101104
0.00044
31738
\uefc0
0.0107981
0.00034
25900
iNdEx
0.0151775
0.00051
22803
.^{[
0.0281363
1.5e-05
15739
▁beginnetje
0.0300512
0.00014
30929
᥀
0.0380323
0.00057
26831
▁febbra
0.0421692
2.4e-05
▁febbraio
25975
▁Населення
0.0422046
6.7e-05
19248
NdEx
0.0435411
8.3e-05
iNdEx
18927
ederbörd
0.0471718
4.5e-06
▁nederbörd
27916
▁Marcatori
0.0585935
0.00028
22350
▁насеље
0.0671451
0.00016
26718
eltemperaturen
0.0700269
3.2e-05
27265
▁SDValue
0.0727615
0.0058
12872
▁underarter
0.0770717
7e-05
28593
pgfscope
0.0791623
0.022
28373
▁nederbörd
0.0799257
0.00058
15641
▁uitgen
0.0806028
7e-06
▁uitgenodigd
18699
▁становника
0.0899587
0.0011
21160
▁Становништво
0.0922849
0.00016
3 additional entries below threshold
token_id
token
indicator
max_prob
in_other_tokens
14052
▁Jahrhund
0.0959119
0.00079
▁Jahrhundert
, ▁Jahrhunderts
12645
▁släktet
0.10091
6.5e-05
15500
itempty
0.108821
0.0026
omitempty
140 entries below threshold of 0.182
token_id
token
indicator
ord
hex
byte_type
reencoded
93
<0x5A>
3.48395e-08
90
0x5A
ascii
28828: Z
32
<0x1D>
3.48718e-08
29
0x1D
ascii
31236: \x1d
112
<0x6D>
3.50091e-08
109
0x6D
ascii
28719: m
17
<0x0E>
3.5068e-08
14
0x0E
ascii
30517: \x0e
19
<0x10>
3.5076e-08
16
0x10
ascii
30388: \x10
73
<0x46>
3.51179e-08
70
0x46
ascii
28765: F
27
<0x18>
3.51436e-08
24
0x18
ascii
30555: \x18
48
<0x2D>
3.51733e-08
45
0x2D
ascii
28733: -
6
<0x03>
3.52463e-08
3
0x03
ascii
30662: \x03
71
<0x44>
3.52539e-08
68
0x44
ascii
28757: D
102
<0x63>
3.52799e-08
99
0x63
ascii
28717: c
104
<0x65>
3.53477e-08
101
0x65
ascii
28706: e
42
<0x27>
3.53498e-08
39
0x27
ascii
28742: '
4
<0x01>
3.53514e-08
1
0x01
ascii
29534: \x01
118
<0x73>
3.53524e-08
115
0x73
ascii
28713: s
92
<0x59>
3.53692e-08
89
0x59
ascii
28802: Y
127
<0x7C>
3.54028e-08
124
0x7C
ascii
28766: |
72
<0x45>
3.54074e-08
69
0x45
ascii
28749: E
10
<0x07>
3.54211e-08
7
0x07
ascii
30963: \x07
256
<0xFD>
3.54266e-08
253
0xFD
unused_utf8
120 additional entries below threshold
token_id
token
indicator
ord
hex
byte_type
reencoded
41
<0x26>
3.54453e-08
38
0x26
ascii
28800: &
115
<0x70>
3.54526e-08
112
0x70
ascii
28720: p
47
<0x2C>
3.55137e-08
44
0x2C
ascii
28725: ,
257
<0xFE>
3.55179e-08
254
0xFE
unused_utf8
109
<0x6A>
3.55382e-08
106
0x6A
ascii
28768: j
58
<0x37>
3.5545e-08
55
0x37
ascii
28787: 7
57
<0x36>
3.55506e-08
54
0x36
ascii
28784: 6
119
<0x74>
3.55867e-08
116
0x74
ascii
28707: t
75
<0x48>
3.55958e-08
72
0x48
ascii
28769: H
251
<0xF8>
3.56107e-08
248
0xF8
unused_utf8
46
<0x2B>
3.5626e-08
43
0x2B
ascii
28806: +
7
<0x04>
3.56293e-08
4
0x04
ascii
30724: \x04
55
<0x34>
3.56711e-08
52
0x34
ascii
28781: 4
76
<0x49>
3.56851e-08
73
0x49
ascii
28737: I
111
<0x6C>
3.56981e-08
108
0x6C
ascii
28714: l
113
<0x6E>
3.5743e-08
110
0x6E
ascii
28711: n
107
<0x68>
3.57451e-08
104
0x68
ascii
28716: h
255
<0xFC>
3.57592e-08
252
0xFC
unused_utf8
123
<0x78>
3.57783e-08
120
0x78
ascii
28744: x
116
<0x71>
3.5781e-08
113
0x71
ascii
28775: q
8
<0x05>
3.57865e-08
5
0x05
ascii
30550: \x05
94
<0x5B>
3.57902e-08
91
0x5B
ascii
28792: [
20
<0x11>
3.57971e-08
17
0x11
ascii
30557: \x11
66
<0x3F>
3.58069e-08
63
0x3F
ascii
28804: ?
5
<0x02>
3.58153e-08
2
0x02
ascii
30551: \x02
254
<0xFB>
3.58166e-08
251
0xFB
unused_utf8
122
<0x77>
3.58311e-08
119
0x77
ascii
28727: w
24
<0x15>
3.58557e-08
21
0x15
ascii
30675: \x15
250
<0xF7>
3.58678e-08
247
0xF7
unused_utf8
126
<0x7B>
3.58801e-08
123
0x7B
ascii
28751: {
195
<0xC0>
3.58839e-08
192
0xC0
unused_utf8
38
<0x23>
3.58914e-08
35
0x23
ascii
28771: #
69
<0x42>
3.58949e-08
66
0x42
ascii
28760: B
248
<0xF5>
3.58961e-08
245
0xF5
unused_utf8
33
<0x1E>
3.59043e-08
30
0x1E
ascii
31150: \x1e
9
<0x06>
3.59044e-08
6
0x06
ascii
30314: \x06
31
<0x1C>
3.59153e-08
28
0x1C
ascii
31134: \x1c
30
<0x1B>
3.59261e-08
27
0x1B
ascii
30246: \x1b
84
<0x51>
3.59478e-08
81
0x51
ascii
28824: Q
128
<0x7D>
3.59607e-08
125
0x7D
ascii
28752: }
77
<0x4A>
3.59646e-08
74
0x4A
ascii
28798: J
95
<0x5C>
3.5978e-08
92
0x5C
ascii
28756: \
39
<0x24>
3.59826e-08
36
0x24
ascii
28776: $
65
<0x3E>
3.59866e-08
62
0x3E
ascii
28767: >
80
<0x4D>
3.59903e-08
77
0x4D
ascii
28755: M
83
<0x50>
3.60037e-08
80
0x50
ascii
28753: P
125
<0x7A>
3.60055e-08
122
0x7A
ascii
28764: z
121
<0x76>
3.60173e-08
118
0x76
ascii
28728: v
63
<0x3C>
3.6018e-08
60
0x3C
ascii
28789: <
61
<0x3A>
3.60263e-08
58
0x3A
ascii
28747: :
78
<0x4B>
3.60284e-08
75
0x4B
ascii
28796: K
56
<0x35>
3.60342e-08
53
0x35
ascii
28782: 5
114
<0x6F>
3.60457e-08
111
0x6F
ascii
28709: o
21
<0x12>
3.60554e-08
18
0x12
ascii
30298: \x12
49
<0x2E>
3.60665e-08
46
0x2E
ascii
28723: .
258
<0xFF>
3.60827e-08
255
0xFF
unused_utf8
70
<0x43>
3.60944e-08
67
0x43
ascii
28743: C
29
<0x1A>
3.61049e-08
26
0x1A
ascii
30759: \x1a
124
<0x79>
3.61051e-08
121
0x79
ascii
28724: y
87
<0x54>
3.61168e-08
84
0x54
ascii
28738: T
103
<0x64>
3.61285e-08
100
0x64
ascii
28715: d
26
<0x17>
3.61327e-08
23
0x17
ascii
30841: \x17
86
<0x53>
3.61473e-08
83
0x53
ascii
28735: S
198
<0xC3>
3.61504e-08
195
0xC3
utf8
101
<0x62>
3.61512e-08
98
0x62
ascii
28726: b
23
<0x14>
3.6152e-08
20
0x14
ascii
30721: \x14
62
<0x3B>
3.61538e-08
59
0x3B
ascii
28745: ;
16
<0x0D>
3.6177e-08
13
0x0D
ascii
28801: \r
253
<0xFA>
3.61815e-08
250
0xFA
unused_utf8
97
<0x5E>
3.61834e-08
94
0x5E
ascii
28815: ^
252
<0xF9>
3.62119e-08
249
0xF9
unused_utf8
89
<0x56>
3.62145e-08
86
0x56
ascii
28790: V
68
<0x41>
3.62149e-08
65
0x41
ascii
28741: A
196
<0xC1>
3.62408e-08
193
0xC1
unused_utf8
53
<0x32>
3.62499e-08
50
0x32
ascii
28750: 2
117
<0x72>
3.62567e-08
114
0x72
ascii
28712: r
40
<0x25>
3.62725e-08
37
0x25
ascii
28823: %
99
<0x60>
3.62907e-08
96
0x60
ascii
28832: `
120
<0x75>
3.63033e-08
117
0x75
ascii
28718: u
25
<0x16>
3.63129e-08
22
0x16
ascii
30935: \x16
67
<0x40>
3.63172e-08
64
0x40
ascii
28818: @
45
<0x2A>
3.63192e-08
42
0x2A
ascii
28736: *
43
<0x28>
3.63264e-08
40
0x28
ascii
28732: (
100
<0x61>
3.63328e-08
97
0x61
ascii
28708: a
14
<0x0B>
3.63343e-08
11
0x0B
ascii
30638: \x0b
52
<0x31>
3.6353e-08
49
0x31
ascii
28740: 1
90
<0x57>
3.63732e-08
87
0x57
ascii
28780: W
79
<0x4C>
3.63958e-08
76
0x4C
ascii
28758: L
249
<0xF6>
3.6409e-08
246
0xF6
unused_utf8
108
<0x69>
3.64194e-08
105
0x69
ascii
28710: i
105
<0x66>
3.64334e-08
102
0x66
ascii
28722: f
37
<0x22>
3.64461e-08
34
0x22
ascii
28739: "
110
<0x6B>
3.64492e-08
107
0x6B
ascii
28729: k
59
<0x38>
3.64785e-08
56
0x38
ascii
28783: 8
44
<0x29>
3.64788e-08
41
0x29
ascii
28731: )
18
<0x0F>
3.64868e-08
15
0x0F
ascii
30698: \x0f
35
<0x20>
3.65155e-08
32
0x20
ascii
28705: ▁
130
<0x7F>
3.6529e-08
127
0x7F
ascii
30982: \x7f
60
<0x39>
3.65306e-08
57
0x39
ascii
28774: 9
64
<0x3D>
3.65521e-08
61
0x3D
ascii
28746: =
88
<0x55>
3.65631e-08
85
0x55
ascii
28779: U
91
<0x58>
3.66015e-08
88
0x58
ascii
28814: X
98
<0x5F>
3.6638e-08
95
0x5F
ascii
28730: _
51
<0x30>
3.66819e-08
48
0x30
ascii
28734: 0
85
<0x52>
3.66943e-08
82
0x52
ascii
28754: R
11
<0x08>
3.67145e-08
8
0x08
ascii
31129: \x08
34
<0x1F>
3.67174e-08
31
0x1F
ascii
31217: \x1f
82
<0x4F>
3.67361e-08
79
0x4F
ascii
28762: O
36
<0x21>
3.67627e-08
33
0x21
ascii
28808: !
74
<0x47>
3.68409e-08
71
0x47
ascii
28777: G
81
<0x4E>
3.68774e-08
78
0x4E
ascii
28759: N
106
<0x67>
3.68857e-08
103
0x67
ascii
28721: g
50
<0x2F>
3.68978e-08
47
0x2F
ascii
28748: /
28
<0x19>
3.70187e-08
25
0x19
ascii
30969: \x19
96
<0x5D>
3.70761e-08
93
0x5D
ascii
28793: ]
54
<0x33>
3.71352e-08
51
0x33
ascii
28770: 3
22
<0x13>
3.72037e-08
19
0x13
ascii
30453: \x13
129
<0x7E>
3.72633e-08
126
0x7E
ascii
28845: ~
15
<0x0C>
0.000123783
12
0x0C
ascii
29683: \x0c
197
<0xC2>
0.000208018
194
0xC2
utf8
2 entries below threshold of 0.182
token_id
token
indicator
max_prob
1
<s>
3.60892e-08
0.00031
0
<unk>
3.64456e-08
0.00031