Skip to content

Commit

Permalink
Merge pull request #3054 from vespa-engine/jobergum/document-colbert
Browse files Browse the repository at this point in the history
Add colbert doc
  • Loading branch information
Jo Kristian Bergum authored Jan 10, 2024
2 parents e82656b + 884d352 commit 21449de
Show file tree
Hide file tree
Showing 2 changed files with 288 additions and 12 deletions.
164 changes: 152 additions & 12 deletions en/embedding.html
Original file line number Diff line number Diff line change
Expand Up @@ -98,13 +98,13 @@ <h3 id="huggingface-embedder">Huggingface Embedder</h3>
<pre>{% highlight xml %}
<container id="default" version="1.0">
<component id="e5" type="hugging-face-embedder">
<transformer-model path="my-models/model.onnx"/>
<tokenizer-model url="https://huggingface.co/intfloat/e5-base-v2/raw/main/tokenizer.json"/>
<transformer-model url="https://huggingface.co/intfloat/e5-small-v2/resolve/main/model.onnx"/>
<tokenizer-model url="https://huggingface.co/intfloat/e5-small-v2/raw/main/tokenizer.json"/>
</component>
...
</container>{% endhighlight %}</pre>
<p>See <a href="reference/embedding-reference.html#huggingface-embedder-reference-config">configuration reference</a> for all the parameters.
Normalization and pooling strategy (mean, cls) can also be configured for the Huggingface embedder.
Normalization and embedding pooling strategy (mean, cls) can be configured for the Huggingface embedder.
</p>

<h4 id="huggingface-embedder-models">Huggingface embedder models</h4>
Expand All @@ -117,7 +117,7 @@ <h4 id="huggingface-embedder-models">Huggingface embedder models</h4>
<ul>
<li><a href="https://huggingface.co/intfloat/e5-small-v2">intfloat/e5-small-v2</a> produces <code>tensor&lt;float&gt;(x[384])</code></li>
<li><a href="https://huggingface.co/intfloat/e5-base-v2">intfloat/e5-base-v2</a> produces <code>tensor&lt;float&gt;(x[768])</code></li>
<li><a href="https://huggingface.co/intfloat/e5-large-v2">intfloat/e5-large-v2 produces <code>tensor&lt;float&gt;(x[1024])</code></a></li>
<li><a href="https://huggingface.co/intfloat/e5-large-v2">intfloat/e5-large-v2</a> produces <code>tensor&lt;float&gt;(x[1024])</code></li>
<li><a href="https://huggingface.co/intfloat/multilingual-e5-base">intfloat/multilingual-e5-base</a> produces <code>tensor&lt;float&gt;(x[768])</code></li>
<li><a href="https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2">sentence-transformers/all-MiniLM-L6-v2</a> produces <code>tensor&lt;float&gt;(x[384])</code></li>
<li><a href="https://huggingface.co/sentence-transformers/all-mpnet-base-v2">sentence-transformers/all-mpnet-base-v2</a> produces <code>tensor&lt;float&gt;(x[768])</code></li>
Expand Down Expand Up @@ -152,15 +152,15 @@ <h3 id="bert-embedder">Bert embedder</h3>
<pre>{% highlight xml %}
<container version="1.0">
<component id="myBert" type="bert-embedder">
<transformer-model path="models/e5-small-v2.onnx"/>
<transformer-model url="https://huggingface.co/intfloat/e5-small-v2/resolve/main/model.onnx"/>
<tokenizer-vocab url="https://huggingface.co/intfloat/e5-small-v2/raw/main/vocab.txt"/>
<max-tokens>128</max-tokens>
</component>
</container>
{% endhighlight %}</pre>
<ul>
<li>
The <code>transformer-model</code> specifies the embedding model in <a href="https://onnx.ai/">ONNX</a>.
The <code>transformer-model</code> specifies the embedding model in <a href="https://onnx.ai/">ONNX</a> format.
See <a href="#onnx-export">exporting models to ONNX</a>,
for how to export embedding models from Huggingface to compatible <a href="https://onnx.ai/">ONNX</a> format.
</li>
Expand All @@ -174,15 +174,153 @@ <h3 id="bert-embedder">Bert embedder</h3>
Normalization and pooling strategy (mean, cls) can also be configured for the Bert embedder.
</p>

<h3 id="colbert-embedder">Colbert embedder</h3>
<p>
An embedder supporting <a href="https://github.com/stanford-futuredata/ColBERT">ColBERT</a> models. The
Vespa ColBert embedder maps text to token embeddings, representing text as <strong>multiple</strong>
contextualized vector embeddings. </p>

<p>
This embedder distinguishes itself from the from the <a href="#bert-embedder">bert-embedder</a> and
<a href="#huggingface-embedder">hugging-face-embedder</a>, which utilize pooling operations to derive a single embedding representation for the entire text.
In contrast, ColBERT represents text using multiple vector representations, one for each token.
This approach is considered superior to compressing all tokens in the language model context window into a single vector.
The utilization of individual vectors for each token enhances the contextualized representation of the text.</p>

<p>
In Vespa, the colbert bag of token embeddings are represented as a
<a href="tensor-user-guide.html#tensor-concepts">mixed tensor</a>: <code>tensor&lt;float&gt;(t{}, x[dim])</code> where
<code>dim</code> is the vector dimensionality of the contextualized token embeddings. The <a href="https://huggingface.co/colbert-ir/colbertv2.0">colbert model checkpoint</a>
on Hugging Face hub uses 128 dimensions.
</p>
<p>
The embedder destination tensor is defined in the Vespa <a href="schemas.html">schema</a>, and
depending on the target <a href="reference/tensor.html#tensor-type-spec">tensor cell precision</a> definition
the embedder might compress the representation.

If using <code>int8</code> as the target tensor cell type, the colbert embedder compress the token embeddings with binarization for
the document side embeddings to reduce storage footprint using 1-bit per dimension, reducing the token embedding storage footprint
by 32x (compared to using float with 4 bytes per dimension). The query representation is not compressed with binarization.
The following schema snippet demonstrates two ways to use the colbert embedder in
the document schema to <a href="#embedding-a-document-field">embed a document field</a>.
</p>

<pre>
schema doc {
document doc {
field text type string {..}
}
field colbert_tokens type tensor&lt;float&gt;(t{}, x[128]) {
indexing: input text | embed colbert | attribute
}
field colbert_tokens_compressed type tensor&lt;int8&gt;(t{}, x[16]) {
indexing: input text | embed colbert | attribute
}
}
</pre>
<p>The first field <code>colbert_tokens</code> would store the original representation as the tensor destination
cell type is float. The second field, the <code>colbert_tokens_compressed</code> tensor would be compressed.
When using <code>int8</code> tensor cell precision, one
should divide the original dimensionality by 8 (128/8 = 16).</p>

<p>One can also use <code>bfloat16</code> instead of <code>float</code> to reduce storage by 2x compared to <code>float</code>.</p>
<pre>
field colbert_tokens type tensor&lt;bfloat16&gt;(t{}, x[128]) {
indexing: input text | embed colbert | attribute
}
</pre>
<p>Note that the above schema examples did not specify the <code>index</code> keyword for enabling <a href="approximate-nn-hnsw.html">HNSW</a>,
currently, the colbert representation is intended to be used as a ranking model, and not for retrieval with Vespa's nearestNeighbor query operator.</p>

<p>
To reduce the in-memory footprint, it's supported to use <a href="#paged-attributes">paged attributes</a>.
</p>
<p>
The colbert embedder is configured in <a href="reference/services.html">services.xml</a>,
within the <code>container</code> tag. The following example points to the colbert checkpoint on Hugging Face Hub:
</p>
<pre>{% highlight xml %}
<container version="1.0">
<component id="colbert" type="colbert-embedder">
<transformer-model url="https://huggingface.co/colbert-ir/colbertv2.0/resolve/main/model.onnx"/>
<tokenizer-vocab url="https://huggingface.co/colbert-ir/colbertv2.0/raw/main/tokenizer.json"/>
<max-query-tokens>32</max-query-tokens>
<max-document-tokens>128</max-document-tokens>
</component>
</container>
{% endhighlight %}</pre>
<ul>
<li>
The <code>transformer-model</code> specifies the colbert embedding model in <a href="https://onnx.ai/">ONNX</a> format.
See <a href="#onnx-export">exporting models to ONNX</a>,
for how to export embedding models from Huggingface to compatible <a href="https://onnx.ai/">ONNX</a> format.
The <a href="https://huggingface.co/vespa-engine/col-minilm">vespa-engine/col-minilm</a> page on the HF
model hub has a detailed example of how to export a colbert checkpoint to ONNX format for accelerated inference.
</li>
<li>
The <code>tokenizer-model</code> specifies the Huggingface <code>tokenizer.json</code> formatted file.
See <a href="https://huggingface.co/transformers/v4.8.0/fast_tokenizers.html#loading-from-a-json-file"> HF loading tokenizer from a json file.</a>
</li>
<li>
The <code>max-query-tokens</code> controls the maximum number of query text tokens that are represented as vectors and
similarily <code>max-document-tokens</code> controls the document side. These parameters
can be used to control resource usage.
</li>
</ul>
<p>See <a href="reference/embedding-reference.html#colbert-embedder-reference-config">configuration reference</a> for all
configuration options and defaults.</p>

<h4 id="colbert-ranking">ColBert ranking</h4>
<p>As mentioned above, Vespa's colbert embedder is not directly related to <em>retrieval</em> with
Vespa's <a href="approximate-nn-hnsw.html">approximate nearest neighbor search</a>. </p>

<p>
The following <a href="ranking.html">rank-profile</a> is an example of
using a <a href="reference/ranking-expressions.html#tensor-functions">Vespa tensor expression</a> to
express the colbert MaxSim operator between the query and document representation.
The example uses <a href="phased-ranking.html">phased ranking</a>.

This example uses the compressed version with <a href="reference/ranking-expressions.html#unpack-bits">unpack_bits</a>.
Note that the query tensor is not compressed, this is an example
of asymmetric compression, as the query does not need to be compressed.
</p>
<pre>
rank-profile max-sim inherits default {
inputs {
query(qt) tensor&lt;float&gt;(qt{}, x[128])
}
function unpack() {
expression: unpack_bits(attribute(colbert_tokens_compressed))
}
first-phase {
expression: nativeRank(text) # example
}
second-phase {
rerank-count: 1000
expression {
sum(
reduce(
sum(
query(qt) * unpack() , x
),
max, t
),
qt
)
}
}
}
</pre>

<h2 id="embedding-a-query-text">Embedding a query text</h2>
<p>Where you would otherwise supply a tensor representing the vector point in a query,
<p>Where you would otherwise supply a tensor in a query request,
you can with an embedder configured instead supply any text enclosed in <code>embed()</code>, e.g:</p>

<pre>
input.query(q)=<span class="pre-hilite">embed(myEmbedderId, "Hello%20world")</span>
</pre>
<p>If you have only configured a single embedder, you can skip the embedder id argument and optionally also the quotes. Prefer
to specify the embedder id as introducing more embedder models requires specifying the identifier.
to specify the embedder id as introducing more embedder models requires specifying the embedding identifier.
</p>
<p>Both single(') and double quotes(") are permitted. </p>

Expand All @@ -196,7 +334,7 @@ <h2 id="embedding-a-query-text">Embedding a query text</h2>
</pre>
<p>Output from <code>embed</code> that cannot fit into the tensor dimensionality is truncated, only retaining the first values.</p>

<p>A single Vespa <a href="query-api.html#http">query</a> can use multiple embedders or embed multiple texts with the same embedder:</p>
<p>A single Vespa <a href="query-api.html#http">query</a> request can use multiple embedders or embed multiple texts with the same embedder:</p>
<pre>{% highlight json %}
{
"yql": "select id,title from paragraph where ({targetHits:10}nearestNeighbor(embedding,q)) or ({targetHits:10}nearestNeighbor(embedding,q2)) or userQuery()",
Expand All @@ -216,7 +354,7 @@ <h2 id="embedding-a-query-text">Embedding a query text</h2>
"ranking": "semantic",
}{% endhighlight %}</pre>
<p>Using the same embedding tensor as input to two nearestNeighbor query operators, searching two different embedding fields. For this to
work, both <em>embedding</em> and <em>question_embedding</em> must have the same dimensionality.
work, both <em>embedding</em> and <em>question_embedding</em> fields must have the same dimensionality.
</p>


Expand Down Expand Up @@ -369,7 +507,7 @@ <h3 id="onnx-debug">Debugging ONNX models</h3>
default <a href="reference/embedding-reference.html#bert-embedder-reference-config">bert-embedder parameters</a>.
</p>
<p>
If loading models without the expected input and output parameter names, the Vespa Container node will not start
If loading models without the expected input and output parameter names, the container service will not start
(check <em>vespa.log</em> in the container running Vespa):
</p>
<pre>
Expand All @@ -388,13 +526,15 @@ <h3 id="onnx-debug">Debugging ONNX models</h3>
Waiting up to 5m0s for query service to become available ...
Error: service 'query' is unavailable: services have not converged
</pre>
<p>Embedders supports changing the input and output names, consult <a href="reference/embedding-reference.html">embedding reference</a>
documentation.</p>

<h2 id="embedder-performance">Embedder performance</h2>
<p>Embedding inference can be resource intensive for larger embedding models. Factors that impacts performance:</p>

<ul>
<li>The embedding model parameters. Larger models are more expensive to evaluate than smaller models.</li>
<li>The sequence input length. Transformer type models scales quadratic with input length. Since queries
<li>The sequence input length. Transformer models scales quadratic with input length. Since queries
are typically shorter than documents, embedding queries is less resource intensive than documents.
</li>
<li>
Expand Down
136 changes: 136 additions & 0 deletions en/reference/embedding-reference.html
Original file line number Diff line number Diff line change
Expand Up @@ -225,6 +225,142 @@ <h3 id="bert-embedder-reference-config">Bert embedder reference config</h3>
</tbody>
</table>

<h2 id="colbert-embedder">colbert embedder</h2>
<p>
The colbert embedder is configured in <a href="services.html">services.xml</a>,
within the <code>container</code> tag:
</p>
<pre>{% highlight xml %}
<container version="1.0">
<component id="colbert" type="colbert-embedder">
<transformer-model path="models/colbertv2.onnx"/>
<tokenizer-vocab url="https://huggingface.co/colbert-ir/colbertv2.0/raw/main/tokenizer.json"/>
<max-query-tokens>32</max-query-tokens>
<max-document-tokens>256</max-document-tokens>
</component>
</container>
{% endhighlight %}</pre>
<p>The Vespa colbert implementation works with default configurations for transformer models that use WordPiece tokenization.
</p>

<h3 id="colbert-embedder-reference-config">colbert embedder reference config</h3>
<p>In addition to <a href="#embedder-onnx-reference-config">embedder ONNX parameters</a>:</p>
<table class="table">
<thead>
<tr>
<th>Name</th>
<th>Occurrence</th>
<th>Description</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td>transformer-model</td>
<td>One</td>
<td>Use to point to the transformer ColBERT ONNX model file</td>
<td><a href="#model-config-reference">model-type</a></td>
<td>N/A</td>
</tr>
<tr>
<td>tokenizer-model</td>
<td>One</td>
<td>Use to point to the <code>tokenizer.json</code> Huggingface tokenizer configuration file</td>
<td><a href="#model-config-reference">model-type</a></td>
<td>N/A</td>
</tr>
<tr>
<td>max-tokens</td>
<td>One</td>
<td>Max length of token sequence the transformer-model can handle </td>
<td>numeric</td>
<td>512</td>
</tr>
<tr>
<td>max-query-tokens</td>
<td>One</td>
<td>The maximum number of ColBERT query token embeddings. Queries are padded to this length. Must be lower than max-tokens</td>
<td>numeric</td>
<td>32</td>
</tr>
<tr>
<td>max-document-tokens</td>
<td>One</td>
<td>The maximum number of ColBERT document token embeddings. Documents are not padded. Must be lower than max-tokens</td>
<td>numeric</td>
<td>512</td>
</tr>
<tr>
<td>transformer-input-ids</td>
<td>One</td>
<td>The name or identifier for the transformer input IDs</td>
<td>string</td>
<td>input_ids</td>
</tr>
<tr>
<td>transformer-attention-mask</td>
<td>One</td>
<td>The name or identifier for the transformer attention mask</td>
<td>string</td>
<td>attention_mask</td>
</tr>
<tr>
<td>transformer-mask-token</td>
<td>One</td>
<td>The mask token id used for ColBERT query padding</td>
<td>numeric</td>
<td>103</td>
</tr>
<tr>
<td>transformer-start-sequence-token</td>
<td>One</td>
<td>The start of sequence token id</td>
<td>numeric</td>
<td>101</td>
</tr>
<tr>
<td>transformer-end-sequence-token</td>
<td>One</td>
<td>The end of sequence token id</td>
<td>numeric</td>
<td>102</td>
</tr>
<tr>
<td>transformer-pad-token</td>
<td>One</td>
<td>The pad sequence token id</td>
<td>numeric</td>
<td>0</td>
</tr>
<tr>
<td>query-token-id</td>
<td>One</td>
<td>The colbert query token marker id</td>
<td>numeric</td>
<td>1</td>
</tr>
<tr>
<td>document-token-id</td>
<td>One</td>
<td>The colbert document token marker id</td>
<td>numeric</td>
<td>2</td>
</tr>
<tr>
<td>transformer-output</td>
<td>One</td>
<td>The name or identifier for the transformer output</td>
<td>string</td>
<td>contextual</td>
</tr>
</tbody>
</table>
<p>The Vespa colbert-embedder uses <code>[unused0]</code>token id 1 for <code>query-token-id</code>, and <code>[unused1]</code>,
token id 2 for <code> document-token-id</code>document marker. Document punctuation chars are filtered (not configurable).
The following characters are removed <code>!"#$%&'()*+,-./:;&lt;=&gt;?@[\]^_`{|}~</code>.
</p>

<h2 id="huggingface-tokenizer-embedder">Huggingface tokenizer embedder</h2>
<p>
The Huggingface tokenizer embedder is configured in <a href="services.html">services.xml</a>,
Expand Down

0 comments on commit 21449de

Please sign in to comment.