Merge pull request #3054 from vespa-engine/jobergum/document-colbert

Add colbert doc
vespa-engine · Jan 10, 2024 · 21449de · 21449de
2 parents e82656b + 884d352
commit 21449de
Show file tree

Hide file tree

Showing 2 changed files with 288 additions and 12 deletions.
diff --git a/en/embedding.html b/en/embedding.html
@@ -98,13 +98,13 @@ <h3 id="huggingface-embedder">Huggingface Embedder</h3>
 <pre>{% highlight xml %}
 <container id="default" version="1.0">
     <component id="e5" type="hugging-face-embedder">
-        <transformer-model path="my-models/model.onnx"/>
-        <tokenizer-model url="https://huggingface.co/intfloat/e5-base-v2/raw/main/tokenizer.json"/>
+        <transformer-model url="https://huggingface.co/intfloat/e5-small-v2/resolve/main/model.onnx"/>
+        <tokenizer-model url="https://huggingface.co/intfloat/e5-small-v2/raw/main/tokenizer.json"/>
     </component>
     ...
 </container>{% endhighlight %}</pre>
 <p>See <a href="reference/embedding-reference.html#huggingface-embedder-reference-config">configuration reference</a> for all the parameters.
-  Normalization and pooling strategy (mean, cls) can also be configured for the Huggingface embedder.
+  Normalization and embedding pooling strategy (mean, cls) can be configured for the Huggingface embedder.
 </p>
 
 <h4 id="huggingface-embedder-models">Huggingface embedder models</h4>
@@ -117,7 +117,7 @@ <h4 id="huggingface-embedder-models">Huggingface embedder models</h4>
   <ul>
     <li><a href="https://huggingface.co/intfloat/e5-small-v2">intfloat/e5-small-v2</a> produces <code>tensor&lt;float&gt;(x[384])</code></li>
     <li><a href="https://huggingface.co/intfloat/e5-base-v2">intfloat/e5-base-v2</a> produces <code>tensor&lt;float&gt;(x[768])</code></li>
-    <li><a href="https://huggingface.co/intfloat/e5-large-v2">intfloat/e5-large-v2 produces <code>tensor&lt;float&gt;(x[1024])</code></a></li>
+    <li><a href="https://huggingface.co/intfloat/e5-large-v2">intfloat/e5-large-v2</a> produces <code>tensor&lt;float&gt;(x[1024])</code></li>
     <li><a href="https://huggingface.co/intfloat/multilingual-e5-base">intfloat/multilingual-e5-base</a> produces <code>tensor&lt;float&gt;(x[768])</code></li>
     <li><a href="https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2">sentence-transformers/all-MiniLM-L6-v2</a> produces <code>tensor&lt;float&gt;(x[384])</code></li>
     <li><a href="https://huggingface.co/sentence-transformers/all-mpnet-base-v2">sentence-transformers/all-mpnet-base-v2</a> produces <code>tensor&lt;float&gt;(x[768])</code></li>
@@ -152,15 +152,15 @@ <h3 id="bert-embedder">Bert embedder</h3>
 <pre>{% highlight xml %}
 <container version="1.0">
   <component id="myBert" type="bert-embedder">
-    <transformer-model path="models/e5-small-v2.onnx"/>
+    <transformer-model url="https://huggingface.co/intfloat/e5-small-v2/resolve/main/model.onnx"/>
     <tokenizer-vocab url="https://huggingface.co/intfloat/e5-small-v2/raw/main/vocab.txt"/>
     <max-tokens>128</max-tokens>
   </component>
 </container>
 {% endhighlight %}</pre>
 <ul>
   <li>
-    The <code>transformer-model</code> specifies the embedding model in <a href="https://onnx.ai/">ONNX</a>.
+    The <code>transformer-model</code> specifies the embedding model in <a href="https://onnx.ai/">ONNX</a> format.
     See <a href="#onnx-export">exporting models to ONNX</a>,
     for how to export embedding models from Huggingface to compatible <a href="https://onnx.ai/">ONNX</a> format.
   </li>
@@ -174,15 +174,153 @@ <h3 id="bert-embedder">Bert embedder</h3>
   Normalization and pooling strategy (mean, cls) can also be configured for the Bert embedder.
 </p>
 
+<h3 id="colbert-embedder">Colbert embedder</h3>
+<p>
+  An embedder supporting <a href="https://github.com/stanford-futuredata/ColBERT">ColBERT</a> models. The
+  Vespa ColBert embedder maps text to token embeddings, representing text as <strong>multiple</strong>
+  contextualized vector embeddings. </p>
+
+<p> 
+This embedder distinguishes itself from the from the <a href="#bert-embedder">bert-embedder</a> and 
+<a href="#huggingface-embedder">hugging-face-embedder</a>, which utilize pooling operations to derive a single embedding representation for the entire text. 
+In contrast, ColBERT represents text using multiple vector representations, one for each token. 
+This approach is considered superior to compressing all tokens in the language model context window into a single vector. 
+The utilization of individual vectors for each token enhances the contextualized representation of the text.</p>
+
+<p>
+In Vespa, the colbert bag of token embeddings are represented as a 
+<a href="tensor-user-guide.html#tensor-concepts">mixed tensor</a>: <code>tensor&lt;float&gt;(t{}, x[dim])</code> where
+<code>dim</code> is the vector dimensionality of the contextualized token embeddings.  The <a href="https://huggingface.co/colbert-ir/colbertv2.0">colbert model checkpoint</a>
+on Hugging Face hub uses 128 dimensions.  
+</p>
+<p>
+  The embedder destination tensor is defined in the Vespa <a href="schemas.html">schema</a>, and 
+  depending on the target <a href="reference/tensor.html#tensor-type-spec">tensor cell precision</a> definition
+  the embedder might compress the representation.
+
+  If using <code>int8</code> as the target tensor cell type, the colbert embedder compress the token embeddings with binarization for 
+  the document side embeddings to reduce storage footprint using 1-bit per dimension, reducing the token embedding storage footprint
+  by 32x (compared to using float with 4 bytes per dimension). The query representation is not compressed with binarization. 
+  The following schema snippet demonstrates two ways to use the colbert embedder in 
+  the document schema to <a href="#embedding-a-document-field">embed a document field</a>.
+</p> 
+
+<pre>
+schema doc {
+  document doc {
+    field text type string {..}
+  }
+  field colbert_tokens type tensor&lt;float&gt;(t{}, x[128]) {
+    indexing: input text | embed colbert | attribute
+  }
+  field colbert_tokens_compressed type tensor&lt;int8&gt;(t{}, x[16]) {
+    indexing: input text | embed colbert | attribute
+  }
+}
+</pre>
+<p>The first field <code>colbert_tokens</code> would store the original representation as the tensor destination 
+  cell type is float. The second field, the <code>colbert_tokens_compressed</code> tensor would be compressed. 
+  When using <code>int8</code> tensor cell precision, one
+  should divide the original dimensionality by 8 (128/8 = 16).</p>
+
+<p>One can also use <code>bfloat16</code> instead of <code>float</code> to reduce storage by 2x compared to <code>float</code>.</p>
+<pre>
+field colbert_tokens type tensor&lt;bfloat16&gt;(t{}, x[128]) {
+  indexing: input text | embed colbert | attribute
+}
+</pre>
+<p>Note that the above schema examples did not specify the <code>index</code> keyword for enabling <a href="approximate-nn-hnsw.html">HNSW</a>, 
+currently, the colbert representation is intended to be used as a ranking model, and not for retrieval with Vespa's nearestNeighbor query operator.</p>
+
+<p>
+To reduce the in-memory footprint, it's supported to use <a href="#paged-attributes">paged attributes</a>.
+</p>
+<p>
+  The colbert embedder is configured in <a href="reference/services.html">services.xml</a>,
+  within the <code>container</code> tag. The following example points to the colbert checkpoint on Hugging Face Hub:
+</p>
+<pre>{% highlight xml %}
+<container version="1.0">
+    <component id="colbert" type="colbert-embedder">
+      <transformer-model url="https://huggingface.co/colbert-ir/colbertv2.0/resolve/main/model.onnx"/>
+      <tokenizer-vocab url="https://huggingface.co/colbert-ir/colbertv2.0/raw/main/tokenizer.json"/>
+      <max-query-tokens>32</max-query-tokens>
+      <max-document-tokens>128</max-document-tokens>
+    </component>
+</container>
+{% endhighlight %}</pre>
+<ul>
+  <li>
+    The <code>transformer-model</code> specifies the colbert embedding model in <a href="https://onnx.ai/">ONNX</a> format.
+    See <a href="#onnx-export">exporting models to ONNX</a>,
+    for how to export embedding models from Huggingface to compatible <a href="https://onnx.ai/">ONNX</a> format.
+    The <a href="https://huggingface.co/vespa-engine/col-minilm">vespa-engine/col-minilm</a> page on the HF 
+    model hub has a detailed example of how to export a colbert checkpoint to ONNX format for accelerated inference. 
+  </li>
+  <li>
+    The <code>tokenizer-model</code> specifies the Huggingface <code>tokenizer.json</code> formatted file.
+    See <a href="https://huggingface.co/transformers/v4.8.0/fast_tokenizers.html#loading-from-a-json-file"> HF loading tokenizer from a json file.</a>
+  </li>
+  <li>
+    The <code>max-query-tokens</code> controls the maximum number of query text tokens that are represented as vectors and 
+    similarily <code>max-document-tokens</code> controls the document side. These parameters 
+    can be used to control resource usage. 
+  </li>
+</ul>
+<p>See <a href="reference/embedding-reference.html#colbert-embedder-reference-config">configuration reference</a> for all 
+  configuration options and defaults.</p>
+
+<h4 id="colbert-ranking">ColBert ranking</h4>
+<p>As mentioned above, Vespa's colbert embedder is not directly related to <em>retrieval</em> with 
+  Vespa's <a href="approximate-nn-hnsw.html">approximate nearest neighbor search</a>. </p>
+
+<p>
+  The following <a href="ranking.html">rank-profile</a> is an example of 
+  using a <a href="reference/ranking-expressions.html#tensor-functions">Vespa tensor expression</a> to 
+  express the colbert MaxSim operator between the query and document representation. 
+  The example uses <a href="phased-ranking.html">phased ranking</a>.
+
+  This example uses the compressed version with <a href="reference/ranking-expressions.html#unpack-bits">unpack_bits</a>. 
+  Note that the query tensor is not compressed, this is an example
+  of asymmetric compression, as the query does not need to be compressed.
+</p>
+<pre>
+rank-profile max-sim inherits default {
+    inputs {
+      query(qt) tensor&lt;float&gt;(qt{}, x[128])
+    }
+    function unpack() {
+      expression: unpack_bits(attribute(colbert_tokens_compressed))
+    }
+    first-phase {
+      expression: nativeRank(text) # example   
+    }
+    second-phase {
+      rerank-count: 1000
+      expression {
+        sum(
+          reduce(
+            sum(
+              query(qt) * unpack() , x
+            ),
+            max, t
+          ),
+          qt
+        )
+    }
+  }
+}
+</pre>
+
 <h2 id="embedding-a-query-text">Embedding a query text</h2>
-<p>Where you would otherwise supply a tensor representing the vector point in a query,
+<p>Where you would otherwise supply a tensor in a query request,
 you can with an embedder configured instead supply any text enclosed in <code>embed()</code>, e.g:</p>
 
 <pre>
 input.query(q)=<span class="pre-hilite">embed(myEmbedderId, "Hello%20world")</span>
 </pre>
 <p>If you have only configured a single embedder, you can skip the embedder id argument and optionally also the quotes. Prefer
-  to specify the embedder id as introducing more embedder models requires specifying the identifier.
+  to specify the embedder id as introducing more embedder models requires specifying the embedding identifier.
 </p>
 <p>Both single(') and double quotes(") are permitted. </p>
 
@@ -196,7 +334,7 @@ <h2 id="embedding-a-query-text">Embedding a query text</h2>
 </pre>
 <p>Output from <code>embed</code> that cannot fit into the tensor dimensionality is truncated, only retaining the first values.</p>
 
-<p>A single Vespa <a href="query-api.html#http">query</a> can use multiple embedders or embed multiple texts with the same embedder:</p>
+<p>A single Vespa <a href="query-api.html#http">query</a> request can use multiple embedders or embed multiple texts with the same embedder:</p>
 <pre>{% highlight json %}
  {
     "yql": "select id,title from paragraph where ({targetHits:10}nearestNeighbor(embedding,q)) or ({targetHits:10}nearestNeighbor(embedding,q2)) or userQuery()",
@@ -216,7 +354,7 @@ <h2 id="embedding-a-query-text">Embedding a query text</h2>
      "ranking": "semantic",
  }{% endhighlight %}</pre>
  <p>Using the same embedding tensor as input to two nearestNeighbor query operators, searching two different embedding fields. For this to
-  work, both <em>embedding</em> and <em>question_embedding</em> must have the same dimensionality.
+  work, both <em>embedding</em> and <em>question_embedding</em> fields must have the same dimensionality.
  </p>
 
 
@@ -369,7 +507,7 @@ <h3 id="onnx-debug">Debugging ONNX models</h3>
 default <a href="reference/embedding-reference.html#bert-embedder-reference-config">bert-embedder parameters</a>.
 </p>
 <p>
-  If loading models without the expected input and output parameter names, the Vespa Container node will not start
+  If loading models without the expected input and output parameter names, the container service will not start
   (check <em>vespa.log</em> in the container running Vespa):
 </p>
 <pre>
@@ -388,13 +526,15 @@ <h3 id="onnx-debug">Debugging ONNX models</h3>
 Waiting up to 5m0s for query service to become available ...
 Error: service 'query' is unavailable: services have not converged
 </pre>
+<p>Embedders supports changing the input and output names, consult <a href="reference/embedding-reference.html">embedding reference</a>
+documentation.</p>
 
 <h2 id="embedder-performance">Embedder performance</h2>
 <p>Embedding inference can be resource intensive for larger embedding models. Factors that impacts performance:</p>
 
 <ul>
   <li>The embedding model parameters. Larger models are more expensive to evaluate than smaller models.</li>
-  <li>The sequence input length. Transformer type models scales quadratic with input length. Since queries
+  <li>The sequence input length. Transformer models scales quadratic with input length. Since queries
     are typically shorter than documents, embedding queries is less resource intensive than documents.
   </li>
   <li>

diff --git a/en/reference/embedding-reference.html b/en/reference/embedding-reference.html
@@ -225,6 +225,142 @@ <h3 id="bert-embedder-reference-config">Bert embedder reference config</h3>
   </tbody>
 </table>
 
+<h2 id="colbert-embedder">colbert embedder</h2>
+<p>
+  The colbert embedder is configured in <a href="services.html">services.xml</a>,
+  within the <code>container</code> tag:
+</p>
+<pre>{% highlight xml %}
+<container version="1.0">
+  <component id="colbert" type="colbert-embedder">
+    <transformer-model path="models/colbertv2.onnx"/>
+    <tokenizer-vocab url="https://huggingface.co/colbert-ir/colbertv2.0/raw/main/tokenizer.json"/>
+    <max-query-tokens>32</max-query-tokens>
+    <max-document-tokens>256</max-document-tokens>
+  </component>
+</container>
+{% endhighlight %}</pre>
+<p>The Vespa colbert implementation works with default configurations for transformer models that use WordPiece tokenization. 
+</p>
+
+<h3 id="colbert-embedder-reference-config">colbert embedder reference config</h3>
+<p>In addition to <a href="#embedder-onnx-reference-config">embedder ONNX parameters</a>:</p>
+<table class="table">
+  <thead>
+    <tr>
+      <th>Name</th>
+      <th>Occurrence</th>
+      <th>Description</th>
+      <th>Type</th>
+      <th>Default</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>transformer-model</td>
+      <td>One</td>
+      <td>Use to point to the transformer ColBERT ONNX model file</td>
+      <td><a href="#model-config-reference">model-type</a></td>
+      <td>N/A</td>
+    </tr>
+    <tr>
+      <td>tokenizer-model</td>
+      <td>One</td>
+      <td>Use to point to the <code>tokenizer.json</code> Huggingface tokenizer configuration file</td>
+      <td><a href="#model-config-reference">model-type</a></td>
+      <td>N/A</td>
+    </tr>
+    <tr>
+      <td>max-tokens</td>
+      <td>One</td>
+      <td>Max length of token sequence the transformer-model can handle </td>
+      <td>numeric</td>
+      <td>512</td>
+    </tr>
+    <tr>
+      <td>max-query-tokens</td>
+      <td>One</td>
+      <td>The maximum number of ColBERT query token embeddings. Queries are padded to this length. Must be lower than max-tokens</td>
+      <td>numeric</td>
+      <td>32</td>
+    </tr>
+    <tr>
+      <td>max-document-tokens</td>
+      <td>One</td>
+      <td>The maximum number of ColBERT document token embeddings. Documents are not padded. Must be lower than max-tokens</td>
+      <td>numeric</td>
+      <td>512</td>
+    </tr>
+    <tr>
+      <td>transformer-input-ids</td>
+      <td>One</td>
+      <td>The name or identifier for the transformer input IDs</td>
+      <td>string</td>
+      <td>input_ids</td>
+    </tr>
+    <tr>
+      <td>transformer-attention-mask</td>
+      <td>One</td>
+      <td>The name or identifier for the transformer attention mask</td>
+      <td>string</td>
+      <td>attention_mask</td>
+    </tr>
+    <tr>
+      <td>transformer-mask-token</td>
+      <td>One</td>
+      <td>The mask token id used for ColBERT query padding</td>
+      <td>numeric</td>
+      <td>103</td>
+    </tr>
+    <tr>
+      <td>transformer-start-sequence-token</td>
+      <td>One</td>
+      <td>The start of sequence token id</td>
+      <td>numeric</td>
+      <td>101</td>
+    </tr>
+    <tr>
+      <td>transformer-end-sequence-token</td>
+      <td>One</td>
+      <td>The end of sequence token id</td>
+      <td>numeric</td>
+      <td>102</td>
+    </tr>
+    <tr>
+      <td>transformer-pad-token</td>
+      <td>One</td>
+      <td>The pad sequence token id</td>
+      <td>numeric</td>
+      <td>0</td>
+    </tr>
+    <tr>
+      <td>query-token-id</td>
+      <td>One</td>
+      <td>The colbert query token marker id</td>
+      <td>numeric</td>
+      <td>1</td>
+    </tr>
+    <tr>
+      <td>document-token-id</td>
+      <td>One</td>
+      <td>The colbert document token marker id</td>
+      <td>numeric</td>
+      <td>2</td>
+    </tr>
+    <tr>
+      <td>transformer-output</td>
+      <td>One</td>
+      <td>The name or identifier for the transformer output</td>
+      <td>string</td>
+      <td>contextual</td>
+    </tr> 
+  </tbody>
+</table>
+<p>The Vespa colbert-embedder uses <code>[unused0]</code>token id 1 for <code>query-token-id</code>,  and <code>[unused1]</code>, 
+  token id 2 for <code> document-token-id</code>document marker. Document punctuation chars are filtered (not configurable). 
+  The following characters are removed <code>!"#$%&'()*+,-./:;&lt;=&gt;?@[\]^_`{|}~</code>.
+</p>
+
 <h2 id="huggingface-tokenizer-embedder">Huggingface tokenizer embedder</h2>
   <p>
     The Huggingface tokenizer embedder is configured in <a href="services.html">services.xml</a>,