Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GODRIVER-3286 BSON Binary vector subtype support. #1919

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

qingyang-hu
Copy link
Collaborator

@qingyang-hu qingyang-hu commented Jan 10, 2025

GODRIVER-3286

Summary

BSON Binary vector subtype support.

Background & Motivation

specs: https://github.com/mongodb/specifications/blob/master/source/bson-binary-vector/bson-binary-vector.md
tests: https://github.com/mongodb/specifications/blob/master/source/bson-binary-vector/tests/README.md

The invalid cases in the spec tests are skipped because they are inappropriate for the implementation.

The motivation for using generic types is to reduce run-time errors by shifting the burden onto the compiler.

@mongodb-drivers-pr-bot mongodb-drivers-pr-bot bot added the priority-3-low Low Priority PR for Review label Jan 10, 2025
Copy link
Contributor

mongodb-drivers-pr-bot bot commented Jan 10, 2025

API Change Report

./v2/bson

compatible changes

Float32Vector: added
Int8Vector: added
NewPackedBitVector: added
NewVector: added
NewVectorFromBinary: added
PackedBitVector: added
TypeBinaryVector: added
Vector: added

@qingyang-hu qingyang-hu marked this pull request as ready for review January 10, 2025 23:32
bson/vector.go Outdated
// BitVector represents a binary quantized (PACKED_BIT) vector of 0s and 1s
// (bits). The Padding prescribes the number of bits to ignore in the final byte
// of the Data. It should be 0 for an empty Data and always less than 8.
type BitVector struct {
Copy link
Collaborator

@prestonvasquez prestonvasquez Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest merging BitVector into Vector, similar to python. Also, suggest not using generics since a function with a generic signature like NewVectorFromBinary[T ...] cannot dynamically decide at runtime which T to return and so must return an ambiguous type such as interface{} or any which is not ideal.

type VectorDType byte

// These constants are vector data types.
const (
	Int8Vector      VectorDType = 0x03
	Float32Vector   VectorDType = 0x27
	PackedBitVector VectorDType = 0x10
)

// Vector represents a densely packed array of numbers / bits.
type Vector struct {
	DType       VectorDType
	Int8Data    []int8
	Float32Data []float32
	BitData     []byte
	BitPadding  []uint8
}

Then the internal constructors would only return their associated type, rather than an entire struct:

func newInt8Vector(b []byte) ([]int8, error) {
	if len(b) == 0 {
		return nil, ErrInsufficientVectorData
	}
	if padding := b[0]; padding > 0 {
		return nil, ErrNonZeroVectorPadding
	}
	s := make([]int8, 0, len(b)-1)
	for i := 1; i < len(b); i++ {
		s = append(s, int8(b[i]))
	}
	return s, nil
}

Finally, the vector constructor, avoiding the necessity of returning interface{}/any:

func NewVectorFromBinary(b Binary) (Vector, error) {
	var vec Vector

	if b.Subtype != TypeBinaryVector {
		return vec, ErrNotVector
	}
	if len(b.Data) < 2 {
		return vec, ErrInsufficientVectorData
	}

	switch VectorDType(b.Data[0]) {
	case Int8Vector:
		data, err := newInt8Vector(b.Data[1:])
		if err != nil {
			return vec, err
		}

		vec.DType = Int8Vector
		vec.Int8Data = data

		return vec, nil
	default:
		return vec, fmt.Errorf("invalid Vector data type: %d", b.Data[1:])
	}
}

The con of this approach is that we would need to actual run invalid cases such as "Vector with float values PACKED_BIT", which we currently skip because it's not possible under the current design.

Copy link
Collaborator Author

@qingyang-hu qingyang-hu Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation of the current design is to reduce run-time errors by shifting the burden onto the compiler. Likewise, it intends to minimize user mistakes, e.g., assigning a wrong/invalid DType because Vector and its fields are directly accessible.

Returning a general interface from a constructor is not ideal because it lacks a syntactical clue about the type. However, the name of the method also suggests the scope. On the other hand, the user has to identify the vector type by either the DType or a type assertion anyway. The latter seems more instinctive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it intends to minimize user mistakes, e.g., assigning a wrong/invalid DType because Vector and its fields are directly accessible.

I have concerns about returning an empty interface from an exported function in stable API. Specifically "accept interfaces, return structs". While there are exceptions( io.LimitReader, etc), IMO the case described here should be avoided since we know what the function should return. The generic solution is fine as long as we return concrete types from exported constructors.

If we are concerned with users accessing types, we could privatize them:

// Vector represents a densely packed array of numbers / bits.
type Vector struct {
	dType       VectorDType
	int8Data    []int8
	float32Data []float32
	bitData     []byte
	bitPadding  []uint8
}

// NewVector creates a new Vector from the provided data and padding, based on the type of the data.
func NewVector[T int8 | float32 | byte](data []T, padding []uint8) (Vector, error) {
	switch any(data).(type) {
	case []int8:
		if len(padding) > 0 {
			return Vector{}, errors.New("padding should not be set for int8 data")
		}
		return Vector{
			dType:    Int8Vector,
			int8Data: any(data).([]int8), // Type assertion
		}, nil

	case []float32:
		if len(padding) > 0 {
			return Vector{}, errors.New("padding should not be set for float32 data")
		}
		return Vector{
			dType:       Float32Vector,
			float32Data: any(data).([]float32), // Type assertion
		}, nil

	case []byte:
		return Vector{
			dType:      PackedBitVector,
			bitData:    any(data).([]byte), // Type assertion
			bitPadding: padding,
		}, nil

	default:
		return Vector{}, errors.New("unsupported data type")
	}
}

// GetInt8Data retrieves the int8 data if the Vector is of Int8Type
func (v *Vector) Int8Vector() ([]int8, error) { // maybe ok instead of error?
	if v.dType != Int8Vector {
		return nil, errors.New("data is not of type int8")
	}
	return v.int8Data, nil
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Privatizing data fields sounds like a better approach to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthewdale What are your thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that making Vector a non-generic struct with unexported fields seems like the best choice. What's not clear is how to expose the vector data.

One option is to use a method like Decode:

func (v Vector) Decode(v any) error {
	// ...
}

In use, that would look like:

var v Vector
var floats []float32
err := v.Decode(&floats)
if err != nil {
// ...

Another option is to use an API similar to RawValue's conversion methods:

func (v Vector) Int8() []int8
func (v Vector) Int8OK() ([]int8, bool)
func (v Vector) Float32() []float32
func (v Vector) Float32OK() ([]float32, bool)
func (v Vector) PackedBits() (bits []byte, padding byte)
func (v Vector) PackedBitsOK() (bits []byte, padding byte, ok bool)

In use, that would look like:

var v Vector
floats, ok := v.Float32OK()
if !ok {
// ...

I lean toward the latter because it's more conventional based on the existing RawValue API.

For creating a vector, I recommend defining two methods:

func NewVector[T int8 | float32](data []T) Vector
func NewPackedBitsVector(bits []byte, padding byte) Vector

In use, that would look like:

vec := bson.NewVector([]int8{0, 1, 2, 3})

Additionally, I recommend making it possible to decode "int8" and "float32" vectors directly into []int8 and []float32, respectively. For example, the following should work:

var coll *mongo.Collection
d := bson.D{{
	"vec", bson.NewVector([]int8{0, 1, 2, 3}),
}}
coll.InsertOne(..., v)

var res struct {
	Vec []int8
}
coll.FindOne(...).Decode(&res)

Additional questions:

  • Should we also add a struct field annotation for []int8 and []float32 fields that instructs the Go Driver to store them as vectors instead of arrays? That would allow a user to encode and decode vector data with a single struct without any additional decoding/conversion steps.
    For example:
var coll *mongo.Collection
type MyType struct {
	Vec []int8 `bson:"vec,asvector"`
}
coll.InsertOne(... MyType{Vec: []int8{0, 1, 2, 3})

var res MyType
coll.FindOne(...).Decode(&res)
  • If we do the above, do we even need a Vector type? Maybe we only need a PackedBitVector type, and we can use []int8 and []float32 for everything else?
  • Some of these questions would be easier to answer if we had examples of how vector data might be inserted, queried, and manipulated in a Go application. Do we have any example use cases?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer a Vector type over the field annotation. Annotation lacks compile-time checks, which means a typo can result in a hard-to-find bug.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the Vector type is safer. In addition to what @qingyang-hu notes, it gives us a way to maintain state concerning the type in the unlikely case that is required. It also mirrors the python implementation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a non-generic Vector type with unexported fields sounds like the best plan.

Based on our separate conversation, supporting unmarshaling directly to []int8 or []float32 is lower priority than being able to easily insert vector data in those formats. I've created GODRIVER-3472 to track that improvement.

bson/vector.go Outdated
Comment on lines 25 to 28
ErrNotVector = errors.New("not a vector")
ErrInsufficientVectorData = errors.New("insufficient data")
ErrNonZeroVectorPadding = errors.New("padding must be 0")
ErrVectorPaddingTooLarge = errors.New("padding larger than 7")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should users need to assert against these error types? Consider un-exporting them until there is a clear use case for exporting

bson/vector.go Outdated Show resolved Hide resolved
bson/vector.go Outdated
}

// NewBinaryFromVector converts a Vector into a BSON Binary.
func NewBinaryFromVector[T BitVector | Vector[int8] | Vector[float32]](v T) (Binary, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this a method on the Vector type instead?

E.g.

func (v Vector) Binary() Binary

@qingyang-hu qingyang-hu force-pushed the godriver3286 branch 4 times, most recently from 65ef156 to 9133d6c Compare January 28, 2025 15:53
@jtazin jtazin added priority-1-high High Priority PR for Review and removed priority-3-low Low Priority PR for Review labels Jan 29, 2025
bson/vector.go Outdated
Comment on lines 17 to 23
type VectorDType byte

// These constants are vector data types.
const (
Int8Vector VectorDType = 0x03
Float32Vector VectorDType = 0x27
PackedBitVector VectorDType = 0x10
Copy link
Collaborator

@matthewdale matthewdale Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it important to have a custom type for the vector subtypes? The BSON binary subtypes don't use a custom Go type (although the BSON types do, so there's not a consistent pattern).

bson/vector.go Show resolved Hide resolved
bson/vector.go Outdated Show resolved Hide resolved
bson/vector.go Outdated Show resolved Hide resolved
bson/vector.go Outdated Show resolved Hide resolved
bson/vector.go Outdated Show resolved Hide resolved
bson/vector.go Outdated Show resolved Hide resolved
bson/vector.go Outdated Show resolved Hide resolved
CanonicalBson string `json:"canonical_bson"`
}

func Test_BsonBinaryVector(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would typically expect tests with _ to specify a method call

type A struct {}

func (*A) B() {}

func TestA_B(*testing.T) {}
Suggested change
func Test_BsonBinaryVector(t *testing.T) {
func TestBsonBinaryVectorSpec(t *testing.T) {

})
}

t.Run("Insufficient vector data FLOAT32", func(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest making non-spec tests their own test.

func TestInsufficientVectorDataFloat32(*testing.T)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, these should be included in the spec tests. I will submit a spec PR later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a pattern for naming specification tests Test{Focus}Spec. For example, TestConnStringSpec, TestURIOptionsSpec, TestPollingSRVRecordsSpec, TestServerSelectionRTTSpec, etc. We should call this test TestBsonBinaryVectorSpec and decouple it from non-spec tests. If we are going to make these subtests spec tests in the future, we should at least add a comment with the corresponding DRIVERS ticket so that they can be removed as duplicates after implementation.

bson/vector.go Outdated
Comment on lines 141 to 150
data := make([]byte, 2, len(v)+2)
copy(data, []byte{byte(Int8Vector), 0})
for _, e := range v {
data = append(data, byte(e))
}

return Binary{
Subtype: TypeBinaryVector,
Data: data,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly setting the size and filling it in reads more clearly IMO. Additionally, IIUC the following is slightly more efficient since we use a single allocation and avoid calling append in a loop:

	data := make([]byte, len(v)+2)

	data[0] = byte(Int8Vector)
	data[1] = 0

	for i, e := range v {
		data[i+2] = byte(e)
	}

	return Binary{Subtype: TypeBinaryVector, Data: data}

}

func runBsonBinaryVectorTest(t *testing.T, testKey string, test bsonBinaryVectorTestCase) {
if !test.Valid {
Copy link
Collaborator

@prestonvasquez prestonvasquez Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we've unified vector and packed bit vector, we should not skip invalid tests.

v := make([]T, len(s))
for i, e := range s {
f := math.NaN()
switch v := e.(type) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v shadows the outer v, suggest renaming it to val.

@qingyang-hu qingyang-hu force-pushed the godriver3286 branch 2 times, most recently from 332dbab to a099fc7 Compare January 31, 2025 22:11
matthewdale
matthewdale previously approved these changes Feb 1, 2025
Copy link
Collaborator

@matthewdale matthewdale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! 👍

bson/vector.go Outdated
"math"
)

// These constants are vector data types.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: This comment actually gets rendered in the docs. Consider a more descriptive comment.

E.g.

// BSON binary vector types as described in https://bsonspec.org/spec.html.

prestonvasquez
prestonvasquez previously approved these changes Feb 3, 2025
Copy link
Collaborator

@prestonvasquez prestonvasquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with consideration for the suggested updates.

})
}

t.Run("Insufficient vector data FLOAT32", func(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a pattern for naming specification tests Test{Focus}Spec. For example, TestConnStringSpec, TestURIOptionsSpec, TestPollingSRVRecordsSpec, TestServerSelectionRTTSpec, etc. We should call this test TestBsonBinaryVectorSpec and decouple it from non-spec tests. If we are going to make these subtests spec tests in the future, we should at least add a comment with the corresponding DRIVERS ticket so that they can be removed as duplicates after implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority-1-high High Priority PR for Review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants