header

Cosine Similarity or Dot Product?

Google’s embedder uses dot product between normalized vectors which is computationally more efficient but mathematically equivalent to cosine similarity.

How Googler’s work and think internally typically aligns with their open source code (Gemini -> Gemma) and Chrome is no exception. It’s why I look there for answers and clarity on Google’s machine learning approaches.

After examining the Chrome codebase, I found the following key evidence regarding the similarity method used:

Embedding::ScoreWith Method Implementation

The core similarity calculation is implemented in the ScoreWith method of the Embedding class in vector_database.cc:

float Embedding::ScoreWith(const Embedding& other_embedding) const {
  // This check is redundant since the database layers ensure embeddings
  // always have a fixed consistent size, but code can change with time,
  // and being sure directly before use may eventually catch a bug.
  CHECK_EQ(data_.size(), other_embedding.data_.size());

  float embedding_score = 0.0f;
  for (size_t i = 0; i < data_.size(); i++) {
    embedding_score += data_[i] * other_embedding.data_[i];
  }
  return embedding_score;
}

This implementation is calculating the dot product of two embedding vectors.

Normalization of Embeddings

The code shows that embeddings are normalized to unit length:

void Embedding::Normalize() {
  float magnitude = Magnitude();
  CHECK_GT(magnitude, kEpsilon);
  for (float& s : data_) {
    s /= magnitude;
  }
}

And in the FindNearest method in VectorDatabase, there’s a check to ensure the query embedding has unit magnitude:

// Magnitudes are also assumed equal; they are provided normalized by design.
CHECK_LT(std::abs(query_embedding.Magnitude() - kUnitLength), kEpsilon);

There’s also a constant defined:

// Standard normalized magnitude for all embeddings.
constexpr float kUnitLength = 1.0f;

No Direct References to Cosine Similarity

There are no direct references to “cosine” or “cosine similarity” in the codebase.

Based on the evidence, the code is using dot product between normalized vectors for similarity calculation.

Potatoes – Potatos.

It doesn’t really matter.

When vectors are normalized to unit length (magnitude = 1), the dot product is mathematically equivalent to cosine similarity. This is because:

Cosine similarity = (A·B) / (|A|·|B|)

When |A| = |B| = 1 (normalized vectors), this simplifies to:

Cosine similarity = A·B = dot product

Therefore, the code is effectively implementing cosine similarity by:

  1. Ensuring all vectors are normalized to unit length
  2. Computing the dot product between these normalized vectors

This approach is computationally more efficient than calculating the full cosine similarity formula, as it avoids the division operation while producing the same result for normalized vectors.

The archive contains a Chromium component called history_embeddings that implements a service for embedding and searching browser history using vector embeddings.

The files can be categorized as follows:

  1. Build and Configuration Files:
    • BUILD.gnDEPSDIR_METADATAOWNERSREADME.md
  2. Core Service Files:
    • history_embeddings_service.h/cc: Main service implementation
    • history_embeddings_features.h/cc: Feature flags and parameters
    • passage_embeddings_service_controller.h/cc: Controller for embeddings service
  3. Embedding Generation Files:
    • embedder.h: Base interface for embedding text passages
    • ml_embedder.h/cc: ML-based implementation of embedder
    • scheduling_embedder.h/cc: Priority-based embedding scheduler
    • mock_embedder.h/cc: Mock implementation for testing
  4. Answer Generation Files:
    • answerer.h/cc: Interface for generating answers from embeddings
    • ml_answerer.h/cc: ML-based implementation of answerer
    • mock_answerer.h/cc: Mock implementation for testing
  5. Database Files:
    • vector_database.h/cc: Vector storage and similarity search
    • sql_database.h/cc: Persistent storage for embeddings
  6. Utility Files:
    • passages_util.h/cc: Utilities for text passage processing
  7. Core Subdirectory Files:
    • search_strings_update_listener.h/cc: Listener for search string updates
  8. Proto Files:
    • history_embeddings.proto: Defines storage format for passages and embeddings
    • passage_embeddings_model_metadata.proto: Defines model metadata
  9. Mock Service Files:
    • mock_history_embeddings_service.h/cc: Mock service for testing
  10. Test Files:
    • Various unit test files for each component

The component implements a semantic search system for browser history that:

  1. Extracts text passages from web pages
  2. Converts passages to vector embeddings using ML models
  3. Stores embeddings in a vector database
  4. Performs similarity search using dot product of normalized vectors
  5. Generates answers to user queries based on relevant passages

The approach is effective and computationally efficient. Sounds like Google to me.


Comments

2 responses to “Cosine Similarity or Dot Product?”

  1. With impact of vector magnitudes on similarity score, are there scenarios where using dot-product might accidentally exaggerate similarity, like for vectors with large magnitudes?
    Or are web content/passage embeddings too few dimensions to worry about that?

    1. Absolutely. If direction is more important than intensity, use cosine similarity or normalize embeddings before computing dot-product.

Leave a Reply to Dan Petrovic Cancel reply

Your email address will not be published. Required fields are marked *