Multimodal RAG

Remember Cassettes?

Giving long-term memory to your agents and grounding them in your data doesn’t need to be complicated.

After building numerous knowledge bases, I'm convinced that both the data and its downstream representations should be stored in a single place.

They should be tightly coupled.

That's why we'll use SQL to build a multimodal RAG system for vectors to meet where your data already likely lives.

Our focus would be on four common use-cases:

Semantic text search
Semantic image search
Cross-modal search
Hybrid text search

For database we'll use Postgres on Supabase, specifically with Drizzle ORM, but you can just as easily use any other relational DB - one with a vector extension.

Axioms

Before we get in the weeds, here are two strong opinions held lightly:

You don't need a dedicated vector database
Long context windows will not fully eliminate the need for RAG (or whatever form information retrieval takes in the future)

A Primer

When your app scales to store millions of documents or images à la rich context for agents, you'll need to store their retrievable representations i.e. vectors, somewhere.

At the risk of being reductive, here are the 3 main pieces in a typical retrieval system:

Chunking: Breaking down text into semantically coherent units (e.g. sentences, paragraphs, etc.)
Embedding: Representing each chunk as an array of numbers. These can be sparse (TF-IDF, BM25) or dense (deep neural net embeddings)
Retrieval: Finding the most relevant chunks for a given query

Beyond this you can have rerankers, metadata filters, late interaction models and more but we won't cover them here just yet.

Incase of text, we'll assume that you've obtained chunks from your data. For images, ideally store them in a S3 or R2 bucket so we have access to the original files.

The idea is that we want to yield self-contained, atomic units of information - whether that's a paragraph of text or a single image.

Any good vector store needs to satisfy 3 fundamental axioms from day one:

Multi-embedding: Ability to store embeddings of different dimensions (e.g. 1024 for text, 4096 for images)
Multi-modality: Ability to handle multiple modalities of data (text, images, audio, video, etc.)
Multi-tenancy: Clear separation and enforcement of data ownership

The Schema

First, enable the required PostgreSQL extensions:

sql

CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";

Then define the schema using Drizzle ORM like so:

typescript

import { pgTable, uuid, text, timestamp, jsonb, vector, pgEnum, check, index, pgPolicy } from 'drizzle-orm/pg-core';
import { sql } from 'drizzle-orm';
import { authenticatedRole } from 'drizzle-orm/supabase';


// we define row level security (RLS) policies for user owned data
export const users_own_policy = [
  pgPolicy('users_select_own', {
    for: 'select',
    to: authenticatedRole,
    using: sql`auth.uid() = id`,
  }),
  pgPolicy('users_insert_own', {
    for: 'insert',
    to: authenticatedRole,
    withCheck: sql`auth.uid() = id`,
  }),
  pgPolicy('users_update_own', {
    for: 'update',
    to: authenticatedRole,
    using: sql`auth.uid() = id`,
    withCheck: sql`auth.uid() = id`,
  }),
];

// sample users table
export const users = pgTable('users', {
  id: uuid('id').defaultRandom().primaryKey(),
  email: text('email').notNull().unique(),
  name: text('name').notNull(),
  createdAt: timestamp('created_at', { withTimezone: true }).defaultNow(),
  updatedAt: timestamp('updated_at', { withTimezone: true }).defaultNow(),
}, (table) => [
  index('idx_users_email').on(table.email),
  ...users_own_policy,
]);

// we define an organization table for multi-tenancy
export const organizations = pgTable('organizations', {
  id: uuid('id').defaultRandom().primaryKey(),
  userId: uuid('user_id')
    .notNull()
    .references(() => users.id, { onDelete: 'cascade' }),
  name: text('name').notNull(),
  createdAt: timestamp('created_at', { withTimezone: true }).defaultNow(),
  updatedAt: timestamp('updated_at', { withTimezone: true }).defaultNow(),
}, (table) => [
  index('idx_organizations_user_id').on(table.userId),
  ...users_own_policy,
]).enableRLS();

export const modalityEnum = pgEnum('modality', ['text', 'image']);

// and here we have our core vectors table which references both the users and organization
export const vectors = pgTable('vectors', {
  id: uuid('id').defaultRandom().primaryKey(),
  organizationId: uuid('organization_id')
    .notNull()
    .references(() => organizations.id, { onDelete: 'cascade' }),
  userId: uuid('user_id')
    .notNull()
    .references(() => users.id, { onDelete: 'cascade' }),
  content: text('content'),
  chunk: text('chunk'),
  embedding1024: vector('embedding_1024', { dimensions: 1024 }),
  imageUrl: text('image_url'),
  embedding4096: vector('embedding_4096', { dimensions: 4096 }),
  modality: modalityEnum('modality').notNull().default('text'),
  metadata: jsonb('metadata').notNull().default({}),
  createdAt: timestamp('created_at', { withTimezone: true }).defaultNow(),
}, (table) => [
  index('idx_vectors_org_user').on(table.organizationId, table.userId),
  index('idx_vectors_modality').on(table.modality),
  check('check_text_embedding', sql`
    (modality = 'text' AND embedding_1024 IS NOT NULL AND embedding_4096 IS NULL) OR
    (modality = 'image' AND embedding_4096 IS NOT NULL AND embedding_1024 IS NULL)
  `),
  // we use GIN index for keyword search
  index('vectors_keyword_idx').using(
    'gin',
    sql`to_tsvector('english', ${table.chunk})`
  ),
  // we use HNSW index for vector similarity search
   index('vectors_embedding_idx').using(
      'hnsw',
      table.embedding.op('vector_cosine_ops')
    ),
  ...users_own_policy,
]).enableRLS();

Migrations

First, install pnpm i drizzle-orm postgres dotenv followed by pnpm i -D drizzle-kit and then configure Drizzle in drizzle.config.ts:

typescript

import { defineConfig } from 'drizzle-kit';

export default defineConfig({
  schema: './db/schema.ts',
  out: './migrations',
  dialect: 'postgresql',
  schemaFilter: ['public'],
  verbose: true,
  strict: true,
  dbCredentials: {
    url: process.env.DATABASE_URL!,
  },
});

Run these commands to generate and push changes to the database:

bash

npx drizzle-kit generate
npx drizzle-kit migrate

Usage

Inserting Vectors

typescript

import { db } from './db';
import { vectors, users, organizations } from './schema';
import { eq, and, sql } from 'drizzle-orm';

// text document
await db.insert(vectors).values({
  organizationId: 'org-uuid',
  userId: 'user-uuid',
  content: 'Machine learning is a subset of AI...',
  chunk: 'Machine learning is a subset of AI...',
  embedding1024: Array(1024).fill(0), // replace with actual embedding
  modality: 'text',
  metadata: { source: 'ml_textbook.pdf', page: 15 },
});

// image
await db.insert(vectors).values({
  organizationId: 'org-uuid',
  userId: 'user-uuid',
  imageUrl: 'https://s3.amazonaws.com/bucket/image.png',
  embedding4096: Array(4096).fill(0), // replace with actual embedding
  modality: 'image',
  metadata: { alt_text: 'neural network diagram' },
});

Querying Vectors by Similarity

Text Semantic Search

typescript

const queryEmbedding = Array(1024).fill(0); // replace with actual query embedding
const results = await db
  .select()
  .from(vectors)
  .where(
    and(
      eq(vectors.organizationId, 'org-uuid'),
      eq(vectors.userId, 'user-uuid'),
      eq(vectors.modality, 'text')
    )
  )
  .orderBy(sql`embedding_1024 <=> ${queryEmbedding}`)
  .limit(10);

Image Semantic Search

typescript

const imageQueryEmbedding = Array(4096).fill(0); // replace with actual query embedding
const imageResults = await db
  .select()
  .from(vectors)
  .where(
    and(
      eq(vectors.organizationId, 'org-uuid'),
      eq(vectors.userId, 'user-uuid'),
      eq(vectors.modality, 'image')
    )
  )
  .orderBy(sql`embedding_4096 <=> ${imageQueryEmbedding}`)
  .limit(10);

typescript

// use a multimodal embedding model (e.g. jina-clip-v2, multimodalembedding@001 or nova-2-multimodal-embeddings-v1) that maps
// both text and images to the same embedding space
// NOTE: the dimensionality might be different depending on the model so change constraint in the schema accordingly
const crossModalQueryEmbedding = Array(4096).fill(0); // replace with actual query embedding
const relatedImages = await db
  .select({
    id: vectors.id,
    imageUrl: vectors.imageUrl,
    metadata: vectors.metadata,
  })
  .from(vectors)
  .where(
    and(
      eq(vectors.organizationId, 'org-uuid'),
      eq(vectors.modality, 'image')
    )
  )
  .orderBy(sql`embedding_4096 <=> ${crossModalQueryEmbedding}`) // replace with actual embedding
  .limit(5);

Hybrid Text Search Using Rank Reciprocal Fusion (RRF)

Create a SQL function on Supabase with RRF for combining both keyword and semantic search:

sql

create or replace function hybrid_text_search(
  query_text text,
  query_embedding vector(1024),
  org_id uuid,
  uid uuid,
  match_count int,
  full_text_weight float = 1,
  semantic_weight float = 1,
  rrf_k int = 50
)
returns table (
  id uuid,
  chunk text,
  metadata jsonb,
  semantic_score float
)
language sql
as $$
with full_text as (
  select
    vectors.id,
    row_number() over(
      order by ts_rank_cd(to_tsvector('english', chunk), websearch_to_tsquery(query_text)) desc
    ) as rank_ix
  from vectors
  where organization_id = org_id
    and user_id = uid
    and modality = 'text'
    and to_tsvector('english', chunk) @@ websearch_to_tsquery(query_text)
  order by rank_ix
  limit least(match_count, 30) * 2
),
semantic as (
  select
    vectors.id,
    1 - (embedding_1024 <=> query_embedding) as cosine_similarity,
    row_number() over (order by embedding_1024 <=> query_embedding) as rank_ix
  from vectors
  where organization_id = org_id
    and user_id = uid
    and modality = 'text'
  order by rank_ix
  limit least(match_count, 30) * 2
)
select
  vectors.id,
  vectors.chunk,
  vectors.metadata,
  semantic.cosine_similarity as semantic_score
from full_text
full outer join semantic on full_text.id = semantic.id
join vectors on coalesce(full_text.id, semantic.id) = vectors.id
order by
  coalesce(1.0 / (rrf_k + full_text.rank_ix), 0.0) * full_text_weight +
  coalesce(1.0 / (rrf_k + semantic.rank_ix), 0.0) * semantic_weight desc
limit least(match_count, 30)
$$;

Call the function from your application:

typescript

const results = await db.execute(sql`
  select * from hybrid_text_search(
    ${query},
    ${queryEmbedding}::vector,
    ${organizationId}::uuid,
    ${userId}::uuid,
    10
  )
`);

How to Pick a Model

Check the MTEB leaderboard to pick best performing model
- Incase of open-source models, base it on your compute budget and tolerance for latency
Run custom evals on specific domains and use-cases

A Retrospective

It's been fascinating to see the unbundling and then subsequent bundling of vectors into traditional SQL databases.

Vector databases like Pinecone, and Weaviate for example, were the first wave which were built on the assumption that embeddings will require their own layer, separate from the core application DB.

While not entirely without promise (especially for billions of vectors), I think it wildly overestimated the need for dedicated infra and tooling for embeddings.

Most projects for that matter are better off with a battle-tested SQL database.

Multimodal RAG

Axioms

A Primer

The Schema

Migrations

Usage

Inserting Vectors

Querying Vectors by Similarity

Text Semantic Search

Image Semantic Search

Cross-Modal Search (Find Images Related to a Text Query)

Hybrid Text Search Using Rank Reciprocal Fusion (RRF)

How to Pick a Model

A Retrospective

References