Core Concepts

The foundational ideas behind TasteBrain: a shared embedding space, lightweight per-user models, living data, and grounded language.

The embedding space

TasteBrain is a transformer-based, multimodal model derived from Qwen-Omni and trained at scale on ~20M curated product images across ~7M products. The model fine-tunes attention layers and uses a custom projector to map per-token hidden states into two representations:

Multi-vector embedding

A structured 128-dimensional space where each dimension encodes an aesthetic axis — style, material, form factor, color palette, brand affinity. Per-token hidden states are projected individually, preserving fine-grained spatial and semantic detail.

Pooled embedding

A single vector summarizing the overall aesthetic signature. Used for fast approximate nearest-neighbor lookups, top-k candidate retrieval, and situations where a compact representation is sufficient.

This shared embedding space represents the global geometry of taste. Styles, brands, materials, form factors, and aesthetic signals form coherent regions. Proximity in this space means aesthetic similarity — regardless of whether the inputs are images, text, audio, or products.

Example

A user uploads a photo of a cafe interior with exposed brick and warm lighting. TasteBrain projects this image into the embedding space. Products that land in the same region — industrial pendant lamps, reclaimed wood furniture, ceramic mugs with matte glazes — are returned as results. The search crosses product categories, retailers, and price points because similarity is computed in aesthetic space, not catalog metadata.

Cross-modal queries

Because every modality is projected into the same 128-dimensional space, you can combine inputs freely:

  • Image → products — "find me products that look like this"
  • Text → products — "minimalist Scandinavian desk lamp"
  • Image + text → products — "like this room, but in warm earth tones"
  • Product → products — "more like this" from a product handle
  • Audio → products — map sonic aesthetics to visual products

Multiple query inputs are combined in embedding space before retrieval, allowing nuanced multi-signal searches that no single modality could express alone.

Per-user taste models

On top of the shared embedding space, TasteBrain learns lightweight per-user models that identify each individual's personal taste manifold. These models capture:

  • Attraction — regions of the embedding space the user gravitates toward
  • Aversion — regions the user consistently rejects
  • Drift — how the user's taste evolves over time

Crucially, per-user models do not retrain the backbone. They operate as learned offsets on top of the shared geometry, which means:

No fragmentation

Every user shares the same base model. The global geometry remains coherent regardless of how many users are active.

Real-time adaptation

Per-user models update from sentiment signals (likes, dislikes, saves) without offline retraining. Preferences take effect immediately.

Lightweight

Each user model is a small learned structure — not a copy of the backbone. This scales to millions of users.

In the API, personalization is activated simply by passing a user_id to the Prism search endpoints. Without a user_id, results come from the shared space alone (equivalent to using the Shopkeep API).

Crawling infrastructure

The embedding space is only as useful as the data it covers. Bestomer operates proprietary crawling infrastructure with two key properties:

Weekly revisits

Every indexed retail domain is revisited weekly. This tracks catalog changes — new arrivals, discontinued items, price updates, seasonal rotations — so the product corpus stays current. Stale data degrades aesthetic relevance, so freshness is a first-class concern.

Cross-domain coverage

The system goes beyond retail products:

DomainWhat it enables
Products & brandsCore product search and brand matching across ~7M items
RestaurantsMap dining aesthetics to product preferences — atmosphere, plating, interior design
MusicCross-modal mapping from sonic aesthetics to visual products
ReviewsExtract aesthetic language from natural-language opinions
Cultural sourcesArt, architecture, fashion editorials, trend reports — the broader context that shapes taste

This breadth is what allows taste to be modeled as a living, evolving system rather than a static product catalog. When a new design trend emerges in architecture, TasteBrain can connect it to products, restaurants, and music that share the same aesthetic signal — before anyone writes a tag or a label.

Sentiments

Sentiments are user reactions to products that train the per-user taste model. Each sentiment is a data point that adjusts the user's personal manifold:

+

Like

Positive aesthetic preference. Pulls the user's manifold toward this region.

-

Dislike

Negative signal. Pushes the manifold away from this region.

Want

Strong positive signal. Weighted more heavily than a like.

Have

Strongest signal. The user owns this — definitive taste evidence.

Sentiments are recorded in the Level 2: Persona Services tier and fed back into the per-user model in real time. They're the primary training signal for personalization.

Captures

Captures are media inputs — images, screenshots, social media posts — that express aesthetic preference without words. When a user uploads or saves an image, it's projected into the embedding space and becomes a query anchor.

Sources

  • Device uploads — photos from camera roll (via iOS integration or web upload)
  • Social sync — saved posts from Instagram, Pinterest, TikTok
  • Product screenshots — images of specific products the user wants to find
  • Scene captures — interiors, environments, fashion looks, food presentations

At Level 1, captures are used as stateless query inputs. At Level 2, they're persisted and clustered to build richer user profiles over time.

Windows

Windows are personalized collections of products organized around an aesthetic theme. They exist at the Level 2 tier and can be:

  • Auto-generated — the system clusters a user's sentiments and captures to create themed collections
  • Manually curated — users add captures and refine with explicit sentiments
  • Scoped — filtered by brand, retailer, or recipient (for gift-giving)

Each window represents a coherent region of the user's taste manifold. As the user interacts, windows evolve — new products appear, less relevant ones fade, and entirely new windows can emerge when the system detects a shift in the user's aesthetic interests.

Tasteboards

Tasteboards are visual summaries of a user's aesthetic DNA. Unlike windows (which are product collections), tasteboards are computed visualizations of the user's position in the embedding space — showing the aesthetic axes they occupy.

  • User-facing — help users understand and share their own taste profile
  • Brand-facing — show brands which aesthetic regions resonate with specific user segments
  • Collaborative — share taste profiles with friends, stylists, or gift-givers

Grounded language generation

The same multimodal backbone is being extended to generate natural-language explanations of why products match a query or a user's taste. Using VQA-style (Visual Question Answering) supervision, the system learns to produce grounded descriptions:

"This hand-thrown ceramic vase shares the same warm minimalism you're drawn to — the matte clay finish and asymmetric form echo the Scandinavian pieces you've saved, but with a Japanese wabi-sabi influence that's emerging in your recent captures."

Generated explanation grounded in both the user's taste manifold and the product's embedding coordinates.

This unifies representation, personalization, and explanation in a single system — the model doesn't just find relevant products, it can articulate the aesthetic reasoning.

How it all connects

1

Input is projected into the 128-dim embedding space

Images, text, audio, or product handles are all mapped to the same shared geometry by the Qwen-Omni backbone.

2

Nearest neighbors are retrieved from the product corpus

The pooled embedding enables fast top-k retrieval from ~7M products, refreshed weekly by the crawling infrastructure.

3

Per-user taste model reranks results

If a user_id is provided, the lightweight personal model adjusts ranking based on attraction and aversion signals.

4

User reacts with sentiments

Likes, dislikes, wants, and haves update the per-user model in real time — no retraining of the backbone.

5

Future queries are more personal

The feedback loop refines the user's taste manifold. Windows evolve, tasteboards update, and explanations become more precise.

Next steps