Core Concepts
The foundational ideas behind TasteBrain: a shared embedding space, lightweight per-user models, living data, and grounded language.
The embedding space
TasteBrain is a transformer-based, multimodal model derived from Qwen-Omni and trained at scale on ~20M curated product images across ~7M products. The model fine-tunes attention layers and uses a custom projector to map per-token hidden states into two representations:
Multi-vector embedding
A structured 128-dimensional space where each dimension encodes an aesthetic axis — style, material, form factor, color palette, brand affinity. Per-token hidden states are projected individually, preserving fine-grained spatial and semantic detail.
Pooled embedding
A single vector summarizing the overall aesthetic signature. Used for fast approximate nearest-neighbor lookups, top-k candidate retrieval, and situations where a compact representation is sufficient.
This shared embedding space represents the global geometry of taste. Styles, brands, materials, form factors, and aesthetic signals form coherent regions. Proximity in this space means aesthetic similarity — regardless of whether the inputs are images, text, audio, or products.
Example
A user uploads a photo of a cafe interior with exposed brick and warm lighting. TasteBrain projects this image into the embedding space. Products that land in the same region — industrial pendant lamps, reclaimed wood furniture, ceramic mugs with matte glazes — are returned as results. The search crosses product categories, retailers, and price points because similarity is computed in aesthetic space, not catalog metadata.
Cross-modal queries
Because every modality is projected into the same 128-dimensional space, you can combine inputs freely:
- Image → products — "find me products that look like this"
- Text → products — "minimalist Scandinavian desk lamp"
- Image + text → products — "like this room, but in warm earth tones"
- Product → products — "more like this" from a product handle
- Audio → products — map sonic aesthetics to visual products
Multiple query inputs are combined in embedding space before retrieval, allowing nuanced multi-signal searches that no single modality could express alone.
Per-user taste models
On top of the shared embedding space, TasteBrain learns lightweight per-user models that identify each individual's personal taste manifold. These models capture:
- Attraction — regions of the embedding space the user gravitates toward
- Aversion — regions the user consistently rejects
- Drift — how the user's taste evolves over time
Crucially, per-user models do not retrain the backbone. They operate as learned offsets on top of the shared geometry, which means:
No fragmentation
Every user shares the same base model. The global geometry remains coherent regardless of how many users are active.
Real-time adaptation
Per-user models update from sentiment signals (likes, dislikes, saves) without offline retraining. Preferences take effect immediately.
Lightweight
Each user model is a small learned structure — not a copy of the backbone. This scales to millions of users.
In the API, personalization is activated simply by passing a user_id to the Prism search endpoints. Without a user_id, results come from the shared space alone (equivalent to using the Shopkeep API).
Crawling infrastructure
The embedding space is only as useful as the data it covers. Bestomer operates proprietary crawling infrastructure with two key properties:
Weekly revisits
Every indexed retail domain is revisited weekly. This tracks catalog changes — new arrivals, discontinued items, price updates, seasonal rotations — so the product corpus stays current. Stale data degrades aesthetic relevance, so freshness is a first-class concern.
Cross-domain coverage
The system goes beyond retail products:
| Domain | What it enables |
|---|---|
| Products & brands | Core product search and brand matching across ~7M items |
| Restaurants | Map dining aesthetics to product preferences — atmosphere, plating, interior design |
| Music | Cross-modal mapping from sonic aesthetics to visual products |
| Reviews | Extract aesthetic language from natural-language opinions |
| Cultural sources | Art, architecture, fashion editorials, trend reports — the broader context that shapes taste |
This breadth is what allows taste to be modeled as a living, evolving system rather than a static product catalog. When a new design trend emerges in architecture, TasteBrain can connect it to products, restaurants, and music that share the same aesthetic signal — before anyone writes a tag or a label.
Sentiments
Sentiments are user reactions to products that train the per-user taste model. Each sentiment is a data point that adjusts the user's personal manifold:
Like
Positive aesthetic preference. Pulls the user's manifold toward this region.
Dislike
Negative signal. Pushes the manifold away from this region.
Want
Strong positive signal. Weighted more heavily than a like.
Have
Strongest signal. The user owns this — definitive taste evidence.
Sentiments are recorded in the Level 2: Persona Services tier and fed back into the per-user model in real time. They're the primary training signal for personalization.
Captures
Captures are media inputs — images, screenshots, social media posts — that express aesthetic preference without words. When a user uploads or saves an image, it's projected into the embedding space and becomes a query anchor.
Sources
- Device uploads — photos from camera roll (via iOS integration or web upload)
- Social sync — saved posts from Instagram, Pinterest, TikTok
- Product screenshots — images of specific products the user wants to find
- Scene captures — interiors, environments, fashion looks, food presentations
At Level 1, captures are used as stateless query inputs. At Level 2, they're persisted and clustered to build richer user profiles over time.
Windows
Windows are personalized collections of products organized around an aesthetic theme. They exist at the Level 2 tier and can be:
- Auto-generated — the system clusters a user's sentiments and captures to create themed collections
- Manually curated — users add captures and refine with explicit sentiments
- Scoped — filtered by brand, retailer, or recipient (for gift-giving)
Each window represents a coherent region of the user's taste manifold. As the user interacts, windows evolve — new products appear, less relevant ones fade, and entirely new windows can emerge when the system detects a shift in the user's aesthetic interests.
Tasteboards
Tasteboards are visual summaries of a user's aesthetic DNA. Unlike windows (which are product collections), tasteboards are computed visualizations of the user's position in the embedding space — showing the aesthetic axes they occupy.
- User-facing — help users understand and share their own taste profile
- Brand-facing — show brands which aesthetic regions resonate with specific user segments
- Collaborative — share taste profiles with friends, stylists, or gift-givers
Grounded language generation
The same multimodal backbone is being extended to generate natural-language explanations of why products match a query or a user's taste. Using VQA-style (Visual Question Answering) supervision, the system learns to produce grounded descriptions:
"This hand-thrown ceramic vase shares the same warm minimalism you're drawn to — the matte clay finish and asymmetric form echo the Scandinavian pieces you've saved, but with a Japanese wabi-sabi influence that's emerging in your recent captures."
Generated explanation grounded in both the user's taste manifold and the product's embedding coordinates.
This unifies representation, personalization, and explanation in a single system — the model doesn't just find relevant products, it can articulate the aesthetic reasoning.
How it all connects
Input is projected into the 128-dim embedding space
Images, text, audio, or product handles are all mapped to the same shared geometry by the Qwen-Omni backbone.
Nearest neighbors are retrieved from the product corpus
The pooled embedding enables fast top-k retrieval from ~7M products, refreshed weekly by the crawling infrastructure.
Per-user taste model reranks results
If a user_id is provided, the lightweight personal model adjusts ranking based on attraction and aversion signals.
User reacts with sentiments
Likes, dislikes, wants, and haves update the per-user model in real time — no retraining of the backbone.
Future queries are more personal
The feedback loop refines the user's taste manifold. Windows evolve, tasteboards update, and explanations become more precise.
Next steps
- Level 0: The Model — architecture details for the Qwen-Omni backbone
- Prism API — start making personalized searches
- Shopkeep API — non-personalized search endpoints
- Level 2: Persona Services — Windows, Sentiments, Tasteboards, and B-Shop