👋 The Wake Up Call

Google dropped Gemma 4 this week.

Most people saw a model release. We saw a product it sees images and understands speech. It reasons across long context. It runs on-device, meaning no server round trip, no latency, no cloud dependency. this unlocks one idea we have been watching for years.

An object-aware AI guide that can see what you see, hear what you ask, and answer like an expert instantly, in the physical world.

No app. No typing. No waiting.

And we know exactly where to build it first.

Here is what you can build.

You are standing in front of a 4th century BC sword.

It is behind glass. It has a name, a battle, a dynasty, a story. Someone died holding it.

The best we offer you is a wall label that says:

"Iron. 4th century BC. From the army of Alexander the Great."

That is it.

You pull out your phone. You open Google. You type something half-right. You get a Wikipedia page you won't finish. By the time you look up the moment is gone. Your kid is already in the next room. The wonder evaporated before you could catch it.

This happens 2.2 billion times a year.

That is how many museum visits happen globally. Every single one of them running on the same broken interface: tiny labels, static audio guides, and QR codes that open a PDF nobody reads.

Less than 8% of visitors use the existing audio guides. Not because people don't want to learn. Because the product makes learning feel like homework.

Kids lose interest in 30 seconds. Adults pretend to read. Nobody asks the question they actually want answered.

The problem is not the objects. The problem is the interface between wonder and knowledge.

Big tech built chat. They did not build context.

That is the opening.

🦄 The Idea Drop: ArtifactWhisperer

"Shazam for history. Point. Ask. Know."

The Problem:

Museums, heritage sites, temples, monuments, and galleries still run on dead interfaces.

The current stack forces the wrong workflow:

  • Open an app

  • Type a question

  • Upload a photo

  • Wait for an answer

  • Rephrase because it doesn't know this venue

  • Give up

That is not how curiosity works in the wild.

A tourist sees a sword, a mask, a coin, a painting. The thought is instant: "What is this? Why is it shaped like that? Was this actually used in battle?"

The current stack creates six steps between wonder and answer. Six steps kills curiosity every time.

The Solution:

Artifact Whisperer is a dead-simple mobile web experience.

Open one page. Camera turns on. Tap once. Ask out loud.

The product grabs the visual scene, the spoken question, the location context, and the venue's own knowledge base. Then it returns a tight, human explanation in the right tone for the moment.

No typing. No app download. No menus.

Here is how it works in four steps:

  1. Point camera sees the object. No photo upload required.

  2. Tap once mic opens. Ask anything, in any language.

  3. AI reasons vision + voice + venue catalog combined.

  4. Answer arrives spoken and on-screen, in the mode you want.

Why This Is Different:

Most "AI for museums" tools give you a chatbot widget and call it innovation.

Artifact Whisperer is not a chatbot. It is a context engine.

  • Venue Memory pulls from the museum's own catalog so answers are specific to this object in this collection, not generic Wikipedia summaries

  • Story Modes "Explain like I'm 10," "Expert mode," "Give me the 30-second version," "What's the myth behind this?" One tap to switch.

  • Any Language ask in Hindi, answer in Hindi. Ask in Spanish, answer in Spanish. No language barrier, ever.

  • Zero friction browser-first, no app install, works on any phone already in your pocket

  • Follow-up questions feel natural because the AI holds context across the conversation, not just one question

The moat is not the model. The moat is venue partnerships.

The first team to sign 20 institutions owns the category. Generic AI cannot compete with an agent that knows this museum's exact catalog.

🚀 MVP Blueprint & Business Model

Before we dive in in rememeber KriyaOS.com launches April 30th.

What Can You Build With KriyaOS?

A Restaurant Agent Which Answers Every Table. With Zero Extra Staff.

A customer sits down. Scans the QR code on the table. WhatsApp opens. Your business agent is already there.

"What's your best seller?" answered instantly with you recomendation.

"Does the biryani have nuts?" pulled from the ingredient sheet you uploaded once.

No waiter running back to the kitchen. No allergy anxiety. No "let me check for you."

Back to the MVP blue print for ArtifactWhisperer

Will 35% of users ask more than one question in a session?

That is the signal. Not downloads. Not ratings. Repeat curiosity.

Here is the 4-week build:

Week 1 - Intelligence Layer

Frontend: Next.js or plain React PWA.

Camera via MediaStream API. Mic via Web Speech API. Both are native browser, no app required.

Model path: Start with a Gemma-family web-compatible model for the browser MVP. Treat full Gemma 4 as the upgrade unlock it was built for advanced multimodal reasoning, long context, and agentic workflows, which makes this category suddenly feel buildable, not sci-fi.

Knowledge layer: venue catalog ingested as JSON or a lightweight vector store (Chroma, pgvector). RAG pulls the right exhibit context before the model generates the answer. No hallucinating exhibition dates.

// Core flow pseudocode
const frame    = captureFrame(videoStream)
const question = await recordVoice()           // Web Speech API
const context  = await retrieveVenueContext(venueId, frame)

const answer = await callModel({
  model:    'gemma-multimodal',
  image:    frame,
  question: question,
  context:  context,   // RAG-retrieved exhibit data
  mode:     'story'    // quick | story | kid | expert
})

speak(answer.text)     // TTS, hands-free
renderCard(answer)     // on-screen summary

Week 2 - Interaction Layer

The entire UI is camera + one button. Nothing else.

On tap: capture frame, start mic. On release: package image + transcribed audio + location → inference endpoint. TTS reads the response hands-free.

If typing appears anywhere in the flow, you have failed the interaction design test.

Week 3 - Venue Layer

Onboard one partner venue. Ingest their catalog — PDFs, CSVs, exhibit notes.

Build three response modes: Quick (30 sec), Story (narrative, 2 min), Kid (ages 6–10).

Run a 50-person pilot. Watch session recordings. This week is where the product lives or dies.

Week 4 - Validate

Track: session length, questions per session, mode switches, drop-off point.

The curiosity heatmap — which objects got the most questions — is also your first sales asset for the next venue. "Here is what 200 visitors actually wanted to know about your collection" is a conversation no museum director can ignore.

Budget: Lean if you keep it narrow. Hosting is light. Main costs are model runtime and TTS. A bootstrapped team can reach the pilot in 4 weeks.

Success metric: 35% of users ask more than one question. That is the real signal. Everything else is vanity.

💰 Business Model

Charge venues, not visitors. Visitors get the experience free. Venues pay for the platform.

Venue SaaS -$299/month per location Unlimited visitor queries. Branded experience. Catalog ingestion. Up to 3 story modes. Analytics dashboard.

Premium Analytics - +$149/month Curiosity heatmaps by exhibit. Most-asked questions by language. Dwell time vs engagement data. Export-ready reports for grant applications and board decks.

The analytics add-on is the sleeper product.

Museum directors need to prove their exhibits engage visitors - for funding, for grants, for board reports. You are not selling them an AI guide. You are selling them the evidence that their collection is worth expanding.

That is a completely different conversation.

10 venues × $299/month = $35,880 ARR before you write a single investor email.

Ask for the MVP @ NexTribes.

🧠 Founder Lesson: "Friction hides demand"

A lot of markets look small only because the current product is annoying.

People did not "want voice notes" until sending one was easier than typing.

People did not "want stories on maps" until travel apps got visual.

People do not hate museum learning. They hate the interface tax.

Artifact Whisperer wins if it removes the tax entirely.

It fails if it becomes another AI demo where users have to frame the object perfectly, repeat themselves, and read a wall of text for an answer.

The rule is simple: curiosity is instant, so the product has to be instant too.

Quick Tips

  • The Tool: MediaPipe + Google AI Edge docs for fast on-device prototyping before you need a backend

  • The Real Enemy: Not one app. The enemy is the museum audio guide industry - a $3B market with zero meaningful innovation since the Walkman

  • The Book: The Design of Everyday Things by Don Norman - because this product lives or dies on interaction design, not model quality

Bye Bye.

Reply

Avatar

or to participate