9 min read

Idea Sketch #5: A Language-Learning Layer for the Web

Idea Sketch #5: A Language-Learning Layer for the Web

Motif began with a very small request.

My partner was learning French for her semester and asked whether there was a local text-to-speech model she could use for pronunciation practice. Around the same time, I happened to come across Supertonic, an open-weight text-to-speech model designed for fast, on-device inference. When I tried it on my MacBook, the speech generated almost immediately and sounded surprisingly natural compared with the default clunky TTS audio personas macOS has.

I had already been building Cue, a desktop tool for collecting context from the screen and acting on it without interrupting what I was doing. So I spent about ten minutes experimenting with a small addition: select some text, press Cue’s shortcut, and have it read the selection aloud instead of collecting it as context.

It was simple, but it felt good. When we were reading a French article, a blog post, or part of a book, we could select a sentence and hear it immediately. Cue could already collect text from the screen, and the local model made the rest fast enough to feel like a direct interaction rather than a request being sent somewhere else.


A feature or a separate tool?

I built the experiment in a separate pull request, but I never merged it into Cue.

The interaction fit Cue’s general model: collect something from the current screen, ask for an action, and stay in the flow. “Ask” did not necessarily have to mean opening a conversation. It could mean reading, translating, explaining, summarizing, saving, or performing some other small action that was useful in the moment. At the same time, text-to-speech felt slightly outside Cue’s center of gravity.

Supporting it properly would eventually require more than adding another command. Cue would need some kind of extension system, settings for voices and languages, model management, and interfaces specialized for listening and language learning. I was also working on a more substantial search, bookmark, and retrieval system for Cue, which felt more closely connected to the original idea of a context-aware desktop companion.

Rather than make Cue increasingly broad, I decided to pull this particular interaction out and see what it could become on its own.


Browser extension v. system-wide app

The first question was what shape the new project should take.

A native desktop app would give me control over local models and system-level shortcuts, but it would reproduce much of the infrastructure I had already built for Cue. A standalone website would be easier to distribute, but it would make users copy text away from the page they were reading.

A browser extension felt like the lightest useful form.

Most of the material my partner was already using—news articles, videos, podcasts, Instagram posts, and other French-language media—lived in the browser. Instead of building another destination for studying French, Motif could add a small learning layer to the pages she already visited.

The project started with one interaction: Select some French text and press a shortcut to hear it.

The audio plays in a compact overlay beside the selected passage. There is no need to copy the sentence into a translator, open another application, or leave the page.

Although I am initially building and testing Motif for Chrome, the extension format also leaves room for running it in other Chromium-based browsers where the required APIs are supported.


Details around the implementations

Following the words

Once the basic text-to-speech interaction worked, I wanted the text to respond to the audio.

When Motif reads a sentence, each word is highlighted as it is spoken. This makes the relationship between spelling and pronunciation easier to follow, especially when the learner is still developing an ear for how written French contracts and flows in speech.

I initially assumed the TTS model might return timing information for each word. It does not. Supertonic predicts the overall duration needed for synthesis, but Motif does not receive reliable word boundaries from the model. Instead, it estimates them from the resulting audio.

The process is roughly:

  1. Split the text into words.
  2. Compute a smoothed energy envelope from the synthesized waveform.
  3. Detect the beginning and end of speech by trimming silence.
  4. Divide the spoken region into an initial set of proportional segments.
  5. Move each boundary toward a nearby low-energy point in the waveform.
  6. assign each word an estimated start and end time.

During playback, Motif compares the current audio time with those precomputed intervals and updates the highlighted word.

This is not true forced alignment, and it is not a learned or probabilistic model. It is a deterministic heuristic: a proportional split refined using valleys in the audio signal. It works reasonably well for many short passages, but it becomes less accurate when words have very different durations, when punctuation introduces unusual pauses, or when the speaker’s rhythm does not correspond neatly to the written tokens.

That limitation has become one of the more interesting technical questions in the project. A future version could use phoneme information, a dedicated alignment model, or timing data exposed directly by a different synthesis system. For now, the heuristic is fast, local, and good enough to make the text feel connected to the sound.


Beyond selectable text

Selecting text works well on articles, but not everything on the web is represented as selectable HTML.

A French phrase might appear inside a photograph, a scanned page, a video frame, an illustration, or an Instagram post. In those cases, the original interaction breaks down: there is nothing for the browser selection API to return.

I added a capture mode using Tesseract. When no text is selected, the same shortcut lets the user draw a rectangle over part of the page. Motif captures that region, runs French OCR in the browser, and sends the recognized text into the same pronunciation interface.

The important part is that OCR does not become a separate workflow. Whether the text begins as selectable HTML or as pixels inside an image, it eventually reaches the same overlay: listen to the passage, follow its words, tap a word, translate it, or save it.

That became a useful design principle for Motif: different forms of input should converge on the same small set of interactions.


Listening as well as reading

The next expansion came from thinking about audio.

A learner might understand most of an article but lose track of a sentence in a YouTube video, a podcast, or a radio program. Copying text is no longer possible because the text does not exist yet. Motif needs to produce it.

I found stt-web, an in-browser speech-to-text project from Idle Intelligence. Its current French-and-English model runs through a Rust, WebAssembly, and WebGPU stack, using the Burn machine-learning framework underneath. (Hugging Face)

The French transcription was accurate enough in my experiments that I built a live transcription mode around it.

The user can start transcription while a tab is playing French audio. Motif captures the tab’s audio and displays a rolling transcript over the page. The transcript can be paused, edited, translated, and used like any other French text surface in the extension. Individual words can be tapped to hear their pronunciation or saved for later.

Editing turned out to be important. Even a strong transcription model will occasionally misunderstand a name, contraction, or noisy passage. Rather than treat the generated transcript as authoritative, Motif lets the learner correct it. The transcript is both a model output and a working surface.

This shifted the project from “select and speak” toward something broader: reading and listening tools that remain attached to the source material.


Translation without another service

Translation appears throughout Motif.

It can provide an English gloss for a selected passage, translate a word before it is saved, or show a translation beside a live transcript. I did not want each of those interactions to require a request to a separate server.

Chrome’s built-in Translator API made this possible. On supported versions of Chrome, the browser can download and manage its own translation model, and translation happens before the text leaves the device. (Chrome for Developers)

This is one of the parts of the project that made me rethink what a browser application can contain. Motif now combines text-to-speech, speech recognition, OCR, translation, waveform analysis, and persistent vocabulary storage without a Motif backend performing the inference.

The browser is no longer only rendering the interface for an intelligent service somewhere else. Increasingly, it can be the place where the models themselves run.


From hearing to remembering

Pronunciation was the original motivation, but hearing a word once is only one part of learning it.

Motif now lets users save a word or phrase along with its English translation and the context in which it appeared: the surrounding sentence, the page title, and its source URL. When the same word appears again, another context can be attached to the existing entry.

The distinction matters to me. A word encountered while reading an article is not merely a French–English pair. Its surrounding sentence contains tone, grammar, subject matter, and the reason the learner noticed it in the first place.

The current library supports browsing, searching, notes, and a weekly flashcard view. I am also experimenting with more active forms of recall, such as hearing a saved word and typing what was spoken. Spaced repetition could eventually be added on top of the same collection.

The evolving structure is:

  • Hear it → Save it → Remember it.

This is also how I arrived at the name Motif. A motif is something recognizable that returns: a sound, form, phrase, or idea that gains meaning through repetition. Vocabulary begins to behave similarly as it reappears across articles, conversations, and recordings.

The product document describes Motif as a quiet, editorial tool for learning from the open web rather than a gamified language course. That framing—and the details of its shipped and planned features—helped shape this sketch.


Designing the landing page

The visual direction for Motif grew alongside the interaction design.

For the landing page, I borrowed a palette from one of Francisco Goya’s paintings. I liked the contrast between its muted greens, brown-reds, cool grays, ochres, and a small area of bright orange. It felt warmer and more editorial than the blue gradients and glowing panels commonly used for AI products.

The page uses those colors as a horizontal palette near the hero section. The selected word in “Francisco Goya” is highlighted in orange, echoing the way Motif highlights a word while reading.

I also used ChatGPT to generate some of the feature illustrations in a Goya-influenced visual language. The intention was not to imitate a particular painting literally, but to give the interface and its surrounding imagery the feeling of a reading tool with a distinct cultural texture—not another generic browser utility.

This became a nice reversal: the application uses models to help a learner interpret media, while the website uses an older visual vocabulary to explain the models.


Reflections

What “local” is for

Motif currently emphasizes on-device processing. After the initial model downloads, pronunciation, transcription, and OCR can run locally, while vocabulary remains in browser storage. The full product and technical stack are described in more detail in the current product document.

Every language-learning interaction doesn't needs to be local simply because privacy sounds like a good product claim. In many cases, a cloud model could be faster to initialize, smaller for the user to install, or maybe more accurate.

But language learning often overlaps with material that people may not want to upload indiscriminately: private messages, unpublished writing, paid or copyrighted textbooks, workplace documents, or personal browsing history. Local processing changes what the learner can comfortably use as study material.

It also makes Motif an experiment in a more general technical direction. WebGPU, WebAssembly, ONNX Runtime Web, and browser-native AI APIs are making the browser a credible inference environment. Projects such as stt-web demonstrate how substantial speech models can be delivered through this stack, while frameworks such as Burn are making portable Rust-based model execution increasingly practical. (Hugging Face)

Cloud inference could still become part of Motif. My upcoming work at Modular will naturally expose me to a different side of model serving, and it may be interesting to explore how a product can move between local and hosted models depending on hardware, latency, model size, and the sensitivity of the material.

The more interesting question is not whether everything should run locally or everything should run in the cloud. It is how an application can choose the appropriate place for each interaction without making the user think about infrastructure.


Where it is now

Motif has grown considerably beyond the ten-minute Cue experiment.

It can pronounce selected text, recognize text inside images, transcribe audio from a tab, translate passages, let users interact with individual words, and save vocabulary with its original context. The extension is now under review for the Chrome Web Store, and I am excited to have friends begin using it.

What I found most valuable, though, was seeing how much of the system could live inside the browser. Instead of bringing the web into a language-learning app, Motif brings a small language-learning environment into the web.