BhashaSetu is an open-source AI for two-way Indian Sign Language translation — built with and for the Deaf and hard-of-hearing community of India. We're early, working in public, and looking for collaborators.
Where is the nearest hospital?
नज़दीकी अस्पताल कहाँ है?
India is home to roughly 63 million people in the Deaf and hard-of-hearing community — yet has fewer than 300 certified ISL interpreters. Education, healthcare, government services, employment: most stay out of reach for want of someone who can sign.
BhashaSetu is a small attempt at a big gap: an open-source model that anyone can run on a phone or laptop, two-way, in real time. Not a replacement for human interpreters — a complement, where there are none.
It's easy to read "63 million people, <300 interpreters" as an abstraction. Here's what that means for an actual Deaf Indian, week to week.
of Deaf children aged 6–13 are out of school entirely. Most schools that do enrol them use oralist methods — lip-reading and forced speech — not ISL. Even where teachers want to sign, no formal ISL curriculum exists for them.
DHH out-of-school study · 360info, 2014of disabled Indians live in rural areas — where the <300 certified ISL interpreters basically don't exist. Imagine explaining symptoms to a doctor without a shared language. Real-time translation isn't a luxury; it's informed consent.
Census of India · 2011of working-age Deaf adults are in marginal or informal work. When school is inaccessible, written communication is shaky, and an interview interpreter is unavailable, the disadvantage compounds across families and generations.
Census of India · 2011It's a complete, grammatically distinct language — and the reason most "sign translators" fail in India is they treat it as one-to-one word substitution. Four things any serious ISL model has to get right:
ISL fingerspells with both hands. ASL and most Western sign languages use one. Models trained on Western data don't transfer cleanly — the morphology is genuinely different.
"I rice eat," not "I eat rice." A word-for-word translation produces gibberish. ISL → text requires a real grammatical reordering step, not a lookup table.
Raised eyebrows mark yes/no questions. A head-shake makes a clause negative. Mouth-shape carries intensity. Ignore non-manuals and you lose half the meaning.
ISL is shared pan-India, but vocabulary varies state-to-state — Mumbai-ISL and Kolkata-ISL share grammar, differ in signs. Fine-tunes for regional dialects are not optional.
Commercial sign-language tools today are ASL-focused, closed, and priced out of reachfor Indian schools, clinics, and panchayat offices. They treat the model as a moat. That doesn't fit the shape of this problem — a country with one shared sign language, dozens of regional dialects, and a Deaf community that has been talked at for decades.
Camera data of a Deaf signer is among the most sensitive there is. It can't sit on a US-based server, behind a paywall, with terms of service no one read. The weights, the code, the data recordings, and the eval suite all have to be open — under permissive licenses, auditable, fork-able, and shaped by the people the system claims to serve.
A note on framing."Nothing about us, without us" — the disability-rights principle — is the operating principle here. Deaf signers, ISL educators, and accessibility researchers shape every release, starting before the first line of code.
A single model that handles continuous signing in one direction and drives a 3D avatar in the other — running fully on-device so the camera feed never leaves the phone. Here's the design we're working toward.
Most published ISL models classify one isolated sign at a time. We're working toward continuous-sentence recognition that preserves grammar, finger-spelling, and non-manual markers (eyebrows, mouth-shape, head-tilt) — because that's how ISL actually works.
Type or speak in Hindi and English. A retargeted 3D avatar performs the corresponding ISL with natural transitions — so a hearing speaker and a Deaf signer can hold a real conversation.
The model targets browser inference via WebGPU and Android via NNAPI. Your camera feed stays on your device — nothing is uploaded, no account needed.
Public ISL datasets are small — INCLUDE has 263 signs, CISLR has 4,765 with very few samples each. We start from these and grow a community-collected corpus, with consent, from real signers.
Designed alongside Deaf and hard-of-hearing collaborators — captions everywhere, haptic cues, high-contrast theming, large hit targets, full keyboard navigation. WCAG 2.2 AA is the floor, not the ceiling.
Permissive licenses across the board. Fine-tune for your school, your clinic, your state. Commercial use is fine. No CLAs.
Camera frames go in. Translated speech and a signing avatar come out. Every stage is swappable and benchmarked.
Hand, body and face landmarks per frame using MediaPipe Holistic — 21 hand + 33 body + 468 face points, batched on GPU.
mediapipe · 30 fpsAn ST-GCN learns sign morphology over a sliding window, fusing manual + non-manual features (eyebrows, mouth).
st-gcnA transformer-CTC head emits gloss sequences; a small LM reorders ISL's SOV grammar into natural Hindi / English / regional.
transformer-ctcBack-translation pairs gloss with multilingual text using a seq2seq fine-tuned on parallel ISL↔text data we're curating.
mT5 · smallGloss sequences drive a rigged 3D signer with smoothed inverse kinematics and non-manual blending.
three.js · webgpuWhen BhashaSetu ships, here's how you'll consume it — a thin JS SDK for the browser, a Python package for notebooks and servers, and an open model on HuggingFace. Install paths go live with v0.1.
Drop-in browser SDK with WebGPU runtime, a web component for the signing avatar, and an event-driven translation stream.
For notebooks, servers, and batch processing. Same API as the JS SDK, ONNX runtime under the hood, CPU and CUDA backends.
Weights, model card, eval results, and a Spaces demo. Apache-2.0 licensed, fully fine-tunable, no gating.
A preview of what the SDK will look like — fixed early so contributors can build against it.
Available via npm for web, and pip for Python notebooks & servers.
The SDK requests webcam permission once. Frames are processed entirely on-device.
Listen for onTranslation events — get gloss, text, and confidence per utterance.
Drop in the <setu-avatar> web component for two-way conversations.
// install: npm i @bhashasetu/web (coming soon)
import { Setu } from "@bhashasetu/web";
const setu = await Setu.load({
model: "setu-isl-base",
target: "hi", // output language
backend: "webgpu",
});
await setu.start(document.getElementById("cam"));
setu.onTranslation(({ text, gloss, conf }) => {
console.log(gloss, "→", text, `(${conf.toFixed(2)})`);
});
// the other direction: text → signing avatar
const avatar = setu.avatar("#stage");
await avatar.say("नमस्ते, आप कैसे हैं?");Translation models can fail in ways that aren't obvious to hearing developers. We're partnering with Deaf signers, ISL educators, and accessibility orgs to review every release — starting from datasets and continuing through UI, error states, and the model card itself.
Where we are, and where we're going. Honest about the order — data first, model second, polish last.
Survey existing ISL datasets (INCLUDE, CISLR, ISLRTC dict), define gloss vocabulary, draft model card & consent protocols.
Isolated sign recognition on existing public data. Browser SDK with the API above. Honest about what doesn't work yet.
Transformer-CTC head, sliding-window inference, real-time streaming translations.
The other direction. Rigged 3D signer driven by gloss sequences from a multilingual seq2seq.
Fine-tunes for state-level ISL variations, mobile-first runtime, classroom & clinic deployment kits.
We're looking for ML engineers, ISL signers, accessibility researchers, and anyone who wants to help. Repo and community channels go public with v0.1.