Giving Sensors a Voice: Multimodal JEPA for Time Series

At a glance

ProblemSensor streams are abundant but their embeddings are opaque and modality-siloed, hard to interpret or query semantically.

Key ideaCombine masked latent prediction over sensor windows with alignment to a semantic modality, giving sensors a queryable, meaning-bearing embedding.

ModalityMultimodal sensor time series (aligned to semantics, e.g. text)

Target / maskingTemporal window per sensor channel; masked-segment and cross-modal latents under an EMA target.

Builds onJEPA latent prediction; cross-modal semantic alignment.

Used forSemantic retrieval, zero-shot recognition, cross-modal querying over sensor data.

Motivation

Sensor streams — accelerometer, physiological, environmental signals — are abundant, but their learned embeddings are typically opaque and modality-siloed: a number that means little on its own and cannot be queried in human terms. This work aims to give sensors a semantic representation by aligning their dynamics with meaning-bearing modalities such as text, so that the resulting embeddings are both predictive of sensor behavior and interpretable — effectively giving sensors a shared "voice" with language-like descriptions of activities and states.

How it works

Canonical JEPA schematic for Time series. The input is split into a visible context and hidden targets (window-level, block). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance. A local/generative loss runs alongside latent prediction (hybrid objective).

Building on the JEPA template, the model combines self-supervised prediction with cross-modal alignment.

A context encoder $f_\theta$ embeds windows of (possibly multimodal) sensor time series; the masking unit is a temporal window per sensor channel.
A predictor $g_\phi$ matches masked-segment or cross-modal latents.
An EMA target encoder $\bar f_\theta$ supplies the targets, with gradients stopped.

The multimodal aspect aligns the sensor latent space with a semantic target space — e.g. language-derived embeddings of activities or states — so prediction targets carry semantic content rather than only raw temporal structure. The model thus learns both the dynamics of the signal and its mapping into meaning.

The objective

The objective combines a within-signal latent regression over masked windows with a cross-modal alignment term toward the semantic embedding $s$ of the segment:

$$\mathcal{L} = \big\lVert\, g_\phi(z_{\text{ctx}}, m) - \operatorname{sg}\big[\bar f_\theta(x)_m\big]\,\big\rVert_2^2 \;+\; \lambda\,\big\lVert\, g_\phi(z_{\text{ctx}}) - \operatorname{sg}[\,s\,]\,\big\rVert_2^2,$$

with $\operatorname{sg}$ the stop-gradient, EMA targets, and $\lambda$ balancing prediction against semantic alignment. The first term captures sensor dynamics; the second grounds the latent space in a semantic modality, making it interpretable and queryable.

Key results & what's novel

The method extends JEPA to multimodal sensor analytics. Combining masked latent prediction over sensor windows with alignment to a semantic modality yields embeddings that are simultaneously predictive of sensor dynamics and interpretable in semantic terms — sensors gain a shared "voice" with language-like descriptions. This enables semantic retrieval, zero-shot recognition of activities or states, and cross-modal querying over time-series data, illustrating how joint-embedding prediction can bridge low-level sensor signals and high-level semantics rather than producing opaque, task-specific features.

Strengths & limitations

+ Produces interpretable, queryable sensor embeddings grounded in a semantic modality.
+ Enables semantic retrieval, zero-shot recognition, and cross-modal querying.
+ Unifies within-signal prediction with cross-modal alignment in one objective.
− Requires paired sensor-semantic data (e.g. labeled or described segments) for the alignment term.
− Semantic quality is bounded by the chosen language/semantic embedding space.
− Balancing the prediction and alignment terms ($\lambda$) is a sensitive design choice.

Connections & references

Builds onI-JEPA Joint Embeddings Temporal

Paper ↗