Motivation
Sensor streams — accelerometer, physiological, environmental signals — are abundant, but their learned embeddings are typically opaque and modality-siloed: a number that means little on its own and cannot be queried in human terms. This work aims to give sensors a semantic representation by aligning their dynamics with meaning-bearing modalities such as text, so that the resulting embeddings are both predictive of sensor behavior and interpretable — effectively giving sensors a shared "voice" with language-like descriptions of activities and states.
How it works
Building on the JEPA template, the model combines self-supervised prediction with cross-modal alignment.
- A context encoder $f_\theta$ embeds windows of (possibly multimodal) sensor time series; the masking unit is a temporal window per sensor channel.
- A predictor $g_\phi$ matches masked-segment or cross-modal latents.
- An EMA target encoder $\bar f_\theta$ supplies the targets, with gradients stopped.
The multimodal aspect aligns the sensor latent space with a semantic target space — e.g. language-derived embeddings of activities or states — so prediction targets carry semantic content rather than only raw temporal structure. The model thus learns both the dynamics of the signal and its mapping into meaning.
The objective
The objective combines a within-signal latent regression over masked windows with a cross-modal alignment term toward the semantic embedding $s$ of the segment:
$$\mathcal{L} = \big\lVert\, g_\phi(z_{\text{ctx}}, m) - \operatorname{sg}\big[\bar f_\theta(x)_m\big]\,\big\rVert_2^2 \;+\; \lambda\,\big\lVert\, g_\phi(z_{\text{ctx}}) - \operatorname{sg}[\,s\,]\,\big\rVert_2^2,$$
with $\operatorname{sg}$ the stop-gradient, EMA targets, and $\lambda$ balancing prediction against semantic alignment. The first term captures sensor dynamics; the second grounds the latent space in a semantic modality, making it interpretable and queryable.
Key results & what's novel
The method extends JEPA to multimodal sensor analytics. Combining masked latent prediction over sensor windows with alignment to a semantic modality yields embeddings that are simultaneously predictive of sensor dynamics and interpretable in semantic terms — sensors gain a shared "voice" with language-like descriptions. This enables semantic retrieval, zero-shot recognition of activities or states, and cross-modal querying over time-series data, illustrating how joint-embedding prediction can bridge low-level sensor signals and high-level semantics rather than producing opaque, task-specific features.
Strengths & limitations
- + Produces interpretable, queryable sensor embeddings grounded in a semantic modality.
- + Enables semantic retrieval, zero-shot recognition, and cross-modal querying.
- + Unifies within-signal prediction with cross-modal alignment in one objective.
- − Requires paired sensor-semantic data (e.g. labeled or described segments) for the alignment term.
- − Semantic quality is bounded by the chosen language/semantic embedding space.
- − Balancing the prediction and alignment terms ($\lambda$) is a sensitive design choice.