Generalized Embedding Models for Industry 4.0 Applications

Christodoulos Constantinides; Shuxin Lin; Dhaval Patel

doi:10.48550/arXiv.2506.12607

EMNLP 2025

Conference paper

05 Nov 2025

Generalized Embedding Models for Industry 4.0 Applications

Download paper

Abstract

In this work, we present the first embedding model specifically designed for Industry 4.0 applications, targeting the semantics of industrial asset operations. Given natural language tasks related to specific assets, our model retrieves relevant items and generalizes to queries involving similar assets, such as identifying sensors relevant to an asset’s failure mode. We systematically construct nine asset-specific datasets using an expert-validated knowledge base reflecting real operational scenarios. To ensure contextually rich embeddings, we augment queries with Large Language Models, generating concise entity descriptions that capture domain-specific nuances. Across five embedding models ranging from BERT (110M) to gte-Qwen (7B), we observe substantial in-domain gains: $\textbf{HIT@1 +54.2\%, MAP@100 +50.1\%, NDCG@10 +54.7\%}$ on average. Ablation studies reveal that (a) LLM-based query augmentation significantly improves embedding quality; (b) contrastive objectives without in-batch negatives are more effective for tasks with many relevant items; and (c) balancing positives and negatives in batches is essential. We experimented out-of-domain tasks using Retrieval-Augmented Generation (RAG) pipeline. $\textbf{We open-source implementation and experiments.}$