Boosting entity recognition by leveraging cross-task domain models for weak supervision
2024
Entity Recognition (ER) is a common natural language processing task encountered in a number of real-world applications. For common domains and named entities such as places and organisations, there exists sufficient high quality annotated data and foundational models such as T5 and GPT-3.5 also provide highly accurate predictions. However, for niche domains such as e-commerce and medicine with specialized entity types, there is a paucity of labeled data since manual labeling of tokens is often time-consuming and expensive, which makes entity recognition challenging for such domains. Recent works such as NEEDLE [48] propose hybrid solutions to efficiently combine a small amount of strongly labeled (human-annotated) with a large amount of weakly labeled (distant supervision) data to yield superior performance relative to supervised training. The extensive noise in the weakly labeled data, however, remains a challenge. In this paper, we propose WeSDoM (Weak Supervision with Domain Models), which leverages pre-trained encoder models from the same domain but different tasks to create domain ontologies that can enable the creation of less noisy weakly labeled data. Experiments on internal e-commerce and public biomedical NER datasets demonstrate that WeSDoM outperforms existing SOTA baselines by a significant margin. We achieve new SOTA F1 scores on two popular Biomedical NER datasets, BC5CDR-chem 94.27, BC5CDR-disease 91.23.
Research areas