In a large-scale Spoken Language Understanding system, Natural Language Understanding (NLU) models are typically decoupled, i.e, trained and updated independently, from the upstream Automatic Speech Recognition (ASR) system that provides textual hypotheses for the user’s voice signal as input to NLU. Such ASR hypotheses often contain errors causing severe performance degradation as the downstream NLU models