End-to-end (E2E) automatic speech recognition (ASR) systems often exploited pre-trained hidden Markov model (HMM) systems for word timing estimation (WTE), due to their inability to predict word boundaries. However, training an HMM is difficult for low-resource languages due to the lack of phonetic transcriptions, leading to a high demand for HMM-free WTE methods, particularly for multilingual ASR systems