Information retrieval in multilingual systems poses a challenging problem; this challenge is exacerbated when the component languages do not share the same script and writing system. These differences make indexing names across scripts incredibly difficult or even impossible, and more languages that are part of the system make the problem worse and searches less reliable. This paper describes a new, SOUNDEX-based approach, called IndicSOUNDEX, that attempts to alleviate such problems for an information retrieval system that includes Hindi, Marathi, Telugu, Tamil, Malayalam, Punjabi, Bengali, Kannada, Gujarati, and English by collapsing spelling differences in phonetically related words. We examine IndicSOUNDEX’s strengths in handling word pairs written in two different Indic scripts, one written in Indic script and the other in Latin script, and from different phonemic representations.
Research areas