Recent advancements in Generative AI, such as scaled Transformer large language models (LLM) and diffusion decoders, have revolutionized speech synthesis. With speech encompassing the complexities of natural language and audio dimensionality, many recent models have relied on autoregressive modeling of quantized speech tokens. Such an approach limits speech synthesis to left-to-right generation, making