Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator





overview

We propose Signs as Tokens (SOKE), a unified sign language generator that can produce multilingual signs from text inputs. (Left: American sign language; Middle: Chinese sign language; Right: German sign language.)



Abstract


Sign language is a visual language that encompasses all linguistic features of natural languages and serves as the primary communication method for the deaf and hard-of-hearing communities. Although many studies have successfully adapted pretrained language models (LMs) for sign language translation (sign-to-text), the reverse task—sign language generation (text-to-sign)—remains largely unexplored. In this work, we introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs using a pretrained LM. To align sign language with the LM, we leverage a decoupled tokenizer that discretizes continuous signs into token sequences representing various body parts. During decoding, unlike existing approaches that flatten all part-wise tokens into a single sequence and predict one token at a time, we propose a multi-head decoding method capable of predicting multiple tokens simultaneously. This approach improves inference efficiency while maintaining effective information fusion across different body parts. To further ease the generation process, we propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs as auxiliary conditions, significantly improving the precision of generated signs. Extensive qualitative and quantitative evaluations demonstrate the effectiveness of SOKE. Code, models, and data will be made publicly available.



Method Overview


overview

An overview of our proposed method, Signs as Tokens (SOKE). We begin by training a VQ-VAE-based decoupled tokenizer to map continuous sign motions into discrete tokens for various body parts (upper body, left hand, and right hand). These sign motion tokens are then integrated into the text vocabulary of a pretrained language model, which serves as the backbone of our autoregressive multilingual generator (AMG). Given a text input, the AMG first retrieves word-level signs from external dictionaries, appends their motion tokens to the text tokens, and feeds them into the language model encoder. During decoding, our novel multi-head decoding strategy generates motion tokens for all body parts simultaneously at each time step. Finally, the derived motion tokens are used to reconstruct sign avatars.



Qualitative Evaluation


overview
overview

Qualitative comparisons of generated signs between our proposed method, SOKE, with the SOTA method, S-MotionGPT, on the test sets of How2Sign (left), CSL-Daily (middle), and Phoenix-2014T (right).



Video Demos

Citation


AخA
 
@article{zuo2025soke,
    title={Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator},
    author={Zuo, Ronglai and Potamias, Rolandos Alexandros and Ververas, Evangelos and Deng, Jiankang and Zafeiriou, Stefanos},
    journal={arXiv preprint arXiv:2411.17799},
    year={2025}
}



Feel free to contact Ronglai Zuo if you have any questions. The website template is adapted from AniPortraitGAN.