Action-GPT:
Leveraging Large-scale Language Models for Improved and Generalized Action Generation

Sai Shashank Kalakonda        Shubh Maheshwari        Ravi Kiran Sarvadevabhatla       

Paper        Code       

 

Sample text conditioned action generations from our large language model based approach (Action-GPT-TEACH). By incorporating large language models, our approach results in noticeably improved generation quality for seen and useen categories.

Abstract

We introduce Action-GPT,

  • A plug and play framework for incorporating Large Language Models (LLMs) into text-based action generation models
  • By carefully crafting prompts for LLMs, we generate richer and fine-grained descriptions of the action.
  • We show that utilizing these detailed descriptions instead of the original action phrases leads to better alignment of text and motion spaces.
  • Our experiments show qualitative and quantitative improvement in the quality of synthesized motions produced by recent text-to-motion models.
  • Code, pretrained models and sample videos will be made available.

Motivation

 

Action-GPT Motivation: We generate richer and fine-grained body movement descriptions of the actions, by carefully prompting large-language models. These detailed descriptions instead of the original action phrases leads to better alignment of text and motion spaces resulting in an enhanced quality of motion sequence generations.

Overview

 

Action-GPT Overview: Given an action phrase, we first create a suitable prompt using an engineered prompt function. The result is passed to a large-scale language model (GPT-3) to obtain multiple action descriptions containing fine-grained body movement details. The corresponding deep text representations are obtained using Description Embedder. The aggregated version of these embeddings is processed by the Text Encoder. During training, the action pose sequence is processed by a Motion Encoder. The encoders are associated with a deterministic sampler (autoencoder) or a VAE style generative model. During training(shown with black), the latent text embedding and the latent motion embedding are aligned. During inference(shown in green), the sampled text embedding is provided to the Motion Decoder which outputs the generated action sequence.

Comparisons

 

Visual comparison of generated motion sequences across models trained on Action-GPT framework on BABEL dataset. Note that the generations using Action-GPT are well-aligned with the semantic information of action phrases. The example in the end shows latent space editing similar to MotionCLIP. Action-GPT is better able to transfer the drink from mug style from standing to sitting pose.

 

Citation

@InProceedings{Action-GPT,
title={Action-GPT: Leveraging Large-scale Language Models for Improved and Generalized Action Generation},
author={Kalakonda, Sai Shashank and Maheshwari, Shubh and Sarvadevabhatla, Ravi Kiran},
booktitle={arXiv preprint https://arxiv.org/abs/2211.15603},
year={2022}
}