<p dir="ltr">In recent years, Artificial Intelligence (AI) has made significant strides in solving real-world problems, encompassing both discriminative and generative tasks. A notable milestone in this field is the transition of AI models from supervised to zero-shot, enabling them to understand and generate various actions with limited prior knowledge. This transition is crucial for addressing human action-related tasks, where target actions exhibit considerable variability across different scenarios. </p><p dir="ltr">This thesis introduces a series of approaches that enable models to comprehend and generate previously unseen human actions (zero-shot). Previous supervised human action understanding models lack the robustness needed to comprehend learned actions with varying durations across different environments. We thus begin by developing two supervised prototypes: a supervised action recognition model and an action detection system. The works explore efficient ways to model spatio-temporal information and build solid baseline for later works. Despite their state-of-the-art performance on multiple benchmarks, these prototypes are limited to only detecting or recognizing the actions they have been trained on. To overcome this limitation, we introduce an innovative ”factorization” mech anism inspired by human learning processes. This mechanism decomposes com plicated actions into atomic components, which function as latent representations bridging seen and unseen domains and are then integrated with prior knowledge from extensive unsupervised training datasets. </p><p dir="ltr">Notably, the ”factorization” mechanism provides a unified solution for human action understanding and generation. Building on this foundation, we then de velop a generative model that enables users to generate sequences of complicated actions from unique sequences of atomic actions with independent attribute set tings and timings applied. However, this model is still constrained by the size of the learned atomic action pool. To address this challenge, we have created the largest open-sourced 3D human motion dataset, LaViMo, and an open-vocabulary text-conditioned 3D human action generation model, TMT. Our results show that combining tokenizer pre-trained on large unsupervised datasets (prior knowledge) with Large Language Models (logic reasoning) significantly boosts the model’s open vocabulary generalization capability. </p><p dir="ltr">Through this line of works, we advance the field of human action-related AI by providing novel methodologies and frameworks that enhance the model’s generalization capabilities.</p>