<p dir="ltr">Understanding and generation are cornerstone capabilities of visual intelligence, enabling systems to interpret complex visual scenes and construct meaningful representations. Advanced understanding models exhibit remarkable proficiency in comprehending scenes, even under challenging conditions. Concurrently, generative models, such as diffusion models and autoregressive models, have demonstrated impressive zero-shot capabilities, generating photorealistic images for diverse and intricate scenarios. </p><p dir="ltr">Despite these advancements, the interplay between visual understanding and generation remains underexplored. Visual understanding extracts high-level semantics from raw RGB images, creating compact and meaningful representations of visual scenes. Conversely, visual generation decodes these compact representations back into realistic RGB images. Bridging these two domains presents an opportunity to foster a mutually beneficial relationship, leveraging their inherent complementarities. </p><p dir="ltr">This thesis seeks to bridge the gap between visual understanding and generation by exploring how these domains can complement and enhance one another. The work begins by analyzing and validating the individual effectiveness of understanding and generation models. It then focuses on integrating these domains, revealing the underlying relationships and synergies between them. The key contributions of this thesis are as follows: We present studies demonstrating how visual generation can benefit from visual understanding and vice versa, leveraging shared knowledge from learned repre?sentations. We explore integrating understanding and generation within a unified generative framework, enhancing performance by enriching the model’s latent space. We conduct comprehensive experiments in heterogeneous settings to evaluate the impact of architectural design choices, modalities, and training methodologies. This thesis provides valuable insights into intelligent multimedia analysis in the era of deep learning, with practical implications for multimodal forensic understanding and deduction. It aspires to inspire further research in related fields, advancing the frontiers of visual intelligence.</p>