<p dir="ltr">Generalization is a defining challenge of modern machine learning. Classical theory explains small supervised models but struggles with the surprising behavior of over-parameterized neural networks and with other paradigms such as reinforcement learning and large-scale pretraining. These developments point to the need for new theoretical and empirical machinery better suited to modern machine learning. </p><p dir="ltr">We begin with supervised learning, where we introduce an empirical phenomenon called the <i>Generalization Disagreement Equality</i> (GDE), which enables accurate estimation of generalization error in deep neural networks using only unlabeled data. Building on this discovery, we develop a theoretical framework grounded in feature learning-the process by which neural networks learn to extract meaningful representations from raw data. This framework not only explains the mathematical basis of the GDE but also successfully predicts outcomes of novel experiments, providing a more complete understanding of how deep networks learn and generalize. </p><p dir="ltr">The second part explores generalization properties of deep reinforcement learning (RL) algorithms through the lens of exploration. We establish a crucial relationship between exploration and generalization, and present a novel algorithm that significantly improves an agent's ability to generalize to unseen environments. We also demonstrate that RL agents built on large language models (LLM) can be trained to conduct efficient exploration at test time and solve novel decision making problems. </p><p dir="ltr">The third part extends these insights to unsupervised pre-training. We demonstrate that the learning loss curve of a data subset can be accurately modeled with a scaling law, which can be further decomposed into different types of uncertainties. These uncertainties serve as signals to dynamically adjust the data composition during training, improving the learning efficiency and incurring minimal computational overhead. </p><p dir="ltr">In the final part of this thesis, we argue that the surprising properties of over-parameterized networks, and the broader limits of classical information theory in explaining modern machine learning, arise because real learners are computationally bounded. To capture this, we introduce epiplexity, a measure of the structural information accessible to time-bounded observers. Epiplexity resolves paradoxes left open by classical theory, admits a practical estimator from learning curves and scaling laws, and is validated across diverse domains from cellular automata to language and vision. These results suggest that computation limits are not peripheral but fundamental to phenomena in machine learning, offering a unified framework for generalization, data selection and representation learning. They also establish a new connection between machine learning, algorithmic information theory, complexity theory, and cryptography. </p>