<p dir="ltr">Structural Health Monitoring (SHM) for infrastructure requires reliable detection of both surface level and internal defects, a challenge not fully met by single-modality approaches. This study addresses that gap by integrating visual and vibration data through a multimodal machine learn ing framework. The proposed approach utilizes a two-branch deep neural network that processes images which are for surface damage and acceleration time-series signals which are for internal hidden damage in parallel. The extracted features are fused with an attention-based mechanism, allowing the model to capture complementary information from each modality. This multimodal approach is designed to overcome the limitations of vision only or vibration only inspection, pro viding a more comprehensive and robust damage identification system. </p><p dir="ltr">The proposed model is evaluated on a laboratory scale cement panel instrumented with a camera and accelerometers under varied damage scenarios, including cracks visible on the surface and subsurface defects undetectable by vision alone. The multimodal model consistently outper forms single modality baselines. The result confirms that fusing visual and vibration modalities yields significantly more accurate and reliable damage detection than either modality alone. This highlights the value of multimodal learning in SHM, as the integrated approach can identify a broader range of structural damage with greater reliability than traditional single-sensor methods.</p>