Disinformation presented in multiple modalities (textual, visual, and auditory modes; multimodal disinformation) has become a serious concern. This study examines how disinformation, portrayed using an image or video format, may be more powerful than text-only disinformation. In particular, we examined the impact on affective mechanisms, as well as the moderating role of perceived issue relevance. Through an online experiment with modality conditions and a control group (text-only disinformation vs image-plus-text disinformation vs video-plus-text disinformation vs control; N = 413), results indicate that while anxiety is a critical mechanism that explains the overall effects of disinformation on misperceptions, video-plus-text disinformation turns out to increase misperceptions directly or indirectly through anxiety. Video-plus-text disinformation (vs control) showed a significant interaction with perceived issue relevance; that said, the difference in anxiety decreased between those with low and high perceived issue relevance in the video-plus-text disinformation. Implications are discussed in light of the realism heuristic, affect heuristic, and modality-biased processing in explaining the emotional impact of multimodal disinformation.