eeVAHN
eeVAHN, short for energy-efficient Visual-Audio Hierarchical Network, is a family of multimodal neural networks designed for real-time processing of audio and visual streams on edge devices. It emphasizes data efficiency, low latency, and reduced energy consumption while maintaining competitive accuracy on standard benchmarks.
Architecture: The model uses separate lightweight encoders for visual and audio inputs, followed by a series
Training and evaluation: eeVAHN is trained on curated audiovisual datasets including VGGSound and AudioSet, and evaluated
Applications and limitations: Potential uses include mobile apps for content tagging, assistive devices, smart cameras, and
See also: Multimodal learning, Edge AI, Efficient neural networks.