Lip Reading Deep Network Exploiting Multi-Modal Spiking Visual and Auditory Sensors


This work presents a lip reading deep neural network that fuses the asynchronous spiking outputs of two bio-inspired silicon multimodal sensors: the Dynamic Vision Sensor (DVS) and the Dynamic Audio Sensor (DAS). The fusion network is tested on the GRID visual-audio lipreading dataset. Classification is carried out using event-based features generated from the spikes of the DVS and DAS. Networks are trained separately on the two modalities and also jointly trained on both modalities. The jointly trained network when tested on DVS spike frames alone, showed a relative increase in accuracy of around 23% over that of the single DVS modality network.



Software And Hardware

• Hardware: Processor: i3 ,i5 RAM: 4GB Hard disk: 16 GB • Software: operating System : Windws2000/XP/7/8/10 Anaconda,jupyter,spyder,flask Frontend :-python Backend:- MYSQL