Listen to the Image

Di Hu Dong Wang Xuelong Li Feiping Nie Qi Wang
Northwestern Polytechnical University
CVPR 2019


Download Paper

Abstract

Visual-to-auditory sensory substitution devices can assist the blind in sensing the visual environment by translating the visual information into a sound pattern. To improve the translation quality, the task performances of the blind are usually employed to evaluate different encoding schemes. In contrast to the toilsome human-based assessment, we argue that machine model can be also developed for evaluation, and more efficient. To this end, we firstly propose two distinct cross-modal perception model w.r.t. the late-blind and congenitally-blind cases, which aim to generate concrete visual contents based on the translated sound. To validate the functionality of proposed models, two novel optimization strategies w.r.t. the primary encoding scheme are presented. Further, we conduct sets of humanbased experiments to evaluate and compare them with the conducted machine-based assessments in the cross-modal generation task. Their highly consistent results w.r.t. different encoding schemes indicate that using machine model to accelerate optimization evaluation and reduce experimental cost is feasible to some extent, which could dramatically promote the upgrading of encoding scheme then help the blind to improve their visual perception ability.


Sensory Substitution for the Blind

There are millions of blind people all over the world, how to help them to “re-see” the outside world is a significant but challenging task. According to the theory of cross-modal plasticity, it becomes possible to use other organs (e.g., ears) as the sensor to “visually” perceive the environment. Hence, in the past decades, there have been several projects attempting to help the disabled to recover their lost senses via other sensory channels, and the relevant equipments are usually named as Sensory Substitution (SS) devices. And the well-deployed one is the visual-to-auditory SS device of vOICe. After 10-15 hours of training, the regions of visual cortex become active due to cross-modal plasticity.

Machine Assessment and Analysis

To evaluate the proposed late-blind model and Congenitally-blind Model, we access them by the visual generation task. Concretely, the images are translated into sounds with different encoding schemes. Then, the obtained audios are embedded into hidden representations, which are viewed as the input to visual generation. After training the whole network, the reality of generated images show the effectiveness of the proposed Cross-modal Perception Model.

Generated visual examples using our late-blind model.

Generated visual examples using our congenitally-blind model.

Different Encoding Schemes

The quality of generated modality depends on the quality of the other encoded modality in the cross-modal generative model. In this section, the proposed CBM is chosen as the evaluation reference, as the encoding scheme directly impacts the quality of translated sound for the audio discriminator, then further affects the performance of visual generator, as shown in the following(we choose the Columbia Object Image Library (COIL-20) as the evaluation dataset).

Comparison among the generated image examples using our congenitally-blind model in terms of different encoding schemes.

Cognitive Assessment and Analysis

The blind participants’ feedback or task performance is usually served as the indicator to the quality of encoding schemes in the conventional assessment. Accordingly, extensive control experiments are performed for evaluating the modified schemes in this section. More importantly, the human-based assessment provides the reference for the proposed cross-modal perception model, as well as the machine-based evaluation.

Different encoding schemes are compared in human evaluation and inception score..

The comparison analysis between machine- and human-based assessment in terms of correlation coefficient.

Examples Overview

We visualize several images and its translated sounds. Click the images below to hear what sounds activate that unit. Turn on your speakers! You will hear the corresponding sounds.

The primary encoding scheme

The modified scheme w.r.t. longer audio length

The modified scheme w.r.t. the position-frequency function of tanh

Code & Trained Models

The code and models are available on Github and open source. It is implemented in Keras with tensorflow backend.

Data

Three-paired image datasets are considered for imitating the visual environment, from the complex background (CIFAR-10 and ImageNet) to the simple black background (COIL-20).

The sound of the images in these datasets can be directedly generated by using the released encoding schemes above.

Reference

Machine cross-modal perception is an exciting area of research but has just set sail! Check out some related work in Cognitive Science below:

Acknowledgements

We thank our lab colleagues for conducting the Sensory Substitution Tests. This work was supported in part by the National Natural Science Foundation of China grant under number 61772427, 61751202, U1864204 and 61773316, Natural Science Foundation of Shaanxi Province under Grant 2018KJXX-024, and Project of Special Zone for National Defense Science and Technology Innovation.