Di Hu
Dong Wang
Xuelong Li
Feiping Nie
Qi Wang
Northwestern Polytechnical University
CVPR 2019
Abstract
Visual-to-auditory sensory substitution devices can assist the blind in sensing the visual environment by translating
the visual information into a sound pattern. To improve the translation quality, the task performances of the blind are usually employed to evaluate different encoding schemes. In contrast to the toilsome human-based assessment, we argue that machine model can be also developed for evaluation, and more efficient. To this end, we firstly propose two distinct cross-modal perception model w.r.t. the late-blind and congenitally-blind cases, which aim to generate concrete visual contents based on the translated sound. To validate the functionality of proposed models, two novel optimization strategies w.r.t. the primary encoding scheme are presented. Further, we conduct sets of humanbased experiments to evaluate and compare them with the conducted machine-based assessments in the cross-modal generation task. Their highly consistent results w.r.t. different encoding schemes indicate that using machine model to accelerate optimization evaluation and reduce experimental cost is feasible to some extent, which could dramatically promote the upgrading of encoding scheme then help the blind to improve their visual perception ability.
Sensory Substitution for the Blind
There are millions of blind people all over the world, how to help them to “re-see” the outside world is a significant but challenging task. According to the theory of cross-modal plasticity, it becomes possible to use other organs (e.g., ears) as the sensor to “visually” perceive the environment. Hence, in the past decades, there have been several projects attempting to help the disabled to recover their lost senses via other sensory channels, and the relevant equipments are usually named as Sensory Substitution (SS) devices. And the well-deployed one is the visual-to-auditory SS device of vOICe. After 10-15 hours of training, the regions of visual cortex become active due to cross-modal plasticity.
Machine Assessment and Analysis
To evaluate the proposed late-blind model and Congenitally-blind Model, we access them by the visual generation task. Concretely, the images are translated into sounds with different encoding schemes. Then, the obtained audios are embedded into hidden representations, which are viewed as the input to visual generation. After training the whole network, the reality of generated images show the effectiveness of the proposed Cross-modal Perception Model.
Generated visual examples using our late-blind model.
Generated visual examples using our congenitally-blind model.
Different Encoding Schemes
The quality of generated modality depends on the quality of the other encoded modality in the cross-modal generative model. In this section, the proposed CBM is chosen as the evaluation reference, as the encoding scheme directly impacts the quality of translated sound for the audio discriminator, then further affects the performance of visual generator, as shown in the following(we choose the Columbia Object Image Library (COIL-20) as the evaluation dataset).
Comparison among the generated image examples using our congenitally-blind model in terms of different encoding schemes.
Cognitive Assessment and Analysis
The blind participants’ feedback or task performance is usually served as the indicator to the quality of encoding schemes in the conventional assessment. Accordingly, extensive control experiments are performed for evaluating the modified schemes in this section. More importantly, the human-based assessment provides the reference for the proposed cross-modal perception model, as well as the machine-based evaluation.
Different encoding schemes are compared in human evaluation and inception score..
The comparison analysis between machine- and human-based assessment in terms of correlation coefficient.
Examples Overview
We visualize several images and its translated sounds. Click the images below to hear what sounds activate that unit. Turn on your speakers! You will hear the corresponding sounds.
The primary encoding scheme
The modified scheme w.r.t. longer audio length
The modified scheme w.r.t. the position-frequency function of tanh
Code & Trained Models
The code and models are available on Github and open source. It is implemented in Keras with tensorflow backend.
Data
Three-paired image datasets are considered for imitating the visual environment, from the complex background (CIFAR-10 and ImageNet) to the simple black background (COIL-20).
The sound of the images in these datasets can be directedly generated by using the released encoding schemes above.
Reference
Machine cross-modal perception is an exciting area of research but has just set sail! Check out some related work in Cognitive Science below:
Acknowledgements
We thank our lab colleagues for conducting the Sensory Substitution Tests. This work was supported in part by the National Natural Science Foundation of China grant under number 61772427, 61751202, U1864204 and 61773316, Natural Science Foundation of Shaanxi Province under Grant 2018KJXX-024, and Project of Special Zone for National Defense Science and Technology Innovation.