Return to Dave's Planet

I have tools to sample audio input and output files, lining the input and output samples up until they overlap, selecting the region to extract, performing Fourier transforms of both sets of data and storing it in the database as a training set for a neural network. The idea is to use your voice as the input and the target voice as the output to train the neural net for voice morphing.

I have tools to train the neural network using all the supplied training sets.

I have built an extremely flexible neural network whose structure is defined by database entries of nodes and connections.

I have tools to automatically generate database scripts to build neural networks of whatever dimensions I like.

I have trained neural networks to successfully morph my voice envelope into that of a target speaker

I have tools to perform a Fourier synthesis of the output of the neural net and save the result as an uncompressed 8 bit mono wave file.

My weakness lies in my inability to successfully create a synthesized FFT that sounds correct. The synthesis itself works, I can perfectly reconstruct inputs. I can take my voice, rip it apart into Fourier components, recontstruct those components, create the wav file and hear an absolutely perfect reproduction of me. The envelopes of the morphed voices on the screen look perfect, my voice envelope perfectly becoming that of the target speaker. The problem must lie in the way I am attempting to directly manipulate Fourier transforms. Say I used an FFT with 256 components, I'd have 256 input nodes, N intermediate nodes and 256 output nodes. It was my working assumption that this would be adequate to get an FFT out that could be reconstructed into a recognizable voice. I think the problem is that FFT might not be the best representation of the data, it tends to put out large spikes instead of smooth spectrums, but there are subtleties in the interrelatedness of different FFT components, apparently you can't just manipulate them like this, if related real and imaginary components are manually modified one does not get a smooth modification of the output signal, it turns to garbage rather quickly. Either I need a better understanding of how to successfully morph FFTs while retaining their audio characteristics, or I need to move on to a different form of voice sampling altogether.

I put this web page up to ask for help. If you demonstrate a willingness to supply meaningful technical assistance that moves this process forward then I in turn would be more than happy to share my work with you, giving you copies of all my tools and work.

Return to Dave's Planet

Locations of visitors to this page