Re: [Numpy-discussion] Help Convolution with binaural filters(HRTFs)

Hi,
i try to implement a real-time convolution module refreshed by head listener location (angle from a reference point).The result of the convolution by binaural flters(HRTFs) allows me to spatialize a monophonic wavfile. I got trouble with this as long as my convolution doesnt seem to work properly: np.convolve() doesnt convolve the entire entry signal
->trouble with extracting numpyarrays from the audio wav. filters and monophonic entry ->trouble then with encaspulating the resulting array in a proper wav file...it is not read by audacity
Do you have any idea of how this could work or do you have any implementation of stereo filtering by impulse response to submit me Thank you very much
Arthur de Conihout

2010/5/26 arthur de conihout arthurdeconihout@gmail.com:
i try to implement a real-time convolution module refreshed by head listener location (angle from a reference point).The result of the convolution by binaural flters(HRTFs) allows me to spatialize a monophonic wavfile.
I suspect noone not closely involved with your subject can understand this. From what you write later on I guess binaural filters are some LTI system?
I got trouble with this as long as my convolution doesnt seem to work properly: np.convolve() doesnt convolve the entire entry signal
Hmm http://docs.scipy.org/doc/numpy/reference/generated/numpy.convolve.html#nump... claims that the convolution is complete. Can you give an example of what you mean?
Furthermore, I think the note there about scipy.signal.fftconcolve may be of large use for you, when you are going to convolve whole wav files?
->trouble with extracting numpyarrays from the audio wav. filters and monophonic entry ->trouble then with encaspulating the resulting array in a proper wav file...it is not read by audacity
Hmmm I worked one time with wavs using the wave module, which is a standard module. I didn't deal with storing wavs. I attach the reading module for you. It needs a module mesh2 to import, which I don't include to save traffic. I think the code is understandable without it and the method .get_raw_by_frames() may already help solving your problem.
But I didn't really get the point what your aim is. As far as I understood you want to do sth named "spacialise" with the audio, based on the position of some person with respect to some reference point. What means "spacialise" in this case? I guess it's not simply a delay for creating stereo impression? I guess it is sth creating also a room impression, sth like "small or large room"?
Friedrich

Hi thanks for your answer
It s my first time i get involved in such a forum, i m a raw recruit and i don't exactly know how to post properly. I try to make me clearer on my project :
""But I didn't really get the point what your aim is. As far as I understood you want to do sth named "spacialise" with the audio, based on the position of some person with respect to some reference point. What means "spacialise" in this case? I guess it's not simply a delay for creating stereo impression? I guess it is sth creating also a room impression, sth like "small or large room"?""
The project consists in Binaural Synthesis( that s the given name), which is a solution for Sound Spatialzation which is the closest to rea-life listening. Thus binaural encoding of spatial information is fundamentaly based on the synthesis of localisation cues, namely the ITD(Interaural Time Difference), ILD(Interaural Level Difference)and the SC(Spectral Cues). A way to create an artificial sound scene is by using binaural filters.The binaural signals are then obtained by convolving a monophonic source signal with a pair of binaural filters that reproduce the transfer function of the acoustic path between the source location and the listener's ears .These transfer functions are refered to as Head Related Transfer Functions or HRTF( their time equivalent HRIR for Head Related Impulse Response).
These HRTF can be obtained by measurement.The protocol consists in setting very small microphones in the ears of a listenner and to send for each direction in his acoustic sphere a white noise.Thus we obtained the impulse response to a white noise which correponds to the acoustic signature of the sound in a given direction(all this is made in anechoic room to consider a neutral room without reverb).Then to "spatialize" the sound i convolve two of these IR(left and right ear response) with the monophonic sound and i obtain that the sound seems to come from the given direction.The decoding part use a headphone to eliminate problem with the room response by making the recording point of the HRTF closer to the reproduction point (headphone).
I give you a part of the code i use for convolution with all the wav treatment:
import wave, struct, numpy, time SAMPLE_RATE = 88200
*#here i got trouble if i set 44100 instead the final wav is under pitched even if the original was 44100?*
def convo(foriginal, ffiltre):
original = wave.open(foriginal, "r") filtre = wave.open(ffiltre, "r") * # i m creating the file in which i will write the result of the convolution* filtered = wave.open("/home/arthur/Desktop/NubiaFilteredFIRnoteR.wav", "w") filtered.setnchannels(1) filtered.setsampwidth(2) filtered.setframerate(SAMPLE_RATE)
*#when i unpack the monophonic and the filter i might be making a mistake with the arguments that make my convolution not be performed on the entire signal? ** #getting wav mono file info to unpack the data properly* nframes = original.getnframes() nchannels=original.getnchannels() original = struct.unpack_from("%dh" % nframes*nchannels, original.readframes(nframes*nchannels)) *#i dont really understand the %dh and the s/2.0**15 but it might be my problem * original = [s / 2.0**15 for s in original]
nframes=filtre.getnframes() nchannels=filtre.getnchannels() filtre = struct.unpack_from("%dh" % nframes*nchannels, filtre.readframes(nframes*nchannels)) filtre = [s / 2.0**15 for s in filtre]
result = numpy.convolve(original, filtre)
result = [ sample * 2.0**15 for sample in result ] filtered.writeframes(struct.pack('%dh' % len(result), *result))
filtered.close()
convo(foriginal, ffiltre)
i had a look to what you sent me i m on my way understanding maybe your initialisation tests will allow me to make difference between every wav formats? i want to be able to encode every formats (16bit unsigned, 32bits)what precautions do i have to respect in the filtering?do filter and original must be the same or? Thank you
AdeC
2010/5/27 Friedrich Romstedt friedrichromstedt@gmail.com
2010/5/26 arthur de conihout arthurdeconihout@gmail.com:
i try to implement a real-time convolution module refreshed by head listener location (angle from a reference point).The result of the convolution by binaural flters(HRTFs) allows me to spatialize a
monophonic
wavfile.
I suspect noone not closely involved with your subject can understand this. From what you write later on I guess binaural filters are some LTI system?
I got trouble with this as long as my convolution doesnt seem to work properly: np.convolve() doesnt convolve the entire entry signal
Hmm http://docs.scipy.org/doc/numpy/reference/generated/numpy.convolve.html#nump... claims that the convolution is complete. Can you give an example of what you mean?
Furthermore, I think the note there about scipy.signal.fftconcolve may be of large use for you, when you are going to convolve whole wav files?
->trouble with extracting numpyarrays from the audio wav. filters and monophonic entry ->trouble then with encaspulating the resulting array in a proper wav file...it is not read by audacity
Hmmm I worked one time with wavs using the wave module, which is a standard module. I didn't deal with storing wavs. I attach the reading module for you. It needs a module mesh2 to import, which I don't include to save traffic. I think the code is understandable without it and the method .get_raw_by_frames() may already help solving your problem.
But I didn't really get the point what your aim is. As far as I understood you want to do sth named "spacialise" with the audio, based on the position of some person with respect to some reference point. What means "spacialise" in this case? I guess it's not simply a delay for creating stereo impression? I guess it is sth creating also a room impression, sth like "small or large room"?
Friedrich
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Thu, May 27, 2010 at 8:00 PM, arthur de conihout arthurdeconihout@gmail.com wrote:
#i dont really understand the %dh and the s/2.0**15 but it might be my problem original = [s / 2.0**15 for s in original]
This is done because wav file are (usually, not always) in fixed point, with values in the unsigned 16 bits int range (~ [-32768, 32768]), but when you want to do processing in floating point (as does numpy), you want the values normalized (in the [-1, 1] range). 2 * 15 gives you the normalization factor. But audiolab does this for you automatically,
David

Thank you for the answer i would love trying audiolab but i got this error while importing i m running python2.6
import audiolab
/usr/local/lib/python2.6/dist-packages/audiolab-0.0.0-py2.6-linux-i686.egg/audiolab/soundio/play.py:48: UserWarning: Could not import alsa backend; most probably, you did not have alsa headers when building audiolab
import scikits.audiolab
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "scikits/audiolab/__init__.py", line 25, in <module> from pysndfile import formatinfo, sndfile File "scikits/audiolab/pysndfile/__init__.py", line 1, in <module> from _sndfile import Sndfile, Format, available_file_formats, available_encodings *ImportError: No module named _sndfile*
Thank you
AdeC
2010/5/27 David Cournapeau cournape@gmail.com
On Thu, May 27, 2010 at 8:00 PM, arthur de conihout arthurdeconihout@gmail.com wrote:
#i dont really understand the %dh and the s/2.0**15 but it might be my problem original = [s / 2.0**15 for s in original]
This is done because wav file are (usually, not always) in fixed point, with values in the unsigned 16 bits int range (~ [-32768, 32768]), but when you want to do processing in floating point (as does numpy), you want the values normalized (in the [-1, 1] range). 2 * 15 gives you the normalization factor. But audiolab does this for you automatically,
David _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Fri, May 28, 2010 at 1:58 AM, arthur de conihout arthurdeconihout@gmail.com wrote:
Thank you for the answer i would love trying audiolab but i got this error while importing i m running python2.6
import audiolab
/usr/local/lib/python2.6/dist-packages/audiolab-0.0.0-py2.6-linux-i686.egg/audiolab/soundio/play.py:48: UserWarning: Could not import alsa backend; most probably, you did not have alsa headers when building audiolab
How did you install it ? Your installation looks bogus (if you build it by yourself, please give the *exact* instructions you used)
David

2010/5/27 arthur de conihout arthurdeconihout@gmail.com:
I try to make me clearer on my project : [...]
I think I understood now. Thank you for explanation.
original = [s / 2.0**15 for s in original]
nframes=filtre.getnframes() nchannels=filtre.getnchannels() filtre = struct.unpack_from("%dh" % nframes*nchannels, filtre.readframes(nframes*nchannels)) filtre = [s / 2.0**15 for s in filtre]
result = numpy.convolve(original, filtre)
Additionally to what David pointed out, it should also suffice to normalise only the impulse response channel.
Furthermore, I would spend some thought on what "normalised" actually means. And I think it can only be understood in Fourier domain. When taking the conservation of energy into account, i.e. the conservation of the L2 norm of the impulse response function under Fourier transformation, it can be normalised in time domain also by setting the time-domain L2 norm to unity. (This is already much different from the maximum normalisation.) Then the energy content of the signal before and after the processing is identical. I.e., it emphasises some frequencies in favour of others which are diminished in volume.
A better approach might be in my opinion to use the (already directly) determined transfer function. This can be normalised by the power of the input signal, which can e.g. be determined by a reference measurement without any obstruction in defined distance, I would say.
result = [ sample * 2.0**15 for sample in result ] filtered.writeframes(struct.pack('%dh' % len(result), *result))
It's to me too a mystery why you observe this octave jump you reported on. I guess it's an octave, because you can compensate by doubling the sampling rate. Can you check whether your playback program loads the output file as single or double channel? Also I guess some relation between your observation of "incomplete convolution" and this pitch change.
And now, I'm very irritated by the way you handle multiple channels in the input file. Actually you load maybe a two-channel input file, extract *all* the data, i.e. data from ch1, ch2, ch1, ch2, ..., and then convolve this? For single-channel input, this is correct, but when your input data is two-channel, it explains on the one hand why your program maybe doesn't work properly and on the other why you have pitch-halfening (you double the length of each frame). It is I think possible that you didn't notice more strange phenomenon, since the impulse response function is applied to each datum individually.
i had a look to what you sent me i m on my way understanding maybe your initialisation tests will allow me to make difference between every wav formats?
Exactly! It should be able to decode at least most wavs, with different samp widhts and different channel number. But I think I will myself in future hold to David's advice ...
i want to be able to encode every formats (16bit unsigned, 32bits)what precautions do i have to respect in the filtering?do filter and original must be the same or? Thank you
All problems would go away when you use the transfer function directly. It should be present as some function, which can easily be interpolated to the frequency points your FFT of the input signal yields. Interpolating the time-domain impulse response is not a good idea, since it assumes already frequency-boundedness, which is not necessarily fulfilled, I guess. Also it's much more complicated.
When you measure transfer function, how can you reconstruct a unique impulse response? Do you measure also phases? When you shine in with white noise, do autocorrelation of the result and Fourier transform, the phases of the transfer function are lost, you only obtain amplitudes (squared). Of course it is the same to multiply in Fourier space with the transfer function as to convolve with its Fourier-back transform, but I'm not yet convinced that this is a good idea to do it in time domain tough ... Can you maybe give some hint?
This means you essentially convolve with the autocorrelation function itself - but this one is symmetric and therefore not causal. I think it's strange when sound starts before it starts in the input ... the impulse response must be causal. Anyway, this problem remains also when you multiply with the plain real numbers in Fourier domain which result from the Fourier transform of the autocorrelation function - it's still acausal. The only way to circumvent this I'm seeing currently is to measure the phases of the transfer function too - i.e. to do a frequency sweep, white noise autocorrelation is then as far as I think not sufficient.
Sorry, I feel it didn't came out very clear ...
Friedrich

""Can you maybe give some hint?"" The most commonly used model for HRTF implementation is the one refered to as "minimum phase filter and pure delay".It is composed of: -a minimum-phase filter, which accounts for the magnitude spectrum of HRTF -and a pure delay , which represents the temporal information contained in the HRTF
If H(f) is the HRTF to be implemented , the corresponding Hminphase is given by:
2010/5/27 Friedrich Romstedt friedrichromstedt@gmail.com
2010/5/27 arthur de conihout arthurdeconihout@gmail.com:
I try to make me clearer on my project : [...]
I think I understood now. Thank you for explanation.
original = [s / 2.0**15 for s in original]
nframes=filtre.getnframes() nchannels=filtre.getnchannels() filtre = struct.unpack_from("%dh" % nframes*nchannels,
filtre.readframes(nframes*nchannels)) filtre = [s / 2.0**15 for s in filtre]
result = numpy.convolve(original, filtre)
Additionally to what David pointed out, it should also suffice to normalise only the impulse response channel.
Furthermore, I would spend some thought on what "normalised" actually means. And I think it can only be understood in Fourier domain. When taking the conservation of energy into account, i.e. the conservation of the L2 norm of the impulse response function under Fourier transformation, it can be normalised in time domain also by setting the time-domain L2 norm to unity. (This is already much different from the maximum normalisation.) Then the energy content of the signal before and after the processing is identical. I.e., it emphasises some frequencies in favour of others which are diminished in volume.
A better approach might be in my opinion to use the (already directly) determined transfer function. This can be normalised by the power of the input signal, which can e.g. be determined by a reference measurement without any obstruction in defined distance, I would say.
result = [ sample * 2.0**15 for sample in result ] filtered.writeframes(struct.pack('%dh' % len(result), *result))
It's to me too a mystery why you observe this octave jump you reported on. I guess it's an octave, because you can compensate by doubling the sampling rate. Can you check whether your playback program loads the output file as single or double channel? Also I guess some relation between your observation of "incomplete convolution" and this pitch change.
And now, I'm very irritated by the way you handle multiple channels in the input file. Actually you load maybe a two-channel input file, extract *all* the data, i.e. data from ch1, ch2, ch1, ch2, ..., and then convolve this? For single-channel input, this is correct, but when your input data is two-channel, it explains on the one hand why your program maybe doesn't work properly and on the other why you have pitch-halfening (you double the length of each frame). It is I think possible that you didn't notice more strange phenomenon, since the impulse response function is applied to each datum individually.
i had a look to what you sent me i m on my way understanding maybe your initialisation tests will allow me to make difference between every wav formats?
Exactly! It should be able to decode at least most wavs, with different samp widhts and different channel number. But I think I will myself in future hold to David's advice ...
i want to be able to encode every formats (16bit unsigned, 32bits)what precautions do i have to respect in the filtering?do filter and original must be the same or? Thank you
All problems would go away when you use the transfer function directly. It should be present as some function, which can easily be interpolated to the frequency points your FFT of the input signal yields. Interpolating the time-domain impulse response is not a good idea, since it assumes already frequency-boundedness, which is not necessarily fulfilled, I guess. Also it's much more complicated.
When you measure transfer function, how can you reconstruct a unique impulse response? Do you measure also phases? When you shine in with white noise, do autocorrelation of the result and Fourier transform, the phases of the transfer function are lost, you only obtain amplitudes (squared). Of course it is the same to multiply in Fourier space with the transfer function as to convolve with its Fourier-back transform, but I'm not yet convinced that this is a good idea to do it in time domain tough ... Can you maybe give some hint?
This means you essentially convolve with the autocorrelation function itself - but this one is symmetric and therefore not causal. I think it's strange when sound starts before it starts in the input ... the impulse response must be causal. Anyway, this problem remains also when you multiply with the plain real numbers in Fourier domain which result from the Fourier transform of the autocorrelation function - it's still acausal. The only way to circumvent this I'm seeing currently is to measure the phases of the transfer function too - i.e. to do a frequency sweep, white noise autocorrelation is then as far as I think not sufficient.
Sorry, I feel it didn't came out very clear ...
Friedrich _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

sorry for the suspense...
mod(Hmin(f))= mod(H(f)) phase(Hmin(f))=Im(HilbertTransform(-log(mod(H(f))))
the pure delay is computed by estimating the HRTF (or HRIR) delay.For convenience implementation of one single delay is prefereed corresponding to the difference delay and is applied to the contralateral ear (opposed to the signal)
This is common implementation but in my opinion you can also choose to convolve HRIR directly with the monophonic sound it will take into account the delay since the measure of right and left are synchronised and when you plot it you can see the Interaural Delay.It doesnt take in account evolution of the delay with the frequency which will lead to spectral coloration but in the case of my practical application i can omit it.
2010/5/27 arthur de conihout arthurdeconihout@gmail.com
""Can you maybe give some hint?"" The most commonly used model for HRTF implementation is the one refered to as "minimum phase filter and pure delay".It is composed of: -a minimum-phase filter, which accounts for the magnitude spectrum of HRTF -and a pure delay , which represents the temporal information contained in the HRTF
If H(f) is the HRTF to be implemented , the corresponding Hminphase is given by:
2010/5/27 Friedrich Romstedt friedrichromstedt@gmail.com
2010/5/27 arthur de conihout arthurdeconihout@gmail.com:
I try to make me clearer on my project : [...]
I think I understood now. Thank you for explanation.
original = [s / 2.0**15 for s in original]
nframes=filtre.getnframes() nchannels=filtre.getnchannels() filtre = struct.unpack_from("%dh" % nframes*nchannels,
filtre.readframes(nframes*nchannels)) filtre = [s / 2.0**15 for s in filtre]
result = numpy.convolve(original, filtre)
Additionally to what David pointed out, it should also suffice to normalise only the impulse response channel.
Furthermore, I would spend some thought on what "normalised" actually means. And I think it can only be understood in Fourier domain. When taking the conservation of energy into account, i.e. the conservation of the L2 norm of the impulse response function under Fourier transformation, it can be normalised in time domain also by setting the time-domain L2 norm to unity. (This is already much different from the maximum normalisation.) Then the energy content of the signal before and after the processing is identical. I.e., it emphasises some frequencies in favour of others which are diminished in volume.
A better approach might be in my opinion to use the (already directly) determined transfer function. This can be normalised by the power of the input signal, which can e.g. be determined by a reference measurement without any obstruction in defined distance, I would say.
result = [ sample * 2.0**15 for sample in result ] filtered.writeframes(struct.pack('%dh' % len(result), *result))
It's to me too a mystery why you observe this octave jump you reported on. I guess it's an octave, because you can compensate by doubling the sampling rate. Can you check whether your playback program loads the output file as single or double channel? Also I guess some relation between your observation of "incomplete convolution" and this pitch change.
And now, I'm very irritated by the way you handle multiple channels in the input file. Actually you load maybe a two-channel input file, extract *all* the data, i.e. data from ch1, ch2, ch1, ch2, ..., and then convolve this? For single-channel input, this is correct, but when your input data is two-channel, it explains on the one hand why your program maybe doesn't work properly and on the other why you have pitch-halfening (you double the length of each frame). It is I think possible that you didn't notice more strange phenomenon, since the impulse response function is applied to each datum individually.
i had a look to what you sent me i m on my way understanding maybe your initialisation tests will allow me to make difference between every wav formats?
Exactly! It should be able to decode at least most wavs, with different samp widhts and different channel number. But I think I will myself in future hold to David's advice ...
i want to be able to encode every formats (16bit unsigned, 32bits)what precautions do i have to respect in the filtering?do filter and original must be the same or? Thank you
All problems would go away when you use the transfer function directly. It should be present as some function, which can easily be interpolated to the frequency points your FFT of the input signal yields. Interpolating the time-domain impulse response is not a good idea, since it assumes already frequency-boundedness, which is not necessarily fulfilled, I guess. Also it's much more complicated.
When you measure transfer function, how can you reconstruct a unique impulse response? Do you measure also phases? When you shine in with white noise, do autocorrelation of the result and Fourier transform, the phases of the transfer function are lost, you only obtain amplitudes (squared). Of course it is the same to multiply in Fourier space with the transfer function as to convolve with its Fourier-back transform, but I'm not yet convinced that this is a good idea to do it in time domain tough ... Can you maybe give some hint?
This means you essentially convolve with the autocorrelation function itself - but this one is symmetric and therefore not causal. I think it's strange when sound starts before it starts in the input ... the impulse response must be causal. Anyway, this problem remains also when you multiply with the plain real numbers in Fourier domain which result from the Fourier transform of the autocorrelation function - it's still acausal. The only way to circumvent this I'm seeing currently is to measure the phases of the transfer function too - i.e. to do a frequency sweep, white noise autocorrelation is then as far as I think not sufficient.
Sorry, I feel it didn't came out very clear ...
Friedrich _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Wed, May 26, 2010 at 10:43 PM, arthur de conihout arthurdeconihout@gmail.com wrote:
Hi, i try to implement a real-time convolution module refreshed by head listener location (angle from a reference point).The result of the convolution by binaural flters(HRTFs) allows me to spatialize a monophonic wavfile. I got trouble with this as long as my convolution doesnt seem to work properly: np.convolve() doesnt convolve the entire entry signal ->trouble with extracting numpyarrays from the audio wav. filters and monophonic entry ->trouble then with encaspulating the resulting array in a proper wav file...it is not read by audacity Do you have any idea of how this could work or do you have any implementation of stereo filtering by impulse response to submit me
For reading audio files into numpy, I suggest you use audiolab. It uses libsndfile underneath, which handles various wav format *really* well (and is most likely the one used by audacity):
http://pypi.python.org/pypi/scikits.audiolab/
Concerning the filtering part, filtering in the time domain is way too consuming, because HRTF impulse responses are quite long, so you should use FFT (with windowing of course, using overlap add methods)
David
participants (3)
-
arthur de conihout
-
David Cournapeau
-
Friedrich Romstedt