Misplaced Pages

Steered-response power

Article snapshot taken from[REDACTED] with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
(Redirected from Steered-response power with phase transform)

Steered-response power (SRP) is a family of acoustic source localization algorithms that can be interpreted as a beamforming-based approach that searches for the candidate position or direction that maximizes the output of a steered delay-and-sum beamformer.

Steered-response power with phase transform (SRP-PHAT) is a variant using a "phase transform" to make it more robust in adverse acoustic environments.

Algorithm

Steered-response power

Consider a system of M {\displaystyle M} microphones, where each microphone is denoted by a subindex m { 1 , , M } {\displaystyle m\in \{1,\dots ,M\}} . The discrete-time output signal from a microphone is s m ( n ) {\displaystyle s_{m}(n)} . The (unweighted) steered-response power (SRP) at a spatial point x = [ x , y , z ] T {\displaystyle \mathbf {x} =^{T}} can be expressed as

P 0 ( x ) n Z | m = 1 M s m ( n τ m ( x ) ) | 2 , {\displaystyle P_{0}(\mathbf {x} )\triangleq \sum _{n\in \mathbb {Z} }\left\vert \sum _{m=1}^{M}s_{m}(n-\tau _{m}(\mathbf {x} ))\right\vert ^{2},}

where Z {\displaystyle \mathbb {Z} } denotes the set of integer numbers and τ m ( x ) {\displaystyle \tau _{m}(\mathbf {x} )} would be the time-lag due to the propagation from a source located at x {\displaystyle \mathbf {x} } to the m {\displaystyle m} -th microphone.

The (weighted) SRP can be rewritten as

P ( x ) = 1 2 π m 1 = 1 M m 2 = 1 M π π Φ m 1 , m 2 ( e j ω ) S m 1 ( e j ω ) S m 2 ( e j ω ) e j ω τ m 1 , m 2 ( x ) d ω , {\displaystyle P(\mathbf {x} )={\frac {1}{2\pi }}\sum _{m_{1}=1}^{M}\sum _{m_{2}=1}^{M}\int _{-\pi }^{\pi }\Phi _{m_{1},m_{2}}(e^{j\omega })S_{m_{1}}(e^{j\omega })S_{m_{2}}^{*}(e^{j\omega })e^{j\omega \tau _{m_{1},m_{2}}(\mathbf {x} )}d\omega ,}

where ( ) {\displaystyle ()^{*}} denotes complex conjugation, S m ( e j ω ) {\displaystyle S_{m}(e^{j\omega })} represents the discrete-time Fourier transform of s m ( n ) {\displaystyle s_{m}(n)} and Φ m 1 , m 2 ( e j ω ) {\displaystyle \Phi _{m_{1},m_{2}}(e^{j\omega })} is a weighting function in the frequency domain (later discussed). The term τ m 1 , m 2 ( x ) {\displaystyle \tau _{m_{1},m_{2}}(\mathbf {x} )} is the discrete time-difference of arrival (TDOA) of a signal emitted at position x {\displaystyle \mathbf {x} } to microphones m 1 {\displaystyle m_{1}} and m 2 {\displaystyle m_{2}} , given by

τ m 1 , m 2 ( x ) f s x x m 1 x x m 2 c , {\displaystyle \tau _{m_{1},m_{2}}(\mathbf {x} )\triangleq \left\lfloor f_{s}{\frac {\|\mathbf {x} -\mathbf {x} _{m_{1}}\|-\|\mathbf {x} -\mathbf {x} _{m_{2}}\|}{c}}\right\rceil ,}

where f s {\displaystyle f_{s}} is the sampling frequency of the system, c {\displaystyle c} is the sound propagation speed, x m {\displaystyle \mathbf {x} _{m}} is the position of the m {\displaystyle m} -th microphone, {\displaystyle \|\cdot \|} is the 2-norm and {\displaystyle \lfloor \cdot \rceil } denotes the rounding operator.

Generalized cross-correlation

The above SRP objective function can be expressed as a sum of generalized cross-correlations (GCCs) for the different microphone pairs at the time-lag corresponding to their TDOA

P ( x ) = m 1 = 1 M m 2 = 1 M R m 1 , m 2 ( τ m 1 , m 2 ( x ) ) , {\displaystyle P(\mathbf {x} )=\sum _{m_{1}=1}^{M}\sum _{m_{2}=1}^{M}R_{m_{1},m_{2}}(\tau _{m_{1},m_{2}}(\mathbf {x} )),}

where the GCC for a microphone pair ( m 1 , m 2 ) {\displaystyle (m_{1},m_{2})} is defined as

R m 1 , m 2 ( τ ) 1 2 π π π Φ m 1 , m 2 ( e j ω ) S m 1 ( e j ω ) S m 2 ( e j ω ) e j ω τ d ω . {\displaystyle R_{m_{1},m_{2}}(\tau )\triangleq {\frac {1}{2\pi }}\int _{-\pi }^{\pi }\Phi _{m_{1},m_{2}}(e^{j\omega })S_{m_{1}}(e^{j\omega })S_{m_{2}}^{*}(e^{j\omega })e^{j\omega \tau }d\omega .}

The phase transform (PHAT) is an effective GCC weighting for time delay estimation in reverberant environments, that forces the GCC to consider only the phase information of the involved signals:

Φ m 1 , m 2 ( e j ω ) 1 | S m 1 ( e j ω ) S m 2 ( e j ω ) | . {\displaystyle \Phi _{m_{1},m_{2}}(e^{j\omega })\triangleq {\frac {1}{\vert S_{m_{1}}(e^{j\omega })S_{m_{2}}^{*}(e^{j\omega })\vert }}.}

Estimation of source location

The SRP-PHAT algorithm consists in a grid-search procedure that evaluates the objective function P ( x ) {\displaystyle P(\mathbf {x} )} on a grid of candidate source locations G {\displaystyle {\mathcal {G}}} to estimate the spatial location of the sound source, x s {\displaystyle {\textbf {x}}_{s}} , as the point of the grid that provides the maximum SRP:

x ^ s = arg max x G P ( x ) . {\displaystyle {\hat {\mathbf {x} }}_{s}=\arg \max _{\mathbf {x} \in {\mathcal {G}}}P(\mathbf {x} ).}

Modified SRP-PHAT

Modifications of the classical SRP-PHAT algorithm have been proposed to reduce the computational cost of the grid-search step of the algorithm and to increase the robustness of the method. In the classical SRP-PHAT, for each microphone pair and for each point of the grid, a unique integer TDOA value is selected to be the acoustic delay corresponding to that grid point. This procedure does not guarantee that all TDOAs are associated to points on the grid, nor that the spatial grid is consistent, since some of the points may not correspond to an intersection of hyperboloids. This issue becomes more problematic with coarse grids since, when the number of points is reduced, part of the TDOA information gets lost because most delays are not anymore associated to any point in the grid.

The modified SRP-PHAT collects and uses the TDOA information related to the volume surrounding each spatial point of the search grid by considering a modified objective function:

P ( x ) = m 1 = 1 M m 2 = 1 M τ = L m 1 , m 2 l ( x ) L m 1 , m 2 u ( x ) R m 1 , m 2 ( τ ) , {\displaystyle P'(\mathbf {x} )=\sum _{m_{1}=1}^{M}\sum _{m_{2}=1}^{M}\sum _{\tau =L_{m_{1},m_{2}}^{l}(\mathbf {x} )}^{L_{m_{1},m_{2}}^{u}(\mathbf {x} )}R_{m_{1},m_{2}}(\tau ),}

where L m 1 , m 2 l ( x ) {\displaystyle L_{m_{1},m_{2}}^{l}(\mathbf {x} )} and L m 1 , m 2 u ( x ) {\displaystyle L_{m_{1},m_{2}}^{u}(\mathbf {x} )} are the lower and upper accumulation limits of GCC delays, which depend on the spatial location x {\displaystyle \mathbf {x} } .

Accumulation limits

The accumulation limits can be calculated beforehand in an exact way by exploring the boundaries separating the regions corresponding to the points of the grid. Alternatively, they can be selected by considering the spatial gradient of the TDOA τ m 1 , m 2 ( x ) = [ x τ m 1 , m 2 ( x ) , y τ m 1 , m 2 ( x ) , z τ m 1 , m 2 ( x ) ] T {\displaystyle \nabla _{\tau _{m_{1},m_{2}}}(\mathbf {x} )=^{T}} , where each component γ { x , y , z } {\displaystyle \gamma \in \left\{x,y,z\right\}} of the gradient is:

γ τ m 1 , m 2 ( x ) = 1 c ( γ γ m 1 x x m 1 γ γ m 2 x x m 2 ) . {\displaystyle \nabla _{\gamma \tau _{m_{1},m_{2}}}(\mathbf {x} )={\frac {1}{c}}\left({\frac {\gamma -\gamma _{m_{1}}}{\|\mathbf {x} -\mathbf {x} _{m_{1}}\|}}-{\frac {\gamma -\gamma _{m_{2}}}{\|\mathbf {x} -\mathbf {x} _{m_{2}}\|}}\right).}

For a rectangular grid where neighboring points are separated a distance r {\displaystyle r} , the lower and upper accumulation limits are given by:

L m 1 , m 2 l ( x ) = τ m 1 , m 2 ( x ) τ m 1 , m 2 ( x ) d {\displaystyle L_{m_{1},m_{2}}^{l}(\mathbf {x} )=\tau _{m_{1},m_{2}}(\mathbf {x} )-\|\nabla _{\tau _{m_{1},m_{2}}}(\mathbf {x} )\|\cdot d} L m 1 , m 2 u ( x ) = τ m 1 , m 2 ( x ) + τ m 1 , m 2 ( x ) d , {\displaystyle L_{m_{1},m_{2}}^{u}(\mathbf {x} )=\tau _{m_{1},m_{2}}(\mathbf {x} )+\|\nabla _{\tau _{m_{1},m_{2}}}(\mathbf {x} )\|\cdot d,}

where d = ( r / 2 ) min ( 1 | sin ( θ ) cos ( ϕ ) | , 1 | sin ( θ ) sin ( ϕ ) | , 1 | cos ( θ ) | ) {\displaystyle d=(r/2)\min \left({\frac {1}{\vert \sin(\theta )\cos(\phi )\vert }},{\frac {1}{\vert \sin(\theta )\sin(\phi )\vert }},{\frac {1}{\vert \cos(\theta )\vert }}\right)} and the gradient direction angles are given by

θ = cos 1 ( z τ m 1 , m 2 ( x ) τ m 1 , m 2 ( x ) ) , {\displaystyle \theta =\cos ^{-1}\left({\frac {\nabla _{z\tau _{m_{1},m_{2}}}(\mathbf {x} )}{\|\nabla _{\tau _{m_{1},m_{2}}}(\mathbf {x} )\|}}\right),} ϕ = arctan 2 ( y τ m 1 , m 2 ( x ) , x τ m 1 , m 2 ( x ) ) . {\displaystyle \phi =\arctan _{2}\left(\nabla _{y\tau _{m_{1},m_{2}}}(\mathbf {x} ),\nabla _{x\tau _{m_{1},m_{2}}}(\mathbf {x} )\right).}

See also

References

  1. Don H. Johnson; Dan E. Dudgeon (1993). Array Signal Processing: Concepts and Techniques. Prentice Hall. ISBN 978-0-13-048513-7.
  2. DiBiase, J. H. (2000). A High Accuracy, Low-Latency Technique for Talker Localization in Reverberant Environments using Microphone Arrays (PDF) (Ph.D.). Brown Univ.
  3. Silverman, H. F.; Yu, Y.; Sachar, J. M.; Patterson III, W. R. (2005). "Performance of real-time source-location estimators for a large-aperture microphone array". IEEE Trans. Speech Audio Process. 13 (4). IEEE: 593–606. doi:10.1109/TSA.2005.848875. S2CID 9506719.
  4. Cobos, M.; Marti, A.; Lopez, J. J. (2011). "A Modified SRP-PHAT Functional for Robust Real-Time Sound Source Localization With Scalable Spatial Sampling". IEEE Signal Processing Letters. 18 (1). IEEE: 71–74. Bibcode:2011ISPL...18...71C. doi:10.1109/LSP.2010.2091502. hdl:10251/55953. S2CID 18207534.
Categories:
Steered-response power Add topic