AFRL/HECB Grant Final Performance Report

 

 

Integration of Laser Vibrometry

with Infrared Video

for Multimedia Surveillance Display

 

Zhigang Zhu (Principal Investigator)

Weihong Li

Computer Science Department

City College of New York /CUNY

 

 

 

Prepared for

Air Force Research Laboratory

under Award No F33615-03-1-6383

 

December 2004

 

 

 

The City College of New York

  DEPARTMENT OF COMPUTER SCIENCE    CONVENT AVE & 138TH ST

  SCHOOL OF ENGINEERING                                 NEW YORK, NY 10031

 

 

 

Approved for public release; distribution is unlimited. If you see many question marks (???) displayed, please check the encoding setting of your browser; Western (ISO) should work


 

Abstract

Laser Doppler vibrometer (LDV) is a non-contact, remote and high resolution voice detector. Vibration of the objects caused by voice reflects the voice itself. After the enhancement with Gaussian bandpass filtering and the adaptive volume scaling, the LDV voice signals were mostly intelligible from targets without retro-reflective finishes at short or medium distances (< 100m). By using retro-reflective tapes, the distance could be as far as 300 meters. Infrared (IR) imaging for target selection and localization was also discussed for LDV listening. A system has been set up with three types of sensors (IR cameras, PTZ color cameras and LDVs) for performing integration of multimedia sensors in human signature detection. The basic idea is to provide an advanced augmented interface in order to give users the best cognitive understanding of the environment, the sensors and the events. However, without retro-reflective tape treatment, the LDV voice signals were still very noisy from targets at medium and large distances. Therefore, with the state-of-the-art sensor technology, more advanced signal enhancement techniques are needed. Further sensor improvement is also necessary. In addition, automatic targeting and intelligent refocusing is a technical issue that deserves research attention for long range LDV listening.

 

 

Index Terms

laser vibrometry, clandestine listening, multimedia integration, audio signal enhancement, infrared video surveillance


 

Table of Contents

Acknowledgements  3

 

1. Introduction  4

2. Overview: A Multimedia Integration Approach  5

3.  Multimedia Sensors  7

3.1 The LDV sensor 7

3.2. Infrared camera  9

3.3. PTZ camera  10

4. Laser Doppler Vibrometer: Principle and Applications  11

5. LDV Audio Signal Enhancement 14

5.1. The Gaussian bandpass filter 15

5.2. Volume selection and adaptation  19

6.  Experiment Designs and Analysis  21

6.1. Real data collections  21

6.2.1. Experiments on long range LDV listening  22

6.2.2. Experiments on listening through walls/doors/windows  23

6.2.3. Experiments on talking inside cars  24

6.2.4. Experiments on types of surfaces  26

6.2.5. Experiments on surface directions  27

6.2. LDV performance analysis  28

7. Discussions on Sensor Improvements and System Integration  32

7.1. Further research issues in LDV acoustic detection  33

7.2. Multimodal integration and intelligent targeting and focusing  36

8. Conclusions  36

9. References  37

 


 

Acknowledgements

 

We are grateful to Lt. Jonathan Lee and Mr. Robert Lee at the Air Force Research Laboratory (AFRL) for their guidance and valuable discussions on many technical issues on laser Doppler vibrometers during the course of this work. Prof. George Wolberg at the City College has been involved in a collaboration effort with the PI on the multimodal sensor integration for human signature detection, and has also provided many insightful suggestions and discussions. Prof. Ning Xiang at Rensselaer Polytechnic Institute (RPI) has provided his consulting services on laser Doppler vibrometers that have led us to a better understanding of this new type of sensor. Prof. Esther Levin, with her expertise in speech technology, has provided valuable discussions on speech signal processing. We also thank Mr. Robert T. Hill at the City College for proofreading the document and for providing some valuable comments and suggestions.

This material is based on research sponsored by the Air Force Research Laboratory under agreement number F33615-03-1-6383. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. However, the views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Research Laboratory or the U.S. Government.

 


1. Introduction

 

Recent improvements in laser vibrometry [1-6] and day/night IR imaging technology [15] have created the opportunity to create a long-range multimedia surveillance system.  Such a system would have day and night operation.  The IR video system would provide the video surveillance while allowing the operator to select the best target for picking up audio detectable by the laser vibrometer. This multimedia capability would greatly improve security force performance through clandestine listening of targets that are probing or penetrating a perimeter defense. The targets may be aware that they are observed but most likely would not infer that they could be heard. This system could also provide the feeds for advanced face and voice recognition systems.

Laser Doppler vibrometers (LDV) such as those manufactured by Polytec™ [2] and B&K Ometron [3] can effectively detect vibration within two hundred meters with a sensitivity on the order of 1µm/s. These instruments are designed for use in laboratories (0-5 m working distance) and field work (5-200 m) [2-7]. For example, these instruments have been used to measure the vibrations of civil structures like high-rise buildings, bridges, towers, etc. at distances of up to 200m. However, for distances above 200 meters, it will be necessary to treat the target surface with retro-reflective tape or paint to ensure sufficient retro-reflectivity. At distances beyond 200m and under field conditions, the outgoing and reflected beam will pass through medium with different temperatures and thus different reflective coefficients. Another difficulty is that such an instrument uses a front lens to focus the laser beam on the target surface in order to minimize the size of the measuring point. At 200m the spot size is 12mm and very weak. At 1,000m the spot diameter would be 63mm and extremely weak. At a distance above 200 m, the speckle pattern of the laser beam induces noise and signal dropout will be substantial [8]. Finally, the visible laser beam is good for a human to select a target, but it is not desirable for a clandestine surveillance application.

The overall goal of this project is on an advanced multimedia interface for human effectiveness in using the state-of-the-art sensing technologies for perimeter surveillance. We believe that in the foreseeable future of these technologies, human involvement in all the three stages – sensors, alarm, and response - is still vital for a successful surveillance system.  Meanwhile, we fully realize that the capabilities of sensors - infrared (IR) cameras, visible (EO) cameras, and the laser vibrometers (LDVs) in our study - are critical to surveillance tasks. IR and EO cameras have been widely used in human and vehicle detection in traffic and surveillance applications. However, literature on remote acoustic detection using LDVs is rare. Therefore, in this one year project we have mainly focused on the experimental study of LDV-based voice detection, and this will be the main focus of this report.  We have also set up a system with all  three types of sensors for performing integration of multimedia sensors in human signature detection. This report also briefly discusses how we can use IR/EO imaging for target selection and localization for LDV listening.

This report is organized as follows. First, we give an overall picture of our technical approach:   the human-centered technology paradigm for the integration of laser Doppler vibrometry and IR imaging for multimedia surveillance display. The basic idea is to provide an advanced virtualized-reality based interface of the site (e.g., air base) to give the operator the best cognitive understanding of the environment, the sensors, and the events. One of the important issues is how to use IR imaging to help the laser Doppler vibrometer to select the appropriate targets.

Then, we discuss various aspects of LDVs for voice detection: basic principles and problems, signal enhancement algorithms, and experimental designs. We focus on the study of the humancentered technology for LDV sensor information enhancement and clandestine listening. We investigate the performance of the laser Doppler vibrometer on two types of targets:  fixed facilities in the environment that vibrate with humans and /or vehicles nearby, and the human subjects themselves. We have designed a graphic human computer interface for signal analysis, signal filtering, and signal synthesis. The graphic interface helps a user to understand the relation between the signals and noises in term of magnitudes and frequencies, and by signal synthesis (i.e. speech synthesis from the filtered laser Doppler vibrometry signals), the user can adaptively pick up the useful signals.

Speech enhancement algorithms are applied to improve the performance of recognizing a noisy voice detected by the LDV system. The detected speech signal may be corrupted by more than one noise source, such as laser photon noises, target movements, and background acoustic noises (wind, engine sound, etc.). Many speech enhancement algorithms have been proposed [9-12], but they have been mainly used for improving the performance of speech communication systems in noisy environments. Acoustic signals captured by laser vibrometers need special treatment.

The laser Doppler vibrometer strongly depends the reflectance properties of the surfaces of the target. Important issues like target surface properties, size and shape, distance from the sensors, sensor installation, and calibration strategy are studied through several sets of indoor and outdoor experiments. By doing this study, we have gained a better understanding of the LDV performance, which could guide us for improving the LDV sensors. We provide a brief discussion on some future work in LDV sensor improvements and multimedia human signature detections. 

We envision that the integration of the IR imaging and laser Doppler vibrometry will provide a multimedia display to the user with spatial coherent environment, enhanced video and audio presentation, and rapid target localization capabilities via the technologies of augmented reality, video-audio registration, information filtering / enhancement, and automatic target detection / listening.  Ultimately, this goal could be achieved for kilometer long-range surveillance. In this one-year project, this research provides a feasibility study of multimedia integration and visualization solutions with the state-of-the-art sensors.

 

2. Overview: A Multimedia Integration Approach

There are three main components in our approach of multimedia human signature detection (Figure 1, Figure 2): the IR/EO imaging video surveillance component, the LDV audio surveillance component, and the human-computer interaction components. Both the IR/EO and LDV sensing components can support day and night operation even though it will be better to use a standard EO camera (coupled with the IR camera) to perform the surveillance task during daytime. The overall approach is the integration of the IR/EO imaging and LDV audio detection for a long-range surveillance task. The integration has the following three steps.

Step 1. Target detection, tracking, and selection via the IR/EO imaging module. The targets of interest could be humans or vehicles (driven by humans). This will be performed by motion detection and human/ vehicle segmentation methods.

 

Figure 1. System components of a multimodal human signature detection system.

 

Step 2. Audio targeting and detection by the LDV audio module. The audio signals could be human voices or vehicle engine sounds. We mainly consider the human voice detection. The main issue is to select the LDV targeting points provided by the IR/EO imaging module to detect the vibration caused by human voices.

Step3. Face/vehicle shot of best view capture by the feedback from audio detection. By using the audio feedback, the IR/EO imaging module can verify the existence of humans and capture the best face shots for face recognition. Together with the voice recognition module, the surveillance system could further perform human identification and event understanding.

 

An important concept is to design a human-computer interface for the human-centered multimedia surveillance. Human involvement in all the three stages – sensors, alarm, and response –  is vital for a successful surveillance system. Figure 2 shows the human-computer interaction (HCI) synopsis for human-in-the-loop surveillance operation with augmented reality (AR) visualization, target selection, signal extraction and enhancement, and human identification.

 

 

 

Figure 2. System Diagram. The Human-Computer Interaction (HCI) is important for sensor modeling/registration, video/audio detection, and recognition.

 

3.  Multimedia Sensors

For enabling the study of the multimedia sensor integration for human signature detection, we have acquired the following sensors: a Laser Doppler Vibrometer (LDV) OFV-505 from Polytec, a ThermoVision A40M infrared camera from FLIR, and a Canon color/near IR pan/tilt/zoom (PTZ) camera. The FLIR ThermoVision A40M IR camera and the Canon  PTZ camera VC-C50i were purchased under the funding this project, and the Polytec LDV was purchased with a matching funding through a CUNY Equipment Competition Award. We will briefly list the main characteristics of each of them in the following paragraphs.

3.1 The LDV sensor

The Laser Doppler Vibrometer from Polytec [2] includes a controller OFV-5000 with a digital velocity decode card VD-6 and a sensor head OFV-505 (Figure 3). We also acquired a telescope VIB-A-P05 for accurate targeting at large distances.

The sensor head uses a particular helium neon red laser with wavelength of 633 nm and is equipped with a super long-range lens. It sends the interferometry signals to the controller, which is connected to the computer via an RS-232 port. The controller box includes a velocity decoder VD-06, which processes signals received from the sensor head. There are a number of output signal formats from the controller, including an S/P-DIF output and digital and analogue velocity signal outputs.

Figure 3 The Polytec™ LDV  (a) Controller OFV-5000 (b) Sensor head OFV-505 (c) Telescope VIB-A-P05

 

To receive and to process the signal from the controller, we use a low-cost Audigy2 ZS audio card with built-in S/P-DIF I/O interface on the console of the computer. This audio card can receive the digital signals from the controller and play them back through the audio outputs on the console machine. It can also save the received signals as audio files, e.g., in MP3 or WAV format. The main features of the LDV sensor and the accessories are listed as follows:

 

n Sensor Head OFV-505

n HeNe (Helium-Neon) laser, l=632.8 nm, power <1 mW

n OFV-SLR lens (f=30mm) 1.8 m – 200+ m, automatic focus

n  “Any” surface

n Controller OFV-5000

n Low pass (5, 20,100 kHz), high pass (100Hz), tracking filters

n RS-232 interface for computer control

n Velocity Decoder VD-06

n Ranges: 1, 2, 10 and 50 mm/s/V

n Resolution 0.02 mm/s under 1mm/s/V range (2mv/20V)

n 350 kHz bandwidth analog output

n 24 bit, 96 kHz max. digital output on S/P-DIF interface

n Telescope VIB-A-P05

n +/-1° vertical tilt and +/-1.5° horizontal tilt

n HeNe interference filter gives improved visibility

 

We also developed a software system, called LDVProject (LDVProject.jar), to configure the controller and process the received LDV digital signals for audio play (Figure 4). This system communicates with the controller via the RS-232 interface by sending commands to the controller to change the device parameters and to monitor the status of the device. This system also has integrated some LDV signal processing and enhancement components, which will be described in Chapter 4.

 

 

Figure 4. LDV control interface

3.2. Infrared camera

The FLIR ThermoVision A40M IR camera has the following features that make it suitable for human and vehicle detection:

n Temp Range of -20° to 500°C, accuracy (% of reading) ± 2°C or ± 2%

n 320 x 240 Focal Plane Array with Uncooled Microbolometer Detector, spectral range 7.5 to 13 µm

n 24° FOV Lens, spatial resolution 1.3 mrad and with built-in focus motor

n Firewire Output - IEEE-1394 8/16-bit monochrome & 8-bit color

n Video output  - RS170 EIA/NTSC or CCIR/PAL composite video for monitoring on a TV screen

n Keyboard Interface for easy on-site control of the camera

n ThermoVision Systems Developers Kit (C++) for software development

 

Figure 5(1). A person sitting in a dark room can be clearly seen in the IR image, and the temperature can be accurately measured. The reading of the temperature at the cross (Sp1) on the face is 33.1oC.

Figure 5(2). Two IR images before and after a person standing at about 200 feet. The reading of the temperature at the cross (Sp1) changes from 11oC to 27 oC. The corresponding color images with the person in the scene are shown in Figure 6.

 

Figure 5(1) shows an example where a person sitting in a dark room can be clearly detected by the FILR ThermoVision far-infrared camera. Furthermore, the accurate temperature measurements also provide important information for discriminating human bodies from other hot/warm objects. After the successful detection of humans, objects, such as the doors or walls in this example, can be searched in the environment whose vibration with audio waves could reveal what the persons might be speaking.  Note that the FILR ThermoVision IR camera is a far-infrared thermal camera. It does not need to have active IR illumination, and it is suitable for detecting humans and vehicles at a distance (Figure 5(2)).

3.3. PTZ camera

The state-of-the-art, computer controllable pan/tilt/zoom (PTZ) camera is also ideal for human and other target detection at a large distance. The Canon PTZ camera we acquired has the following properties:

n 26X optical zoom lens & 12X digital zoom

n 1/4" 340,000 pixel CCD

n Pan: +/-100º, Tilt: +90/-30º

n Minimum Subject Illumination 1 Lux

     (1/30 second shutter speed)

n Motorized infrared (IR) cut filter on/off

n Built-in IR light (effective up to 9 feet).

n BNC video output

n RS-232 computer control interface

n Compact and lightweight at only 14.3 oz

 

Figure 6 shows two images of the same scene with two different camera zoom factors. The built-in IR light will not work for long distances. However, since the camera can sense near-IR waves, a LDV with near IR laser can be seen by this kind of camera for IR laser based LDV targeting.

 

 

 

Figure 6. Two images of a person at a distance of about 200 feet, captured by changing the zoom factors of the PTZ camera.

 

4. Laser Doppler Vibrometer: Principle and Applications

 

Laser Doppler vibrometers (LDVs) work according to the principles of laser interferometry. Measurements are made at the point where the laser beam strikes the structure under vibration. In the Heterodyning interferometer (Figure 7), a coherent laser beam is divided into object and reference beams by a beam splitter BS1. The object beam strikes a point on the moving (vibrating) object and light reflected from that point travels back to beam splitter BS2 and mixes (interferes) with the reference beam at beam splitter BS3. If the object is moving (vibrating), this mixing process produces an intensity fluctuation in the light.  Whenever the object has moved by half the wavelength, l/2, which is 0.3169 mm (or 12.46 micro inches) in the case of HeNe laser, the intensity has gone through a complete dark-bright-dark cycle. A detector converts this signal to a voltage fluctuation. The Doppler frequency fD of this sinusoidal cycle is proportional to the velocity v of the object according to the formula

                                                                                                                (1)

Instead of detecting the Doppler frequency, the velocity is directly obtained by a digital quadrature demodulation method [1, 2]. The Bragg cell, which is an acousto-optic modulator to shift the light frequency by 40 MHz, is used for identifying the sign of the velocity.

 

Figure 7 .The modules of the Laser Doppler Vibrometer (LDV)

 

Most objects vibrate while wave energy (including voice waves) is applied on them. Though the vibration caused by the voice energy is very small compared with other vibration, this tiny vibration can be detected by the LDV. Voice frequency f ranges from about 300 Hz to 3000Hz. Velocity demodulation is better for detecting vibration with higher frequencies because of the following relation of velocity, frequency, and magnitude of the vibration:

            v = 2p f m                                                                                                       (2)

Note that the velocity v will be large with a large frequency f, even under a small magnitude m. The Polytec LDV sensor OFV-505 and the controller OFV-5000 can be configured to detect vibrations under several different velocity ranges:  1 mm/s/V, 2 mm/s/V, 10 mm/s/V, and 50 mm/s/V, where V stands for velocity. For voice vibration, we usually use the 1mm/s/V velocity range. The best resolution is 0.02 mm/s under 1mm/s/V range, according to the manufacture’s specification (with retro-tape treatment). Without retro-tape treatment, the

LDV still has a sensitivity on the order of 1 mm/s. This indicates that the LDV can detect vibration (due to voice waves) at a magnitude as low as m = v/ 2p f = 1 /(2*3.14*300) = 0.5 mm.  Note that voice waves are in a relative low frequency range. The Polytec OFV-505 LDV sensor that we have is capable of detection vibration with a much higher frequency (up to 350K Hz).

There are two important issues to consider in order to use an LDV to detect the vibration of a target caused by human voices. First, the target vibrates with the voices. Second, points on the surface of the target where the laser beam hits reflect the laser beam back to the LDV. We call such points LDV targeting points, or simply LDV points. Therefore, the LDV points selected for audio detection could be the following three types of targets (Figure 8).

Figure 8.  Target selection and multimedia display. The laser Doppler vibrometer (LDV) can measure audio signals from tiny vibrations of the LDV points (indicated by the beams and the red dots onto the objects in the figure) that couple with the audio sources

 

 (1) Points on a human body.  For example, the throat of a human will be one of the most obvious parts where the vibration with the speech could be detected by the LDV. However, we have found that it is very challenging since it is “uncooperative”: (a) it is not easyily targeted, especially when the human is moving; (b) it does not have a good reflective surface for the laser beam, and therefore a retro-reflective tape has to be used; (c) the vibration of the throat only includes the low frequency parts of the voice. For these reasons, our experiments will mainly focus on the remaining two types of targets.

(2) Points on a vehicle with humans within. Human voice signals vibrate the body of a vehicle, which could be readily detected by the LDV. Even if the engine is on and the volume of the speech is low (e.g., in cases of whispering), we could still extract the human voice by signal decomposition since the human voice and engine noise have different frequency ranges.  However, even if the vehicle is stationary we have found that the body of the vehicle basically does not reflect the HeNe laser suitably for our purposes without applying retro-reflective tape. With retro-tape, the signal returns with LDV are excellent when the targets (cars) are at various distances (10 to 50 meters in our experiments) and also with a large range of incident angles of the laser beam. It is even more challenging to detect the voice when the vehicle is moving.

(3) Points in the environment. For perimeter surveillance, we can use existing facilities or install special facilities for human audio signal detection.  Facilities like walls, pillars, lamp posts, large bulletin boards, and traffic signs vibrate very well with human voices, particularly during the relative silence of night.  Note that a LDV has a sensitivity on the order of 1 mm/s, and can therefore pick up very small vibrations. We have found that most objects vibrate with voices, and many types of surfaces reflect the LDV laser beam within some distance (about 10 meters). Response is even better if we can paint or paste certain points of the facilities with retro-reflective tapes or paints; operating distances can increase to 300 meters (1000 feet) or more.

 

5. LDV Audio Signal Enhancement

 

Before we describe our experiment designs and data collection, we will first introduce our algorithms for LDV voice signal enhancement since we will need to analyze and present the results of the collected data using the designed algorithms.

For the human voice, the frequency range is about 300 Hz to 3 KHz. However, the frequency response range of the LDV is much wider than that. Even if we have used the on-board digital filters, we still get signals that include troublesome large, slowly varying components corresponding to the slow but significant background vibrations of the targets. The magnitudes of the meaningful acoustic signals are relatively small, adding on top of the low frequency vibration signals. This prevents the intelligibility of the acoustic signals by human ears. On the other hand, the inherent “speckle pattern” problem on a normal “rough” surface and the occlusion of the LDV laser beam (by passing-by objects) introduce noises with large and high-frequency components into the LDV measurements (Figure 9). This creates very high and loud noise when we directly listen to the acoustic signal. Therefore, we have applied a Gaussian bandpass filter to process the vibration signals captured by the LDV. In addition, the volumes of the voice signals may change dramatically with the changes of the vibration magnitudes of the target due to the changes of speech loudness (shouting, normal speaking, whispering) and the distances of the human speakers to the target. Therefore, we have also designed an adaptive volume function to cope with this problem. Figure 9 shows two real examples of these two types of problems.

 

(a) “Hello…Hello”

 

(b) “I am whispering…(high frequency noise)… OK … Hello (high frequency noise)”

Figure 9.  Two real examples of LDV acoustic signals with both low and high frequency noises. The audio files can be played by clicking the corresponding speaker icons. (a) “Hello…Hello” on top of a low frequency background component from the air conditioning machine. (b)  “I am whispering…(high frequency noise)… OK … Hello (high frequency noise)” with both low and high frequency noises. The volumes of voice changed from whispering to normal to shouting. While the first audio clip is still audible, the second one is almost impossible to hear without enhancement.

5.1. The Gaussian bandpass filter

We can produce the Gaussian bandpass transfer function by expressing it as the difference of two Gaussians of different widths, as has been widely used in image processing [13], i.e.

                                          (3)

Figure 10 shows the function. The impulse response of this filter is given by

                                (4)

MATLAB Handle Graphics

Figure 10. The Gaussian bandpass filter transfer function

MATLAB Handle Graphics

Figure 11. The Gaussian bandpass filter impulse response

 

Notice that the broader Gaussian in the frequency domain (Figure 10) creates a narrower Gaussian in the time domain (Figure 11), and vice versa. We want to reduce the signal magnitude outside the frequency range of human voices, i.e., below s1= 300 Hz and above s2 = 3K Hz. The high frequency reduction is mainly controlled by the width of the first (the broader) Gaussian function in Eq. (3), i.e., a2, and the low frequency reduction is mainly controlled by the width of the second Gaussian function, i.e., a1.  Since the Gaussian function drops significantly when |si| > 2ai, (i=1, 2), as shown by a pair of ‘*’s and a pair of ‘+’s in Figure 10, respectively, we obtain the widths of the two Gaussian functions in the frequency domain as

                                                                               (5)

In practice, we process the waveform directly in the time domain, i.e., by convolving the waveform with the impulse response in Eq. (4). This leads to a real-time algorithm for LDV voice signal enhancement. For doing this, we need to calculate the variances of the two Gaussian functions in the time domain. Combining Eq. (4) and Eq. (5) we have

                                                                         (6)

For digital signals, we need to determine the size of the convolution kernel. Since the narrower Gaussian (with width a1) in the frequency domain creates a broader Gaussian (with width s1) in the time domain, we use s1 to estimate the appropriate window size of the convolution. Again, we truncate the impulse function when we have t > 2s1. Therefore, the size of the Gaussian bandpass filter is calculated as

                                                                                        (7)

where m is the sampling rate of the digital signal. Typically, we use m = 48 K samples/second with the S/P-DIF format.  Therefore, the size of the window will be W1 = 210. The size of the convolution kernel is marked by a pair of ‘*’s in Figure 11.

As noted in [10], most speech enhancement systems improve the quality of the signal (i.e. reduce the noise level) at the expense of reducing its intelligibility. Listeners can usually extract more information from the noisy signal than from the enhanced signal by carefully listening to that signal, since by filtering, some of the useful acoustic signal components are also reduced. We first look at the high frequency reduction issue. In some cases, reducing frequency above s2= 3K Hz will obviously reduce the level of the high frequency noise, particularly for some very short time noises brought in by a passing vehicle or a person in front of the LDV, where very high frequency “screaming” noise will be generated (e.g. Figure 9b). Without high frequency reduction, listeners experience fatigue over extended listening sessions, a fact that results in reduced intelligibility of the noisy signal.  This can be demonstrated by listening to a pair corresponding audio clips before and after high-frequency reduction in Figure 15. In some other cases, high frequency reduction will make the speech “heavy”, and sometimes bring in some other high frequency noise due to the digital processing of the filtering (e.g., truncation of the Gaussian window). An example is shown in Figure 14.

 

Now let us look at the low-frequency reduction problem. Different from the common speech communication systems, our speech signals are captured by a vibrometer, which in many of the cases have significantly low “signal-to-noise-ratio”, i.e. very high magnitudes of the low-frequency background vibration “noises” due to wind, the engine of a vehicle, or an air conditioning machine, and also relatively low magnitudes of the vibration measurements due to the inherent qualities of speech (Figure 9).  In some cases, it is not possible for the listener to perceive the vibration signals as speech (Figure 9b). However, in some of the cases where the reflection of the LDV laser beam is perfect and the low frequency vibration magnitudes are low or the voice components are comparable to the background vibration (Figure 9a), the intelligibility will be better without low-frequency reduction, and the computation in filtering will be much less expensive (as will be shown below).

 

 

Figure 12. LDV voice signal processing interface.

 

Therefore, in our current implementation, the user has the control over the high-frequency reduction and/or the low-frequency reduction by enabling/disabling the low-pass (LP) filter and the high-pass (HP) filter, and also by selecting an appropriate frequency range (Figure. 12). We are also working on algorithms that can automatically analyze original LDV signals and then determine what is the appropriate range of the band-pass filter. In practice, with only one of them on we could simplify the computation. Without high-frequency reduction, we have a2 approaching infinity, and therefore the narrower Gaussian in the time domain narrows down to an impulse and the filter has the form shown in Figure 13. In this case, the impulse response becomes

                                               (8)

The result of the processed signal is simply the original signal subtracted by the result of the Gaussian low-pass filter, with variance s1. However, the window size in convolution is still W1 in Eq. (6), and therefore there is no significant benefit in terms of computational cost.

 

MATLAB Handle Graphics

Figure 13. The Gaussian low-stop filter

 

On the other hand, without low-frequency reduction the band-pass filter becomes a Gaussian low-pass (LP) filter with variance s2.  In this case the window size in convolution becomes

                                                                                      (9)

which is much narrower and more computationally efficient. For example, when m = 48 K samples/second and s2 = 3K Hz, we have W2 = 21. The size of the window is marked by a pair of ‘+’s in Figure 11.

 

A real example of Gaussian bandpass filtering is shown in Figure 14, with different combinations of the two filters (low-reduction and high-reduction). The corresponding audio clips can be played by following the links in the sub-captions.