A DIGITAL VLSI LOW POWER INTEGRATED CIRCUIT ARCHITECTURE FOR DELAY ESTIMATION

A. Chacón-Rodríguez*, F. Martin-Pirchio*, P. Julián*, P. Mandolesi*

* Laboratorio de Componentes Electrónicos, Universidad Nacional de Mar del Plata, Argentina, on leave from Instituto Tecnológico de Costa Rica.
° Departamento de Ingeniería Eléctrica y Computadores, Universidad Nacional del Sur, Argentina

achacon@fi.mdp.edu.ar, fmartinpirchio@uns.edu.ar, pjulian@uns.edu.ar, pmandolesi@uns.edu.ar

ABSTRACT

The design of a low power digital VLSI CMOS integrated circuit for the measurement of signals in the range [10, 300] Hz is presented. The architecture performs a delay calculation in order to determine the bearing angle of a sound source. Restrictions regarding power dissipation are to be improved against a previous implementation, while keeping computing accuracy. A Verilog RTL preliminary implementation is tested on a Xilinx® FPGA in order to determine performance of the calculation algorithm and tuning-up the digital structure.

Keywords: Verilog, Hardware Description Language, FPGA, low power, digital VLSI.

1. INTRODUCTION

Methods for the detection of sound sources have been widely studied, including the use of complex techniques such as Independent Component Analysis, Cross-correlation analysis [1], [2], [3], Gradient Flow techniques [4], and the emulation of the human hearing cochlea [5]-[7], some of which have been successfully implemented in analog and digital VLSI circuits [6]-[10]. One of these methods has been proposed in [5] and successfully implemented in [11], based on a cross-correlation derivative algorithm. The integrated circuit (IC) allows bearing angle detection with an error of less than one degree, and less than 500µW of power. An alternative structure for the estimation of such angle is proposed in order to reduce power dissipation still further, while keeping calculation performance as established in [5], [11]. The problem is to determine the direction of a sound source picked by an array of microphones such as the one in Fig.1. The latest version of this circuit, described in another paper presented at this congress [12], features a performance of only 62 µW of power dissipation, with 45.6 µW dissipated in the circuit itself, the rest divided between pads and internal reset generation. The objective is to significantly reduce this power dissipation.

Fig. 1. Microphone array to measure bearing angle from a sound source

In order to achieve this, a single counter for determining delay is proposed instead of the 92 counters used in the former implementation. The counter is intended to run at a speed not much higher than the frequency of the signals being measured. The delay registers and intermediate FFs needed to provide data for the calculation—which incidentally must be clocked at a higher frequency and would account therefore for most of the dynamic power requirements—are to be designed using C²MOS dynamic techniques in order to reduce the number of transistors required and, consequently, power needs. Section 2 of this paper describes the Verilog HDL front-end implementation of the proposed structure. Section 3 analyzes simulations and tests executed on a Xilinx®
Spartan3 FPGA, to provide data for contrast against previous results obtained [5]. Section 4 describes the back-end implementation of the circuit, with the preliminary SPICE simulation results obtained and power requirements calculated and compared with those in [12], being the chip at the final stages of design verification at the time of submission of this paper.

2. STRUCTURAL DESIGN (FRONT-END)

The first problem is the testing of the new algorithm in order to determine its accuracy with respect to the version previously used. Any significant loss of such accuracy would of course render useless any power dissipation improvements.

The front-end design was coded using Verilog HDL and the Xilinx® Integrated Software Environment (ISE) and implemented on a Spartan3 Digilent Inc. prototyping board. Simulations were done in Mentor Graphics® ModelSim® HDL simulator.

Fig. 2 depicts the basic structure, composed of a block that captures and stores the signals being measured, and a second block which calculates the delay. The output has a tri-state control (oe_L) to allow its interfacing to a general data bus, with two extra signals providing information about the state of the unit, i.e., if it is out of its measurement range (out_range) and if data is available (data_rdy).

2.1. Delay chain

The first block, shown in Fig. 3, captures the signals at a 200 kHz rate. This in order to attain an estimation accuracy of one degree for angles in the range $\theta = [0, 50^\circ] \cup [+130, +180^\circ]$ for signals between $[20 \text{ Hz}, 200 \text{ Hz}]$, as stated in [5]. Data is stored in two SIPO registers that serve as delay chains.

Considering such speeds, the circuit proposed would allow for measurements of up to $\pm620 \mu\text{s}$ of delay, increasing thus the range of the system implemented in [11].

For reference’s sake, it is always assumed that signal $X_1$ leads $X_2$. The first bit of one of the chains (by convention $X_2$) is used as base pointer, while the other chain is swept in search of transitions by the index provided by the delay-calculation part of the circuit ($\text{tao\_index}$). This index is an 8 bit signed integer in 2’s complement format. The sign bit controls the multiplexers that allow for the case in which $X_1$ is actually lagging $X_2$ instead of leading it. In this situation, the base pointer is switched to $X_1[0]$ and the index’s magnitude is used to sweep $X_2$ instead. To allow for this without using another decoder, $X_2$’s enable signals are wired backwards to the 7 to 128 decoder so, for instance, a minus 1 (FFH in 2’s complement), activates the decoder’s 128 output which goes to $X_2[0]$ and so on.

An error of minus one tap ($-5\mu\text{s}$ at a sampling rate of 200 kHz) is introduced using this scheme, because of the base being actually displaced minus one bit as a result of the base switching. This error is considered negligible for simplicity, and in any case can be easily corrected by the software of the system receiving the final data.

As a side note, in the case of the FPGA implementation, and due to the lack of internal tri-state buffers on the Spartan3, the synthesizer was allowed to perform wired logic substitution in order to create the 128 bit buffers. This will not be the case in the ASIC implementation.
2.2. Calculation unit

The calculation unit must discover valid transitions in the input signal in order to account for an increase or a decrease in the index counter (see Fig. 5 for an example of the validation of such transitions). The index will move upwards or backwards depending on such transitions and the index sign bit. Repeated application of the calculation will produce a monotonic estimation of the target delay.

Since the circuit is designed to increase or decrease its count by one on each valid transition, the convergence time is determined as:

$$T_{convergence} = \frac{1}{2} \cdot \frac{1}{f_{signal}} \cdot \text{Signal delay} \cdot f_{CLK} \quad (1)$$

This convergence time, in its worst case (maximum delay of 400 µs for a 200 Hz signal), is still well within the proposed estimation period of one second. An out_range signal is provided to indicate the saturation of the index counter, which in the future can be used as an auxiliary signal to allow for the adaptive measurement of faster or slower signals via the modification of the clock speed.

For the validation of the transitions, the signals are fed through two FFs to produce the signals illustrated in Fig. 4. Taking into account the order of arrival of these signals, the decision logic determines whether to increase, decrease or leave the counter unchanged. This logic is registered in order to act as a pipeline that introduces a clock tick latency between the detection of the transition and the variation of the index. This eliminates the chance of falsely locking the circuit to the same transition over and over and thus producing a run-up of the counter.

3. FUNCTIONAL TESTING AND RESULTS

ANALYSIS OF FRONT-END IMPLEMENTATION

Simulations were run at the RTL level and the gate level (post place & route) and the results were fed into Matlab® for a preliminary check of the accuracy of the algorithm. A set of files with test signals was created in Matlab® to feed the simulator, and simulations were also performed using real signals taken from previous experiments on the same system used in [12]. This allowed for the tuning of the Verilog code and therefore of the digital structure, and gave an approximate idea of the accuracy of the circuit.

The final tests were executed on a Spartan3 Digilent Inc. board, with the input signals fed from a programmable delay generator written in VHDL and implemented on another Spartan3 board. The outputs were fed to the computer through a PMD-1608FS Measurement Computing® acquisition board. The data was fed to Matlab® to produce an analysis of the delay standard deviation and mean.

The results were compared with theoretical data, with data obtained through simulation and with the data obtained in [12]. Fig. 6 shows an example of such calculations, in which the output evolves from a steady state to another after a sudden change in the delay being measured. For a clock frequency of 200 kHz and an input signal of 92 Hz with a delay of 325 µs, the convergence time is 353.3 ms according to (1). The delay output value was sampled every 5 ms, measuring up a convergence time of 355 ms, as it is shown in Fig. 6.
4. BACK-END IMPLEMENTATION, POWER CONSIDERATIONS AND PRELIMINARY SIMULATION RESULTS

Once a functional validation of the structure proposed was obtained and its results showed a performance similar to the previous design, the next step was the back-end implementation of the system in a low power ASIC using 0.5 µm technology.

Due to the lack of standard cells specifically designed for low power purposes, the utilities to integrate Verilog code into the design flow were not used at this time. Instead, based on the logical design already tested, a schematic design was drawn on Tanner® S-Edit, including all the constraints regarding power consumption while trying to abide to the logical data flow and control implemented in Verilog as much as possible.

Based on this schematic, a layout of the circuit was drawn using Tanner® L-Edit, from which diffusion and parasitic capacitances were extracted for SPICE analog timing and power simulations.

4.1. Delay chains

As already shown, this unit basically consists of two large SIPO registers, two large multiplexers and a decoder for selection of data, and some extra logic for the switching of the reference. This unit is to operate at 200 kHz and, considering its size and operation speed, it will be responsible for the maximum power dissipation. In order to reduce such dissipation, the SIPO delay chains were built using C²MOS registers, such as the one shown in Fig. 7. This master-slave edged triggered register works on a very similar way to its static master-slave transmission gate counterpart, but without the need of feedback, as the data is stored in the internal node capacitances. Another important feature of this circuit is its lower clock fan-in (two transistors for each clock phase), and it only needs eight transistors instead of the 18 required for a transmission gate based static register [13].

SPICE simulations with diffusion capacitance parameters extracted from the layout were executed with the whole unit of 256 registers connected, plus the selection logic, everything supplied with 3.3 V to obtain a preliminary estimation of the power dissipation (3.3 V were chosen for the whole structure this time in order to easily interface the chip with the rest of the current working test system).

Results of the drawn current and RMS power values calculated are shown in Fig 8 and Table 1, compared against data from a previous implementation by Julian et al [11].
TABLE I. CROSS-CORRELATOR IC POWER CONSUMPTION

<table>
<thead>
<tr>
<th>Description</th>
<th>Power</th>
</tr>
</thead>
<tbody>
<tr>
<td>Cross-correlator [5]</td>
<td>45.7 µW</td>
</tr>
<tr>
<td>Adaptive</td>
<td>2.075 µW</td>
</tr>
<tr>
<td>Power reduction</td>
<td>$\approx 20$</td>
</tr>
</tbody>
</table>

a. measured at $V_{cc}=3.3$V, with clock signal and without signal activity

4.2. Logical decision and delay calculation

The Boolean equations for the control of the delay counter, $DN_{UP}$ (2) and $CLK_{CNT}$ (3), were obtained using Berkeley’s Espresso minimization algorithm and were pre-tested on the FPGA by directly introducing them into the RTL code instead of the high level decision sentences used in the front-end implementation. A similar performance to that in Section 3 was obtained. Some problems regarding set up times were detected, and were corrected by tuning the decision logic structure.

$$DN_{UP} = \overline{SGN} \cdot (A + B) + \overline{SGN} \cdot \overline{OVN} \cdot (C + D) \quad (2)$$

$$CLK_{CNT} = SGN \cdot OVP(C + D) + SNG \cdot OVN \cdot (C + D) + (A + B) \quad (3)$$

$$A + B = \overline{Y}_1 \cdot Y_{1,1} \cdot \overline{Y}_2 \cdot Y_{2,1} + Y_{1,1} \cdot \overline{Y}_2 \cdot Y_{2,1} \quad (4)$$

$$C + D = \overline{Y}_1 \cdot Y_{1,1} \cdot Y_2 \cdot Y_{2,1} + Y_{1,1} \cdot \overline{Y}_2 \cdot Y_{2,1} \quad (5)$$

This part of the circuit is not as critical regarding power dissipation as the delay chains. In the worst case condition (maximum delay between X1 and X2), the counter operates at a maximum speed equal to twice the input signal frequency, and the decision logic would switch at twice this speed, as it evaluates both the data in the first and the second delay chain. Nonetheless, C-MOS logic was also used for the pipelining registers, as they are clocked at 200 kHz, and only 8 standard master slave static resisters were needed for the counter, being its switching speed too slow to allow for dynamic techniques without the use of a data refreshing unit.

At the time of submitting this paper, the chip was at the final layout stage, DRC and LVS.

5. CONCLUSIONS

An implementation of a low power VLSI CMOS architecture using a 0.5 µm technology was presented. Results of analog simulation show a significant improvement of power dissipation of about 20 times over previous implementations. Actually, if one compares the power dissipation per stage, this design features 8.1 nW per stage as compared with the 770 nW of the previous design (both at 3.3 V). The efficiency of C-MOS dynamic techniques is thus corroborated, with new improvements being still possible by reducing supply voltage in the critical stages. Besides, an improvement in the resolution of the circuit allows for the measurement of delays of up to ±640 µS with a sampling speed of 200 kHz, a feature that hints at the feasibility of a future adaptive system.

6. ACKNOWLEDGEMENTS

The authors thank Martín Di Federico at Universidad Nacional del Sur for his help with the VHDL programmable delay generator.

P. Julián is also with CONICET.

Work partially funded by “Desarrollo de tecnología de redes de sensores para aplicaciones en el medio social y productivo”, PICT 2003 No. 14628, Agencia Nacional de Promoción Científica y Técnica; “Redes de Sensores” PGI 24/ZK12, Universidad Nacional del Sur; “Desarrollo de Microdispositivos para Redes de Sensores Acústicos”, # 5048, PIP 2005-2006, CONICET.

A. Chacón-Rodríguez is on a scholarship funded by the Organization of American States, and the Instituto Tecnológico de Costa Rica.

7. REFERENCES


