A Twofold Lookup Table Architecture for Efficient Approximation of Activation Functions

Yusheng Xie®, Alex Noel Joseph Raj®, Zhendong Hu®, Shaohaohan Huang®, Zhun Fan®, Senior Member, IEEE, and Miroslav Joler®, Senior Member, IEEE

Abstract—In this article, we propose a novel approach to reduce hardware resource consumption when neural networks (NNs) are deployed on field-programmable gate array (FPGA) boards. Rather than using a classical approach with lookup tables (LUTs) to approximate the activation functions of an NN, the proposed solution is based on a twofold LUT (t-LUT) architecture, which comprises an error-LUT (e-LUT) and a data-LUT (d-LUT), in order to achieve high precision and speed as well as low hardware resource consumption. The efficiency of the proposed approach was tested against multiple earlier approaches. Our solution showed that the compressibility of the previously referenced works, which were based on single LUTs, could be improved by up to 94.44% and those that were based on a range addressable LUT (RALUT) by up to 6.35% in the examined case of a hyperbolic tangent (tanh) activation function. Moreover, when RALUT and our architecture were combined, it improved the compressibility of the RALUT-based result by up to additional 10.21% for a tanh activation function. The designed architecture had an initial latency of 39.721 ns, when tested with a 50-MHz clock, to simultaneously retrieve data from the d-LUT and t-LUTs.

Index Terms—Activation functions, field-programmable gate array (FPGA), twofold lookup table (t-LUT).

I. INTRODUCTION

WITH the rapid development of artificial intelligence, neural networks (NN) play a vital role in many fields, such as computer vision, natural language processing, and automatic driving. Generally speaking, most of these NNs are implemented using a combination of a CPU and graphics processing unit (GPU), which results in high costs and high power consumption. Recently, NNs have been implemented on a field-programmable gate array (FPGA) with low-cost and low-power architectures. With that being done, a computing activation function of an NN remains a crucial process.

Due to the hardware limitations in FPGA, activation functions are all based on the approximation rather than being calculated on the FPGA. Researchers implement the activation functions by the following three typical approaches: 1) piecewise linear approximation and piecewise nonlinear approximation [1], [2]; 2) lookup tables (LUTs) [3]; and 3) hybrid methods using a combination of the first two methods [4]–[7]. Among these, LUT is the fastest method when compared with the other different hardware implementations as it requires fewer computations to compute the activation function.

Recently, LUT methods that are characterized by implementations with a smaller area, shorter latency, and higher accuracy were presented. These methods either combined the LUTs with linear interpolation procedure [8] or used range addressable LUT (RALUT) [9]–[12] mapping unique outputs to an address range. However, with a high precision [9]–[11], RALUT also needed a number of comparators, which then involves a high area complexity. In [12], a customized logic was used to design an address range decoder, where a simplification of a Boolean expression for necessary comparisons was arduous. Some other proposed methods enabled the use of the same LUTs [13] to reduce the memory space or employ compressibility within the coding process to optimize the table size [14]. In [15], Taylor series expansion [16]–[19] was used to decompose and compress the signal and the middle value was used in the tables as the reference point to represent the consecutive values in the original LUT, leading to differences that are either positive or negative, while no precondition was set for the compression. Linear LUTs present a simple architecture where the keywords are directly mapped to an address, but this results in a heavy hardware usage. Reduction in hardware usage can be achieved by finding a suitable transformation function that maps the keywords to an address, but, again, designing such an architecture is not trivial.

In this article, we propose an LUT architecture that is based on a direct mapping method, the minimum value (rather than the middle value) and two LUTs that reduce the complexity of storage and computation, as it will be explained. Although the transformation of double-precision floating-point values to fixed-point values (Fig. 1) results in many duplicate values, we circumvent that by forming bands of neighboring values grouped together where from each band a minimum value is chosen to be stored in an LUT referred to as the data-LUT (d-LUT). Next, we calculate the difference between the original value and the minimum value stored in the d-LUT and store the error value in another LUT called the error-LUT (e-LUT). Thus, the original LUT values are obtained by adding the minimum value and the error that are

Authorized licensed use limited to: Shantou University. Downloaded on August 20,2020 at 00:08:02 UTC from IEEE Xplore. Restrictions apply.
simultaneously stored in the d-LUT and e-LUT. By choosing the minimum value to represent a band of consecutive values, we ensure that the differences are all only positive values, which simplifies the computation by not having to work with the sign bit. Moreover, we defined a precondition, which enabled us to compute the compressibility mathematically and present the optimal compressibility for various widths of LUTs and band sizes. Our experiments with different activation functions, such as sigmoid, tanh, softplus, and more, proved that the activation functions approximated based on the proposed twofold LUT (t-LUT) structure require only half of the total number of bits compared to the original LUT.

The rest of this article is organized as follows. Section II explains the t-LUT architecture mathematically. Subsequently, the procedure of designing our LUT is shown in Section III. The implementation details and the hardware utilization are presented in Section IV. In Section V, we discuss the limitations and the advantages of the proposed method and provide comparisons with other designs. Finally, Section VI concludes this article.

II. PRINCIPLE OF T-LUT

In this section, the idea behind the proposed t-LUT is described, followed by the theoretical discussion of compressibility between the original LUT and t-LUT.

A. t-LUT Architecture

When we approximate an activation function using an LUT (referred to as the single LUT structure) in an FPGA, the fixed point representation is usually a preferred choice to represent the data. Unfortunately, due to the quantization effects, the approach presents many duplicate values, which causes a huge loss of valuable resources (see Fig. 1). A higher compressibility can be achieved by widening the sampling intervals, but this results in an increased error.

To design the t-LUT architecture, we assume that the difference between two consecutive values in the original LUT should be either 0 or ±1 (we will later discuss in Section V the situation where data do not meet this precondition). We form a band by grouping B neighboring data together (here, we limit B = 2^k (k ∈ Z^+) for computational simplicity) and choose the minimum value M = min(v_1, v_2, v_3, ..., v_i, ..., v_B), v_i ∈ band, i = 1, 2, 3, ..., B as the representation value of this band. We store all M's in a d-LUT and employ another LUT referred to as e-LUT to store all the differences between original LUT values and M's. While retrieving them, values M's and error values are simultaneously read from d-LUT and e-LUT, followed by the sign-extension added in addition to M's. Thus, the original values are intact while extracting from the t-LUT. Fig. 2 shows the proposed t-LUT architecture.

B. Bit Width Requirements of t-LUT and Compressibility

Assume that we have built an LUT with size T_0 given by

\[ T_0 = W_0 \times D_0 \text{ (bits)} \] (1)

where \( W_0 \) is the width in bits and \( D_0 \) is the depth of the LUT (\( D_0 \) depends on the designed techniques). Let \( B \) be the band formed with \( 2^k (k ∈ Z^+) \) values and \( E \) be the number of errors in the band. After grouping, the depth of the d-LUT is determined by

\[ D_d = \text{ceil} \left( \frac{D_0}{B} \right) \] (2)

where \( \text{ceil}(\cdot) \) means round up to an integer and the width of d-LUT \( W_d \) remains the same as \( W_0 \). The depth of the e-LUT is determined by

\[ D_e = D_0. \] (3)

In theory, the band size \( B \geq E \), and then, the width of e-LUT \( W_e \) can be determined by

\[ W_e = \text{ceil}(\log_2 E) \leq \log_2 B. \] (4)

For a given \( W_e \) and \( W_0 \), the compressibility \( C \) is defined as

\[ C = 1 - \frac{T_e + T_d}{T_e} = 1 - \frac{W_e \times D_0 + W_0 \times \text{ceil} \left( \frac{D_0}{B} \right)}{W_0 \times D_0} \] (5)

where

\[ T_e = W_e \times D_e = W_e \times D_0 \text{ (bits)} \] (6)

and

\[ T_d = W_d \times D_d = W_0 \times \text{ceil} \left( \frac{D_0}{B} \right) \text{ (bits)} \] (7)

are the size of e-LUT and d-LUT, respectively.
When \( D_0 \gg B \), (5) can be approximated as
\[
C \approx 1 - \frac{W_e \times D_0 + W_0 \times \frac{D_0}{B}}{W_0} = 1 - \frac{W_e + W_0}{W_0} \tag{8}
\]
where the compressibility \( C \) is independent of the depth of table \( D_0 \).

Since \( B \geq E \), for the worst case scenario, we substitute \( W_e = \log_2 B \) to find the relationship between the band size \( B \), width of LUT \( W_0 \), and compressibility \( C \). Thus, (8) can be approximated as
\[
C \approx 1 - \frac{\log_2 B + \frac{W_0}{B}}{W_0}. \tag{9}
\]

Fig. 3 shows the relationship between \( B \), \( W_0 \), and \( C \) for the worst case. It can be inferred that for a fixed \( W_0 \), as \( B \) is increasing, \( C \) will reach a maximum. Also, irrespective of \( B \), larger the size of \( W_0 \), the better the compressibility performance will be.

Once a proper band size is fixed, we can further improve the compressibility by finding a smaller \( W_e = \text{ceil}(\log_2 E) \) (if it exists). For most arbitrary LUTs, the number of continuous values is nonlinear and nonmonotonic. Thus, although it is a rather complex task to design a generic function that will determine \( E \), the actual \( E \) can be computed through an exhaustive method (see Appendix A). In general, if \( \text{ceil}(\log_2 E) < \log_2 B \), the compressibility will be greater.

III. DESIGN OF t-LUT

In this section, we will demonstrate the process of designing a t-LUT for a sigmoid function.

A. Simplification of the Activation Function

Due to the symmetry of the sigmoid function along the zero, only a portion of the whole function is required to be stored in the LUT. Specifically, if the positive values are computed, the corresponding output can be obtained by \( 1 - \text{sigmoid}(-x) \), where \( x \) is the corresponding negative input. Thus, the sigmoid function \( F(x) \) can be represented as
\[
F(x) = \begin{cases} 
\text{sigmoid}(x), & x \geq 0 \\
1 - \text{sigmoid}(-x), & x < 0 
\end{cases} \tag{10}
\]
where
\[
\text{sigmoid}(x) = \frac{1}{1 + e^{-x}}, \quad x \in R.
\]

Moreover, once the input exceeds the threshold, the sigmoid function saturates and the output can be approximated to 1. Consequently, for a symmetrical and convergent function, only an output equivalent to \( 0 \leq x \leq \text{Threshold} \) is required to be stored in LUT (see Fig. 4).

Since the LUT output data precision is identical to the NN weights precision, the output data may have a wider range than the sigmoid function. In other words, considering that the NN weights have \( m \) integer bits and \( n \) fractional bits (since the sigmoid function is mapped between 0 and 1), the information between 0 and 1 is required to be stored. Therefore, only \( n \) fractional bits are used to represent the sigmoid function and, thus, by storing only \( n \) fractional bits for each output, we save an \( m \)-bit space per output.

Accordingly, we propose three general strategies for simplification of the target function \( f(x) \).

1) If function \( f(x) \) is symmetrical, only one half of the function is required to be stored in the LUT. The remaining function can be obtained by the symmetrical calculation.

2) If function \( f(x) \) is convergent to a constant \( C \), then \( f(x) \) can be approximated based on the high (HT) and low (LT) threshold conditions
\[
f(x) = \begin{cases} 
\text{LUT}(x), & \text{LT} \leq x \leq \text{HT} \\
C, & \text{else} \end{cases} \tag{11}
\]
where LUT\((x)\) means making inquiry from LUT.

3) If the range of function \( f(x) \) is much narrower than the range of the whole data structure, we abandon the unused bits in the LUT. For example, for a sigmoid function of range 0–1, we only need to store the fractional bits to represent all the values and, thus, one can abandon the sign bit as well as the integer bits.
Based on these conditions, in Section III-B, we discuss the width and depth of t-LUT.

### B. Data Precision

Before we design the t-LUT, we first determine the bit width of \( f(x) \) and \( x \), i.e., how many integer and fractional bits should be used to represent \( f(x) \) and \( x \), respectively. Since hardware designs are constrained, due to memory limitations, we initially presented our bit width analysis and attainable compressibility for a block memory size of 18k bits. Later, we relaxed these memory limitations (see Table X) and presented the achievable compressibility for different widths and band sizes.

Now, the number of integer bits of \( x \) can be easily determined from the range (from LT to HT). In the case of the sigmoid function, it is between 0 and 6. Therefore, the number of integer bits of \( x \) should be 4, including the sign bit. Also, as we explained earlier, the sigmoid function requires only the fractional bits to be stored. Therefore, the number of integer bits of \( F(x) \) is 0, while we set the number of fractional bits of \( F(x) \) and \( x \) to be equal, which can make the LUT meet our precondition more easily (more details are discussed in Section V-A). The number of the fractional bits of \( F(x) \) and \( x \) determines the precision and the sampling interval, respectively. Here

\[
\text{sampling interval} = 2^{-j}, \quad j \leq f \quad (12)
\]

where \( j \) is the number of the fractional bits of \( x \) and \( f \) is the number of the fractional bits of \( F(x) \). Here, we limit \( j \leq f \) because when \( j > f \), it does not reduce the mean-squared error (MSE) but exponentially increases the necessary hardware, resulting in huge memory utilization. An experimental analysis shown in Appendix B for \( f = 9 \) and \( j = 8, 9, 10, 11, 12 \) clearly illustrates it. Referring to Table I, when \( f \) increases from 4 to 8, \( j \) becomes equal to \( f \), presenting more samples with higher precision. Moreover, when \( f \) increased from 8 to 12, \( j \) is lower than \( f \) to meet the memory limitation.

Table I also shows the maximum absolute error, average absolute error, and MSE between the original \( F(x) \) values in double precision and the respective approximated values computed for different data widths of LUTs. When \( f \) is increased from 4 to 12 bits, the maximum absolute errors and average absolute errors will decrease significantly. However, when \( f \) increases from 8 to 12 bits, the memory usage is larger, yet the compressibility does not increase (how the t-LUT is built will be discussed later). Therefore, we establish a tradeoff between the precision, compressibility, and utilization and fix the precision of both \( F(x) \) and \( x \) to eight fractional bits for the sigmoid case, for which we can consequently build two LUTs—a single LUT structure and the proposed t-LUT structure—and determine the required depth. For a given sampling interval, the depth of the single LUT is given by

\[
D_0 = \frac{\text{HT} - \text{LT}}{\text{sampling interval}} + 1 \quad (13)
\]

where HT and LT refer to the upper and lower thresholds of the sigmoid function, respectively. We used MATLAB to compute the threshold values LT and HT, and they were: \( \text{LT} = 0 \) and \( \text{HT} = 6.234375 \). Based on these thresholds, we designed a single LUT for the sigmoid function with fractional bit \( W_0 = 8 \) and the input value \( x \) ranging from LT to HT for the sampling interval given as \( j = f = 8 \). Using (13) and (1), the depth and the total size of the single LUT were \( D_0 = 1597 \) bits and \( T_0 = 12776 \) bits, respectively.

To design the t-LUT, recall (9) and Fig. 3. When \( B = 4 \) or \( B = 8 \), the compressibility is maximal, and subsequently, we use the exhaustive method (see Appendix A) to calculate the actual \( E \). For the sigmoid function, \( E = 2 \) when \( B = 4 \) and \( E = 3 \) when \( B = 8 \) and since ceil\((\log_2 E)\) is less than \( \log_2 B \) and the compressibility can be improved from \( 50\% \) to approximately \( 62.45\% \) with a reduction of 7979 bits when \( B = 4 \) and \( E = 2 \) or from \( 50\% \) to approximately \( 62.48\% \) with a reduction of 7982 bits when \( B = 8 \) and \( E = 3 \), according to (5). The difference between \( (B = 4, E = 2) \) and \( (B = 8, E = 3) \) is caused by the approximation from (5) to (9). Therefore, if there are different \( B \)'s, they make the compressibility maximal according to (9) and Fig. 3, while an error may exist because of the approximation from (5) to (9).

For a comparison, we chose \( B = 8 \) and \( E = 3 \) and, thus, grouped eight neighboring data together and stored the minimum value across the neighbors in the d-LUT. Thus, the width of d-LUT is equal to \( W_0 \) and the depth of d-LUT \( D_d \) is 200, based on (2). Subsequently, we calculate a
TABLE II
LUT SIZE OF DIFFERENT FUNCTIONS

<table>
<thead>
<tr>
<th>Function</th>
<th>Size of Single LUT $W_0 \times D_0$ (bits)</th>
<th>Size of t-LUT $W_1 \times D_1 + W_2 \times D_2$ (bits)</th>
<th>Compressibility</th>
<th>Range</th>
<th>Maximum Absolute Error</th>
<th>Mean Squared Error</th>
</tr>
</thead>
<tbody>
<tr>
<td>sigmoid</td>
<td>$8 \times 1597 = 12776$</td>
<td>ceil(log$_2 E$) $\times 1597 + 8 \times$ ceil($\frac{1597}{4}$) = 4794</td>
<td>62.48%</td>
<td>[0, 6.234375]</td>
<td>0.001953</td>
<td>1.226473e-06</td>
</tr>
<tr>
<td>hyperbolic tangent</td>
<td>$8 \times 888 = 7104$</td>
<td>ceil(log$_2 E$) $\times 888 + 8 \times$ ceil($\frac{888}{8}$) = 3552</td>
<td>50.00%</td>
<td>[0, 3.64684375]</td>
<td>0.001952</td>
<td>6.521651e-07</td>
</tr>
<tr>
<td>ELU</td>
<td>$8 \times 1597 = 12776$</td>
<td>ceil(log$_2 E$) $\times 1597 + 8 \times$ ceil($\frac{1597}{4}$) = 6391</td>
<td>49.98%</td>
<td>[-6.23828125, -0.00590625]</td>
<td>0.001953</td>
<td>1.281475e-06</td>
</tr>
<tr>
<td>softsign</td>
<td>$8 \times 2048 = 16384$</td>
<td>ceil(log$_2 E$) $\times 2048 + 8 \times$ ceil($\frac{2048}{4}$) = 8192</td>
<td>50.00%</td>
<td>[0, 7.99609375]</td>
<td>0.001951</td>
<td>1.271225e-06</td>
</tr>
<tr>
<td>softplus</td>
<td>$12 \times 2048 = 24576$</td>
<td>ceil(log$_2 E$) $\times 2048 + 12 \times$ ceil($\frac{2048}{4}$) = 9216</td>
<td>62.50%</td>
<td>[-4, 3.99609375]</td>
<td>0.001952</td>
<td>1.263616e-06</td>
</tr>
<tr>
<td>$\frac{1}{\sqrt{1+x^2}}$</td>
<td>$8 \times 2048 = 16384$</td>
<td>ceil(log$_2 E$) $\times 2048 + 8 \times$ ceil($\frac{2048}{4}$) = 8192</td>
<td>50.00%</td>
<td>[0, 7.99609375]</td>
<td>0.001953</td>
<td>1.244180e-06</td>
</tr>
<tr>
<td>erf$(\frac{2}{\sqrt{\pi}}x)$</td>
<td>$8 \times 633 = 5064$</td>
<td>ceil(log$_2 E$) $\times 633 + 8 \times$ ceil($\frac{633}{8}$) = 2538</td>
<td>49.88%</td>
<td>[0, 2.46875]</td>
<td>0.001950</td>
<td>4.323394e-07</td>
</tr>
<tr>
<td>$\frac{1}{\arctan{\frac{x}{2}}}$</td>
<td>$8 \times 2048 = 16384$</td>
<td>ceil(log$_2 E$) $\times 2048 + 8 \times$ ceil($\frac{2048}{4}$) = 8192</td>
<td>50.00%</td>
<td>[0, 7.99609375]</td>
<td>0.001952</td>
<td>1.258111e-06</td>
</tr>
<tr>
<td>GELU</td>
<td>$12 \times 2048 = 24576$</td>
<td>(ceil(log$_2 E$) $+ 1) \times 2048 + 12 \times$ ceil($\frac{2048}{4}$) = 11264</td>
<td>34.17%</td>
<td>[-4, 3.99609375]</td>
<td>0.001952</td>
<td>1.185720e-06</td>
</tr>
<tr>
<td>exp$(x)$</td>
<td>$15 \times 1242 = 18630$</td>
<td>failed</td>
<td></td>
<td></td>
<td>0.00000</td>
<td>1.340687e-06</td>
</tr>
</tbody>
</table>

Table III
BIT WIDTH OF T-LUT

<table>
<thead>
<tr>
<th>Function</th>
<th>Address of d-LUT</th>
<th>Address of e-LUT</th>
<th>Data Width of d-LUT</th>
<th>Data Width of e-LUT</th>
</tr>
</thead>
<tbody>
<tr>
<td>sigmoid</td>
<td>8</td>
<td>11</td>
<td>8</td>
<td>2</td>
</tr>
<tr>
<td>tanh</td>
<td>8</td>
<td>10</td>
<td>8</td>
<td>2</td>
</tr>
<tr>
<td>softsign</td>
<td>9</td>
<td>11</td>
<td>8</td>
<td>2</td>
</tr>
</tbody>
</table>

Fig. 5. Sigmoid-case compressibility function with the best 62.48\% compressibility when $E = 3$ and $B = 8$.

difference between the original value and the stored minimum value and use ceil(log$_2 E$) = 2 bits to store the error in the e-LUT. Therefore, the width of e-LUT is 2 and the depth of e-LUT is 1597 according to (4) and (3). Fig. 5 shows the compressibility of sigmoid function. In Table II, more activation functions were approximated using the t-LUT structure and then compared with a single structure (GELU and exp$(x)$ are discussed in Section V).

IV. HARDWARE IMPLEMENTATION AND UTILIZATION ANALYSIS

To establish the utilized hardware resources, we implemented the proposed LUT with t-LUT structure and compared it with the single LUT structure. The implementations were done on a Xilinx Spartan-7 XC7S50 FPGA device with 18k-bits Block RAM. Fig. 6 shows the structure of our t-LUT.

A. Sketch of the Hardware Architecture

The proposed LUT with t-LUT architecture consists of four multiplexers, one comparator, three adders/subtractors units, two core LUTs (d-LUT and e-LUT), and several buffers for synchronization. Assuming that the LUTs are preloaded, the following steps are employed to retrieve the stored values. When queried with a 12-bit input $[x(10:0)]$, for a value greater than 0 [$x(11) = 0$] (refer to Fig. 6), the absolute value of the input $[x(10:0)]$ is compared (using the comparator CMP) to verify the conditions mentioned in (10). Now, we analyze two possible cases.

Case 1: The input is within HT, i.e., $0 \leq x \leq HT$. Now, the comparator enables the LUTs (ena = 1 in Fig. 6) and the input is presented as an address to retrieve the preloaded LUT data from both d-LUT and e-LUT. The depth and width of the LUTs differ for different activation functions. For the sigmoid function, d-LUT and e-LUT require an 8-bit and an 11-bit addressing and the output value to have 8 and 2 bits, respectively. The address width and the data width for a few activation functions, such as tanh and softsign are presented in Table III. Once the data are retrieved from the LUTs, to satisfy the generalized bit-width requirements, the retrieved values are concatenated with zeros and summed and presented as 12-bit outputs.

Case 2: The input exceeds HT ($x > HT$). The comparator disables d-LUT and e-LUT and employs the multiplexers (MUX2 and MUX3) to present a 12-bit constant value $C$ as the output. Also, when a negative input is presented ($x(11) = 1$), the input is 2’s complemented and the absolute value is presented to comparator through MUX1, which retrieves the 12-bit preloaded data based on the scenarios presented earlier (cases 1 and 2). Furthermore, to compensate for the symmetry, the 12-bit output is subtracted from the contact value $C$ and presented as the output through MUX4, as shown in Fig. 6.

B. Utilization

Tables IV–VI present the utilization information for different activation functions, based on the proposed t-LUT method.
Fig. 6. Designed circuit for sigmoid, tanh, softsign, \((x/(1 + x^2)^{1/2})\), \(\text{erf}((\sqrt{\pi}/2)x)\), and \((2/\pi)\text{arctan}(\pi/2x)\) functions using t-LUT in FPGA. Note: * stands for the corresponding bit width of various functions. ** stands for the corresponding depth of d-LUT and e-LUT. Table III shows address width and the data width for sigmoid, tanh, and softsign.

TABLE IV

<table>
<thead>
<tr>
<th>Activation Function</th>
<th>sigmoid</th>
<th>tanh</th>
<th>softsign</th>
</tr>
</thead>
<tbody>
<tr>
<td>Architecture</td>
<td>t-LUT Used</td>
<td>Single Used</td>
<td>t-LUT Used</td>
</tr>
<tr>
<td>Slice LUTs</td>
<td>39 23</td>
<td>38 22</td>
<td>40 23</td>
</tr>
<tr>
<td>Used as logic</td>
<td>39 23</td>
<td>38 22</td>
<td>40 23</td>
</tr>
<tr>
<td>Occupied Slices</td>
<td>14 7</td>
<td>15 7</td>
<td>15 7</td>
</tr>
<tr>
<td>Carry4</td>
<td>5 2</td>
<td>5 2</td>
<td>5 2</td>
</tr>
<tr>
<td>18k BRAMs (bits)</td>
<td>+1,600</td>
<td>+1,776</td>
<td>+4096</td>
</tr>
</tbody>
</table>

TABLE V

<table>
<thead>
<tr>
<th>Activation Function</th>
<th>(\frac{x}{\sqrt{1 + x^2}})</th>
<th>(\text{erf}((\sqrt{\pi}/2)x))</th>
<th>(\frac{2}{\pi}\text{arctan}(\pi/2x))</th>
</tr>
</thead>
<tbody>
<tr>
<td>Architecture</td>
<td>t-LUT Used</td>
<td>Single Used</td>
<td>t-LUT Used</td>
</tr>
<tr>
<td>Slice LUTs</td>
<td>38 22</td>
<td>37 21</td>
<td>39 23</td>
</tr>
<tr>
<td>Used as logic</td>
<td>38 22</td>
<td>37 21</td>
<td>39 23</td>
</tr>
<tr>
<td>Occupied Slices</td>
<td>13 8</td>
<td>14 8</td>
<td>14 8</td>
</tr>
<tr>
<td>Carry4</td>
<td>5 2</td>
<td>5 2</td>
<td>5 2</td>
</tr>
<tr>
<td>18k BRAMs (bits)</td>
<td>+4,096</td>
<td>+5,064</td>
<td>+4,096</td>
</tr>
</tbody>
</table>

Comparing with a single LUT structure to approximate a sigmoid function, the t-LUT consumes more logic gates for the design adders and the comparator, whereas in terms of Block RAM usage, the single LUT uses 12,776 bits, compared with the t-LUT structure requiring only 4794 bits, presenting a compressibility 62.48% with a reduction of 7982 bits.

V. DISCUSSION

In this section, we will analyze the limitation and advantages of our proposed t-LUT and further present comparisons in terms of the hardware utilization with respect to other existing LUT design techniques.

A. Limitation

Our previous results were based on the precondition that the difference between the two consecutive values in the original LUT should be either 0 or \(\pm 1\) [see Fig. 7(a) and (c)]. When the precondition is not met [see Fig. 7(b) and (d)], the e-LUT exhibits an overflow, and hence, more bits are required to avoid the overflow, thus directly reducing the compressibility. The following cases illustrate the preconditions.

Case 1: The difference between the two consecutive values of \(f(x)\) is neither 0 nor \(\pm 1\). For example, while approximating the GELU function, we store the approximated function in an LUT for an input range \(-4 \leq x \leq 3.99609375\), according to (11), before the function saturates. Also, a 12-bit fixed point structure with four signed integer bits and eight fractional bits was used to approximate the values of the function. In theory, the maximum compressibility \(C = 62.50\%\) can be achieved with \(B = 8\) and \(W_c = 3\) (see Fig. 3), but as the difference between the consecutive values ranges from \(-1\) to 2 [see Fig. 7(b)], it causes the e-LUT to overflow. To avoid the overflow, we have to increase the width of the e-LUT to store the whole error value, leading to a reduction in compressibility from 62.5% to roughly 54.17%. Table II shows that if the difference between the consecutive values is larger (as an example, the derivative of exponent function is also an exponential function where \((d(e^x)/dx) = e^x\), the width of the e-LUT will increase, which leads to a reduced compressibility or even uncompressed scenarios (refer to GELU and \(\exp(x)\) in Table II).
Case 2: A large sampling interval [referring to (12)] presents a difference between the consecutive values of $f(x)$ as being neither 0 nor $\pm 1$. As shown in Table I (with the analysis based on an 18k-bit memory), as $f$ increases from 4 to 8, $j$ becomes equal to $f$, and the compressibility increases from 24.09% to 62.48% since the consecutive values are either 0 or $\pm 1$ [see Fig. 7(c)]. The same is valid when $f = 9$ and $j = 8$. However, when $f$ increases from 10 to 12, keeping $j$ fixed to 7, the differences between the consecutive values of $f(x)$...
TABLE X

<table>
<thead>
<tr>
<th>Width</th>
<th>2</th>
<th>4</th>
<th>8</th>
<th>16</th>
<th>32</th>
<th>64</th>
<th>128</th>
<th>256</th>
<th>512</th>
<th>1024</th>
</tr>
</thead>
<tbody>
<tr>
<td>2</td>
<td>0.00%</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
</tr>
<tr>
<td>3</td>
<td>16.67%</td>
<td>8.33%</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
</tr>
<tr>
<td>4</td>
<td>25.00%</td>
<td>25.00%</td>
<td>12.50%</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
</tr>
<tr>
<td>5</td>
<td>30.00%</td>
<td>30.00%</td>
<td>27.50%</td>
<td>13.75%</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
</tr>
<tr>
<td>6</td>
<td>33.33%</td>
<td>41.67%</td>
<td>37.50%</td>
<td>27.08%</td>
<td>13.54%</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
</tr>
<tr>
<td>7</td>
<td>35.71%</td>
<td>46.43%</td>
<td>44.64%</td>
<td>36.61%</td>
<td>25.45%</td>
<td>12.72%</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
</tr>
<tr>
<td>8</td>
<td>37.50%</td>
<td>50.00%</td>
<td>50.00%</td>
<td>43.75%</td>
<td>34.38%</td>
<td>23.44%</td>
<td>11.72%</td>
<td>&lt;0</td>
<td>&lt;0</td>
<td>&lt;0</td>
</tr>
<tr>
<td>9</td>
<td>38.89%</td>
<td>52.78%</td>
<td>54.17%</td>
<td>49.31%</td>
<td>41.32%</td>
<td>31.77%</td>
<td>21.44%</td>
<td>10.72%</td>
<td>&lt;0</td>
<td>&lt;0</td>
</tr>
<tr>
<td>10</td>
<td>40.00%</td>
<td>55.00%</td>
<td>57.50%</td>
<td>53.75%</td>
<td>46.88%</td>
<td>38.44%</td>
<td>29.22%</td>
<td>19.61%</td>
<td>9.80%</td>
<td>&lt;0</td>
</tr>
<tr>
<td>11</td>
<td>40.91%</td>
<td>56.82%</td>
<td>60.23%</td>
<td>57.39%</td>
<td>51.42%</td>
<td>43.89%</td>
<td>35.58%</td>
<td>26.88%</td>
<td>17.99%</td>
<td>8.99%</td>
</tr>
<tr>
<td>12</td>
<td>41.67%</td>
<td>58.33%</td>
<td>62.50%</td>
<td>60.42%</td>
<td>55.21%</td>
<td>48.44%</td>
<td>40.89%</td>
<td>32.94%</td>
<td>24.80%</td>
<td>16.57%</td>
</tr>
<tr>
<td>13</td>
<td>42.31%</td>
<td>59.62%</td>
<td>64.42%</td>
<td>62.98%</td>
<td>58.41%</td>
<td>52.28%</td>
<td>45.37%</td>
<td>38.07%</td>
<td>30.57%</td>
<td>22.98%</td>
</tr>
<tr>
<td>14</td>
<td>42.86%</td>
<td>60.71%</td>
<td>66.07%</td>
<td>65.18%</td>
<td>61.16%</td>
<td>55.58%</td>
<td>49.22%</td>
<td>42.47%</td>
<td>35.52%</td>
<td>28.47%</td>
</tr>
<tr>
<td>15</td>
<td>43.33%</td>
<td>61.67%</td>
<td>67.50%</td>
<td>67.08%</td>
<td>63.54%</td>
<td>58.44%</td>
<td>52.55%</td>
<td>46.28%</td>
<td>39.80%</td>
<td>33.24%</td>
</tr>
<tr>
<td>16</td>
<td>43.75%</td>
<td>62.50%</td>
<td>68.75%</td>
<td>68.75%</td>
<td>65.63%</td>
<td>60.94%</td>
<td>55.47%</td>
<td>49.61%</td>
<td>43.55%</td>
<td>37.40%</td>
</tr>
<tr>
<td>32</td>
<td>46.88%</td>
<td>68.75%</td>
<td>78.13%</td>
<td>81.25%</td>
<td>81.25%</td>
<td>79.69%</td>
<td>77.34%</td>
<td>74.61%</td>
<td>71.68%</td>
<td>68.65%</td>
</tr>
<tr>
<td>64</td>
<td>48.44%</td>
<td>71.88%</td>
<td>82.81%</td>
<td>87.50%</td>
<td>89.06%</td>
<td>89.06%</td>
<td>88.28%</td>
<td>87.11%</td>
<td>85.74%</td>
<td>84.28%</td>
</tr>
</tbody>
</table>

B. Advantages

1) We used t-LUT to approximate different activation functions, as shown in Table II. In the respective literature, there are eight popular activation functions [from sigmoid to \((2/\pi) \arctan((\pi/2)x)\)] that meet our precondition and reach the theoretic compressibility. Among these results, the softsign reaches the maximum compressibility of 62.5% when compared with the use of a single LUT.

2) Since the compressibility is independent of the LUT depth [as shown in (8)], an increase in the number of entries in the t-LUT will proportionally increase the size of the LUTs (d-LUT and e-LUT) with no additional logic added to the existing circuitry. In contrast, the hardware resources in the implementations in [9], [10], and [12], such as logic gates and comparators, increase proportionally with the number of additional entries, on top of the increase in the required memory resources.

3) The proposed design employs an e-LUT to avoid any error and thus compresses the information without any loss, thereby approximating the activation function with high-precision and less memory resources.

C. Comparison

In order to provide realistic comparisons, we chose to approximate a hyperbolic tangent function, tanh, with a 3.91% relative error and a 3.070168e-05 MSE, as shown in Fig. 8. Furthermore, we also set the LUT architecture designed in [11] as our benchmark and calculate the compressibility and the resources utilization in terms of 18k bits of Block RAM for different architectures. The results are presented in Tables VII and VIII for the LUT depth size of 512 and 1024, respectively. RALUT in [11] utilized 549 bits (2.98%) and 1270 bits (6.89%) of the Block RAM and presented the...
Fig. 10. Designed circuit for ELU, softplus, and GELU activation functions using the t-LUT in an FPGA. Note: * stands for the corresponding bit width of various functions. ** stands for the corresponding depth of d-LUT and e-LUT. Table in Fig. 10 shows the address width and the data width for ELU, softplus, and GELU.

TABLE XI
ERROR ANALYSIS OF DIFFERENCE SAMPLING PRECISIONS

<table>
<thead>
<tr>
<th>Fractional Bits</th>
<th>Sampling</th>
<th>Size of Table (bits)</th>
<th>Maximal Error (bit)</th>
<th>Average Error (bit)</th>
<th>Mean Error (bit)</th>
</tr>
</thead>
<tbody>
<tr>
<td>9</td>
<td>8</td>
<td>39.0625000</td>
<td>159.75</td>
<td>9.765613</td>
<td>5.04665</td>
</tr>
<tr>
<td>9</td>
<td>9</td>
<td>3.9312500</td>
<td>321.94</td>
<td>9.765613</td>
<td>4.925943</td>
</tr>
<tr>
<td>9</td>
<td>10</td>
<td>9.76562500</td>
<td>638.73</td>
<td>9.765613</td>
<td>4.924733</td>
</tr>
<tr>
<td>9</td>
<td>11</td>
<td>9.76562500</td>
<td>1277.66</td>
<td>9.765613</td>
<td>4.925514</td>
</tr>
<tr>
<td>9</td>
<td>12</td>
<td>2.44140625</td>
<td>2554.92</td>
<td>9.765613</td>
<td>4.925671</td>
</tr>
</tbody>
</table>

Algorithm 1 Exhaustive Method Finding E in Original LUT

Input: LUT array lut, Band size B
Output: the maximal different number in one band

\[ \text{dnun} \_\text{max} \]

for i = every indexes of lut do

\[ k = i + B \]

if \[ i \leq \text{length(lut)} - B \] then

\[ \text{temp} = \text{lut}(i : k) \]

if \[ \text{mod}(i, B) == 0 \] then

\[ \text{dnun}(\text{end} + 1) = \text{length(unique} \text{(temp)}) \]

end if

else

\[ \text{temp} = \text{the rest part of lut} \]

\[ \text{dnun}(\text{end} + 1) = \text{length(unique} \text{(temp)}) \]

break

end if

end for

\[ \text{dnun} \_\text{max} = \max(\text{dnun}) \]

replaced with t-LUT, it used 303 bits (1.64%) and reported a compressibility of 93.42%. Also, the combined architecture of t-LUT and RALUT cost 144 bits (0.78%) and reached an improved compressibility of 96.86%.

Overall, when the performance of the conventional LUT is compared against the proposed t-LUT in terms of the BRAM utilization column data in Tables VII–IX, the t-LUT outperforms the LUT by the factor greater than 16.

VI. CONCLUSION

LUT implementations on FPGAs are commonly used in NNs to approximate the activation functions. In this article, we proposed a t-LUT architecture, which employs two parallel LUTs to approximate different activation functions. In our deployments, implementation of the t-LUT requires less resources and has a better performance compared with...
previously implemented LUT structures. The proposed design can be efficiently used in NN implementations on an FPGA.

Furthermore, any existing LUT architectures, satisfying the described preconditions, can be replaced by the proposed t-LUT, consequently achieving higher compressibility rates. We believe that the method can be applied to other function approximations, e.g., square-root, logarithm, and trigonometric functions in digital signal processors (DSPs) or GPUs.

### APPENDIX A

#### PSEUDO CODE

See Algorithm 1.

### APPENDIX B

#### SIZE OF TABLE AND MSE

See Fig. 9 and Table XI.

### APPENDIX C

#### CIRCUIT FIGURE

See Fig. 10.

### REFERENCES


Zhun Fan (Senior Member, IEEE) received the B.S. and M.S. degrees in control engineering from the Huazhong University of Science and Technology, Wuhan, China, in 1995 and 2000, respectively, and the Ph.D. degree in electrical and computer engineering from Michigan State University, Lansing, MI, USA, in 2004. He is currently a Full Professor with Shantou University (STU), Shantou, China. He also serves as the Head of the Department of Electrical Engineering and the Director of the Guangdong Provincial Key Laboratory of Digital Signal and Image Processing. Before joining STU, he was an Associate Professor with the Technical University of Denmark (DTU), Kongens Lyngby, Denmark, from 2007 to 2011, where he was first with the Department of Mechanical Engineering and then with the Department of Management Engineering, and as an Assistant Professor with the Department of Mechanical Engineering from 2004 to 2007. He has been a Principle Investigator of a number of projects from the Danish Research Agency of Science Technology and Innovation and the National Natural Science Foundation of China. His major research interests include intelligent control and robotic systems, robot vision and cognition, MEMS, computational intelligence, design automation, optimization of mechatronic systems, machine learning, and image processing.

Miroslav Joler (Senior Member, IEEE) received the B.S. degree in electrical engineering from the University of Zagreb, Zagreb, Croatia, in 1996, and the M.S. and Ph.D. degrees in electrical engineering from The University of New Mexico, Albuquerque, NM, USA, in 2001 and 2006, respectively. In 2006, he was a Postdoctoral Research Associate at Portland State University, Portland, OR, USA. In 2007, 2013, and 2018, he was elevated to Assistant Professor, Associate Professor, and Full Professor at the Faculty of Engineering, University of Rijeka, Rijeka, Croatia, respectively, where he also worked as an Adjunct Assistant Professor with the Faculty of Maritime Studies in 2008. Since 2008, he has been the Laboratory Director, the Head of the Communications Systems Group, the Graduation Committee Chair, and the Department Chair. He has been a Researcher in multiple scientific projects in the USA and Croatia, while his industry experience includes working as an RF Engineer from 1996 to 1999. He has coauthored articles in distinguished scientific journals and conferences. His research interests include antennas, smart clothing, wearable devices, self-recoverable systems, wireless power transfer, and biomedical applications of electromagnetics. Dr. Joler served as a reviewer, a technical program committee member, and the session (co-)chair and held invited conference talks. He has also served as a Research Proposal and Annual Report Evaluator, an Editorial (Advisory) Board Member, and an Associate Editor. In 2001, he received the University of New Mexico Graduate Office’s Research, Proposal, and Travel Award. In 2017, he was admitted to the Associate Level of the Croatian Academy of Engineering.