VHDL Optimized Model of a Multiplier in Finite Fields 1

Introduction : This article presents a finite field multiplier (GF) model, studying the generalized architecture of the LFSR component (linear regression displacement records), in order to generate a concurrent description. Concepts of structural analysis, description of parameterized components, and mathematical treatment of signals have been applied. Method : The design was performed by the tabulation of the terms in the variable time function and the position in the circuit, the components of the modular reduction, thus creating an array of combined operations. This model was described in VHDL, for testing behavior and optimization of hardware. Results : The research established the equations for the implementation of the VHDL model in its generic expression with operator concatenation for the hardware configuration. It is estimated that the hardware resources, a level of logical operators, obtained a 7.89% savings in the energy consumption associated with the signal in the multiplier design by the proposed optimization technique. Conclusions : The model simplified the description of parallel circuits, high performance from a mathematical model approach to hardware description. The proposed method contributes to the field of optimization in the efficient modeling of advanced logic systems, which can be extrapolated to more complex components.


Introduction
The study of the RS coder [1] for efficient implementation found that they use Galois GF (2 m ) finite field arithmetic to produce fixed length results and that their operations can be implemented with relatively simple circuitry [2].Some of these operations are more complex, which led researchers to focus on the optimization of multiplication algorithms [3].It should be borne in mind that the computational efficiency of arithmetic operations in finite fields is closely related to the particular way in which field elements are presented [4].Therefore, we have detected the need to study solution alternatives for the implementation of GF arithmetic operations (2 m ) based on the mathematical foundations of finite fields.
The Finite Fields of Galois constitute a specific area of mathematics developed by E. Galois.The field is specified by a prime element p (base of the field) and a positive integer m, the length of the field element p m corresponds to the number of elements of the field, and the arithmetic operations on the finite field result in an element that belongs to it [5].These properties are used in applications of coding, Reed-Solomon decoding [6], and cryptography.A widely used presentation is the polynomial form, in which a field-generating polynomial, known as an irreducible polynomial p(x), is defined such that it will operate with the results of operations to bring the result to the fixed length defined for field.
A model that is configurable in a device that supports concurrent implementation that allows for the incorporation of Reed Solomon encoders is proposed.These multipliers are basic components that require implementation in a highly efficient way for speed, and consumption of resources and power by the fractal nature of these circuits [7], as well as the concatenation of complex codes [8].

Finite field multipliers
The multiplier in finite fields is generally more complex than a conventional multiplier [9].These have been a topic of investigation; therefore, the relevance of interpreting the circuit support logic for these modules and establishing an efficient model for their design on FPGA hardware.The arithmetic multipliers of Galois finite fields are based on the multiplication of two elements of the finite body and the reduction of the result by an irreducible polynomial p(x) of degree m.These can be implemented by an algorithmic model, which reproduces the behavior of the multiplier sequentially [10] or with parallel models, depending on the need for these arithmetic modules to operate at high frequencies and occupy the smallest possible area, which demands a balance between these factors.This commitment has led to the search for efficient algorithms and architectures, as indicated by research in the area [11].
Among the models available, the multiplier proposed by Karatsuba-Ofman [12] consists of a modular sub-division technique of the multiplier component, combinational models [13], through tables [14], and table-algorithm combinations [15].The latter presents a contribution in reducing resources based on the original implementation of 255×255 tables, equivalent to 65025 8-bit symbols, to 256 8-bit symbols for the multiplier; however, this reduction still demands 512 LUTs of 4 inputs when making the expansion to cover the table.
From the study of previously developed GF multiplier models, the GF [16] Mastrovito, Paar-Rosner, and Pipelined Combinatorial multipliers were found to have the lowest consumption and highest speed compared to alternative multipliers.For this reason, they have been considered as reference for the evaluation of the results, although some of these multipliers present optimization for a particular approach to the architecture and have been an important contribution for the interpretation of optimization alternatives.From the models studied, the LFSR-based multiplier is considered as the base for the optimization.It presents a latency of m clock pulses, which can be parallelized under a temporal analysis of the circuit.
This analysis is based on the study of the mathematical model of polynomial representation, in which it is stated: If p(x) is the irreducible polynomial, then the multiplication of two elements of the field, represented as the polynomials A(x) and B(x) is the algebraic product of the two polynomials, and the modulus operation of the polynomial p(x), also known as modular reduction, is that shown in equation 1.
The multiplication of polynomials is associative, commutative, and distributive with respect to the addition whereby we obtain equation 2.
Where A(x) and B(x) correspond to the polynomial representation of the operands.In the case of the RS encoder, the multiplicand A(x) corresponds to a coefficient of the code generator polynomial, the multiplier B(x) to the data input, and p(x) is the irreducible polynomial of the Galois field.For the calculation of the modular reduction, the concept of the division of polynomials on finite fields is used, whose mathematical expression is presented in equation 3.
Where r(x) = A(x) mod p(x) corresponds to the remainder of the division between the operand A(x) of the multiplication and the irreducible polynomial of the finite field GF(2 m ).For the circuital implementation of the equations, the stages of the multiplier in finite fields must be identified.First, the model proposed by [9], which includes a modular reduction step in the results, is divided into four levels of operation on the data.In Figure 1, the stages are presented based on the initial model, where the m-bit operands and the displacement of the partial products that give rise to an accumulation of length m + m − 1 are observed.This generates a resultant element that does not belong to the finite field and a conversion must be incorporated through a final reduction so that the product corresponds to an element of the field.
The circuit diagram of the multiplier GF(2 m ) comprises three stages: the operation circuit module p(x) implemented by an LFSR circuit, obtaining the modular reduction of the polynomial A(x), the partial product module with the element Bi implemented by a set of parallel AND gates.The accumulator of partial products, implemented by XOR gates in case of being sequential are fed back using flip-flop memory storage devices.The case of the polynomial generator of the finite field is P(x) = x 8 + x 4 + x 3 + x 2 + 1.
Source: Author own elaboration.
The architecture of the modular reduction stage is based on generators of Galois sequences; its properties are studied in [18].From there it is observed that it was originally implemented by registers to obtain the n vectors corresponding to the partial residues.This is by means of a bit shift over the structure in m clock cycles, where m is equal to the number of bits of the word, in the case of a GF field (2 8 ), with 256 field elements of 8 bits in length.The model of a configurable LFSR is of great interest for the synthesis process [19].Based on this, we propose the analysis of the behavior of the temporal results generated by the LFSR structure in order to construct a mathematical model, oriented to the description in VHDL.
From the circuit model and the construction of terms as a function of the instant and the spatial position, the behavior of the polynomial divisor module is determined to obtain the vectors a1 ... a8, the result of A(x) mod P(x) using concatenation instead of sequential processing.The implementation is reduced to the use of AND gates between each coefficient of p(x), and the use of XOR gates that operate with partial coefficients results, all in the same clock pulse.In this way, the configuration syntax for a parallelized version of this multiplier in VHDL is created, where the sequential components have been eliminated and replaced by variables that depend on combinational functions.
Starting from the analysis of multiplier architectures [20], we can observe a combinational multiplication step composed of a set of ANDs and XORs and a sequential modular reduction step, which implements a division of polynomials between a finite field element A(x) ∈ GF and the irreducible polynomial that defines the field p(x).If it is possible to define the LFSR model through a concurrent implementation [21], the area and power consumed are similar and even less compared to the sequential version [22].For this, we analyzed optimizations on the circuit [18] and a mathematical model for the generation of the sub-sequences, considering that applications of high performance require parallel implementations [23].

Method of description of multiplier GF in VHDL
The modules for data processing in finite field algebra GF(2 m ) can be programmed in software through algorithms for sequential processing or hardware solutions for concurrent models, the latter being the model that is proposed.An alternative for the implementation of this model is based on search tables, allowing changing the representation of the elements of the field GF (2 m ), or through a circuit model where the design of the module is based on the mathematical model and behavior interpretation of the circuit.For this purpose, the architecture was analyzed, using modular reduction based on the LFSR.
For the interpretation and analysis of the LFSR device with sequential structure, we proceeded to describe each sequence generated at an instant of time t, thus obtaining the associated states, and the mathematical representation of the sequential LFSR.At this point, we proceeded to describe the behavior of the circuit for each clock pulse, using the position indexes corresponding to each memory element and the logical operations that are performed for the construction of the terms for which the descriptions are given in Table 1.From the description in VHDL, it is possible to reproduce the LFSR in parallel.These data allow us to generalize the equations.At time t = 1 the resulting vector will be the data input, D_date, at time t = 2 we have the partial vector of the circuit operation, shown in equation 4.

Table 1. Description of the concurrent modular reduction circuit
a t =a t-1 (6) and a t-1 (7) xor p (7),… , a t-1 (0) xor a t-1 (7) and p(0), a t-1 (7) and p(8) (4) Where operations are defined by the AND and XOR logic operations, given that the operands are binary elements, corresponding to the coefficients of the polynomials A(x) and p(x).An analysis of the time outputs of the circuit during its operation yields an array of equations, which depend on the clock cycle and the position of the vector, and allow the calculation of the terms generated by the LFSR circuit.The mathematical-logical relationship for each of the elements generated by the LFSR component is presented in Table 2, as an array of combinational operations that describe the parallel multiplier model.
In Table 2 we can observe the relationship of the terms when applying the operation of each branch of the LFSR circuit on the polynomial A(x).Each branch is represented by the coefficient of the generator polynomial of the field p(x) in position i, corresponding to p(i), which will be fixed in time.Since the modular reduction operation generates a set of m partial residues, according to which its behavior can be described, it is proposed to parameterize the generating function of each element.In this way, the method used is the association of the sub-indices with the relation of the parameters given by the position i and the time instant t.
Based on this modeling technique, we establish the coefficients of the residual polynomial as a function of time as a t (i); the coefficient of the polynomial in the most significant position as at-1(m-1), representing the most significant bit fed back into the circuit, at a time instant t-1 and a t-1 (i-1), which corresponds to the preceding bit of the calculated element.From this procedure the equations that describe the behavior of the LFSR circuit for its parallel description were developed, in order to achieve the implementation of the concurrent structure for VHDL.

Hardware description optimization techniques
The simplification of the circuit is proposed, defining this as a function of the characteristic polynomial.The first step is the selection of the irreducible or primitive polynomial for the generation of the field; these polynomials are characteristic of the length m of the field GF (2 m ) and are presented in detail in [24].The criterion that must be met is that they divide the elements of the finite field and cannot be written as the product of two field polynomials [25].
A polynomial satisfying such a condition is the polynomial p(x) = 285, which has been selected because it is a polynomial GF field generator widely used in standard RS encoders [26].At this point the steps of the calculation of the multiplication in finite fields were studied to describe them concurrently in VHDL language in order to obtain the model for the hardware implementation of the multiplier.Using the polynomial generator of the field in its binary form, p(x) = 100011101, which is a polynomial of degree m = 8, we obtain the particularized model shown in Figure 2.This simplified circuit of the modular reduction component presents a saving in the resource consumption of the device.The VHDL description of the concurrent LFSR circuit model used by the multiplier is generated from a simplification of Table 2 by the direct cancellation of the branches whose coefficients are "0" in the selected irreducible polynomial, resulting in the VHDL code of the simplified structure, presented in Table 3.The products must be implemented with AND gates between the elements of the partial outputs of the LFSR, which are at, i, a vector of m elements with the data element bt, the bit being in the position t of the input B. In bitwise parallel, in order to obtain the resulting vectors bt, ci, j, corresponding directly to at, i, in case the bit in the respective position of bt is equal to '1', in case otherwise it will be annulled.A vector Bt is defined as the concatenation of the element bt, m times, the partial results operate through XOR gates for the results of the final product, as described in Table 4.

Results
The test benches of the components were created to observe the response of the designed modules.To validate the GF multiplier, based on LFSR, we considered the parameter m = 8, with a field generator polynomial p(x) = 285, and the coefficients 104 and 13.The multiplication of the first 16 elements of the finite field GF (256), generated the expected results, in parallel (see Figure 3).
The operation of the RS encoder (255, k) was validated for the cases studied when the multiplier components to be used in the encoder were tested.For this purpose, the parameter k variable was established, which was assigned the values 223, 239, and 247, using the polynomial generator of RS code for each case.The coefficients of the polynomial G(x) are assigned in the step of the encoder redundancy generator.For its operation with the data D(x), the data entry uses values from 1 to k increasing, set in the test_bench.In each case, the 255-k corresponding redundancy symbols are obtained.The simulation was performed using the ModelSim XE III 6.3c tool, incorporated in the ISE of Xilinx.The coded word generated corresponded with the result expected from the theoretical behavior.Source: Author own elaboration.
At the resource level of the FPGA hardware device a logical complexity of 3 stages is obtained (modular reduction), demonstrating that the depth of the design does not interfere in the propagation speed of the signals, as shown in Figure 4.
To implement the circuit, 44 reported LUTs were required, the Slice number is associated with the number of LUTs of the FPGA as mentioned before, the device XC5VLX30 presents 4 LUTs per Slice.The number of inputs required by LUTs for the design is 6, 4, and 3, requiring 17 Slices, as shown in Table 5, which is a significant reduction of FPGA components, compared to previous reports.

Optimization for signal allocation
After the theoretical analysis, the power consumption in mW of the designed modules was obtained, using Xilinx's XPower Analyzer tool, as shown in Table 6, considering as a comparison factor the order of allocation of the signals for the multiplier circuit [29].The power consumption associated with the logic of the P_Logic circuit in both cases was 0.04 mW, while the P_Signal presented a variation according to the order of the operands, consuming a power associated with the signals of 0.38 mW in the case of B(x) mod p(x).A(x).A power consumption of 0.35 mW was seen in the case of A(x) mod p(x).B(x); this optimization was achieved through the signal reordering technique; in this case the results of the multiplier were tested using the commutative property of the operations.There is no consumption associated with the clock signal P_clk, since the multiplier is concurrent and there is no consumption of P_IO because it is an internal component and the signals of the multiplier are not implemented in the external pins of the FPGA.The order of the inputs was found to only have an effect on the P_Signal that corresponds to a technique of decrease of the consumption that will be used for the design of the modules of the encoder.Based on the power ratio of the optimized design and the original version, a 7.89% savings in the power consumption associated with the signal in the multiplier design was obtained.
The model was developed starting from the expression of the polynomials generated as partial residues in the multiplication operation in finite fields, which was expressed as a function of the x i displacements in the multiplication operation with the corresponding B i coefficient.
The matrix expression for the product of symbols in Galois finite fields can be expressed as: After obtaining the multiplier equations in VHDL, a mathematical-logical model for reconfigurable hardware was obtained from the study of the equations of each of the elements of the LFSR, presented in Table 2.In this way, a generalization of the behavior of the elements a t (i) was obtained in equation 7.
a t (i)= a t-1 (i-1) xor (a t-1 (m-1) and p(i)) (7) a t is an element generated at a time t to apply the modular reduction A(x) mod p(x), resulting from the operation and between p(i), the polynomial coefficient of the feedback function at a position i, with the element a t-1 in the most significant position of the previous symbol, this term is in XOR with a t-1 , in position i.It is important to note that the optimization by signal organization has been considered, in which the order of multiplication factors affects the efficiency of the design, the most efficient distribution of signals being presented in the model.
Next, the general equation of the residual vector of the modular division or reduction was defined for each time instant, using the variable t as an operation reference with the elements of B(x).Using the concatenation operation, substitute the clock signal to generate the sequence of the bits given by equation 7, from which equation 8 is obtained.
Where the '&' symbol is used to indicate the concatenation operation in order to match the description in VHDL.This is optimized by parallelizing the circuit by concatenation of signals for concurrent processing.To implement the combinational step of the product of A(x) xi mod p(x).B(x), which corresponds to the AND operation of each element of B(x) by each other at and the summation XOR of these, equation 9 is derived.
) and b t (10) This equation has the variable i for nonzero coefficients of the polynomial p(x), which achieves an optimization by the simplification of the equations, as mentioned in the efficiency analysis.This VHDL model allows for the resource consumption analysis based on the multiplier architecture, which has been parallelized through the concurrent description of the modular reduction stage.This represents an important advance in the proposed multiplier model with the concurrent LFSR component.The calculation of the consumption of gates was established by equation 11, for the calculation of AND required.
#AND mult = m 2 (11) And for the calculation of the consumption of XOR gates we used equation 12.
#XOR mult = m 2 +p.m-4.m-p+3(12) where m corresponds to the number of bits of each field word and p corresponds to the number of non-zero bits of the irreducible polynomial p(x); in this case the optimization corresponds to the simplification of operations, as a function of the number of coefficients p(x).When analyzing the structure of the VHDL model, the associated direct operands are obtained and an estimation of hardware resources can be made at the level of gates.

Conclusion
A model of the studied circuit was found where the description in VHDL was defined for the behavior of the LFSR function, recognizing function generating terms, which maintain relation with the temporal sequence, modeled concurrently.This proposal is innovative because it considers the temporary displacements as concurrent concatenation operations of elements of the circuit, thus optimizing the processing speed and minimum energy consumption, which is of great interest at the moment.In this way, a model for implementation in hardware is achieved, with updated and highly efficient technology, useful for other developments that reuse GF multipliers as basic elements.
From the model, a generalized hardware description is obtained as a novel contribution since a model has not been found that allows the reproduction of results from a set of basic equations and obtains the results discussed here.This allows for a dynamic adjustment with great flexibility in its configuration; the adaptive proposal supports hardware development from the parameterized equations and manages to reuse the components, oriented to concurrent implementation, giving a new approach to the design.Similarly, the behavioral principles of the logical systems are the basis for design and optimization, recognizing the associated patterns in order to achieve the design of a descriptive model at the mathematical -logical level for the generation of the configuration code.
Considering the algorithmic relationship of the hardware functions, it has been proposed to advance this method of modeling, for the optimization of the VHDL designs, with parallel processing capacity, using the least number of resources of the FPGA devices, and test it with a selected study case.We obtained the equations, which describe the GF multiplier, with adaptable parameters, since it was determined that the parameters once assigned allow optimizing the design since the results will be given in a function of the coefficients that defines the characteristic polynomial of the circuit.This is achieved by a highly efficient design and a novel model, which provides a method of hardware configuration, taking advantage of the correspondence with the VHDL code of the components of the multiplier element, for adaptation to new designs.

c = t=1 m a t and b t ( 9 )
Substituting equation 8 into equation 9, gives the model of the multiplier as equation 10. c

Table 5 .
Resources used by the multiplier GF(2 m ) with m = 8

Table 6 .
Power Consumption Comparison of Multipliers