Fast and furious convolutions

Quantitative Investments

Read 30 min

Nov 22 2024

Introduction

Artificial intelligence (AI) is radically changing the way we develop quantitative strategies. Among other things, it is changing where we spend our time and efforts. While we spent most of our time formulating macro-economically sound investment rules in the past, we now often find ourselves seeking algorithmic efficiency in our code base. When optimizing algorithms, gaining speed pays off. Consider Sam Altman, who was quoted saying that training ChatGPT cost USD 100 mn, a significant portion of which was spent to buy GPU time.

Many AI networks involve matrix multiplication between large matrices. If, for example, we found a way to multiply matrices faster, resulting in a 10% improvement in training speed, we could claim to have generated USD 10 mn of value via shorter training times, using the ChatGPT example. While figuring out the fastest way to multiply two matrices may not stimulate everybody, compared to trying to figure out the master equation to predict markets, it certainly does add value. Besides, after years of disappointment in the true value-add of complicated models conceived by humans, we argue that we should turn back to more down-to-earth endeavors, and leave it to AI to figure out what model is working for markets at any given point in time.

It is in the spirit of said considerations that we are conducting a deep dive into the convolution operation, a classical building block in AI architectures. While convolutions are a “must-have” piece for image recognition via AI, they hold promise also when forecasting financial markets.

In this article series, we review the basics of the convolution operation and discuss three methods on how they can be expressed as matrix multiplications. We will show that one of methods lends itself to be executed via the FFT (Fast Fourier Transform). The FFT was deemed to be among the top 10 most influential algorithms of the 20th century¹, and significantly abates computational complexity.

The main contribution of this article is to collate information from disparate sources and organize it into a consistent and logical sequence, one that any reader with undergraduate mathematical knowledge can follow. In this respect, nothing of what’s written here shall will provide a truly original contribution to the field of science. We have been mostly repackaging results that are sprawled across a series of media out there, from peer-reviewed articles through to YouTube tutorials (some of them are really good).

In addition to that, this article experiments with a novel way of exposing things, one where we build mathematical propositions bottom-up, starting from a specific instance and generalizing later. In doing so, we will quote and refer to theorems that we need on the go, as opposed to invoking a priori without context. We find that the latter way of doing obscures the true purpose of the theorem to the reader.

We think that mathematics is made unnecessarily hard because it is taught in terms of generalized statements, as opposed to narrated, and discovered from small and simple problems. Choosing to narrate from the specific to the general will certainly lead to more verbosity, but we hope to more clarity also.

Convolution basics

Convolution is a large topic, spanning both continuous and discrete time. In the following, we will deal with discrete, 2D convolution operations, which form the basis for image processing in AI architectures. The extension of the concepts explained herein to higher dimensions is relatively straightforward, as dimensions can be treated independently. We will provide readers with the basics needed to follow the discussion. For a more comprehensive guide on convolution arithmetic, which includes excellent visualizations, we refer readers to the work of V. Dumoulin and F. Visin². In the following, we illustrate the basic idea of a 2D convolution with a numerical example.

A numerical example

Let \(X\) be a 5 by 4 matrix (\(X \in \mathbb{R}^{5 \times 4}\)) and \(K\) a 3 by 2 convolution filter (\(K \in \mathbb{R}^{3 \times 2}\)). Think of \(X\) as an image, and \(K\) as a filter that extracts features from the image. Let us define them as follows:\[X= \left( \begin{array}{rrrr} 1 & 2 & 1 & 2 \\ 1 & 1 & 0 & 1 \\ 0 & 2 & 1 & 2 \\ 1 & 2 & 2 & 0 \\ 1 & 0 & 1 & 1 \end{array}, \right), K = \left( \begin{array}{rr} 1 & 1 \\ 0 & 1 \\ 1 & 0 \end{array} \right). \tag{1} \]

\[F = X \otimes K = \left( \begin{array}{rrr} 4 & 5 & 5 \\ 5 & 4 & 5 \\ 5 & 5 & 4 \end{array} \right). \tag{2} \]

As discussed, we invite readers to consult more comprehensive tutorials to understand what happened in Eq.(2). For a glimpse of what went on, let us consider the upper left element in the resulting matrix, i.e., \(F_{1,1}\). It is obtained by superimposing the filter \(K\) onto the image \(X\). We align their top-left corners, perform an element-wise multiplication, and sum the results. Spelling out the element-wise multiplication leads to: \[X_{1:3, 1:2} \odot K = \left( \begin{array}{rrr} 1 & 2 \\ 1 & 1 \\ 0 & 2 \end{array} \right) \odot \left( \begin{array}{rrr} 1 & 1 \\ 0 & 1 \\ 1 & 0 \end{array} \right)= \left( \begin{array}{rrr} 1 & 2\\ 0 & 1 \\ 0 & 0 \end{array} \right). \tag{3} \]

In the above, \(X_{1:3,1:2}\) denotes the matrix obtained by extracting the first 3 rows and the first 2 columns of \(X\), and \(\odot\) represents the element-wise (or Hadamard) product between matrices. By summing the elements of the resulting matrix we obtain 4, which corresponds to \(F_{1,1}\) in Eq.(2), as we wanted to prove. To calculate \(F_{1,2}\), we simply slide \(K\) by one to the right, and repeat the operation.

The output size

Let us consider the generic case where \(X \in \mathbb{R}^{m \times n}\) and \(K \in \mathbb{R}^{p \times q}\). Typically, \(m\) and \(n\) are much larger than both \(p\) and \(q\). In image processing, this is equivalent to saying that we typically convolve large images with small filters. By continuing the iterative process described in the section above, i.e., by sliding the filter horizontally and vertically, it can be shown that \(F \in \mathbb{R}^{h \times k}\) where:

\[\begin{aligned} h &= m - p + 1, \\ k &= n - q + 1. \end{aligned}\tag{4}\]

There is an intuitive way to make sense of the above. Take Eq.(4) for example. It states that the number of columns of the convoluted image \(k\) is equal to the number of steps by which you have to slide the filter \(K\) to the right, starting from the first position, until you hit the right border of \(X\). Then you add 1 because of course you need to take into account the first element.

Additional parameters

There are a few more options on how to do convolutions. Specifically, you could decide to move the filter \(K\) by \(s\) steps at the time. In the former discussion, we assumed \(s=1\). This parameter is called stride, and it allows us to ’jump’ over columns, which is especially useful when the input does not change as much as we move from left to right.

One notices that the element \(X_{3,2}\) contributes 6 times to the final output. That is, when sliding the filter \(K\) over \(X\), the element \(X_{3,2}\) will be ‘picked up’ 6 times. This is in stark contrast with the element \(X_{1,1}\), which will be picked up only once. In other words, if we do not adjust anything in the above operations, elements at the border matter less than elements in the middle. We may not want that, especially when data at the edge contain important information.

To correct for this, we introduce padding, whereby we add rows and columns around the image before convolving with the filter. Padding by 1 means that we add one column to the left, one to the right, one row to the top, and one row to the bottom of the original image \(X\). In the more general case, padding by \(p\) transforms \(X\) into a matrix of size \(m+2p\) by \(n+2p\).

In the presence of striding and padding, Eq.(4) transforms into: \[k = \left\lfloor \frac{n + 2p - q}{s} \right\rfloor + 1. \tag{5} \]

In Eq.(5) the symbols \(\lfloor\) and \(\rfloor\) denote rounding to the lowest integer. This is because we typically do not allow the filter to cover an area outside the original image. In other words, we need to stop one step before that happens. By the way, this is purely a choice, but one that the image-processing community has commonly accepted. In the following, we will deal with the simpler case where \(s=1\) and \(p=0\), as in the original example discussed above.

Convolution as matrix multiplication

From a computational perspective, multiplications are expensive. Hence, any numerical approach that performs convolutions in a way that minimizes the number of multiplications needed is a winner. In this section, we will explore transforming convolutions into matrix multiplications, an operation greatly optimized for speed. In doing so, we will present three approaches. The first two won’t yield any benefit, but the last one will (at the margin).

Before we start, let us calculate how expensive the convolution operation, as defined earlier, really is. We now know that the final output is a matrix of dimensions \(m-p+1\) by \(n-q+1\). Calculating each element in that matrix involves performing a series of additions and multiplications. Since sums are not computationally expensive, we can neglect them in the calculation of the order of complexity. Once we know the number of multiplications needed to arrive at each element of the matrix \(F\), we simply multiply by \(m-p+1\) and \(n-q+1\) to arrive at the overall complexity.

Let us find out how many multiplications are needed by taking \(F_{1,1}\) as an example, which we already tackled in Eq.(3). The answer is that we need to perform \(p \cdot q\) multiplications. Putting it all together we have that the order of complexity, defined as the maximum number of multiplications involved is:

\[ \mathcal{O} \left[ \left( m-p+1 \right) \left( n-q+1 \right) \left(p q \right) \right] = \mathcal{O}\left(m n p q \right) \tag{6} \]

where we use the fact that, since typically \(m>p\) and \(n>q\), the order of magnitude for computational complexity can be conservatively estimated by neglecting the terms \(-p+1\) and \(-q+1\).

The basic idea behind transforming convolutions into matrix multiplications has to do with the fact that the element-wise multiplication of two vectors and the resulting sum (as outlined in Eq.(3)) ‘smells’ of a scalar product among vectors. Since matrix multiplications involve a series of scalar products (each row multiplied by each column), we can see how the idea came about.

Transforming convolutions into matrix multiplications thus requires re-arranging the image \(X\) or the filter \(K\) into matrices such that, when multiplied together, we obtain the convolution output \(F\).

To strike an optimal compromise between being abstract enough (to ensure we can generalize results) and simple enough (to facilitate mathematical manipulation), we will use the semi-generic image and filters as follows:

\[X= \left( \begin{array}{rrrr} x_{11} & x_{12} & x_{13} & x_{14} \\ x_{21} & x_{22} & x_{23} & x_{24} \\ x_{31} & x_{32} & x_{33} & x_{34} \\ x_{41} & x_{42} & x_{43} & x_{44} \\ x_{51} & x_{52} & x_{53} & x_{54} \end{array} \right), K = \left( \begin{array}{rr} k_{11} & k_{12} \\ k_{21} & k_{22} \\ k_{31} & k_{32} \end{array} \right). \tag{7} \]

Method 1 - unrolling filters

The idea behind this method consists in the unrolling of the filter \(K\) into a row vector as follows:

\[ \widetilde{K} = \left[k_{11}, \ldots , k_{1q}, k_{21}, \dots , k_{2q}, \ldots , k_{p1}, \ldots k_{pq} \right]. \tag{8} \]

In the above, we simply obtain \(\widetilde{K}\) by taking all rows of \(K\) and appending them, one at the time, to the first one. By denoting the \(i\)-th row of \(K\) as \(k_{i*}\), Eq.(8) can be rewritten as:

\[ \widetilde{K} = \left[k_{1\bullet}, k_{2\bullet}, \ldots , k_{p\bullet} \right]. \tag{9} \]

We now need to ask ourselves how to repackage \(X\) into an \(\widetilde{X}\) such that, when multiplying \(\widetilde{K} \cdot \widetilde{X}\), we obtain the convolution results. Since \(\widetilde{K}\) has been made a row vector, it’s natural to conclude that the first column of \(\widetilde{X}\) needs to be made of the elements of \(X\) that are ‘covered’ by the filter \(K\) during the first element-wise multiplication. Also, they need to be ordered such that the sequence matches the multiplication needed by the convolution. We thus have:

\[ \widetilde{X}_{\bullet 1} = \left[x_{11}, x_{12}, x_{21}, x_{22}, x_{31}, x_{32} \right]^T. \tag{10} \]

By iterating on the concept above we have:

\[\widetilde{X} = \begin{pmatrix} x_{11} & x_{12} & x_{13} & \ldots & x_{33} \\ x_{12} & x_{13} & x_{14} & \ldots & x_{34} \\ x_{21} & x_{22} & x_{23} & \ldots & x_{43} \\ x_{22} & x_{23} & x_{24} & \ldots & x_{44} \\ x_{31} & x_{32} & x_{33} & \ldots & x_{53} \\ x_{32} & x_{33} & x_{34} & \ldots & x_{54} \end{pmatrix} \tag{11} \]

We note that the number of rows of \(\widetilde{X}\) corresponds to the number of elements in the filter \(K\), i.e., \(pq\), while the number of columns corresponds to the number of elements in the final convolution matrix, i.e., \((m-p+1)(n-q+1)\). Thus, \(\widetilde{X} \in \mathbb{R}^{pq \times (m-p+1)(n-q+1)}\).

We are now in a situation where the matrix multiplication \(\widetilde{K}\cdot\widetilde{X}\) (whose result we denote as \(\widetilde{F}\)) has the same elements of the \(F = X \otimes K\) convolution, which is exactly what we wanted. Note that we are not saying that the outputs are identical, because \(F\) and \(\widetilde{F}\) are sized differently. Since \(\widetilde{K} \in \mathbb{R}^{1 \times pq}\) and \(\widetilde{X} \in \mathbb{R}^{pq \times (m-p+1)(n-q+1)}\), \(\widetilde{F} \in \mathbb{R}^{1 \times (m-p+1)(n-q+1)}\). To recast \(\widetilde{F}\) into \(F\), we simply have to reshape the former into the latter by ‘chopping’ the row vector \(\widetilde{F}\) every \(n-q+1\) elements.

So far, so good. But have we saved computational time in recasting things via the above? As it turns out, not on the surface, but yes if you go deep. But to prove that, we first need to understand how many multiplications are involved in the act of multiplying two matrices.

Lemma 1. Let \(A \in \mathbb{R}^{m \times n}\), \(B \in \mathbb{R}^{n \times p}\) and \(C = A \cdot B\). The number of multiplications needed to arrive at \(C\) is \(\mathcal{O}(mnp)\).

Proof. Each element of \(C\) involves the dot product between a row vector of \(A\) and a column vector \(B\), both of which contain \(n\) elements. Doing so involves \(n\) multiplications. Since we are performing said operation \(mp\) times (which is the size of the output matrix), we derive that multiplying two matrices involves \(\mathcal{O}(mnp)\). ◻

We now use Lemma 1 to determine the order of complexity of our original problem. Since we are multiplying \(\widetilde{K} \in \mathbb{R}^{1 \times pq}\) by \(\widetilde{X} \in \mathbb{R}^{pq \times (m-p+1)(n-q+1)}\), the order of complexity is:

\[ \mathcal{O}\left(pq \left(m-p+1 \right)\left(n-q+1 \right) \right) = \mathcal{O}(mnpq). \tag{12} \]

As anticipated, we seem to have gained nothing.

In reality, recasting convolutions as we did allows leveraging efficient libraries to perform matrix multiplications. By doing so, computational speed can increase by a factor between 3\(\times\) and 4\(\times\). One of the problems of the approach we just tried is that the resulting matrices have no apparent structure. Also, they are ‘full’, in that there are no zero elements necessarily. Let’s try something else.

Method 2 - unrolling images

The next, most intuitive attempt would be to do the opposite of what we tried in the previous section. That is, we unroll the image and figure out how we must recast the filter. Let us unroll \(X\) into a long row vector, in a similar fashion to how we unrolled \(K\) earlier.

\[ \widetilde{X} = \left[x_{1\bullet}, x_{2\bullet}, \ldots , x_{m\bullet} \right]. \tag{13} \]

The next step here is to figure out how to recast the filter \(K\) into a series of column vectors such that we obtain the output of the original convolution. There is not much intuition or genius here needed, just patience. Perhaps the only thing worth noticing at this stage before we embark on the math, which represents a slight difference compared to the previous approach, is that we need to artificially insert zeros here and there, in order for not all elements of the image to be multiplied by numbers in the filter. After some tribulation we obtain:

\[\widetilde{K} = \left( \begin{array}{ccc|ccc|ccc} k_{11} & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ k_{12} & k_{11} & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & k_{12} & k_{11} & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & k_{12} & 0 & 0 & 0 & 0 & 0 & 0 \\ \hline k_{21} & 0 & 0 & k_{11} & 0 & 0 & 0 & 0 & 0 \\ k_{22} & k_{21} & 0 & k_{12} & k_{11} & 0 & 0 & 0 & 0 \\ 0 & k_{22} & k_{21} & 0 & k_{12} & k_{11} & 0 & 0 & 0 \\ 0 & 0 & k_{22} & 0 & 0 & k_{12} & 0 & 0 & 0 \\ \hline k_{31} & 0 & 0 & k_{21} & 0 & 0 & k_{11} & 0 & 0 \\ k_{32} & k_{31} & 0 & k_{22} & k_{21} & 0 & k_{12} & k_{11} & 0 \\ 0 & k_{32} & k_{31} & 0 & k_{22} & k_{21} & 0 & k_{12} & k_{11} \\ 0 & 0 & k_{32} & 0 & 0 & k_{22} & 0 & 0 & k_{12} \\ \hline 0 & 0 & 0 & k_{31} & 0 & 0 & k_{21} & 0 & 0 \\ 0 & 0 & 0 & k_{32} & k_{31} & 0 & k_{22} & k_{21} & 0 \\ 0 & 0 & 0 & 0 & k_{32} & k_{31} & 0 & k_{22} & k_{21} \\ 0 & 0 & 0 & 0 & 0 & k_{32} & 0 & 0 & k_{22} \\ \hline 0 & 0 & 0 & 0 & 0 & 0 & k_{31} & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & k_{32} & k_{31} & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & k_{32} & k_{31} \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & k_{32} \\ \end{array} \right) \tag{14} \]

where we inserted vertical and horizontal lines to highlight the matrix’s structure. We can isolate the non-zero blocks of matrix \(\widetilde{K}\) in Eq.(14) in block matrices. Specifically, by defining:

\[\widetilde{K}_{1} = \left( \begin{array}{ccc} k_{11} & 0 & 0 \\ k_{12} & k_{11} & 0 \\ 0 & k_{12} & k_{11} \\ 0 & 0 & k_{12} \end{array} \right) \tag{15} \]

and \(\widetilde{K}_{2}\) and \(\widetilde{K}_{3}\) in similar ways, we can recast \(\widetilde{K}\) as:

\[\widetilde{K}= \left( \begin{array}{ccc} \widetilde{K}_{1} & 0 & 0 \\ \widetilde{K}_{2} & \widetilde{K}_{1} & 0 \\ \widetilde{K}_{3} & \widetilde{K}_{2} & \widetilde{K}_{1}\\ 0 & \widetilde{K}_{3} & \widetilde{K}_{2} \\ 0 & 0 & \widetilde{K}_{3} \\ \end{array} \right) \tag{16} \]

where \(0\) in the above indicates a \(4 \times 3\) matrix of zeros. Matrices \(\widetilde{K}_{i}\) for \(i=1,2,3\) are called Toeplitz matrices and enjoy special properties. At this juncture, we recommend readers to study Appendix A, where Toeplitz matrices and their close relatives, called circulant matrices, are properly introduced, along with their most significant properties and relationships. Since we’ll draw upon those properties in what follows, it is recommended that readers go there now.

Matrix \(\widetilde{K}\) is ‘only’ block Toeplitz, but not Toeplitz, which is not as good. But before we discuss this point in detail, let’s see if recasting our original convolution \(F = X \otimes K\) in terms of \(\widetilde{X}\cdot\widetilde{K}\), where \(\widetilde{X}\) and \(\widetilde{K}\) are as in Eqs (16) and (13) respectively, brought savings in terms of computational complexity. It’s easy to see that \(\widetilde{X}\) is a row vector with \(mn\) elements, since we simply re-organized the elements of \(X\) as a row vector.

To calculate the size of \(\widetilde{K}\), consider how we built the first four elements in the first column vector, i.e., \([k_{11}, k_{12},0,0]^T\). The first two elements correspond to the number of columns in the filter \(K\), i.e., \(q\) elements. We then padded with zeros such that only the first \(q\) elements of the first row in \(X\) yield non-zero results. Since all rows of \(X\) are made of \(n\) elements, it follows that we had to pad with \(n-q\) zeros. More simply, we have that the number of rows in \(\widetilde{K}_1\) is equal to \(n\). We now need to calculate how many blocks like \(\widetilde{K}_{1}\) we need, stacked upon each other. The answer is \(m\), because we sized \(\widetilde{K}_{1}\) based on the first row of \(X\), and there are \(m\) such rows. To understand why the last two blocks, when descending from \(\widetilde{K}_{1}\) downwards, are made of zeros, consider that we need a way to ‘block’ the values in the last two rows of \(X\) from entering the calculation of the first elements in the convolution.

To calculate how many columns are in \(\widetilde{K}\), consider how many ‘blocks’ (following the horizontal and vertical delimiters in Eq. (14) we have. The number we’re looking for is equal to the number of horizontal passes of the convolution, which is equal to \(m-p+1\). Each block has dimensions \(n-q+1\), which is equal to the number of horizontal passes of the filter. Hence, \(\widetilde{K}\) has \((m-p+1)(n-q+1)\) columns. Since \(\widetilde{X} \in \mathbb{R}^{1 \times mn}\) and \(\widetilde{K} \in \mathbb{R}^{mn \times (m-p+1)(n-q+1)}\), it follows that \(\widetilde{F} = \widetilde{X} \cdot \widetilde{K}\) is so that \(\widetilde{F} \in \mathbb{R}^{1 \times (m-p+1)(n-q+1)}\). Like we noted when discussing the previous convolution method, we need to chop \(\widetilde{F}\) every \(n-q+1\) elements and start a new row in order to arrive at an output that equals the output of the original convolution operation. But that is mere re-arranging, which does not consume much computational resources.

So how expensive is the new matrix multiplication? Looking at the dimensions of \(\widetilde{X}\) and \(\widetilde{K}\), the multiplication has an order of complexity that is given by:

\[ \mathcal{O} \left[ \left(mn \right) \left(m-p+1\right) \left(n-q+1\right) \right] = \mathcal{O}(m^2 n^2), \tag{17} \]

Since \(mn>pq\) Eq. (17) depicts a worse picture than Eq. (6). So it seems that we’ve made matters worse. In reality, if one counted the non-zero element of \(\widetilde{K}\), one of the \(mn\) factors would need to be substituted with \(pq\), which equals the order of complexity of our first method. So at least we have not made matters worse. Can we do better? Yes, we can. Shall we throw away the work done here? Not at all, since the answer on how to do better lies in the possibilities offered by the Toeplitz matrix structure emerging in the definition of \(\widetilde{K}_{i}\), something we shall discuss in the next section.

Method 3 - partial unrolling

This section is almost entirely drawn from the work of M. Gnacik and K. Lapa ³. This is the most complex section of all, in the sense that we will need to draw from other results in linear algebra to arrive at something of value. However, from a logical perspective, the line of argumentation is pretty simple and can be summarized in the following two steps:

First, we will take another pass at transforming the input \(X\) and the kernel \(K\) into new matrices \(\widetilde{X}\) and \(\widetilde{K}\); this time though, we will aim for \(\widetilde{K}\) to be Toeplitz (not just block Toeplitz, like we had before); like before, this allows us to transform a convolution into a matrix multiplication, i.e., \(F = X \otimes K = \widetilde{X} \cdot \widetilde{K}\)⁴
Second, we will extend \(\widetilde{K}\) into a circulant matrix \(C_{\widetilde{K}}\) by adding elements to it, and rearranging things around in the matrix multiplication s.t. we still obtain \(F\).

The idea behind the second step is that Toeplitz matrices can always be extended to circulant matrices, and that multiplications involving circulant matrices can be performed much faster thanks to the Fast Fourier Transform algorithm. As you see, we’re borrowing results from other disciplines like linear algebra (matrix multiplications involving Toeplitz and circulant matrices) and signal processing (Fourier transforms) to obtain something meaningful.

Since the objective here is to have at least one of the two matrices as Toeplitz, it’s natural to ask which one is best suited, \(X\) or \(K\). Looking at Eq. (14), it’s clear that our best candidate is \(K\). As we noted earlier, \(\widetilde{K}\) in Eq. (14) is almost Toeplitz. In fact, what prevents \(\widetilde{K}\) from being Toeplitz is the fact what we had to add zero-block matrices to “neutralize” the rows of \(X\) that are not involved when moving \(K\) through \(X\). And that is because we unrolled the entire \(X\) into a single row vector, which was perhaps excessive.

Armed with this insight, let us build each row of \(\widetilde{X}\) such that only the rows involved when sliding a filter horizontally are captured. Consider for example the first row of \(F\). It is made of elements obtained by performing element-wise matrix multiplications between the first \(p\) rows of \(X\) and the filter \(K\). That is, let us just unwrap the first \(p\) row vectors of \(X\) into the first row of \(\widetilde{X}\). Repeating similar arguments to discover the other rows of \(\widetilde{X}\) we obtain:

\[ \widetilde{X} = \left( \begin{array}{ccc} x_{1\bullet} & x_{2 \bullet} & x_{3\bullet} \\ x_{2\bullet} & x_{3 \bullet} & x_{4\bullet} \\ x_{3\bullet} & x_{4 \bullet} & x_{5\bullet} \\ \end{array} \right), \tag{18} \]

We can now rearrange the elements of \(K\) into \(\widetilde{K}\) in a way that \(\widetilde{X} \cdot \widetilde{K}\) yields the desired result. We obtain:

\[ \widetilde{K} = \left( \begin{array}{ccc} k_{11} & 0 & 0 \\ k_{12} & k_{11} & 0 \\ 0 & k_{12} & k_{11} \\ 0 & 0 & k_{12} \\ \hline k_{21} & 0 & 0 \\ k_{22} & k_{21} & 0 \\ 0 & k_{22} & k_{21} \\ 0 & 0 & k_{22} \\ \hline k_{31} & 0 & 0 \\ k_{32} & k_{31} & 0 \\ 0 & k_{32} & k_{31} \\ 0 & 0 & k_{32} \\ \end{array} \right), \tag{19} \]

The transformations performed in Eqs (18) and (19) yielded no advantages from a computational efficiency perspective. While this may be expected, let’s prove it. We have that \(\widetilde{X} \in \mathbb{R}^{(m-p+1) \times (pn)}\) while \(\widetilde{K} \in \mathbb{R}^{pn \times (n-q+1)}\), and thus the order of complexity is:

\[ \mathcal{O} \left[ \left( m-p+1 \right) \left( n p \right) \left(n-q+1 \right) \right] = \mathcal{O}\left(m n^2 q \right), \tag{20} \]

Compared to Eq. (6) we seem to have made matters worse. In reality, when counting the number of non-zero elements in \(\widetilde{K}\) (thus substituting one of the \(n\) with \(q\)) we obtain we obtain \(\mathcal{O}(mnpq)\), which is the same as before.

We note that \(\widetilde{K}\) in Eq. (19) is Toeplitz. In fact, it is a special case of a Toeplitz matrix, one that is solely defined by its first column vector. Following the notation introduced in Appendix A, we have that \(t_{-1}=t_{-2}=\ldots=t_{-(n-1)}=0\).

In Appendix A we showed how a Toeplitz matrix can be extended and made part of a broader circulant matrix. We did so because multiplications involving circulant matrices can be performed much faster thanks to the Fast Fourier Transform algorithm. We now need to transform the original multiplication \(F = \widetilde{X} \cdot \widetilde{K}\) (where \(\widetilde{K}\) is Toeplitz) into an equivalent multiplication where we use its extended circulant matrix. We perform the operations, step by step, in Appendix B for a generic Toeplitz matrix. In what follows, we adapt the notation used in Appendix B to the case at hand involving \(\widetilde{X}\) and \(\widetilde{K}\), and their respective dimensions.

In Appendix B we dealt with the case where a Toeplitz matrix \(T\) multiplies a generic vector \(v\) and showed that, by extending the matrix \(T\) into its circulant cousin, we can recast the multiplication in a way that’s more operationally efficient. To use these results in our context, we have to calculate \(F^T = \widetilde{K}^T \cdot \widetilde{X}^T\), so that \(\widetilde{K}\) stays to the left of the multiplication sign.

By using similar notation to the one in Appendix B, we can recast \(F^T\) as follows:

\[ F^T = \widetilde{K}^T \cdot \widetilde{X}^T = S_{\widetilde{K}^T} \cdot C_{\widetilde{K}^T} \cdot \begin{pmatrix} \widetilde{X}^T \\ 0_{n-q,1} \end{pmatrix}, \tag{22} \]

where \(C_{\widetilde{K}^T}\) is the circulant cousin of \(\widetilde{K}^T\), \(S_{\widetilde{K}^T}\) is the selection matrix. It can be seen that \(S_{\widetilde{K}^T} \in \mathbb{R}^{(n-q+1) \times (n-q+pn)}\), \(C_{\widetilde{K}^T} \in \mathbb{R}^{(n-q+pn) \times (n-q+pn)}\), and \(0_{n-q,1}\) is an \(n-q\) column vector of zeros.

The order of complexity implied by Eq. (22) that is associated with the Fourier matrix is:

\[ \mathcal{O}\left[ \left(n-q+np \right) \log \left(n-q+np \right) \right]=\mathcal{O}\left[ \left(n+np \right) \log \left(n+np \right) \right], \tag{23} \]

where in the last step we omitted \(-q\) to stay on the conservative side. By using the fact that \(\mathcal{O}(p+1)=\mathcal{O}(p)\), we can further simplify Eq. (23) into:

\[ \mathcal{O}\left[np \log(np) \right]. \tag{24} \]

The final step involves the multiplication by the extended \(\widetilde{X}^T\) matrix (last factor in Eq. (22)), which adds \(m-p+1\) as a factor to the original complexity. Thus, we have a final complexity of:

\[ \mathcal{O}\left[ (np)(m-p+1) \log(np) \right] = \mathcal{O}\left[ (mnp) \log(np) \right]. \tag{25} \]

If we only used the non-zero elements of \(\widetilde{K}\) to determine the final complexity, we would use \(\log(pq)\) instead of \(\log(np)\) in Eq.(25). Then, to ease the comparison with the original complexity spelled at the beginning, which was:

\[ \mathcal{O}(mnpq), \tag{26} \]

we could assume \(p=q\) and rewrite:

\[ \mathcal{O}\left[ mnp \log(q^2) \right] = \mathcal{O}\left[ 2mnp \log(q) \right] = \mathcal{O}\left[ mnp \log(q) \right]. \tag{27} \]

To conclude, we abated complexity from \(\mathcal{O}(mnpq)\) to \(\mathcal{O}\left[ mnp \log(q) \right]\), which is not much. Further aggravating the problem is the fact that \(p\) and \(q\) - at least for applications in image processing - are typically much lower than \(m\) and \(n\). Hence, the gap between \(q\) and \(\log(q)\) is not very significant.

Conclusion

In this paper we took a closer look at the convolution operation, a key element in AI architectures, and attempted to simplify computational complexity by borrowing results from linear algebra and signal processing. While the simplification obtained is minimal, at least for the size of convolution filters adopted in signal processing, the findings would simplify matters notably for large convolution filters.

Perhaps more importantly, this paper demonstrates how research in asset management at times shifts its attention from classical macro-economics to matters which have less to do with classical finance, but are no less important.

Appendix A: Toeplitz and circulant matrices

In this Appendix we shall provide a formal definition of what a Toeplitz matrix is. Also, we will introduce their close relatives called circulant matrices. We will show that a circulant matrix is a special case of a Toeplitz matrix and that you can always extend a Toeplitz matrix into a circulant matrix. Let’s begin by defining what a Toeplitz matrix is.

Definition 1. A generic matrix \(T \in \mathbb{R}^{m \times n}\) is said to be Toeplitz if it possesses the following structure:

\[ T = \left( \begin{array}{cccccc} t_0 & t_{1} & t_{2} & \cdots & t_{(n-2)} & t_{(n-1)} \\ t_{-1} & t_0 & t_{1} & \cdots & t_{(n-3)} & t_{(n-2)} \\ t_{-2} & t_{-1} & t_0 & \cdots & t_{(n-4)} & t_{(n-3)} \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ t_{-(m-2)} & t_{-(m-3)} & t_{-(m-4)} & \cdots & t_{n-m} & t_{n-m+1} \\ t_{-(m-1)} & t_{-(m-2)} & t_{-(m-3)} & \cdots & t_{n-m-1} & t_{n-m} \end{array} \right) \tag{28} \]

As one can see, the matrix is completely defined by its first row and column vector. The remaining entries in the matrix are obtained by ‘dragging’ the elements of the first column and the first row along the diagonals. To be absolutely picky, the formulation “defined by its first row and its first column” is sloppy, since the first row and the first column of any matrix have the element (1,1) in common. However, we’ll let it pass, so long as it’s clear that the \((m \times n)\) matrix is defined by the vector of \((m+n-1)\) elements:

\[ t = \left[t_{-(m-1)}, t_{-(m-2)}, \ldots, t_{-1}, t_0, t_1, \ldots, t_{n-1} \right]. \tag{29} \]

Circulant matrices are entirely defined by their first column vector. Let’s properly introduce them by the definition below.

Definition 2. A generic (square) matrix \(C \in \mathbb{R}^{r \times r}\) is said to be circulant if it possesses the following structure:

\[ C = \left( \begin{array}{cccccc} c_0 & c_{r-1} & c_{r-2} & \cdots & c_2 & c_1 \\ c_1 & c_0 & c_{r-1} & \cdots & c_3 & c_2 \\ c_2 & c_1 & c_0 & \cdots & c_4 & c_3 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ c_{r-2} & c_{r-3} & c_{r-4} & \cdots & c_0 & c_{r-1} \\ c_{r-1} & c_{r-2} & c_{r-3} & \cdots & c_1 & c_0 \end{array} \right). \tag{30} \]

Starting from the first column, one can see that the second column is built by ‘shifting’ the first column down by one. The last element, which would be ‘pushed out’ by the shift, is appended at the top of the second column. The process is repeated \(r-1\) times. A circulant matrix is a special case of a Toeplitz matrix.

We now show how a Toeplitz matrix can always be extended into a circulant matrix by padding elements or the original Toeplitz matrix above it and to its right. Before we formalize a statement for the generic case, let us look at a concrete example. For that purpose, let us use a Toeplitz matrix \(T \in \mathbb{R}^{5,3}\) defined as follows:

\[ T = \left( \begin{array}{ccc} t_0 & t_1 & t_2 \\ t_{-1} & t_0 & t_1 \\ t_{-2} & t_{-1} & t_0 \\ t_{-3} & t_{-2} & t_{-1} \\ t_{-4} & t_{-3} & t_{-2} \end{array} \right). \tag{31} \]

The second column of \(T\) is almost the first one shifted downwards, except that the elements (1,2) and (1,3) (\(t_{1}\) and \(t_{2}\) respectively) do not correspond to the elements (5,1) and (4,1) (i.e., \(t_{3}\) and \(t_{4}\) respectively). Thus, we need to make them ‘come from above’ by padding \(T\) upwards as in the following:

\[ T_p = \begin{bmatrix} \textcolor{gray}{t_2} & \bigcirc & \bigcirc \\ \textcolor{lightgray}{t_1} & \textcolor{gray}{t_2} & \bigcirc \\ \hdashline t_0 & \textcolor{lightgray}{t_1} & \textcolor{gray}{t_2} \\ t_{-1} & t_0 & \textcolor{lightgray}{t_1} \\ t_{-2} & t_{-1} & t_0 \\ t_{-3} & t_{-2} & t_{-1} \\ t_{-4} & t_{-3} & t_{-2} \end{bmatrix}. \tag{32} \]

where we grey-shaded the padded elements and denoted as \(\bigcirc\) the positions to still figure out. For the latter, we can simply shift the first column downwards, and attach the element that was eliminated by the shift on top of the second vector. By performing the process we obtain:

\[ T_p = \begin{bmatrix} \textcolor{gray}{t_2} & \textcolor{darkgray}{t_{-4}} & \textcolor{dimgrey}{t_{-3}} \\ \textcolor{lightgray}{t_1} & \textcolor{gray}{t_2} & \textcolor{darkgray}{t_{-4}} \\ \hdashline t_0 & \textcolor{lightgray}{t_1} & \textcolor{gray}{t_2} \\ t_{-1} & t_0 & \textcolor{lightgray}{t_1} \\ t_{-2} & t_{-1} & t_0 \\ t_{-3} & t_{-2} & t_{-1} \\ t_{-4} & t_{-3} & t_{-2} \end{bmatrix}. \tag{33} \]

To arrive at a circulant matrix, we can now keep shifting the last column vector downwards by one until we obtain 7 columns, i.e., one step before we obtain the first column again. We obtain:

\[ C_T = \begin{bmatrix} t_{2} & t_{-4} & t_{-3} & t_{-2} & t_{-1} & t_{0} & t_{1} \\ t_{1} & t_{2} & t_{-4} & t_{-3} & t_{-2} & t_{-1} & t_{0} \\ \hdashline t_{0} & t_{1} & t_{2} & t_{-4} & t_{-3} & t_{-2} & t_{-1} \\ t_{-1} & t_{0} & t_{1} & t_{2} & t_{-4} & t_{-3} & t_{-2} \\ t_{-2} & t_{-1} & t_{0} & t_{1} & t_{2} & t_{-4} & t_{-3} \\ t_{-3} & t_{-2} & t_{-1} & t_{0} & t_{1} & t_{2} & t_{-4} \\ t_{-4} & t_{-3} & t_{-2} & t_{-1} & t_{0} & t_{1} & t_{2} \end{bmatrix}. \tag{34} \]

We added the subscript \(T\) to the circulant matrix \(C\) obtained in Eq. (34) to specify that it has been obtained by padding the original Toeplitz matrix \(T\). We now rewrite Eq. (32) in the block-matrix form:

\[ T_p = \left( \begin{array}{c} \widetilde{T} \\ \hline T \end{array} \right) \tag{35} \]

In preparation for the generalized statement of what we’ve been doing up until now.

Proposition 1. A generic Toeplitz matrix \(T \in \mathbb{R}^{m \times n}\) (as defined like in Eq. (28)) can always be transformed into a circulant matrix \(C_T \in \mathbb{R}^{(m+n-1) \times (m+n-1)}\) with structure:

\[ C_T = \left( \begin{array}{ccc|ccc} t_{n-1} & \ldots & t_{-m+n-1} & t_{n-m} & \ldots & t_{n-2} \\ t_{n-2} & \ldots & t_{-m+n-2} & t_{n-m-1} & \ldots & t_{n-3} \\ t_{n-3} & \ldots & t_{-m+n-3} & t_{n-m-2} & \ldots & t_{n-4} \\ \vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\ t_{1} & \ldots & t_{-m+1} & t_{-m+2} & \ldots & t_{0} \\ \hline t_{0} & \ldots & t_{n-1} & t_{-m+1} & \ldots & t_{-1} \\ t_{-1} & \ldots & t_{n-2} & t_{n-1} & \ldots & t_{-2} \\ t_{-2} & \ldots & t_{n-3} & t_{n-2} & \ldots & t_{-3} \\ \vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\ t_{-m+1} & \ldots & t_{n-m} & t_{n-m+1} & \ldots & t_{n-1} \end{array} \right). \tag{36} \]

The proof follows from the construction explained by manipulating the illustrative Toeplitz matrix in Eq. (31). Note that \(C_T\) in Eq. (36) can be recast in terms of its block-matrix components as:

\[ C_T = \left( \begin{array}{c|c} \widetilde{T} & T^* \\ \hline T & T^{\dagger} \\ \end{array} \right) \tag{37} \]

where \(T\) and \(\widetilde{T}\) are defined as illustrated in Eq. (35), and \(T^*\) and \(T^{\dagger}\) are defined as in Eq. (36).

Note that \(T^*\) is obtained by removing the last row and the last column of \(T\), while \(T^{\dagger}\) is obtained by removing the first column of \(\widetilde{T}\) and adding the first row of \(T\) at its bottom (excluding \(t_0\)).

Whether one is using the arguments made just now, or simply inspects the construction in Eq. (36), it is easy to see that \(\widetilde{T} \in \mathbb{R}^{(n-1) \times n}\), \(T^* \in \mathbb{R}^{(n-1) \times (m-1)}\) and \(T^{\dagger} \in \mathbb{R}^{m \times (m-1)}\). We see again that \(C_T \in \mathbb{R}^{(m+n-1) \times (m+n-1)}\).

After having gained familiarity with how Toeplitz matrices are extended into their circulant cousin, let us introduce a shortcut that would achieve everything we’ve discussed in this Appendix, just faster. To obtain the circulant cousin \(C_T\) of a given matrix \(T\) as defined in Eq. (28), consider rotating the first row of \(T\) counterclockwise around \(t_0\) by 90 degrees. It then suffices to shift the column vector thus obtained \(c = \left[ t_{n-1}, t_{n-2}, \ldots, t_{0}, t_{-1}, \ldots, t_{-(m-1)} \right]^T\) downwards (always one step at the time) \(m+n-2\) times, always adding the shifted vector to the right of the matrix \(T\). We often denote \(c\) as the circulant vector.

Appendix B: Circulant matrix multiplications

Consider the matrix-vector multiplication \(T \cdot v\) where \(T \in \mathbb{R}^{m \times n}\) and \(v \in \mathbb{R}^{n \times 1}\). Since we want to express the multiplication in terms of the extended circulant matrix \(C_T\) as in Appendix A, we need to recast \(v\) into a certain \(\hat{v}\) and find a matrix \(S\) such that the following holds:

\[ T \cdot v = S \cdot C_T \cdot \hat{v}. \tag{38} \]

Let us recall that \(C_T\) is a block matrix like in Eq. (37). Looking at the \(C_T \cdot \hat{v}\) part of Eq. (38), we can see that we need to append \(v\) with as many zeros as needed to neutralize the impact of the right side of \(C_T\), the one right of the vertical lines separating the blocks. Since the number of columns right of the vertical line is \(m-1\), we have that \(\hat{v} = \left[ v^T, 0_{1 \times m-1} \right]^T\). We can now see that:

\[ C_T \cdot \hat{v} = \left( \begin{array}{c|c} \widetilde{T} & T^* \\ \hline T & T^{\dagger} \\ \end{array} \right) \cdot \left( \begin{array}{c} v \\ 0_{m-1 \times 1} \end{array} \right) = \left( \begin{array}{c} \widetilde{T} \cdot v \\ \hline T \cdot v \\ \end{array} \right) \tag{39} \]

The last step is to select \(S\) such that, when multiplying by \(C_T \cdot \hat{v}\) as in Eq. (39), we obtain \(T \cdot v\). From Eq. (39) we see that \(C_T \cdot \hat{v} \in \mathbb{R}^{(m+n-1) \times 1}\). We thus know that \(S \in \mathbb{R}^{m \times (m+n-1)}\). The first \(m\) columns of \(S\) must be made of zeros, in order to neutralize the upper part of \(C_T \cdot \hat{v}\), that is \(\widetilde{T} \cdot v\). The remaining \(n-1\) columns of \(S\) must ‘pick up’ \(T \cdot v\). We thus have:

\[ S = \left( \begin{array}{c|c} 0_{m \times m} & S^d_{m \times (n-1)} \end{array} \right) \tag{40} \]

where \(S^d_{m \times (n-1)}\) is a matrix of zeros except for the main diagonal, which is made of ones.

We went through the pain of Eq. (38) because the spectral decomposition of circulant matrices involves Fourier matrices. That is, if \(C_T\) is circulant, we have:

\[ C_T = \mathcal{F}^{-1} \cdot \Lambda \cdot \mathcal{F} \tag{41} \]

where \(\mathcal{F}\) is a Fourier matrix, and \(\Lambda\) contains the eigenvalues of \(C_T\). By substituting Eq. (41) into Eq. (38) we have:

\[ T \cdot v = S \cdot \mathcal{F}^{-1} \cdot \Lambda \cdot \mathcal{F} \cdot \hat{v}. \tag{42} \]

The left side of Eq. (42) is \(\mathcal{O}(mn)\). The right side involves multiplications of Fourier matrices of size \(m+n-1\). That is, we have an order of complexity that is given by:

\[ \mathcal{O}\left[\left(m+n-1\right) \log(m+n-1)\right]=\mathcal{O}\left[\left(m+n\right) \log(m+n)\right]. \tag{43} \]

Let us see if having recast our original multiplication \(T \cdot v\) via Eq. (38) resulted in lower operational complexity. We know that \(T \cdot v\) is \(\mathcal{O}(mn)\). It’s easy to see that:

\[ \mathcal{O}(mn) > \mathcal{O}\left[\left(m+n \right) \log(m+n)\right]. \tag{44} \]

To see how Eq. (44) holds, consider the simpler case where \(m=n\). We have that \(\mathcal{O}(n^2) > \mathcal{O}\left[ 2n \log(2n)\right]=\mathcal{O}\left[ n \log(n)\right]\), which is true. We have shifted \(n\) with \(\log(n)\), thus decreasing the order of complexity.