Matrix Multiplication

In highschool, I was introduced to these operations on matrices:

  • Add matrices.
  • Multiply matrices.
  • Compute the determinant of a matrix.
  • Invert a matrix.
  • Solve a system of linear equations with Cramer's rule.
There were a lot of mysteries around why and how that did not get explained. Years passed and I entered college, yet those mysteries never were explained despite the frequent use of those operations. Here's a list of some mysteries that occupied my mind for quite some time:

  • Why is matrix multiplication defined that way?
  • Why is matrix multiplication associative? (This means $A(BC) = (AB)C$.)
  • Why is the determinant defined that way?
  • Why does Cramer's rule work?
  • Why is the matrix of cofactors defined that way?

I eventually knew the answers, but I wish that I had had someone tell me sooner because such a simple explanation would suffice to answer all those questions. (Some of you might disagree on the use of "simple" when you see the length of this post, but well... 😆)

I did a bit of data collection by asking my friends and my students (when I was a TA) about these questions. Surprisingly many of them could still do those operations but could not answer those questions. Some of them were even familiar with eigenvalues and eigenvectors too!

It is most definitely not the student's fault. It looks more like the math education is flawed, at least when it comes to this topic. Too little is explained in an introductory class, but you're expected to already know it so well in a more advanced class.

I'm writing this post as an effort to help rectify that. I'm going to address only the first 2 questions in this post though because otherwise the post will be too long. (It's pretty long already.) I'll save the other questions for later posts.

Why is matrix multiplication defined the way it is?

By viewing a matrix as a representation of a linear transformation based on a chosen choice of basis vectors, this question can be answered almost immediately. For those who aren't so familiar with this concept, let me explain a bit more.

Suppose $X$ and $Y$ are vector spaces with chosen ordered bases $(x_1, \ldots, x_\ell)$ and $(y_1, \ldots, y_m)$, respectively. If you prefer to see a list of numbers as a vector, you can already assume that

\begin{align*} x_1 = \begin{pmatrix} 1 \\ 0 \\ 0 \\ \vdots \\ 0 \end{pmatrix}\quad x_2 = \begin{pmatrix} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \end{pmatrix}\quad x_3 = \begin{pmatrix} 0 \\ 0 \\ 1 \\ \vdots \\ 0 \end{pmatrix}\quad \ldots \end{align*}

(In fact, there is no loss of generality at all in assuming this.)

Any vector $\xi \in X$ can be represented uniquely by coordinates $\xi^i$ based on the chosen bases so that

\begin{equation*} \xi = \sum_{i=1}^\ell \xi^i x_i = \begin{pmatrix} \xi^1 \\ \xi^2 \\ \vdots \\ \xi^\ell \end{pmatrix}_X. \end{equation*}

(Note: the superscript is NOT raising to a power. It's just an index.)

That means the list of numbers $(\xi^1, \xi^2, \ldots, \xi^\ell)$ carries the same information as the original vector $\xi$. Sometimes I might even write $\xi = (\xi^1, \xi^2, \ldots, \xi^\ell)$, but I'll keep the explicit subscript $_X$ in this post to emphasize that the coordinates were defined based on the chosen ordered basis of $X$. Even though this notation doesn't mention each $x_i$ explicitly, it should be understood that the $x_i$'s are there and they must not change. (It's like they are attached to $X$.) We'll adopt the same notation for $Y$ and its ordered basis $(y_1, y_2, \ldots, y_m)$.

Next, suppose $T: X \to Y$ is a linear transformation. By linearity,

\begin{gather} T(\xi) = T\left(\sum_{i=1}^\ell \xi^i x_i\right) = \sum_{i=1}^\ell \xi^i T(x_i). \label{eq:linearity} \end{gather}

This means $T(\xi)$, for any $\xi \in X$, can be computed if we know $T(x_i)$ for each $i \in \{1, \ldots, \ell\}$. In other words, we can represent $T$ uniquely by a list of vectors in $Y$: $(T(x_1), T(x_2), \ldots, T(x_\ell))$.

We can go a bit further: each vector in $Y$ can be represented by a list of coordinates based on $(y_1, \ldots, y_m)$. So if in writing $(T(x_1), T(x_2), \ldots, T(x_\ell))$, we replace each $T(x_i)$ with a column vector of its coordinates in $Y$, we get a 2-dimensional array, i.e., a matrix that looks like this:

\begin{align*} T & = \begin{pmatrix} \mid & \mid & & \mid \\ T(x_1) & T(x_2) & \cdots & T(x_\ell) \\ \mid & \mid & & \mid \end{pmatrix}^X \\ & = \begin{pmatrix} T_1^1 & T_2^1 & \cdots & T_{\ell}^1 \\ T_1^2 & T_2^2 & \cdots & T_{\ell}^2 \\ \vdots & \vdots & \ddots & \vdots \\ T_1^m & T_2^m & \cdots & T_{\ell}^m \end{pmatrix}_Y^X \end{align*}

The subscript $_Y$ reminds us that the columns are coordinates in $Y$, and the superscript $^X$ reminds us that the $i$-th column interacts with the $i$-th coordinate of $X$. The $T^j_i$'s are the components of the matrix representation of $T$. They are defined as numbers that makes the following hold:

\begin{align} \text{For $i \in \{1, \ldots, \ell\}$}, T(x_i) & = \sum_{j=1}^m T^j_i y_j. \label{eq:matrix_comps} \end{align}

To recap, we can find the matrix of $T$ by evaluating each $T(x_i)$ and writing out its components based on the ordered basis $(y_1, \ldots, y_m)$.

Matrix-vector multiplication

So far we have not talked about how to manipulate the coordinate-based representation. Let us start by going back to the calculation of $T(\xi)$ (continuing from \eqref{eq:linearity}):

\begin{align*} T(\xi) & = T\left(\sum_{i=1}^\ell \xi^i x_i\right) \\ & = \sum_{i=1}^\ell \xi^i T(x_i) & & \text{; by linearity}\\ & = \sum_{i=1}^\ell \xi^i \sum_{j=1}^m T_i^j y_j & & \text{; from \eqref{eq:matrix_comps}}\\ & = \sum_{j=1}^m \left(\sum_{i=1}^\ell \xi^i T_i^j\right) y_j \\ & = \begin{pmatrix} \sum_{i=1}^\ell \xi^i T_i^1 \\ \sum_{i=1}^\ell \xi^i T_i^2 \\ \vdots \\ \sum_{i=1}^\ell \xi^i T_i^m \end{pmatrix}_Y \end{align*} In other words, the $j$-th coordinate of $T(\xi)$ is \begin{gather*} (T(\xi))^j = \sum_{i=1}^\ell \xi^i T_i^j. \nonumber \end{gather*}

Let me write out the expanded notation to facilitate visualization:

\begin{align*} T(\xi) & = \begin{pmatrix} T_1^1 & T_2^1 & \cdots & T_{\ell}^1 \\ T_1^2 & T_2^2 & \cdots & T_{\ell}^2 \\ \vdots & \vdots & \ddots & \vdots \\ T_1^m & T_2^m & \cdots & T_{\ell}^m \end{pmatrix}_Y^X \begin{pmatrix} \xi^1 \\ \xi^2 \\ \vdots \\ \xi^\ell \end{pmatrix}_X \\ & = \begin{pmatrix} T_1^1 \xi^1 + T_2^1 \xi^2 + \ldots + T_{\ell}^1 \xi^\ell \\ T_1^2 \xi^1 + T_2^2 \xi^2 + \ldots + T_{\ell}^2 \xi^\ell \\ \vdots \\ T_1^m \xi^1 + T_2^m \xi^2 + \ldots + T_{\ell}^m \xi^\ell \end{pmatrix}_Y \end{align*}

So here it is, the derivation of the matrix-vector multiplication formula. It is the way to apply a linear transformation to a vector when both are given in terms of coordinates, yielding a result also in terms of coordinates. The notation we have introduced makes it look as if the subscript $_X$ and the superscript $^X$ cancel out. This is not magic, as we deliberately defined it that way, but it's kind of neat that we can. (Those familiar with Einstein notation might find this boring. Well, why are you even reading this? 😛)

It is worth noting that the matrix representation of $T$ carries the same information as $T$ itself, as we just showed that $T(\xi)$ can be computed solely from the components $T^j_i$. This fact will be needed when we try to show that matrix multiplication is associative.

How do we extend to matrix-matrix multiplication? What is it supposed to mean anyway? When do we get 2 matrices that can be multiplied?

Matrix-matrix multiplication

Well, 2 matrices mean 2 linear functions, right? The most basic algebraic operation between 2 functions that can be carried out is probably composition, and that is exactly what corresponds to matrix multiplication.

Recall that if $T: X \to Y$ and $S: Y \to Z$ are two functions, then the function composition $ST$ (or sometimes written $S \circ T$) is defined by

\begin{gather} (ST)(x) = S(T(x)) \quad \text{ for $x \in X$}. \label{eq:composition} \end{gather}

This definition is generic, as in $S$ and $T$ do not have to be linear, nor do $X$, $Y$ and $Z$ have to be vector spaces.

But if $S$ and $T$ are linear functions, then $ST$ will also be linear, and we already know that $ST: X \to Z$ can be represented by a matrix of the form $(\ldots)^X_Z$. (Let's assume that $(z_1, \ldots, z_n)$ is a fixed ordered basis for $Z$.) So what would be the components $(ST)^i_k$ of $ST$? Well, we simply calculate $(ST)(x_i)$ for each $i$, put the results together as columns, and we should be good. Let's do it:

\begin{align*} (ST)(x_i) & = S(T(x_i)) & & \text{; from \eqref{eq:composition}}\\ & = S\left(\sum_{j=1}^m T^j_i y_j\right) & & \text{; by \eqref{eq:matrix_comps}}\\ & = \sum_{j=1}^m T^j_i S(y_j) & & \text{; by linearity} \\ & = \sum_{j=1}^m T^j_i \sum_{k=1}^n S^k_j z_k & & \text{; by \eqref{eq:matrix_comps}}\\ & = \sum_{k=1}^n \left(\sum_{j=1}^m S^k_j T^j_i \right) z_k \\ \therefore (ST)^k_i & = \sum_{j=1}^m S^k_j T^j_i & & \text{; by \eqref{eq:matrix_comps}} \end{align*}

You can check pretty easily that this is the matrix-matrix multiplication. (Recall that in our notation, the subscript is the column index and the superscript is the row index.) We derived it from a more basic intuition. Here's a visualization aid (fix $i = 2$ and $k = 4$):

\begin{gather*} \begin{pmatrix} * & * & * & * \\ * & * & * & * \\ * & * & * & * \\ S^4_1 & S^4_2 & S^4_3 & S^4_4 \\ * & * & * & * \end{pmatrix}^Y_Z \begin{pmatrix} * & T^1_2 & * \\ * & T^2_2 & * \\ * & T^3_2 & * \\ * & T^4_2 & * \\ \end{pmatrix}^X_Y \\ = \begin{pmatrix} * & * & * \\ * & * & * \\ * & * & * \\ * & (ST)^4_2 & * \\ * & * & * \end{pmatrix}^X_Z\\ \end{gather*} \begin{gather*} (ST)^4_2 = S^4_1 T^1_2 + S^4_2 T^2_2 + S^4_3 T^3_2 + S^4_4 T^4_2 \end{gather*}

Why is matrix multiplication associative?

I remember that when I was first told that matrix multiplication was associative, I was also given an exercise to "prove" it, but only for 2-by-2 matrices. You can imagine how tedious it would be to prove this for arbitrary matrices.

Fortunately, we get associativity almost for free with the interpretation that a matrix comes from a linear transformation. Of course, there's still some work to do, which is proving that function composition is associative. It may sound pretty trivial, and I'd agree that the proof itself is extremely straightforward, but from an algebraic point of view, this is somewhat interesting as there are quite a lot of seemingly-natural binary operations that aren't associative. (For example, the cross product is not associative.) Anyway, let's just prove associativity of function composition real quick: suppose $U: W \to X$, $T: X \to Y$, $S: Y \to Z$, and $w \in W$, then

\begin{align*} ((ST)U)(w) & = (ST)(U(w)) & & \text{; $ST$ and $U$}\\ & = S(T(U(w))) & & \text{; $S$ and $T$} \\ & = S((TU)(w)) & & \text{; $T$ and $U$} \\ & = (S(TU))(w) & & \text{; $S$ and $TU$} \\ \therefore (ST)U & = S(TU). \end{align*}

The annotation on the right of each line above refers to the application of the definition of function composition \eqref{eq:composition}.

The consequence of associativity is that we don't need to use parentheses anymore. We can just write $STU$ and it is not ambiguous. Now add back linearity. Since $(ST)U$ and $S(TU)$ are the same linear map, they must possess the same matrix regardless of how we parenthesize our matrix multiplications. This proves that matrix multiplication is associative.

Addendum: How deep is this result?

If you're a student of computer science, you may recall that finding a way to parenthesize a sequence of (naive) matrix multiplications that minimizes the number of scalar operations is a classic dynamic programming example, which suggests that being associative does introduce some computational nuance.

To make the matter more interesting, that dynamic programming solution is in fact far from being optimal. You can explore more about this problem on this Wikipedia page. Essentially, there exists an algorithm that is asymptotically better than the popular dynamic programming solution. This highlights the fact that associativity is indeed an important, non-trivial property.

Addendum 2: Is matrix-vector multiplication a special case of matrix-matrix multiplication?

Indeed it is. I really didn't have to write about the matrix-vector multiplication separately, but I did it just to make this post seem more approachable. (If that's not the case, I apologize.) But here we are, in the second addendum, and you are still reading this post, so let me babble away.

Apparently, a column coordinate vector with $\ell$ components looks like an $\ell$-by-1 matrix, so we can treat it as such. That is it. We now know how to multiply a matrix on the left of a column vector just by looking at the vector as an $\ell$-by-$1$ matrix.

If you still feel itchy...

Recall that each matrix represents a linear transformation, and the multiplication corresponds exactly to composition. Now we know how to multiply a column vector, but what linear transformation does it represent?

The answer is simple. We can multiple an $\ell$-by-$1$ matrix on the left of a $1$-by-$1$ matrix, which is just a scalar. The result is just scaling the matrix by the given scalar. This is indeed the linear transformation we're looking for: multiplying the scalar with a fixed vector.

If you're feeling even more pedantic...

Here's a more proper albebraic treatment of this matter. For each $\xi \in X$, we can define the following linear transformation: $$ \sigma(\xi): \lambda \mapsto \lambda\xi $$ for $\lambda \in F$, where $F$ is the scalar field. (If you're not familiar what the term field, the scalar field is just the set of possible values that can be a component of a matrix. You can choose $F$ to be the set of real numbers $\mathbb R$, the set of complex numbers $\mathbb C$, the set of rational numbers $\mathbb Q$, or any other field, depending on what you want to use it for.)

$F$ is naturally a 1-dimensional vector space. If we pick $1$ as the basis vector of $F$, we can find the matrix representation of $\sigma(\xi): F \to X$ by evaluating it at $1$: \begin{align*} \sigma(\xi)(1) & = \xi = \sum_{i=1}^\ell \xi^i x_i. \end{align*} From the definition of matrix components \eqref{eq:matrix_comps}, we see that $\xi^i$'s are the components of the first and only column of the matrix of $\sigma(\xi)$. Incidentally, $\xi^i$'s are also the coordinates of $\xi$.

This motivates the following abuse of notation: identify $\xi \in X$ as equal to $\sigma(\xi): F \to X$. This has a nice consequence of allowing us to add and drop the superscript $F$ at will: $$ \begin{pmatrix} \xi^1 \\ \xi^2 \\ \vdots \\ \xi^\ell \end{pmatrix}_X = \xi = \sigma(\xi) = \begin{pmatrix} \xi^1 \\ \xi^2 \\ \vdots \\ \xi^\ell \end{pmatrix}^F_X. $$ (Remember that this is contingent on the fact that we picked $1$ as the basis vector of $F$. If we had picked a different basis vector, the matrix components would not have matched the vector coordinates.)

Comments

Popular posts from this blog

Adding code snippets to Blogger posts (or any HTML pages)

C++: Hidden array copy operation

You could have invented the determinant