If you are reading this, you probably already know that data pre-processing is the 90% perspiration of machine learning. You might love it or you might dread it, but you probably don’t think of it as a the part of ML where the most interesting mathematics lives. Let me challenge that view a bit with an interesting pre-processing problem I encountered recently.

#### A pre-processing problem

Suppose you have an $n \times m$ feature matrix $A$. The columns of $A$ are features and the rows are observations. Can you transform $A$ so that the features have zero mean and the rows have unit $L^2$ norm?

Why would a problem like this arise? Consider a retrieval task where you’d like to find the nearest neighbors (rows) in $A$ to a query point $q$. You may want to remove the “common” information from each feature (zero mean condition) and you may want to remove “length” information from each row (unit norm condition) before conducting your search.

#### Remove the mean then normalize — fail

Let’s see what happens if we first remove the mean from our columns and then normalize them. First let’s setup a couple of building blocks. We’re going to work with Python numpy for our matrix calculations.

First let’s setup a normalize transform and a de-mean transform:

```
import numpy as np
def normalize(M, axis):
'''
Normalize the matrix M along the axis
'''
if axis == 1:
return M / np.sqrt((M*M).sum(axis=axis)).reshape((-1,1))
elif axis == 0:
return M / np.sqrt((M*M).sum(axis=axis)).reshape((1, -1))
else:
raise RuntimeError("Invalid axis: %d" % axis)
def demean(M, axis):
'''
Remove the mean along axis from M
'''
if axis == 0:
return M - M.mean(axis=axis)
elif axis==1:
return M - M.mean(axis=axis).reshape((-1,1))
else:
raise RuntimeError("Invalid axis: %d" % axis)
```

Following numpy conventions, “axis” in these transforms means the direction over which we compute something. So “axis=0” means compute over the rows and “axis=1” means compute over the columns.

Let’s define a couple of “checks”:

```
def norm(v):
'''
Compute the norm of the vector v
'''
return v.transpose().dot(v)
def is_normalized(M, axis):
'''
Check that the matrix M has unit norm along the axis
'''
return np.all(np.isclose(np.sqrt((M*M).sum(axis=axis)), 1.0))
def is_demeaned(M, axis):
'''
Check that the mean along the axis of the matrix M is 0
'''
return np.isclose(norm(M.mean(axis=axis)), 0.0)
```

Now let’s see what happens when we demean and then norm a (random) matrix M:

```
np.random.seed(2)
M = np.random.uniform(size=(10,5))
M = demean(M, 0)
print("Is demeaned? %s" % is_demeaned(M, 0))
# Is demeaned? True
M = normalize(M, 1)
print("Is normalized? %s" % is_normalized(M, 1))
# Is normalized? True
print("Is demeaned? %s" % is_demeaned(M, 0))
# Is demeaned? False
print("Norm of mean vector: %.3f" % norm(M.mean(axis=0)))
# Norm of mean vector: 0.018 (for this seed)
```

So normalizing after a de-mean operation reintroduces non-zero mean. You’ll get a similar result if you first normalize and then de-mean. And also if you interchange the axes. And similarly for other matrix shapes.

**A Simple Solution: Raking**

Now here’s a very simple solution to this problem. We repeatedly apply the row and column transformation until the matrix satisfies the zero mean, unit norm condition. Remarkably, this approach appears to always converge and often quite quickly!

```
from warnings import warn
def rake(M, max_iter=300, tol=1e-8):
err = 1e0
iter_count = 0
while err > tol and iter_count < max_iter:
M = demean(M, 0)
M = normalize(M, 1)
err = norm(M.mean(axis=0))
iter_count+=1
print("Error at iteration %d: %.5f" % (iter_count+1, err))
if iter_count >= max_iter:
warn("Raking did not converge after %d iterations." % iter_count)
else:
print("Raking converged after %d iterations." % iter_count)
return M, err, iter_count
```

Let’s apply another random matrix:

```
np.random.seed(2)
M, _, _ = np.random.uniform(size=(10,50))
M = rake(M)
# Error at iteration 2: 0.00070
# Error at iteration 3: 0.00002
# Error at iteration 4: 0.00000
# Error at iteration 5: 0.00000
# Error at iteration 6: 0.00000
print("Is normalized? %s" % is_normalized(M, 1))
# Is normalized? True
print("Is demeaned? %s" % is_demeaned(M, 0))
# Is demeaned? True
```

Let’s provide some numerical evidence of the convergence claims.

```
import matplotlib.pyplot as plt
np.random.seed(2)
iter_counts = []
for i in range(10000):
M = np.random.uniform(size=(10,10))
M, _, iter_count = rake(M)
errors.append(err)
iter_counts.append(iter_count)
p = plt.hist(iter_counts, bins=10, alpha=0.5)
p = plt.xlabel("Number iterations to convergence")
p = plt.ylabel("Count")
plt.savefig("raking.png")
```

So why might this idea work? To my eye, it’s quite similar to the often rediscovered algorithm called variously “raking”, “iterative proportional fitting” or a few other names.

Let me describe this algorithm in the survey context. Suppose you’ve sampled from a population in a non-representative way but you want to weight the survey so that it is representative of the population. Maybe your true population has genders “M”, “F”, “non-binary” and age groups “young” and “old”. You need to choose weights so that both gender and age groups align with the population proportions. So you start by choosing weights so that your genders balance. But that throws off the age ratio. So then you balance your genders again. But again that throws off ages (but a bit less than the first time). Do this enough times and the algorithm will converge to a set of weights that balance your sample along both dimensions!

Proving mathematically that our matrix feature processing trick works seems pretty non-trivial to me. Indeed the literature on the raking / IPF problem seems to suggest that the proofs are highly technical. If you can see a path to a convergence proof, please let me know!