Data Manipulation With Numpy
Data Manipulation With Numpy
Examples are mostly coming from area of machine learning, but will be useful if you're doing number
crunching in python.
In [1]: from __future__ import print_function # for python 2 & python 3 compatibility
%matplotlib inline
import numpy as np
1. numpy.argsort ¶
Sorting values of one array according to the other ¶
Say, we want to order the people according to their age and their heights.
print(ages)
print(heights)
[49 45 44 52 44 57 46 49 31 50]
[209 183 202 188 205 179 209 187 156 209]
[31 44 44 45 46 49 49 50 52 57]
[156 202 205 183 209 209 187 209 188 179]
once you computed permutation, you can apply it many times - this is fast.
Frequently to solve this problem people use sorted(zip(ages, heights)), which is much slower (10-20
times slower on large arrays).
print(permutation)
print(original)
print(original[permutation])
[1 7 4 9 2 8 0 5 6 3]
['a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j']
['b' 'h' 'e' 'j' 'c' 'i' 'a' 'f' 'g' 'd']
print(original[permutation][inverse_permutation])
['a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j']
a[b][c] = a[b[c]]
In [6]: permutation[inverse_permutation]
In [7]: print(np.argsort(permutation))
[6 0 4 9 2 7 8 1 5 3]
[6 0 4 9 2 7 8 1 5 3]
In other words, for each element in array we want to find the number of elements smaller than given.
print(data)
print(np.argsort(np.argsort(data)))
NB: there is scipy function which does the same, but it's more general and faster, so prefer using it:
Out[9]: array([ 8., 6., 2., 9., 0., 3., 5., 4., 7., 1.])
This method is useful to compare distributions or to work with distributions with heavy tails or strange
shape.
Now we can see that IronTransform actually was able to turn signal distribution to uniform:
plt.figure()
plt.hist(iron.transform(sig_pred), bins=30, alpha=0.5)
plt.hist(iron.transform(bck_pred), bins=30, alpha=0.5)
plt.show()
In other words, for each point x we should compute the part of distribution to the left. This was done by
summing all weights to this point using numpy.cumsum (we also had to normalize weights).
To use the learned mapping, linear interpolation ( numpy.interp ) was applied. We can visualize this
mapping:
Also worth mentioning: to fight very small / large numbers, use numpy.clip . Simple and very handy:
In [18]: x = np.arange(-5, 5)
print(x)
print(x.clip(0))
[-5 -4 -3 -2 -1 0 1 2 3 4]
[0 0 0 0 0 0 1 2 3 4]
2. Broadcasting, numpy.newaxis ¶
Broadcasting is very useful trick, which simplifies many operations.
Weighted covariation matrix ¶
numpy has cov function, but it doesn't support weights of samples. Let's write our own covariation
matrix.
Let Xij be the original data. i is indexing samples, j is indexing features. COVjk = ∑i Xij wi Xik
In [21]: np.cov(data.T)
Pearson correlation ¶
One more example of broadcasting: computation of Pearson coefficient.
In [25]: np.corrcoef(data.T)
my_corrcoef(data, np.ones(len(data)))
... but keep in mind there is sklearn.metrics.pairwise which does it for you and has different options
3. numpy.unique, numpy.searchsorted ¶
['ab' 'ac' 'ad' 'ae' 'af' 'ag' 'ah' 'ai' 'aj' 'ak' 'al' 'am' 'an' 'bc' 'bd'
'be' 'bf' 'bg' 'bh' 'bi' 'bj' 'bk' 'bl' 'bm' 'bn' 'cd' 'ce' 'cf' 'cg' 'ch'
'ci' 'cj' 'ck' 'cl' 'cm' 'cn' 'de' 'df' 'dg' 'dh' 'di' 'dj' 'dk' 'dl' 'dm'
'dn' 'ef' 'eg' 'eh' 'ei' 'ej' 'ek' 'el' 'em' 'en' 'fg' 'fh' 'fi' 'fj' 'fk'
'fl' 'fm' 'fn' 'gh' 'gi' 'gj' 'gk' 'gl' 'gm' 'gn' 'hi' 'hj' 'hk' 'hl' 'hm'
'hn' 'ij' 'ik' 'il' 'im' 'in' 'jk' 'jl' 'jm' 'jn' 'kl' 'km' 'kn' 'lm' 'ln'
'mn']
[25 19 5 ..., 69 77 85]
In [33]: print(categories)
print(unique_categories[new_categories])
map randomly to one of present categories (this, i.e. is done by hashing trick)
make a special category 'unknown' where all new values will be mapped. Then special value may
be chosen for it. This approach is more general, but requires more efforts.
4. numpy.percentile, numpy.bincount ¶
This usually works poorly with distributions with long tails (or strange shape) and requires additional
transformation to avoid empty or too large bins.
However, for most cases creating the bins with equal number of events gives more convenient and
stable results. Selecting the edges of bins can be done with numpy.percentile .
In [38]: x = np.random.exponential(size=20000)
_ = plt.hist(x, bins=40)
numpy.searchsorted assumes that first array is sorted and uses binary search, so it is effective even
for large amount of bins.
In [40]: np.searchsorted(edges, x)
Counter ¶
Need to know number of occurences for each category? Use numpy.bincount to compute the
statistics:
In [41]: new_categories
Out[41]: array([25, 19, 5, ..., 69, 77, 85])
This will compute how many times each category present in the data:
In [42]: np.bincount(new_categories)
Out[42]: array([112, 105, 127, 112, 118, 107, 129, 94, 118, 105, 109, 103, 97,
108, 110, 97, 103, 111, 103, 102, 108, 115, 121, 98, 121, 115,
105, 113, 109, 97, 91, 109, 109, 107, 114, 123, 99, 98, 92,
117, 103, 94, 111, 113, 125, 98, 110, 122, 108, 117, 106, 100,
111, 108, 106, 130, 116, 97, 112, 106, 103, 103, 109, 107, 106,
121, 107, 117, 115, 113, 109, 107, 122, 123, 117, 104, 99, 125,
110, 122, 93, 105, 112, 118, 116, 105, 110, 112, 123, 116, 127])
Out[43]: array([112, 105, 127, 112, 118, 107, 129, 94, 118, 105, 109, 103, 97,
108, 110, 97, 103, 111, 103, 102, 108, 115, 121, 98, 121, 115,
105, 113, 109, 97, 91, 109, 109, 107, 114, 123, 99, 98, 92,
117, 103, 94, 111, 113, 125, 98, 110, 122, 108, 117, 106, 100,
111, 108, 106, 130, 116, 97, 112, 106, 103, 103, 109, 107, 106,
121, 107, 117, 115, 113, 109, 107, 122, 123, 117, 104, 99, 125,
110, 122, 93, 105, 112, 118, 116, 105, 110, 112, 123, 116, 127])
One of tricks is to replace in each event category (which cannot be processed by most algorithms) with a
number of times it presents in data.
In [44]: np.bincount(new_categories)[new_categories]
Vx = Ex2 − (Ex)2
In [47]: means_of_squares_over_category = np.bincount(new_categories, weights=predictions**2) / n
var_over_category = means_of_squares_over_category - means_over_category ** 2
var_over_category[new_categories]
You can also assign weights to events and compute weighted mean / weighted variation using the same
functions.
5. Numpy.ufunc.at ¶
To proceed futher, we need to know that there is class of binary (and unary) functions: numpy.ufuncs
In the following example we cook a new column with maximal predictions over category using ufunc.at
print(max_over_category[new_categories])
Important note: numpy.add.at and numpy.bincount are very similar, but latter is dozens of times
faster. Always prefer it to numpy.add.at
I prefer first, code below is demonstration of second approach and it's benefits.
Now some useful things we can do with colected statistic. First thing is create feature, for which we use
fancy indexing:
In [54]: # maximal occurences of second category with same value of first category
counters.max(axis=0)[second_category]
you are not limited to numpy.add , remember there are other ufuncs.
Few space to keep whole matrix? Use scipy.sparse if there are too many zero elements.
Alternatively, you can use matrix decompositions to keep approximation of matrix and compute
result on-the-fly.
ROC curve takes three arrays: labels (binary, 0 or 1), predictions (real-valued) and weights.
In [57]: np.random.seed(1)
size = 1000
labels = (np.random.random(size) > 0.5) * 1
predictions = np.random.normal(size=size) + labels
weights = np.random.exponential(size=size)
0.416
0.416
Important moment that written function supports weights, while scipy version doesn't.
Conclusion ¶
Many things can be written in numpy in efficient way, moreover resulting code will be short and (surpise!)
more readable.
But this skill requires much practice.
See also ¶
Second part of post: numpy tips and tricks 2
I used example from histogram equalization with numpy
Numpy log-likelihood benchmark
This post was written in IPython. You can download the notebook from repository.
Brilliantly wrong