Late last year, I took University of Michigan’s Coursera class on machine learning in Python (great class, by the way). In one of the suggested readings, ML researcher Pedro Domingos covers the folk knowledge used in machine learning. Domingos mentions how human intuition fails at high dimensions, giving the example that high-dimensional multivariate Gaussians are not concentrated at their center, but rather in an increasingly wide shell. He appends this statement with “most of the volume of a high-dimensional orange is in the skin, not the pulp.”
This statement was surely just meant to introduce one difficulty with building ML algorithms, but it’s evocative in a way that got me thinking. Not satisfied with knowing the mathematical explanation behind this phenomenon, I decided to investigate with Gaussians of dimensionalities I can visualize: 3D and 2D. Though three dimensions is miniscule compared to most feature spaces, I figured I could at least see the beginning of a trend.
To begin with, I plotted some random 3D Gaussian data. The Gaussian in each dimension had the same parameters: standard deviation of 1, normalization of 1, and center at the origin. The result looks as expected: a clumpy blob that spreads thin at the edges. Not much information to glean from this, but it reassures that our data was generated correctly.
Satisfied there was nothing fishy going on in my dataset, I then plotted a histogram of the points’ distance from the origin. It might be useful to think of the 3D data as an onion, rather than an orange, such that each bin in the histogram represents a layer of the onion.
Fantastic! Even at a lowly three dimensions, it’s clear that the majority of the data points are in a shell at some distance from the center. In this particular case, it looks like the most popular radius is between 1 and 2 units, although changing the parameters of our Gaussians would affect this.
Since the whole point of this exercise was to build intuition, I decided to reduce to two dimensions and see if I could create a visual that really hammered this concept into my subconscious. For the sake of comparison, I plotted another scatter plot and histogram:
The peak of the histogram has crept closer to zero, as expected. I then stripped down the scatter plot, removed half the points, and added a series of rings—each of width 0.5, going up to a maximum outer radius of 3.
You can see fairly clearly that the innermost dark blue ring sweeps up more points than the light blue center circle. If you like, you might then imagine these rings getting integrated around the z axis, and then around the fourth dimension, and so on (I’m pretty sure I’ve seen it in a Nolan film…). Eventually, that first dark blue ring is a multi-dimensional layer with a crazy amount of volume, and the center circle is just a pip in comparison.
For the heck of it, I plotted a one-dimensional Gaussian as well. In the histogram, we see evidence of the variation that arises from randomly-selected data. The majority of the points should be around 0, but in this particular set there are a few more around 0.5.
To conclude, this example of Gaussian-distributed data shows that the unintuitive behaviors of high-dimensional data can begin to emerge at low dimensions. In some cases, exploring data in these lower dimensions might expose trends that become significant in higher dimensions; however, I’m not sure how one could do that if all the features of a dataset were from different distributions or even different data types. I’m excited to learn more about the world of ML, and I hope this was interesting to someone other than myself!
Here is my code to create all the figures in this post (no credit is necessary):
%matplotlib notebook # The above magic allows for interactive figures in a Jupyter notebook. import numpy as np import matplotlib.pyplot as plt from matplotlib.patches import Wedge from matplotlib.collections import PatchCollection plt.style.use('seaborn-colorblind') np.random.seed(2956) x = np.random.normal(size = 200) y = np.random.normal(size = 200) z = np.random.normal(size = 200) bins = 10 xtix = np.arange(0,5,1) ytix = np.arange(10,60,10) def dist(x,y): return np.sqrt(x**2 + y**2) pt_radii2 = [dist(x[pt],y[pt]) for pt in range(0,200)] pt_radii3 = [dist(pt_radii2[pt],z[pt]) for pt in range(0,200)] # 3D scatter plot ax3 = plt.figure().add_subplot(projection='3d') ax3.scatter(x, y, z, s=10) ax3.set_title('3D Gaussian with n=200') ax3.set_xlim(-3,3) ax3.set_ylim(-3,3) ax3.set_zlim(-3,3) # Histogram for 3D plot plt.figure() plt.title('3D Gaussian with n=200: Distance from Origin') plt.hist(pt_radii3, bins=bins, color='lightblue') plt.tick_params(left=False, bottom=False) plt.xticks(xtix, alpha=0.5) plt.yticks(ytix, alpha=0.5) for i in ['top', 'bottom', 'right', 'left']: plt.gca().spines[i].set_visible(False) # 2D scatter plot & histogram fig = plt.figure() fig.set_figheight(4) fig.set_figwidth(8) fig.suptitle('2D Gaussian with n=200') ax2 = fig.add_subplot(121) points = ax2.scatter(x, y, s=10) ax2.set_title('Position', size=10) ax2.set_xlim(-4,4) ax2.set_ylim(-4,4) ax2.set_aspect('equal') for spine in ['right','top']: ax2.spines[spine].set_visible(False) ax2.tick_params(left=False, bottom=False) ax2.set_yticks(np.arange(-4,6,2)) ax2h = fig.add_subplot(122) ax2h.hist(pt_radii2, bins=bins, color='lightblue') ax2h.set_title('Distance from Origin', size=10) for i in ['top', 'bottom', 'right', 'left']: ax2h.spines[i].set_visible(False) plt.tick_params(left=False, bottom=False) plt.xticks(xtix, alpha=0.5) plt.yticks(ytix, alpha=0.5) fig.tight_layout(pad=1) # 2D target-like plot max_r = 3 width = 0.5 patches2, radii2 = , np.arange(0.5,max_r+width,width*2) patches3, radii3 = , np.arange(1,max_r+width,width*2) for r in radii2: patches2.append(Wedge((0,0),r,0,360,width=width)) for r in radii3: patches3.append(Wedge((0,0),r,0,360,width=width)) p2 = PatchCollection(patches2, alpha=0.2) p3 = PatchCollection(patches3, alpha=0.4) fig = plt.figure() ax2t = fig.add_subplot() ax2t.add_collection(p2); ax2t.add_collection(p3); ax2t.set_xlim(-max_r-0.1,max_r+0.1) ax2t.set_ylim(-max_r-0.1,max_r+0.1) ax2t.set_aspect('equal') plt.tick_params(left=False, bottom=False) points = ax2t.scatter(x[:100], y[:100], s=10) for spine in ['left','right','top','bottom']: ax2t.spines[spine].set_visible(False) ax2t.get_xaxis().set_visible(False) ax2t.get_yaxis().set_visible(False) # 1D scatter plot & histogram fig = plt.figure() fig.set_figheight(4) fig.set_figwidth(8) fig.suptitle('1D Gaussian with n=200') ax1 = fig.add_subplot(121) points = ax1.scatter(x, np.zeros(200), s=10, alpha=0.1) ax1.set_title('Position', size=10) ax1.set_xlim(-3,3) for spine in ['left','right','top','bottom']: ax1.spines[spine].set_visible(False) ax1.tick_params(bottom=False) ax1.get_yaxis().set_visible(False) ax1h = fig.add_subplot(122) ax1h.hist(np.abs(x), bins=bins, color='lightblue') ax1h.set_title('Distance from Origin', size=10) for i in ['top', 'bottom', 'right', 'left']: ax1h.spines[i].set_visible(False) plt.tick_params(left=False, bottom=False) plt.xticks(xtix, alpha=0.5) plt.yticks(ytix, alpha=0.5) fig.tight_layout(pad=1)