Thursday, 15 December 2016

Semantically Ordering Word Lists in Python

I am sure most people who read this blog are familiar with word embedding. If not, word embedding is a feature learning technique in natural language processing (NLP) where words or phrases are mapped into vectors of real numbers. The technique was popularised by Tomas Mikolov's word2vec toolkit in 2013.

Today, word embedding finds its way into many real world applications, including the recent update of Google Translate. Google now use a neural machine translation engine, which translates whole sentences at a time, rather than just piece-by-piece. These sentences represented by vectors.

Normally, word vectors consist of hundreds, if not thousands, of real numbers, which can capture the intricate detail of a rich language. However, I thought, what would happen if we crush each word's representation down to a single number?

Surely, no meaningful information would remain.

I exported into a CSV file a selection of words and their associated word vectors generated by Athena - my port of word2vec. Here's a truncated view of the 128 dimension word vectors:

albert_einstein, -2.480522, -1.91355, 0.4266494, 1.927253, -0.5318278, -3.722191, ...
audi, -0.8185719, -1.691721, -0.3929027, 1.698154, -2.953124, 0.9475167, ...
bill_gates, -1.673416, -1.601884, 1.130291, 2.139339, 0.5832655, -2.355634, ...
bmw, -1.027373, -1.668206, -1.728756, 2.338698, -4.249786, 0.4357562, ...
cat, -0.5071716, -2.760615, -2.546596, 0.9999355, -0.1860456, 0.2416906, ...
climb, 0.7150535, 0.1190424, 1.583062, -0.3858005, -3.991093, 1.382508, ...
dog, -0.4773144, -2.224563, -3.67255, 0.5424166, 0.6331441, 1.222993, ...
hotel, 0.3524887,-4.38588, 1.197459, 2.595855, -0.3414235, -0.4427964, ...
house, 0.5532117, -2.279454, -0.2512704, 0.4140477, 2.676775, 0.05087801, ...
monkey, -0.623855, -3.508944, -0.931955, -0.4193407, -0.9044554, 0.347873, ...
mountain, 2.207317, 0.5984201, -1.398792, -0.5220098, -1.344777, 0.3062904, ...
porsche, -0.316146, -1.779519, -0.8431134, 2.44296, -3.680713, 0.874707, ...
river, 3.286445, 2.139661, -1.43799, 2.606198, -2.337485, -0.4348237, ...
school, -3.210236, -3.298275, 3.333953, 0.9878215, 1.926927, -0.1040714, ...
steve_jobs, -2.178778, -2.492632, 1.083596, 1.491468, 0.5440083, -3.330186, ...
swim, 0.8094505, -0.911125, -1.189181, 1.908399, -4.087798, 1.79775, ...
valley, 1.044242, 1.814712, 0.1396747, 0.6305104, -1.227837, -0.389852, ...
walk, 0.5212327, 0.03666721, 0.6227544, 0.6157224, -2.084322, 0.6642563, ...

Next, I would crush each vector down to just a single dimension using Principle Component Analysis (PCA). It's really easy to do this in Python using the scikit-learn library. Here is my code:

from sklearn import decomposition
import pandas as pd

# Read data from file.
data = pd.read_csv('data.csv', header=None)

# Reduce word embeddings to one dimension.
embeddings = data.ix[:, 1:]
pca = decomposition.PCA(n_components=1)
pca.fit(embeddings)
values = pca.transform(embeddings)

# Build result columns.
labels = pd.Series(data.ix[:, 0], name='labels')
values = pd.Series(values[:, 0], name='values')

# Build results and sort.
result = pd.concat([labels, values], axis=1)
result = result.sort(['values'], ascending=[True])

# Output to console.
print(result)

The results are astonishing!

bmw-12.042180
audi-11.357731
porsche-11.349108
steve_jobs-8.577104
bill_gates-7.390602
albert_einstein-4.910876
monkey-1.285317
cat
-1.197481
dog
-0.925163
house0.517183
school0.732882
hotel1.319551
swim6.056411
climb6.943114
walk7.069848
mountain9.365861
valley11.752656
river15.278058

Even with just a single value representing each word, enough information is retained so that clusters have clearly formed - we have car manufacturers, people, animals, buildings, verbs and geographical features...

What does this mean? No idea..! But, if we can semantically sort words, we can probably do the same for sentences. Would sorting sentences in a document make the flow of ideas more coherent? Maybe..?

Anyway, as always, the code and data for this article are on GitHub.

Follow me on Twitter for more updates like this.