AI Insights

Measuring the similarity of books using Doc2vec and TensorFlow

In this week’s machine learning adventure we shall be measuring the similarity of books using TF-IDF, doc2vec and TensorFlow. The full source code is available to download in my Gist.

Specifically, we want to use machine learning to map books onto a two-dimensional plane so we can explore a library and potentially find other titles we may be interested in reading.

In this toy example, we’ll be downloading twenty books from Project Gutenberg, including such classics as The War of the Worlds, Pride and Prejudice and Treasure Island.

In my first attempt I used a pure doc2vec solution, but found I was not achieving good separation between titles. It’s easy to see why this may be the case as language can often be incredibly generic – consider the short phrase “she smiled sweetly” – it could appear in pretty much any book.

Many natural language processing solutions remove so called stop words or frequently occurring terms. I decided to go a step further and remove all words that contribute little towards the essence of the story.

By calculating TF-IDF scores for each word in each book, I assembled a condensed global vocabulary – concatenating the top 2500 most important words from each book. The overall condensed vocabulary contained about 25,000 words, so clearly we still retained a good overlap between books even after removing much of the noise.

Top tokens in: The War of the Worlds
> martians 0.04507
> martian 0.02557
> woking 0.01715

Top tokens in: Through the Looking-Glass
> alice 0.18587
> humpty 0.0377
> dumpty 0.0377

Stripping out any words that do not appear in the condensed vocabulary produces a much more concentrated form of the story – gibberish to humans, but machines love it!

For instance, Alice’s Adventures in Wonderland would now start like this: alice tired sitting sister bank twice peeped book sister reading pictures conversations book thought alice pictures conversations considering feel sleepy stupid pleasure making daisy chain worth getting picking daisies suddenly rabbit pink ran close remarkable nor alice rabbit itself.

The condensed versions of books were encoded to integer arrays and used to train a TensorFlow model. The model was structured as a feed forward classifier using two sets of embeddings: one for the word tokens and one for the books.

If you’ve ever heard about embeddings you’ve probably heard about word2vec. This method represents words as high-dimensional vectors, such that words that are semantically similar will have similar vectors. Doc2vec is a derivative of this.

After training, the books occupy an embedding space, thus allowing us to measure the distance between them using the cosine similarity of their vectors.

Closest to: Alice’s Adventures in Wonderland
> 0.312 Through the Looking-Glass
> 0.295 The Jungle Book
> 0.239 The Time Machine

To achieve our overall goal, by using the power of t-SNE, we can project the high-dimensional embedding space down onto a two dimensional plane: t-SNE is a technique for dimensionality reduction particularly well suited for the visualization of high-dimensional datasets.

The result looks like this – pretty cool…

We can clearly see that science fiction dominates the left side of the map, whilst the traditional classics live over to the right. Strange fantasy stuff like Alice in Wonderland and The Wizard of Oz sit in the middle…

Although this is just a toy example, using only twenty books, it could very easily scale up to thousands of books or other document types – the query would simply be “find me more like this…

If you’d like to run this experiment yourself the full source code is available to download in my Gist.

Contact us if you would like to learn more about using AI and machine learning in your business.

« »

Talk to an expert

Book a call with one of our AI strategists now!