Later that year, the word2vec C language source code supporting the paper was open sourced and is now available on Google code.
The word2vec tool is generally trained on a very large text corpus and subsequently learns vector representations (or embeddings) of words. The resulting word vectors can be used as features in natural language processing or machine learning applications.
The word2vec code has been hugely popular and as a result been ported to other languages including Python and Java. However, to the best of my knowledge there has not been a lightweight C# port of word2vec – so I decided to make one!
For my own purposes I chose to implement a Continuous Bag of Words model, rather than Skipgram, which works just fine for my needs.
The code below contains three classes:
- Word2vec.cs – this is where the vector representations are learned.
- Model.cs – a simple class showing how to query the word vectors.
- Program.cs – a console application tying it all together.
I train my model on a 100MB extract from Wikipedia, which yields really nice results.
A simple way to visualise the learned representations is to list the closest words for a user input. The console application provided displays the closest words and their cosine similarity to the user input.
For example, if you enter 'france', you should see an output similar to this:
Once words are represented as vectors it’s easy to perform standard vector operations, such as addition and subtraction. Research has shown that word vectors capture many linguistic regularities. A couple of famous examples often cited are:
vector('paris') - vector('france') + vector('italy') is close to vector('rome')
vector('king') - vector('man') + vector('woman') is close to vector('queen').
I’ve not included this in the code, but it’s really easy to implement and I leave for my readers to do so should they wish.
I hope you find this interesting – if you have any questions, please post on my Google+ page.