Measuring the similarity of books using TF-IDF, Doc2vec and TensorFlow

In this week’s machine learning adventure we shall be measuring the similarity of books using TF-IDF, doc2vec and TensorFlow. The full source code is available to download in my Gist.

Specifically, we want to use machine learning to map books onto a two-dimensional plane so we can explore a library and potentially find other titles we may be interested in reading.

In this toy example, we’ll be downloading twenty books from Project Gutenberg, including such classics as The War of the Worlds, Pride and Prejudice and Treasure Island.

In my first attempt I used a pure doc2vec solution, but found I was not achieving good separation between titles. It’s easy to see why this may be the case as language can often be incredibly generic – consider the short phrase "she smiled sweetly" - it could appear in pretty much any book.

Many natural language processing solutions remove so called stop words or frequently occurring terms. I decided to go a step further and remove all words that contribute little towards the essence of the story.

By calculating TF-IDF scores for each word in each book, I assembled a condensed global vocabulary - concatenating the top 2500 most important words from each book. The overall condensed vocabulary contained about 25,000 words, so clearly we still retained a good overlap between books even after removing much of the noise.

Top tokens in: The War of the Worlds
> martians 0.04507
> martian 0.02557
> woking 0.01715

Top tokens in: Through the Looking-Glass
> alice 0.18587
> humpty 0.0377
> dumpty 0.0377

Stripping out any words that do not appear in the condensed vocabulary produces a much more concentrated form of the story - gibberish to humans, but machines love it!

For instance, Alice's Adventures in Wonderland would now start like this: alice tired sitting sister bank twice peeped book sister reading pictures conversations book thought alice pictures conversations considering feel sleepy stupid pleasure making daisy chain worth getting picking daisies suddenly rabbit pink ran close remarkable nor alice rabbit itself.

The condensed versions of books were encoded to integer arrays and used to train a TensorFlow model. The model was structured as a feed forward classifier using two sets of embeddings: one for the word tokens and one for the books.

If you’ve ever heard about embeddings you’ve probably heard about word2vec. This method represents words as high-dimensional vectors, such that words that are semantically similar will have similar vectors. Doc2vec is a derivative of this.

After training, the books occupy an embedding space, thus allowing us to measure the distance between them using the cosine similarity of their vectors.

Closest to: Alice's Adventures in Wonderland
> 0.312 Through the Looking-Glass
> 0.295 The Jungle Book
> 0.239 The Time Machine

To achieve our overall goal, by using the power of t-SNE, we can project the high-dimensional embedding space down onto a two dimensional plane: t-SNE is a technique for dimensionality reduction particularly well suited for the visualization of high-dimensional datasets.

The result looks like this – pretty cool...

We can clearly see that science fiction dominates the left side of the map, whilst the traditional classics live over to the right. Strange fantasy stuff like Alice in Wonderland and The Wizard of Oz sit in the middle...

Although this is just a toy example, using only twenty books, it could very easily scale up to thousands of books or other document types - the query would simply be "find me more like this..."

If you'd like to run this experiment yourself the full source code is available to download in my Gist.

Contact us if you would like to learn more about using AI and machine learning in your business.

Or, follow us on Twitter for more updates like this.

Celebrity Style Transfer

It’s the weekend and time for more machine learning fun!

This time, we’re going to do some celebrity style transfer - neural style transfer is a machine learning technique for recomposing images in the style of other images.

Take a reference style image, such as an artwork by a famous painter. Then find a suitable content photograph that we will transfer the style onto using some deep learning magic - it's that simple.

So, we are going to need some celebrities – it turns out I’m not really up on celebrity culture. A quick trawl of IMDB tells me Jonny Depp is the world’s most famous celeb. Who else? Kim Kardashian, she’s everywhere... How about Donald Trump and, while we’re at it, Theresa May? If we’re doing the UK government, then we must include the photogenic Michael Gove. Who else? Lewis Hamilton, he’s cool... and, erm, Batman - he's really cool!

Now for style – how about Van Gogh, Leonardo da Vinci (Mona Lisa) and someone modern... Banksy?

Great, now we’ve got that out of the way, let’s train a network. No need to reinvent the wheel - a chap called Anish Athalye from MIT has published a nice open source TensorFlow neural style transfer library which we can use.

Download a few copyright free images and after a couple of hours we have the following results...

Van Gogh Da Vinci Banksy

I really like the Banksy Batman, but the overall winner has to be Michael Gove – I said he was photogenic!

Contact us if you would like to learn more about using AI and machine learning in your business.

Or, follow us on Twitter for more updates like this.

Horse Racing

Horse Racing and the Wisdom of Crowds

Horse RacingI am a firm believer in the Wisdom of Crowds.

Sometimes if I’m running an event, I like to pass a jar of sweets through the audience and have everybody guess how many sweets are contained. Plotting the results on a graph nearly always results in a demonstration of normal distribution with the midpoint centred on the true value – uncanny!

So, as a bit of fun, I thought how would this apply to horse racing odds?

There are a wide variety of factors that can affect odds on a horse race, and odds compilers spend time gathering all available information before setting an early price. This price may then fluctuate up to the start of the race, when the 'starting price' becomes fixed. One of the major factors that moves the 'starting price' is how much money has been taken against each horse from punters – bookmakers don’t want to be over exposed should a horse win, so balance the odds accordingly – an example of the wisdom of crowds.

Let’s gather some data and see what it tells us…

Betfair is an online gambling company which operates the world's largest online betting exchange. Betfair use a system of decimal odds, which are a simpler way of working out betting odds.

If the decimal odds are 2.2 and you place a back bet of £10 and win, your total return is £10 x 2.2 = £22. This is equivalent to a traditional price of 6/5.

Even better, Betfair have a ton of historic pricing going back over ten years, ripe for downloading.

Let’s fire up Python and write a quick script… You can replicate this experiment using the source code contained in my Gist.

The script gathers a list of csv files from Betfair containing the results of the UK Horse Racing Win market over the last ten years. It then counts up the number of wins for each range of 'starting prices' and compares them against the implied probability.

And the results are as follows...

Probability of Win

An almost perfect match of the averaged actual outcomes against predicted probabilities.

The wisdom of crowds wins again... Well almost... In most gambling markets there is something called the 'overround', which is the bookmakers profit. If you sum all probabilities in a horse race, the total will be less than 100% - the bookies are the only long term winners - don't waste your money!

Contact us if you would like to learn more about using AI and machine learning in your business.

Or, follow us on Twitter for more updates like this.

AI Augmented Creativity

With the rate of technological change faster than ever, companies cannot leave research and development to chance.

The increase in publishing scientists and the number of academic journals has caused an explosion in scientific research and information.

There are now 2.5 million new scientific papers published each year and 90 percent of the world’s information was created in just the last two years.

The vast amount of information available means it is difficult for research and development teams to access and analyse all the information relevant to their project.

Furthermore, it is becoming impossible to have deep expertise spanning a significant number of subject areas.

Fuelled by machine learning, AI enables the rapid processing of data from all industry sectors. This could lead to the discovery of creative and innovative new technologies by identifying patterns not immediately visible to the human eye. Instead of relying on an inventor to have the next big idea or, as in the case of penicillin and the microwave, a lucky accident. AI could be a significant aid to product design and development, with nothing left to chance.

In the pharmaceutical industry, AI has been a powerful tool used to identify targets for drug development, sifting through biological data to locate suitable proteins to target. This process would have previously been slow and laborious, but can now be done more efficiently, helping new drugs be discovered and come to market more quickly.

AI techniques, from machine learning to pattern recognition, have already proved helpful in virtually every industry. Healthcare, finance and retail are just a few that are reaping the benefits of advanced cognition capabilities. There is no doubt the boundaries of AI’s role in creative endeavours will be pushed. And while it will never replace the human soul of creativity, AI can certainly offer many benefits serving as a smart, efficient and inspirational assistant.

In 2016, the IBM Watson platform was used for the first AI created film trailer for 20th Century Fox’s horror film, Morgan. During the project Watson analysed the visuals, sound and composition of hundreds of existing horror film trailers. Watson then selected scenes from the completed Morgan film for editors to stitch together into the trailer, ultimately reducing what could have been a process of many weeks to one day.

Machine learning is typically deep rather than broad, so the trick is combining these skills effectively. As a creative toolkit, it can augment an existing creative idea or technique, or make mundane tasks more efficient.

The demand for creativity will also increase. There is no limit to the need for creative content, but the quality of it is limited by people, by time, and by how creative they can be. So, as AI enables these things to become more spontaneous, we will have a fresh army of people developing creative work, and thus creating more demand.

As well as traditional product design, the process of generating as many ideas as possible, before narrowing down to a single idea, can be used to invent entirely new products. Information from a range of sectors can be fed into the AI and processed into a product, converging knowledge from a range of sectors into and interdisciplinary ideas - much like a polymath would have done during the Renaissance.

AI is already great at performing specific tasks, such as facial and voice recognition, object tracking, or even transposing your face onto someone else’s body, and due to advances in deep learning, computers are starting to learn and frame reality visually in the same way as humans perceive it.

Using techniques such as deep learning has enabled tremendous progress, but AI remains relegated to an assistant role, for now.

Creativity may be the ultimate mission for AI. Already AI has helped write pop ballads, mimicked the styles of great painters and informed creative decisions in filmmaking. Experts wonder, however, how far AI can or should go in the creative process.

We are on the cusp of a revolution in what we can achieve in the field of amazing, immersive, personalised experiences. In the future, the intelligence may be artificial and the reality virtual, but the impact on creativity is very real indeed.

There are lots of similarities between now and a theory about what triggered the Cambrian Explosion. According to that theory, once creatures developed vision that worked, there was a huge acceleration of evolution and the emergence of a wide variety of new forms and behaviours. Today, AI is driving a new kind of evolution, especially with advancements in visual perception and its application to everything from drones to artistic creativity.

AI can distil the wisdom of the crowd and express it as a useful tool. Essentially, it could help us create new art and discover new knowledge as we learn from each other. It is easy to see the time we can save and the productivity we can gain from these applications. It is equally as easy to see how this could keep creative types from getting tied down in tedious tasks so they can have more time for inspiration and doing the thoughtful work of design.

When everyone can attain a certain standard of creative productivity, it will force those who are truly gifted to strive for even higher standards of creativity and originality. The world will become even more beautiful and entertaining as a result.

It is easy for AI to come up with something novel just randomly. But it is very hard to come up with something that is novel and unexpected and useful. Can we take what humans think is beautiful and creative and try to put that into an algorithm? Possibly sooner than you think…

Contact us if you would like to learn more about using AI and machine learning in your business.

Or, follow us on Twitter for more updates like this.

Building Scalable AI Solutions

When designing Artificial Intelligence enabled software to work at scale there are several themes that should be considered. Some apply specifically to AI systems, but many are just good software development practices...

Already AI has countless applications in all sectors, spanning financial services, engineering, healthcare, marketing and legal. It supports an ecosystem poised to transform industry in the next few years, bringing superior medical diagnosis, unbiased brand analysis, broad investment insights and robust fraud detection.

Our own human imagination is now the only limiting factor!


We’ve all seen the amazing feats performed by research AIs: DeepMind’s AlphaGo, Georgia Tech’s Shimon and Google’s Quick, Draw!, but that’s just part of the story – how do we plug this intelligence into real-world applications securely and reliably?

At Robosoup, given our experience building AI enabled systems, we believe development should be approached using the following prioritisation.

This is the essence of the solution – its raison d'être - the software should add value - it must do something that can’t be done by a human alone, or do it at superior scale, accuracy and/or speed. Our goal is not always to completely automate away human activity, but more generally to create an environment which augments human activity to provide super-human levels of performance. AI is the catalyst that gives humans more time to do what they do best - think creatively.

Software usability can be described as how effectively new users can use, learn or control the system. At its heart, we’re attempting to provide positive answers to questions like these:

  • Are the most common operations streamlined to be performed quickly?
  • Can new users intuitively learn to use the software without help?
  • Do validation and error messages make sense?

Scalability is the ability for software to gracefully meet the demand of stress caused by increased usage. To achieve this, we follow two guiding principles.

Scale horizontally in the cloud - there is a limit to how large a single server can be, both for physical and virtual machines. There are limits to how well a system can scale horizontally, too. That limit, though, is increasingly being pushed further ahead. We always target cloud based virtual servers and databases wherever possible providing ultimate flexibility.

Asynchronous rather than synchronous - we all understand asynchronous communication in the physical world. We send a letter in the mail and sometime later it arrives. Until it does, we are happy in the knowledge it is underway, oblivious to the complexity of the postal system. A similar approach should be taken with our applications. Did a user just hit submit? Tell the user that the submission went well and then process it in the background. Perhaps show the update as if it is already completely done in the meantime.

This is related but slightly different to scalability. What we’re trying to ensure here is that each process, even when run in isolation makes maximum use of available resource. Are we using parallel processes where we can? Are we using caches effectively? Caches are essentially storages of precomputed results that we use to avoid computing the results over and over again.

Avoid the single point of failure. We try to never just have one of anything, we should always assume and design for having at least two of everything. This adds costs in terms of additional operational effort and complexity, but we gain tremendously in terms of availability and performance under load. Also, it forces us into a distributed-first mindset. ‘If you can’t split it, you can’t scale it’ has been said by various people, and it’s very true.

You’ve made an investment in machine learning, but how do you get the most from it? Think API first! In addition to pushing work to clients, we view your application as a service API. Clients these days can be an ever-changing mix of smartphones, web sites, line of business systems and desktop applications. The API does not make assumptions about which clients will connect to it, it will be able to serve all of them. And furthermore, you open your service up for future automation.

Given the world we live in, this should be pretty obvious – it simply means the system’s ability to resist unauthorised attempts at usage or behaviour modification, while still providing service to legitimate users. From an administration perspective this could also mean:

  • Does the system require user or role-based security?
  • Does code access or multi-factor authentication need to occur?
  • What operations need to be secured?
  • Should traffic be encrypted?

These are the edited highlights – situations can and do vary – as solution builders it's very important for us to work closely with clients to establish the correct mix of priorities.

Contact us if you would like to learn more about using machine learning in your business or follow us on Twitter for more updates like this.

human machine collaboration

CRM: 5 Ways Machine Learning Creates Value

CRM can be a huge expense - implementation, updates and training all add up - does your CRM generate optimal return on investment?

With an expanse of data, originating from sales, marketing, customer support and product development, your CRM has the potential to deliver huge business value - but only if you can make sense of all that data. Using machine learning, making sense of that data, becoming more efficient and, most importantly, pleasing customers, can now be performed at scale.

crm robot hand shake

An intelligent layer, sitting on top of an existing CRM system, can extract insight from your entire dataset, thus telling the complete story - it is comprised of three stages:

  • Analyse the past to understand what actions led to great outcomes, such as high customer satisfaction
  • Interpret each new customer interaction and make recommendations to influence a successful outcome
  • Continually update learning based on the most recent set of outcomes, thus remaining relevant without the need for manual changes and inputs

Here are five areas where machine learning can help extend the value of your CRM investment - driving efficiencies without losing the personal touch.

Gain Future Insight
CRM systems are focused on aggregating historical data. However, one of the greatest strengths of machine learning is providing a future-facing predictive view. Machine learning looks at every interaction and makes recommendations on how to next engage with a customer and achieve the best outcome.

Continually Update Process
The world does not stand still and your entire dataset will shift – with new product releases, staff turnover and customer life cycle changes, machine learning will evolve alongside. By automatically interpreting past actions, machine learning eliminates the need to manually set up and maintain rules, continually learning and making recommendations beyond the static analysis typical of a CRM.

Discover “Why”
CRM can help gather all your data into a single pot for a unified view, but it still lacks the insight into why interactions happen in the first place. Even if the CRM flags a high-risk customer, you still need to spend time researching the underlying reasons. Machine learning can help decipher the prediction – by understanding fully why a particular prediction was made, a support agent is more likely to use the information to take the correct action and drive better outcomes.

Customer Level Prediction
CRM is beneficial at reporting on the general health of all customers. However, it starts to fall apart at the individual customer level, where there may be multiple people associated with the customer. Machine learning treats each component of any interaction as a separate data point and that is its power. It can render much richer customer engagement patterns and recommend the right message, for the right person, at the right time, delivering extreme personalisation.

Analysing Unstructured Data
CRM excels at handling structured data like revenue or customer categories, but that is only one piece of the customer jigsaw. Understanding the nuances of unstructured qualitative data, such as email, response templates or meeting notes can be the key to competitive advantage. Machine learning can convert unstructured text into solid data - adding new value to an otherwise elusive email conversation between a customer and support agent. Together with the structured data already captured in the CRM this additional unstructured data becomes a powerful data element, thus driving better outcomes.

With machine learning, you have an opportunity to transform your CRM into a predictive system of intelligence that improves productivity and helps create happy and loyal customers, all the while driving more return on investment from a system you already own.

We specialise in Microsoft Dynamics 365, but are happy to provide advice on other CRM systems.

Please contact us if you would like to learn more about introducing machine learning into your business.

Follow us on Twitter for more updates like this.

Building Word2vec in TensorFlow

The rise of TensorFlow over the past year has been amazing. It is now one of the most popular open source projects on GitHub and certainly the fastest growing deep learning library available. At the time of writing, it has amassed more GitHub stars than Linux, with 42,769 and 40,828 respectively.

It is also incredibly portable, running on a multitude of platforms, ranging from Raspberry Pi, Android and Apple mobile devices through to 64-bit desktop and server systems. Furthermore, in May 2016, Google announced the creation of its tensor processing unit or TPU, which is a custom ASIC built specifically for machine learning and tailored for TensorFlow, which now operate in its data centres. So long-term investment and support is there.

The other superstar in the machine-learning world is the word2vec algorithm released by Tomas Mikolov and a team from Google in January 2013. This was based on their paper, “Efficient Estimation of Word Representations in Vector Space”. I have written before about the incredible properties of word embeddings created by this algorithm.

Word2vec and TensorFlow seem like a perfect match, both emerging from Google, the machine-learning equivalent of a supercouple. However, the few implementations I have seen so far have been disappointing, so I decided to write my own.

Some of the key things I wanted to achieve were:

  • a robust method to clean and tokenise text
  • the ability to process very large text files
  • make full use of TensorFlow GPU support for fast training
  • use TensorFlow FIFO queues to eliminate I/O latency
  • simple code that could be used by people learning TensorFlow
  • a way to demonstrate the model once trained

I am very pleased with the result. Even with a basic Nvidia GTX 750 Ti we can process an entire Wikipedia training epoch in less than 4 hours.

As always, the code is on GitHub.

Follow me on Twitter for more updates like this.

Semantically Ordering Word Lists in Python

I am sure most people who read this blog are familiar with word embedding. If not, word embedding is a feature learning technique in natural language processing (NLP) where words or phrases are mapped into vectors of real numbers. The technique was popularised by Tomas Mikolov's word2vec toolkit in 2013.

Today, word embedding finds its way into many real world applications, including the recent update of Google Translate. Google now use a neural machine translation engine, which translates whole sentences at a time, rather than just piece-by-piece. These sentences represented by vectors.

Normally, word vectors consist of hundreds, if not thousands, of real numbers, which can capture the intricate detail of a rich language. However, I thought, what would happen if we crush each word's representation down to a single number?

Surely, no meaningful information would remain.

I exported into a CSV file a selection of words and their associated word vectors generated by Athena - my port of word2vec. Here's a truncated view of the 128 dimension word vectors:

albert_einstein, -2.480522, -1.91355, 0.4266494, 1.927253, -0.5318278, -3.722191, ...
audi, -0.8185719, -1.691721, -0.3929027, 1.698154, -2.953124, 0.9475167, ...
bill_gates, -1.673416, -1.601884, 1.130291, 2.139339, 0.5832655, -2.355634, ...
bmw, -1.027373, -1.668206, -1.728756, 2.338698, -4.249786, 0.4357562, ...
cat, -0.5071716, -2.760615, -2.546596, 0.9999355, -0.1860456, 0.2416906, ...
climb, 0.7150535, 0.1190424, 1.583062, -0.3858005, -3.991093, 1.382508, ...
dog, -0.4773144, -2.224563, -3.67255, 0.5424166, 0.6331441, 1.222993, ...
hotel, 0.3524887,-4.38588, 1.197459, 2.595855, -0.3414235, -0.4427964, ...
house, 0.5532117, -2.279454, -0.2512704, 0.4140477, 2.676775, 0.05087801, ...
monkey, -0.623855, -3.508944, -0.931955, -0.4193407, -0.9044554, 0.347873, ...
mountain, 2.207317, 0.5984201, -1.398792, -0.5220098, -1.344777, 0.3062904, ...
porsche, -0.316146, -1.779519, -0.8431134, 2.44296, -3.680713, 0.874707, ...
river, 3.286445, 2.139661, -1.43799, 2.606198, -2.337485, -0.4348237, ...
school, -3.210236, -3.298275, 3.333953, 0.9878215, 1.926927, -0.1040714, ...
steve_jobs, -2.178778, -2.492632, 1.083596, 1.491468, 0.5440083, -3.330186, ...
swim, 0.8094505, -0.911125, -1.189181, 1.908399, -4.087798, 1.79775, ...
valley, 1.044242, 1.814712, 0.1396747, 0.6305104, -1.227837, -0.389852, ...
walk, 0.5212327, 0.03666721, 0.6227544, 0.6157224, -2.084322, 0.6642563, ...

Next, I would crush each vector down to just a single dimension using Principle Component Analysis (PCA). It's really easy to do this in Python using the scikit-learn library. Here is my code:

from sklearn import decomposition
import pandas as pd

# Read data from file.
data = pd.read_csv('data.csv', header=None)

# Reduce word embeddings to one dimension.
embeddings = data.ix[:, 1:]
pca = decomposition.PCA(n_components=1)
values = pca.transform(embeddings)

# Build result columns.
labels = pd.Series(data.ix[:, 0], name='labels')
values = pd.Series(values[:, 0], name='values')

# Build results and sort.
result = pd.concat([labels, values], axis=1)
result = result.sort(['values'], ascending=[True])

# Output to console.

The results are astonishing!

bmw -12.042180
audi -11.357731
porsche -11.349108
steve_jobs -8.577104
bill_gates -7.390602
albert_einstein -4.910876
monkey -1.285317
house 0.517183
school 0.732882
hotel 1.319551
swim 6.056411
climb 6.943114
walk 7.069848
mountain 9.365861
valley 11.752656
river 15.278058

Even with just a single value representing each word, enough information is retained so that clusters have clearly formed - we have car manufacturers, people, animals, buildings, verbs and geographical features...

What does this mean? No idea..! But, if we can semantically sort words, we can probably do the same for sentences. Would sorting sentences in a document make the flow of ideas more coherent? Maybe..?

Anyway, as always, the code and data for this article are on GitHub.

Follow me on Twitter for more updates like this.

Sentiment Analysis with Python

In this post, I will demonstrate  how quick and easy it is to run sentiment analysis on text data - inspiration for this post came from Sirajology - many thanks for your awesome videos!

Sentiment analysis is a process which can determine the emotional tone behind a series of words, used to gain an understanding of attitudes, opinions and emotions expressed within. Sentiment analysis is extremely useful in social media monitoring as it allows us to gain an overview of the wider public opinion behind certain topics.

For this example, we will be using Twitter as a text source - specifically searching for opinions about the latest Star Wars film, Rogue One. We can do this using only twenty lines of Python code, which will execute on Windows or Linux!

First, we must install a couple of Python libraries, if not already present, using PIP:

  • Tweepy for accessing the Twitter API
  • TextBlob for natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more

These are the required PIP commands:

> pip install tweepy
> pip install textblob
> python -m textblob.download_corpora

Should we wish, we can run TextBlob from the Python command line:

> python

>>> from textblob import TextBlob

>>> blob = TextBlob("DarkMind is possibly the worst company I've ever known")

>>> blob.tags
[('DarkMind', 'NNP'), ('is', 'VBZ'), ('possibly', 'RB'), ('the', 'DT'), ('worst', 'JJS'), ('company', 'NN'), ('I', 'PRP'), ("'ve", 'VBP'), ('ever', 'RB'), ('known', 'VBN')]

>>> blob.sentiment
Sentiment(polarity=-0.5, subjectivity=1.0)

>>> blob = TextBlob("Robosoup is one of the most inspirational organisations of our time")

>>> blob.tags
[('Robosoup', 'NNP'), ('is', 'VBZ'), ('one', 'CD'), ('of', 'IN'), ('the', 'DT'), ('most', 'RBS'), ('inspirational', 'JJ'), ('organisations', 'NNS'), ('of', 'IN'), ('our', 'PRP$'), ('time', 'NN')]

>>> blob.sentiment
Sentiment(polarity=0.5, subjectivity=0.75)

As you can see, TextBlob has accurately parsed the two sentences and calculated a sentiment score for each. The first phrase having a negative score of minus 0.5 and the second phrase with a positive score of plus 0.5

Next, let us incorporate this into a simple Python program. To use the Twitter API, you will first need to generate some authorisation keys with Twitter Apps, which is quick and free.

import tweepy
from textblob import TextBlobconsumer_key = '...YOUR_KEY_HERE...'
consumer_secret = '...YOUR_KEY_HERE...'access_token = '...YOUR_KEY_HERE...'
access_token_secret = '...YOUR_KEY_HERE...'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

tweets ="#RogueOne OR #StarWars -filter:links -filter:media", lang="en")

for tweet in tweets:
    print(tweet.text) analysis = TextBlob(tweet.text)

Here are the results:

RT @GodfreyElfwick: For 39 years #StarWars hasn't had a single disabled genderqueer black feminist female jedi. This franchise is so out of…
Sentiment(polarity=-0.10952380952380952, subjectivity=0.2785714285714286)I woke up to find #dumpstarwars was trending and just assumed it must mean Jar Jar Binks appears in #RogueOne
Sentiment(polarity=-0.3125, subjectivity=0.6875)This time next week, we will all be riding a high of enjoyment after seeing #RogueOne
Sentiment(polarity=0.08, subjectivity=0.26999999999999996)

RT @shaunduke: That's one thing TFA made clear by injecting much needed diversity into the #StarWars universe: this is a story for people w…
Sentiment(polarity=0.15000000000000002, subjectivity=0.29166666666666663)

May the force be with you… but don't force too hard. Or you'll find the droids you're looking for. #StarWars #RogueOne #JediCouncil
Sentiment(polarity=-0.2916666666666667, subjectivity=0.5416666666666666)

RT @chrisjallan: Genuinely so glad none of these #DumpStarWars lot will be at any #RogueOne screenings like. They seem like an absolute nig…
Sentiment(polarity=0.35, subjectivity=0.95)

#DumpStarWars lead female character, evil government, futuristic ideas, unity against oppression #rogueone full of hate from Trumpsters
Sentiment(polarity=-0.36250000000000004, subjectivity=0.6541666666666667)

I'm literally so excited for @swidentities exhibition! This should keep me going til next week! Ahhhhhhhhh #rogueone #StarWarsIdentities
Sentiment(polarity=0.234375, subjectivity=0.375)

RT @ryan_mceachern: I've seen #RogueOne & there aren't any allusions to Trump but this new character Cheetodust McDaughtergroper was pretty…
Sentiment(polarity=0.018181818181818174, subjectivity=0.2772727272727273)

Considering this is only twenty lines of code, the results are pretty impressive. Of course, this was a bit of fun to see how quickly we could set up a quick and dirty sentiment analysis system. In production, the sentiment system could be trained specifically on your domain, which would yield greater accuracy. The text data may be sourced from emails, product reviews, or pretty much anything else you can think of.

With a little digging, you will find there are a huge number of open source libraries available to accelerate your business with machine learning.

Follow me on Twitter for more updates like this.

Machine Learning: 4 Steps to Improve Business

In this article, I provide a simple, four-step process to improve business profitability using machine learning.

If you have not read my previous articles on the subject, these will provide useful context:

AI Automation Will Save The UK Economy Billions

Grow Your Business with Machine Learning

There is nothing magical about machine learning. Generally, it involves automatically fitting a model to data, with the goal of making useful predictions or decisions. You have been doing this your whole life, ever since at school you learned to fit a straight line on a graph with some data points. What is different these days is the sheer scale of the data, both in terms of the number of data points and its dimensionality, and the variety of forms it can take - numerical, text, audio, images, video, etc.

When people talk about machine learning or deep learning, they are talking about the ability of computers to learn objective functions and make decisions based on data. This is hugely important and already impacting practically every industry. Whilst machine learning still has a long way to go before it reaches full potential, we can already achieve an enormous amount with what we have today. This is where forward-thinking business leaders should be focusing.

Over the next five to ten years, the biggest business gains will likely stem from getting the right information to the right people at the right time. Building upon the business intelligence revolution of the past years, machine learning will boost existing pattern-finding abilities and automate value extraction in many areas. So how can your business incorporate it into daily decision-making and long-term planning?

  • First - catalogue your business processes. Look for procedures and decisions that are made routinely and consistently, like routing a customer support query to an agent. Make sure you collect as much data as possible about how the decision was made, along with the data used to make it - this is the kind of information that will be used to train a machine learning algorithm.
  • Second - to begin with, focus on well defined problems. Automation and machine learning work well where the problem is well defined and well understood, and where the available data fully characterises the information necessary to make a decision.
  • Third - drawing on Occam's razor, do not use machine learning where standard business logic will suffice. Machine learning really comes into its own when the underlying business rules, although well defined, follow complex or non-linear patterns.
  • Fourth - if a process is very complicated, use machine learning to create decision support systems. If the objective is too unclear to define, try to create intermediate way-points that will help your teams become more effective in stages. By thinking of machine learning as part of the hierarchical decision-making process, it will drive a better understanding of the problem in the future.

The point is, there is so much that can be done without digging too deep. For now, the majority of your workforce will continue to have a job, and so you can help them to be more productive, working on more interesting and demanding tasks, by automating away the repetitive parts of your business.

It can sometimes be difficult to rethink your business processes, especially if they are "the way we've always done things", but it does not hurt to try. Be patient, as the transformation will not happen overnight. If you get stuck, it may be worth talking to a third-party for a fresh perspective. However, once you have dipped your foot in the water and are reaping the benefits of your first successful project, you will be equipped tackle far more complex problems with machine learning.

Please contact me if you would like to learn more about introducing machine learning into your business.

Follow me on Twitter for more updates like this.