Wednesday, 1 January 2014

Semantic Maps & Automated Text Generation

For the past few years I've been developing something I call Athena. Very simply, Athena has the ability to read natural language text documents and create a semantic map from the information contained - this is the dehydration phase. Athena also has the ability to reverse this process and automatically generate (near) natural language text from a semantic map - the rehydration phase.

To explain why this is useful imagine the following. In the real world a cartographer would survey a natural landscape, taking measurements between landmarks and other key features, and compress this information into a two dimensional representation (commonly known as a map). With this map you, as an explorer, are able to create an imaginary representation of that landscape, without ever having been there. Sure, some information is lost during the mapping process, such as the exact species of fauna and flora or weather you may encounter, but using generalisation you can infer what's likely to be there based on previous experience in similar surroundings.

Athena does something analogous to a real world cartographer. However, rather than creating a flat, two dimensional map, Athena stores its knowledge in a large, hyper-dimensional, vector array. Real world landmarks are replaced with key words, phrases and concepts. However, unlike the real world map example, where prior knowledge of fauna and flora is required, Athena holds a generalised set of context information, which allows the rehydration phase to occur.

Athena combines information from thousands of web pages and documents to form a single semantic map. As a new document is added the existing map is modified to accommodate the new information. During this process the map is compressed to form the most efficient representation of knowledge. Similar and related concepts are squeezed together, and duplicate and redundant information squeezed out. And, this is where the magic happens... By reducing the footprint of Athena's semantic map, new knowledge is inferred. Athena will create links between concepts it has not seen in existing documents.

Behind the scenes I’m using SQL Server 2012. The full text engine has been very useful as it contains a stemming algorithm, a list of stop words as well as the obvious full text search capability. Most of the useful functions are contained in sys.dm_fts_parser. On top of SQL, there’s plenty of hand written C#, which implements the semantic map generation. Semantic concepts are represented as vectors and I use cosine similarity to measure similarity between vectors. To establish which terms are important, I use a combination of isolating title case nouns and calculating tf–idf (term frequency–inverse document frequency).

Retrieving information from Athena is as simple as using a search engine.  Just type in a word or phrase and Athena will display the relevant portion of its semantic map. Back in the early days of development I used to play a weird game of two word association with Athena. One day when I typed in "sky tank" and Athena replied with "apache gunship", I knew I was onto something. One of the side effects of knowledge compression is that Athena performs inference between concepts.

The next step was to scale up. I'm now able to load in complete documents, which Athena will summarise, and then append relevant information that it already holds in its semantic map. In a slight twist on the above, another way to use Athena is to type in a "news headline", real or imagined, and Athena will prepare an appropriate news article.

In the example below, I typed in "Google DARPA robotics challenge", as I knew that Google had entered into the recent DARPA robotics challenge. The result is a 2D overview graph representing Athena's knowledge.

Athena Semantic Map
Alongside this Athena generates a text file from which I was able to create a news article. The generated text is not perfect. It's largely a collection of loosely ordered words that have been compressed into Athena's dialect of English. However, it's not much trouble to convert this back into something more human readable.

This is the result after cleaning up manually for a couple of minutes.

  • Google acquires Redwood Robotics and Boston Dynamics.
  • Boston Dynamics makes the Atlas, Wildcat and Big Dog (LS3) robots.
  • Boston Dynamics employs scientists from Carnegie Mellon University.
  • Darpa is the Pentagon’s research and development unit.
  • The Darpa robotics challenge tests machines built to work in disaster areas like Fukushima.
  • The Darpa robotics challenge was held in Miami.
  • Google's Schaft robot wins Darpa challenge.
  • IHMC takes second place in the Darpa challenge.
  • Google Talk is an instant messaging service.
  • Google Reader was an RSS feed aggregator.

There's still a lot of work to do to make Athena into a viable product, but much of the hard work has already been done.