Monday, 6 January 2014

Extracting Plain Text from Web Page HTML C#

Natural Language processing solutions, like Athena, require a good supply of high quality text.

As well as loading in ad-hoc documents, I’ve given Athena free reign to browse the Internet as required. Its two main sources of information are Wikipedia and BBC News.

Wikipedia is great for providing domain knowledge and key facts, whilst the BBC News site is an excellent source of up to the minute current affairs.

Anybody who has attempted to extract plain text from real world HTML will know that what should be a simple task can quickly snowball into a mammoth project.

There have been many debates on sites like Stackoverflow on how best to do this. Most people start their journey by using regular expressions (regex) - but this is really only viable with well formed and simple HTML. Madness soon follows...

In the real world, HTML is not always well formed and in practice you will also want to ignore such things as adverts, menus and page navigation. To overcome this, you may consider creating a hybrid regex / imperative code parser. Suddenly, this is getting serious...

Luckily, if you’re using C#, you already have the perfect solution in your toolbox - the WebBrowser control in Windows Forms. This control already knows how to render web pages into text and is incredibly tolerant to badly formed HTML.

Using the HtmlDocument property in the WebBrowser control, you can easily navigate the document to find exactly the clean text portions you’re looking for. And, of course, just because this control sits into the System.Windows.Forms namespace, doesn't mean you can’t use it in other types of application - just be sure to add the relevant assembly reference. One complication is that the WebBrowser control needs to run in its own thread (which is easy to work around).

In the simple example below, I have created a console application that allows you to type in a search phrase on the command line, which is sent to Google, extracting links to the BBC News website and returning relevant, clean, plain text.

Sites like BBC News are very well structured, thanks to their content management system. Therefore, by reading the CSS classname associated with HTML tags, you can easily isolate the information you require.

  1. using System;
  2. using System.Text;
  3. using System.Threading;
  4. using System.Windows.Forms;
  5.  
  6. class Program
  7. {
  8.     private string _plainText;
  9.  
  10.     static void Main(string[] args)
  11.     {
  12.         new Program();
  13.     }
  14.  
  15.     private Program()
  16.     {
  17.         while (true)
  18.         {
  19.             Console.Write("> ");
  20.             string phrase = Console.ReadLine();
  21.             if (phrase.Length > 0)
  22.             {
  23.                 Thread thread = new Thread(new ParameterizedThreadStart(GetPlainText));
  24.                 thread.SetApartmentState(ApartmentState.STA);
  25.                 thread.Start(phrase);
  26.                 thread.Join();
  27.                 Console.WriteLine();
  28.                 Console.WriteLine(_plainText);
  29.                 Console.WriteLine();
  30.             }
  31.         }
  32.     }
  33.  
  34.     private void GetPlainText(object phrase)
  35.     {
  36.         string uri = "";
  37.         WebBrowser _webBrowser = new WebBrowser();
  38.         _webBrowser.Url = new Uri(string.Format(@"http://www.google.com/search?as_q={0}&as_sitesearch=www.bbc.co.uk/news", phrase));
  39.         while (_webBrowser.ReadyState != WebBrowserReadyState.Complete) Application.DoEvents();
  40.  
  41.         foreach (HtmlElement a in _webBrowser.Document.GetElementsByTagName("A"))
  42.         {
  43.             uri = a.GetAttribute("href");
  44.             if (uri.StartsWith("http://www.bbc.co.uk/news")) break;
  45.         }
  46.  
  47.         StringBuilder sb = new StringBuilder();
  48.         WebBrowser webBrowser = new WebBrowser();
  49.         webBrowser.Url = new Uri(uri);
  50.         while (webBrowser.ReadyState != WebBrowserReadyState.Complete) Application.DoEvents();
  51.  
  52.         // Pick out the main heading.
  53.         foreach (HtmlElement h1 in webBrowser.Document.GetElementsByTagName("H1"))
  54.             sb.Append(h1.InnerText + ". ");
  55.  
  56.         // Select only the article text, ignoring everything else.
  57.         foreach (HtmlElement div in webBrowser.Document.GetElementsByTagName("DIV"))
  58.             if (div.GetAttribute("classname") == "story-body")
  59.                 foreach (HtmlElement p in div.GetElementsByTagName("P"))
  60.                 {
  61.                     string classname = p.GetAttribute("classname");
  62.                     if (classname == "introduction" || classname == "")
  63.                         sb.Append(p.InnerText + " ");
  64.                 }
  65.  
  66.         webBrowser.Dispose();
  67.         _plainText = sb.ToString();
  68.     }
  69. }

This is what the result looks like after searching for British Airways...


Happy screen scraping!
John