Wednesday, 4 August 2010

Extract Data From Html- Reading HTML/XHTML with XQuery

An XPath query selects the title node in the Xhtml then the next line assigns the Xhtml to page title. Make sure to define a namespace prefix and use this in the XPath query.

String page = HttpGet("www.myXhtmlPage.biz");
StringReader sReader = new StringReader(page);

XmlDocument doc = FromHtml(sReader);
XmlNamespaceManager nsmgr = new XmlNamespaceManager(doc.NameTable);
nsmgr.AddNamespace("games", "http://www.w3.org/1999/xhtml");

XmlNodeList list = doc.SelectNodes("//games:html/games:head/games:title",nsmgr);
String title = list[0].InnerText;



To convert html to Xhtml I recommend the excellent SgmlReader. This is the code to use SGMLReader to convert Html to Xhtml (from the Sgml main page here).

XmlDocument FromHtml(TextReader reader) {

// setup SgmlReader
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = reader;

// create document
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);
return doc;
}


This is the code to get a web page in string format, otherwise known as HTTP Get.

public static string HttpGet(string URI)
{
System.Net.WebRequest req = System.Net.WebRequest.Create(URI);
//req.Proxy = new System.Net.WebProxy("myproxy", true); //true means no proxy
System.Net.WebResponse resp = req.GetResponse();
System.IO.StreamReader sr = new System.IO.StreamReader(resp.GetResponseStream());
return sr.ReadToEnd().Trim();
}