HTQL - the Hyper-Text Query Language

Forum



What is HTQL

Hyper-Text Query Language (HTQL) is a language for the querying and transformation of HTML, XML and plain text documents. HTQL is developed in C++ with fast and efficient data extraction algorithms. HTQL provides COM and Python interfaces for use in JavaScript, Visual Basic, .NET, ASP, and Python applications. HTQL can be used to:
  1. Extract data from HTML Web pages
  2. Retrieve HTML page through HTTP protocol
  3. Modify HTML pages from applications

Manual

Installation

Python Installation
  1. Windows binaries: Download the htql.zip and extract "htql.pyd" to the Python's DLLs directory, such as in 'C:\Python27\DLLs\' or 'C:\Python32\DLLs\'.
  2. Python 2.6 htql.zip
    Python 2.7 htql.zip
    Python 3.2 htql.zip

  3. Linux binaries: Copy htql.so to your library path.
  4. Pthon 2.4, GNU Linux x86.64 (htql.so)
    Pthon 2.6, GNU Linux x86.64 (htql.so)
    Pthon 2.6, Debian Linux i686.64 (htql.so)
    Pthon 2.6, RedHat Linux x86.64 (htql.so)

COM Installation

  1. Download the HtqlCom.dll into a local directory, such as 'C:\htql\'.
  2. Register the "HtqlCom.dll" by running: (for Vista and Win 7, you need to run cmd.exe as administrator)
    C:\htql\> regsvr32 HtqlCom.dll

Examples

Python Example

A simple example to extract url and text from links.

import htql; 

page="<a href=a.html>1</a><a href=b.html>2</a><a href=c.html>3</a>";
query="<a>:href,tx";

for url, text in htql.HTQL(page, query): 
    print(url, text);

An example using htql.Browser:

import htql;
a=htql.Browser(); 
b=a.goUrl("http://www.bing.com/");
c=a.goForm("<form>1", {"q":"test"});
for d in htql.HTQL(c[0], "<a (tx like '%test%')>"): 
    print(d);
e=a.click("<a (tx like '%test%' and not (href like '/search%'))>1");

If you have installed IRobotSoft Web Scraper, you can browse the web visually with:

 a=htql.Browser(2); 
JavaScript Example

The following example shows the use of HTQL in an HTML page with JavaScript. The JavaScript code in this HTML page retrieves the first <a> tag from http://www.ncbi.nlm.nih.gov/ and show it in the HTML body.

<!--- test.html -->
<html> <base href="http://www.ncbi.nlm.nih.gov/">
<body>
<script language=JavaScript>
	var a= new ActiveXObject("HtqlCom.HtqlControl");
	a.setUrl("http://www.ncbi.nlm.nih.gov/");
	a.setQuery("<a>");
	document.write(a.getValueByIndex(1));
</script>
</body>
</html>
Visual Basic Example

The following Visual Basic example does the same thing and shows the result in a message box:

' VB example
Dim a As Object
Set a = CreateObject("HtqlCom.HtqlControl")
i = a.setUrl("http://www.ncbi.nlm.nih.gov/")
i = a.setQuery("<a>")
MsgBox (a.getValueByIndex(1))

Applications Using HTQL

Citation