HTQL - the Hyper-Text Query Language

Forum



What is HTQL

Hyper-Text Query Language (HTQL) is a language for the querying and transformation of HTML, XML and plain text documents. HTQL is developed in C++ with fast and efficient data extraction algorithms. HTQL provides COM and Python interfaces for use in JavaScript, Visual Basic, .NET, ASP, and Python applications. HTQL can be used to:
  1. Extract data from HTML Web pages
  2. Retrieve HTML page through HTTP protocol
  3. Modify HTML pages from applications
In addition, the HTQL Python package includes some text search libraries that are implemented in C++ and optimized for efficient search of large-volume data, including regular expression search, dictionary search, similar string search, and text clustering. Please refer the Python Interface manual for their usage.

Manual

Installation

Python Installation
  1. Windows binaries: Download the htql.zip and extract "htql.pyd" to the Python's DLLs directory, such as in 'C:\Python27\DLLs\' or 'C:\Python32\DLLs\'.
  2. Python 2.6 htql.zip
    Python 2.7 htql.zip
    Python 3.2 htql.zip
    Python 3.3 htql.zip

  3. Linux binaries: Copy htql.so to your library path.
  4. Pthon 2.4, GNU Linux x86.64 (htql.so)
    Pthon 2.6, GNU Linux x86.64 (htql.so)
    Pthon 2.7, GNU Linux x86.64 (htql.so)
    Pthon 3.3, GNU Linux x86.64 (htql.cpython-33m.so)
    Pthon 2.6, Debian Linux i686.64 (htql.so)
    Pthon 2.6, RedHat Linux x86.64 (htql.so)

COM Installation

  1. Download the HtqlCom.dll into a local directory, such as 'C:\htql\'.
  2. Register the "HtqlCom.dll" by running: (for Vista and Win 7, you need to run cmd.exe as administrator)
    C:\htql\> regsvr32 HtqlCom.dll

Examples

Python Example

A simple example to extract url and text from links.

import htql; 

page="<a href=a.html>1</a><a href=b.html>2</a><a href=c.html>3</a>";
query="<a>:href,tx";

for url, text in htql.HTQL(page, query): 
    print(url, text);

An example using htql.Browser:

import htql;
a=htql.Browser(); 
b=a.goUrl("http://www.bing.com/");
c=a.goForm("<form>1", {"q":"test"});
for d in htql.HTQL(c[0], "<a (tx like '%test%')>"): 
    print(d);

e=a.click("<a (tx like '%test%' and not (href like '/search%'))>1");

If you have installed IRobotSoft Web Scraper, you can browse the web visually with:

 a=htql.Browser(2); 

An example to parse state and zip from US address using HTQL regular expression:

import htql; 
address = '88-21 64th st , Rego Park , New York 11374'
states=['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 
	'Delaware', 'District Of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 
	'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 
	'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 
	'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 
	'Oregon', 'PALAU', 'Pennsylvania', 'PUERTO RICO', 'Rhode Island', 'South Carolina', 'South Dakota', 
	'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 
	'Wyoming']; 

a=htql.RegEx(); 
a.setNameSet('states', states);

state_zip1=a.reSearchStr(address, "&[s:states][,\s]+\d{5}", case=False)[0]; 
# state_zip1 = 'New York 11374'

state_zip2=a.reSearchList(address.split(), r"&[ws:states]<,>?<\d{5}>", case=False)[0]; 
# state_zip2 = ['New', 'York', '11374']
JavaScript Example

The following example shows the use of HTQL in an HTML page with JavaScript. The JavaScript code in this HTML page retrieves the first <a> tag from http://www.ncbi.nlm.nih.gov/ and show it in the HTML body.

<!--- test.html -->
<html> <base href="http://www.ncbi.nlm.nih.gov/">
<body>
<script language=JavaScript>
	var a= new ActiveXObject("HtqlCom.HtqlControl");
	a.setUrl("http://www.ncbi.nlm.nih.gov/");
	a.setQuery("<a>");
	document.write(a.getValueByIndex(1));
</script>
</body>
</html>
Visual Basic Example

The following Visual Basic example does the same thing and shows the result in a message box:

' VB example
Dim a As Object
Set a = CreateObject("HtqlCom.HtqlControl")
i = a.setUrl("http://www.ncbi.nlm.nih.gov/")
i = a.setQuery("<a>")
MsgBox (a.getValueByIndex(1))

Applications Using HTQL

Citation

Contact