José Roca Forum
Web Site
News: Version 1.16 of the Windows API headers released
 
*
Welcome, Guest. Please login or register. September 09, 2010, 02:32:13 PM


Login with username, password and session length


PowerBASIC is a trademark of PowerBASIC, Inc.
This is not an official PowerBASIC site and we are not affiliated with PowerBASIC, Inc.
DISCLAIMER: The software and accompanying documentation are provided "as is" and without warranties as to performance or merchantability or any other warranties whether expressed or implied. Because of the various hardware environments into which the software may be used, no warranty of fitness for a particular purpose is offered. The user must assume the entire risk of using the software. In no case shall any of the contributors to this project be liable for any incidental, special or consequential damages or loss, including, without limitation, lost profits or the inability to use equipment or access data. This is true even if we are advised of the possibility of such damages. We also don't have any obligation of fix eventual bugs or to add new features.
Pages: 1   Go Down
  Print  
Author Topic: A Very Small Fast Parser using PB Inline Assembler  (Read 1832 times)
0 Members and 1 Guest are viewing this topic.
Charles Pegge
Global Moderator
Hero Member
*****

Karma: 19
Online Online

Posts: 648



WWW
« on: May 11, 2007, 10:35:31 AM »

04. August 2006, 11:59:45

This parser is suitable for use in a script engine or for parsing any data strings
that use symbols as well as words. Call it repeatedly to indicate sequential words
and symbols in a string of text.

I have written it to accelerate my scripting language; the machine code is only 251 bytes
long and on the demo test my PC pushes out about 19 million words per second.

The code can be used inline or in its own subroutine or function. It interfaces PowerBasic
code with just 5 variables.

For further interest:

Once you understand the basics of how a Pentium works. the Assembler is surprisingly easy.
The rules are very strict but relatively simple compared to high level languages.
With careful incremental coding and testing, you can directly tap into the formidible processing
power of the x86 processors and outperform any compiled code.

Intel documentation for download from
 
http://www.intel.com

Intel Architecture Software Developer’s Manual Volumes 1 to 3
The most useful reference is volume 2 chapter 3.

Also, the PowerBasic Help topic on the inline assember discusses how PowerBasic variables can
be used with assembler.

Logged
Charles Pegge
Global Moderator
Hero Member
*****

Karma: 19
Online Online

Posts: 648



WWW
« Reply #1 on: May 11, 2007, 10:40:26 AM »

05. August 2006, 11:04:07

A Slightly Larger but Faster Parser (PB Assembler)

The machine code is only 124 bytes long, but 'also uses a lookup table of 256 bytes. By reducing the
branching logic, this runs about 10% faster than the previous non tabular version (in excess of 21
million words/sec on my Athlon 3200).

'The table gives added ease of configuration and flexibility. The upper part of the table
'may be dispensed with if your source text uses 7 bit ascii only.

My thanks to all for feedback and Paul Dixon for comments and suggesting the table approach.

discussion:
http://www.powerbasic.com/support/forums/Forum8/HTML/003574.html

Logged
Charles Pegge
Global Moderator
Hero Member
*****

Karma: 19
Online Online

Posts: 648



WWW
« Reply #2 on: May 11, 2007, 10:45:10 AM »

 08. August 2006, 15:51:23

A Fast Configurable Parser (PB Assembler)

Taking the concept further with more efficient code and additional flexibility.
You can change the parsing rules by tweaking the codings table.

With the demo code, it can read the entire bible in about 50-70 milliseconds
and return counts of various word types.

The text  can be optionally downloaded from:
http://www.talsystems.com/contest12/contest12.zip
It will still perform its default demo without the Bible text.


The parsing routine is now wrapped in a Function, accessible using
conventional Powerbasic, though with a 30% performance overhead.

discussion:
http://www.powerbasic.com/support/forums/Forum8/HTML/003574.html

There are problems in getting an accurate reading of performance, even with
the process set to RealTime priority. However you get the overall picture.


Update:

Bug identified and fixed: 23 Aug 2006

Affecting quoted text:  to interpret ascii 34 as normal punctuation should have a tokentype of 35, not 33
This does not affect this demo but becomes apparent when reading scripts other than the bible12.txt
Updated zip below:
 
 
Logged
Charles Pegge
Global Moderator
Hero Member
*****

Karma: 19
Online Online

Posts: 648



WWW
« Reply #3 on: May 11, 2007, 10:48:57 AM »

19. August 2006, 20:19:22

Fast Parser with Word List (PB Assembler)

This code adds further functionality by enabling the use of a Token word list
in addition to the TokenType table. So you can get specific tokens returned
for words specified in the list.

As before, it can read the entire Bible in about 50-70 milliseconds
and return counts of various word types, but also provides counts
on the specified words.

The text  can be downloaded from:
http://www.talsystems.com/contest12/contest12.zip

The source code is extensively annotated with a view to easy adaptation
to your own needs. For instance if you require partial word matching
instead of exact matching so that document will also accept
documents or documentation, returning the same token.
This requires the substitution of only one instruction.

Update:

Bug identified and fixed: 23 Aug 2006
Affecting quoted text:  to interpret ascii 34 as normal punctuation should have a tokentype of 35, not 33
This does not affect prior demos but becomes apparent when reading scripts other than the bible12.txt
Updated zip below
Logged
Mike Trader
Full Member
***

Karma: 1
Offline Offline

Posts: 162



« Reply #4 on: June 22, 2007, 08:09:31 PM »

Charles,
Thank you for this.
Amazingly fast!
I think I can convert it for searching my UDT array. If it ignores all the binary chars it can search for the text.
I can put the whole UDT array into a dynamic string and search that.
I can figure out which record the text is in by the offset.
(each record is a fixed length)
In this way I can search "multiple documents" for a single word
The question is, how do I search for Phrases?
Logged
Charles Pegge
Global Moderator
Hero Member
*****

Karma: 19
Online Online

Posts: 648



WWW
« Reply #5 on: June 23, 2007, 09:39:45 AM »

Mike, are you looking for an exact match when searching for a phrase?
Logged
Pages: 1   Go Up
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.11 | SMF © 2006-2009, Simple Machines LLC

IMPRESSUM
Valid XHTML 1.0! Valid CSS! Dilber MC Theme by HarzeM
Page created in 0.082 seconds with 20 queries.