### Author Topic: Asm: Hash Code Algorithm  (Read 14471 times)

0 Members and 1 Guest are viewing this topic.

#### Charles Pegge ##### Asm: Hash Code Algorithm
« on: July 29, 2008, 02:18:17 PM »
This function is used to generate a unique code from a word. Part of the hash can then be used to obtain the base indexes of linked-lists for rapid retrieval of data. I use this algorithm in Asmosphere to lookup the opcodes for the instruction set keywords.

The base index is formed from the lower 9 bits of the hash code, giving 512 base locations for a linked list. The full 32 bit hash code is used to get a precice match - traversing the linked list.

Very few steps are required to locate the data. In the case of 300 keywords, most are located within two steps.

PowerBasic
Code: [Select]
`#COMPILE EXE#DIM ALLFUNCTION hashcode(s AS STRING) AS LONG    DIM le AS LONG    DIM p  AS LONG    le=LEN(s):p=STRPTR(s)    ! xor eax,eax    ! mov ecx, le      ' length    ! mov edx, p       ' name pointer    ! mov ah,  cl      ' length    ! mov al,  [edx]   ' 1st letter    ! shl eax, 8    ! mov al,  [edx+1] ' 2nd letter    !  shl eax,8    nextchar:        ! rol eax,1        ! xor al, [edx]        ! inc edx        ! dec ecx        ! jg nextchar    ! mov function,eaxEND FUNCTIONFUNCTION PBMAIN () AS LONGMSGBOX "movups "+HEX\$(hashcode("movups"))END FUNCTION                   `
FreeBasic
Code: [Select]
`function hashcode(s as string) as long    dim as long le    dim as any ptr p    le=len(s):p=strptr(s)    asm    xor eax,eax    mov ecx, [le]    mov edx, [p]     ' name pointer    mov ah,  cl       ' length    mov al,  [edx]   ' 1st letter    shl eax, 8    mov al,  [edx+1] ' 2nd letter    shl eax,8    nextchar:        rol eax,1        xor al, [edx]        inc edx        dec ecx    jg nextchar    mov [function],eax    end asm end function`
« Last Edit: July 31, 2008, 02:28:22 PM by Charles Pegge »

#### Eros Olmi ##### Re: Asm: Hash Code Algorithm
« Reply #1 on: July 29, 2008, 02:36:27 PM »
Charles,

I'm very interested in this function.
As you know I'm using something very similar in thinBasic to maintain and use different Hash Tables for Keywords, variables, ...
Your function is very efficient compared to mine. One problem of mine version is that I need that the hash key is a number between 1 and MaxRange where MaxRange can vary depending on the hash table (every hash table has its own max number of buckets where every bucket is a pointer to a linked list).

The following is my function compare to yours. Do you have any idea on how to improve perfomances (of mine of course yours is already very fast ).

Thanks a lot
Eros

PB Code:
Code: [Select]
`#COMPILE EXE#Dim AllFunction hashcode(s As String) As Long    Dim le As Long    Dim p  As Long    le=Len(s):p=StrPtr(s)    ! mov ecx, le      ' length    ! mov edx, p       ' name pointer    ! mov ah,  cl      ' length    ! mov al,  [edx]   ' 1st letter    ! shl eax, 8    ! mov al,  [edx+1] ' 2nd letter    !  shl eax,8    nextchar:        ! rol eax,1        ! Xor al, [edx]        ! inc edx        ! dec ecx        ! jg nextchar    ! mov Function,eaxEnd FunctionFunction HashTable_Hash(ByVal ptrKeyBuffer As Byte Ptr, ByVal MaxRange As Long) As Dword  Static gdwN   As Dword  gdwN = 0???  Do While @ptrKeyBuffer     gdwN = 31??? * gdwN + @ptrKeyBuffer     Incr ptrKeyBuffer  Loop  Function = gdwN Mod MaxRange + 1&End Function                            Function PBMain () As Long  Dim Counter   As Long  Dim MaxLoops  As Long  Dim T1        As Double  Dim T2        As Double    Dim sKey      As String  Dim pKey      As Byte Ptr      sKey = "movups"  pKey = StrPtr(sKey)  MaxLoops = 1000000      T1 = Timer  For counter = 1 To MaxLoops    hashcode(sKey)  Next  T2 = Timer    MsgBox "Time to perform " & Str\$(MaxLoops) & " loops: " & Format\$(T2 - T1, "#0.000")     T1 = Timer  For counter = 1 To MaxLoops    HashTable_Hash(pKey, 8000)  Next  T2 = Timer    MsgBox "Time to perform " & Str\$(MaxLoops) & " loops: " & Format\$(T2 - T1, "#0.000") End Function`
« Last Edit: July 29, 2008, 02:48:50 PM by Eros Olmi »
thinBasic Script Interpreter - www.thinbasic.com | www.thinbasic.com/community
Win7Pro 64bit - 8GB Ram - Intel i7 M620 2.67GHz - NVIDIA Quadro FX1800M 1GB

#### Charles Pegge ##### Re: Asm: Hash Code Algorithm
« Reply #2 on: July 29, 2008, 04:37:39 PM »
Hi Eros,

Ideally the max number of buckets should be a power of 2 then you simply grab the required number of bits off the lower end of the hash. eg hash and 63 (6 bits) to use as the bucket index.

bucket=hash and 63

which is the same as:

!  mov eax,hash
!  and eax,63
!  mov bucket,eax

Taking a modulus for the address is best avoided because division is expensive (40 clocks), but supposing you needed an index for 96 buckets then you could take the first 6 bits, and add the next 5 bits. This sounds cumbersome but it is about 6 times faster than a modulus.

!  mov edx,hash
!  mov eax,edx
!  and eax,63
!  shr edx,6
!  and edx,31
!  mov bucket,eax

bucket  now contains a value in the range 0..95.

Im am off-PC this afternoon but I hope this gives you a few hints.

#### Eros Olmi ##### Re: Asm: Hash Code Algorithm
« Reply #3 on: July 29, 2008, 04:58:06 PM »
Thanks a lot Charles for the help.

Having the max number of buckets a power of 2 is not a problem at all.
I normally use Hash Tables with a pre-loaded 4000 to 10000 buckets in order to sparse enough keys hash values and avoid collisions (eventually managed by linked lists)

Thanks
Eros
« Last Edit: July 29, 2008, 05:02:16 PM by Eros Olmi »
thinBasic Script Interpreter - www.thinbasic.com | www.thinbasic.com/community
Win7Pro 64bit - 8GB Ram - Intel i7 M620 2.67GHz - NVIDIA Quadro FX1800M 1GB

#### Eros Olmi ##### Re: Asm: Hash Code Algorithm
« Reply #4 on: July 30, 2008, 11:44:38 AM »
I found some publications stating that multiple of 2 hash tables very often result into using the same bucks over and over creating a bad hash distribution.
The best situation seems using a prime number as table size.

In my search almost every hash functions generating a hash value between <1> (or zero) to <HashTableSize> use MOD.

Still searching ...

thinBasic Script Interpreter - www.thinbasic.com | www.thinbasic.com/community
Win7Pro 64bit - 8GB Ram - Intel i7 M620 2.67GHz - NVIDIA Quadro FX1800M 1GB

#### Charles Pegge ##### Re: Asm: Hash Code Algorithm
« Reply #5 on: July 30, 2008, 12:14:58 PM »

I think that depends on the hash algorithm being used. If the low order bits, used for the bucket index  are thoroughly jumbled, the distribution should be okay. I tried a number of algorithms before arriving at this one. However with a prime number modulus system you may be less sensitive to the algorithm itself.

It is quite easy to monitor the performance of the hash table by tallying  the maximum link count.

#### Charles Pegge ##### Re: Asm: Hash Code Algorithm
« Reply #6 on: July 30, 2008, 01:26:09 PM »

I've just done a performance check on the current Asmosphere:

There are 512 buckets and 262 ops to go in them.
only 63 of the ops required links and the maximum number of links for any bucket was 2.

When I doubled the bucket count to 1024 then only 36 ops required links with a max of 2

Applying a modulus to this hashing algorithm makes it perform very badly - it is optimised for taking the lower bits.

#### Charles Pegge ##### Re: Asm: Hash Code Algorithm
« Reply #7 on: July 30, 2008, 09:00:24 PM »
Best hash so far

for min 512 buckets:
262 words of which 44 require 1 link

Code: [Select]
`FUNCTION hashcode(s AS STRING) AS LONG    DIM le AS LONG    DIM p  AS BYTE PTR    le=LEN(s):p=STRPTR(s)    ! xor eax,eax    ! mov ecx, le    ! mov edx, p       ' name pointer    ! mov ah,  cl      ' length    ! mov al,  [edx]   ' 1st letter    ! cmp cl,1    ! jz xhash    ! shl eax,3    ! xor al,  [edx+1] ' 2nd letter    ! shl eax,3    nexh:        ! rol eax,1        ! xor al, [edx]        ! inc edx        ! dec ecx    ! jg nexh    xhash:    ! mov function,eaxEND FUNCTION`
« Last Edit: July 31, 2008, 02:26:04 PM by Charles Pegge »

#### Eros Olmi ##### Re: Asm: Hash Code Algorithm
« Reply #8 on: July 30, 2008, 11:25:05 PM »
Charles,

how do you use the hash value returned by the function in order to be an index of?

Thanks
Eros
thinBasic Script Interpreter - www.thinbasic.com | www.thinbasic.com/community
Win7Pro 64bit - 8GB Ram - Intel i7 M620 2.67GHz - NVIDIA Quadro FX1800M 1GB

#### Charles Pegge ##### Re: Asm: Hash Code Algorithm
« Reply #9 on: July 31, 2008, 12:13:41 AM »
i=hashcode(wd)
v=i and 511

where wd is the keyword i is the hash code and v is the bucket index

This is the instruction lookup function

FreeBasic
Code: [Select]
`function lookupc(wd as string) as long    dim as long v,i    i=hashcode(wd)    v=i and 511    do        if opd(v,0)=i then ' match#            if wd=opn(v) then exit do ' confirm perfect match        end if        v=opd(v,1)        if v=-1 then ert=9:ers="Unidentified instruction: "+wd:exit do ' not found    loop    function=vend function`
« Last Edit: July 31, 2008, 12:16:26 AM by Charles Pegge »

#### Eros Olmi ##### Re: Asm: Hash Code Algorithm
« Reply #10 on: July 31, 2008, 12:43:36 AM »
Ok, I got it, thanks.

You use overflow of keys (same hash values) on the same array where position 0 is the first bucket and position 1 is a link to the next bucket (and so on).
Mine is a much more complex structure with dynamic allocation of buckets and linked lists.
Even if quite good in speed, it is quite complex. Maybe I need to review and simplify it for better maintainability.

Thanks again.
Eros
thinBasic Script Interpreter - www.thinbasic.com | www.thinbasic.com/community
Win7Pro 64bit - 8GB Ram - Intel i7 M620 2.67GHz - NVIDIA Quadro FX1800M 1GB

#### Charles Pegge ##### Re: Asm: Hash Code Algorithm
« Reply #11 on: July 31, 2008, 07:23:43 AM »
Yes Eros, this is a very simple scheme and I get away with it because it is a relatively small fixed set of keywords for the instruction set. But for storing macros &  variables with - scoping, allocation and deallocation, it is a bit more complicated. I am still working out the details for Asmosphere which currently has a very simple backwards searching scheme - not even an alphabetic index. This is fine for assembling small scripts. The Opengl demo is the largest so far at 38K and the compile time is barely noticeable. But when dealing with larger programs with many system calls an efficient hashing scheme will be necessary.

This is what I am planning - using a three part array instead of two. The function returns an index to the global macro list (or -1).
It does not matter if the hashes are not unique - the keyword match is always confirmed.

« Last Edit: July 31, 2008, 02:39:52 PM by Charles Pegge »

#### Charles Pegge ##### Re: Asm: Hash Code Algorithm
« Reply #12 on: July 31, 2008, 02:55:25 PM »

This snippet of code, now operational, shows the hash coder, the lookup function and the hashtable/linked list builder. These are retro-fitted to accelerate a number of search loops in the assembler.

Code: [Select]
`type hashref  h as long   ' hash code  l  as long   ' linked list index  m as long  ' macro list indexend typedim shared mach(4095) as hashrefdim shared as long   mle=512 ' macro link list end boundaryfunction hashcode(s as string) as long    dim as long le    dim as any ptr p    le=len(s):p=strptr(s)    asm    xor eax,eax    mov ecx, [le]    mov edx, [p]     ' name pointer    mov ah,  cl      ' length    mov al,  [edx]   ' 1st letter    cmp cl,1    jz xhash    shl eax,3    xor al,  [edx+1] ' 2nd letter    shl eax,3    nexh:        rol eax,1        xor al, [edx]        inc edx        dec ecx    jnz nexh    xhash:    mov [function],eax    end asm end functionfunction lookupm(wd as string) as long    dim as long f,v,i,k    i=hashcode(wd)    v=i and 511    f=-1 ' find last match in chain    do        if mach(v).h=i then ' match#            k=mach(v).m            if k>=me then mach(v).h=0:mach(v).l=0  ' out of scope so break the chain            if (k<me)and(wd=macs(k,0)) then f=k ' match candidate        end if        v=mach(v).l        if v<512 then exit do ' no more in linked list    loop    function=fend functionsub addm(byval m as long)    dim as long i,j,k,t    i=hashcode(macs(m,0)): if i=0 then ers="zero hash failure: "+macs(m,0):ert=13:exit sub    j=i and 511    do        t+=1        if (mach(j).l<=0)or(mach(j).l>=mle)or(mach(j).m>=me) then            if mle>=4095 then ers="Too many entities to include "+macs(m,0):ert=13:exit do            if (j>=512)or(mach(j).h<>0) then mach(j).l=mle:j=mle:mle+=1            mach(j).h=i:mach(j).l=0:mach(j).m=m: exit do ' added to chain        end if        j=mach(j).l ' next in chain    loop    if t>lmax then lmax=tend sub`