Bay Six Software Forum Index Bay Six Software
Beyond the Basics
 
 FAQFAQ   SearchSearch   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

[Long] Optimal human-readable enumerations

 
Post new topic   Reply to topic    Bay Six Software Forum Index -> Code Optimization
View previous topic :: View next topic  
Author Message
Brent
Site Admin


Joined: 01 Jul 2005
Posts: 797

PostPosted: Jul 17th, 2007, 6:48am    Post subject: [Long] Optimal human-readable enumerations Reply with quote

My struggles with LB's lack of an enumeration type have been epic. However, I think I have the best solution for making something easy to expand, readable in code, and compact to solve the problem of implementing a list of named constants to make code more legible and provide a means of showing the user the name of a constant for error messages.

What is an enumeration?

(If you know, you can skip to the next section.) First you need to know what a constant is. A constant is a named value that is not intended to be changed. LB has a good number of Windows constants predefined for us to use, like _NULL and _SW_SHOW. However, LB doesn't allow us to define out own true constants. We are fprced to use variables, like LVS.REPORT. That's not too bad unless you're using SUBs and FUNCTIONS, then you have to either defines them all GLOBAL or remember to define them inside each SUB or FUNCTION that uses one.

Sometimes you need a bunch of constants. You may not really care what each constant's value is; you just want each one to have a unique value. Many languages have enumeration types.
Code:
(* Pascal *)
TYPE token = ( END_OF_LINE, FOR_STMT, WHILE_STMT );
VAR T: token;
BEGIN
    T := FOR_STMT;
...
// C and C++
enum token { END_OF_LINE, FOR_STMT, WHILE_STMT };
token T = FOR_STMT;
...
' Visual Basic
Enum token
    END_OF_LINE
    FOR_STMT
    WHILE_STMT
End Enum
Dim T As token
...
T = FOR_STMT
In all these instances END_OF_LINE is zero, FOR_STMT is one, and WHILE_STMT is two by default.

What I used in LB

When I wrote my VFScript language in LB a few years back I used STRUCTs as a substitte for enumerated types. It looks something like the following:
Code:
Struct token, EndOfLine As Word, ForStmt As Word, WhileStmt As Word
Compared to the cleanness of the other languages above, this isn't too hot. But that's just the declaration; all of the members of the STRUCT are zero. Now comes the ugliness. I wrote an Enum$ function that gets passed a starting number and the size of the STRUCT as returned by LEN. The function builds and returns a string of WORD-sized integers in the usual little-endian format. That string is then copied to the STRUCT using the RtlMoveMemory API call.

That sure is complicated, but after all of that is done we have ourselves a bunch of constants that can be used anywhere in the program because STRUCTs are global.
Code:
T = token.ForStmt.struct
That doesn't look too bad. It's a little long, but we can live with that, right? An added bonus is that if we mistype the name of a STRUCT element, LB will complain, albeit at runtime. All of the languages shown above would complain about a mistype constant at compile-time (VB too, if you add an Option Explicit).

Besides being difficult to read and edit the declaration, and difficult to understand for almost any person, STRUCTs are really slow. See here for further explanation and code examples.

In addition to parsing tokens, VFScript needed error reporting. I needed to associate some text with each token so when the VFScript programmer did something wrong, VFScript could tell him or her what was wrong. (I suppose I could have made VFScript act like LB itself and just say "Syntax Error," but if you can make a program friendlier, why not do it?) I used an array initialized from DATA statements: yet a bigger nightmare to add new tokens. And now I have to keep two sections of code perfectly synchronized, otherwise the user receives an erroneous error message. Not only is that potentially confusing, it's also ironic. A better way of doing this--and the only way in many languages--is explicit assignments.
Code:
TName$(token.EndOfLine.struct) = "End of line"
TName$(token.ForStmt.struct) = "FOR"
TName$(token.WhileStmt.struct) = "WHILE"
It works, but it seems a bit wasteful to make an array and fill it with potentially hundreds of values when it's rarely even used and just used once per compile. There are ways to do this without using an array, but we still have two sections of code that require some synching.

Some non-solutions

Obviously the fastest expression for LB to evaluate is a literal number (e.g. 12). However, a number in isolation has little meaning, so we want a way to give a pseudo-enumeration some names.

My first thought was to use code comments. I would make a list of comments in a block like so:
Code:
' 5_'End of line
' 10_'FOR
' 15_'WHILE
That way I can be sure that the numbers are unique, and the gaps are for future expansion--a trick any old-time BASIC user should know well. I can then type type my code like this:
Code:
If token = 10_'FOR
Then ...
Or better yet:
Code:
If token = /for/ ...
Then I could go back with Find/Replace and change all of my shorthand to enumerated form, no matter the letter case. What problems could this have? Well, consider that tokens are not always used to test in IF statements. The token variable had to be assigned a value by the tokenizer first and it does it with plain old =. We have to change the form of our enumeration constant.
Code:
token = 10'FOR
This method isn't making anything easier. Let's try something else.

The problem with the preceding method is that LB's (and every BASIC I know) comments are a one-per-line deal. If only LB had an inline form of comment like C (/* */) or Pascal ((* *) or { }). Well, how about this?
Code:
token = val("10 FOR")
...
If token = val("10 FOR") Then
The VAL function gives us a nice inline comment because it will stop parsing a number when it hits the space character and ignore everything after that in the string. What could be the downside? It's slow! In a benchmark comparing VAL to a literal is a magnitude slower. But LB has more than one way to parse a number. How about HEXDEC?
Code:
token = hexdec("A FOR")
...
If token = hexdec("A FOR") Then
Almost amazingly, HEXDEC is more than twice as fast as VAL. Well, it's probably not that amazing just because parsing hex is simpler that parsing decimal, especially given that VAL has to deal with things like -1.23e+45. So now we have it down to about 4.5 times slower than a literal. Is that good enough? I don't think so. Strings are just too slow. We need to avoid them at all cost if possible.

Anyway, by now I was getting pretty annoyed because I realized that none of these methods have yet addressed how to display the text given an enumeration constant. I needed to rethink this.

A viable solution

So now I'm re-evaluating my goals for this algorithm. If strings are too slow, how can I use something numeric to comment on something numeric? Ohh, I have numberic variables I can use. But how to use them?

I had the sneaking suspicion that DATAs were going to be involved. So what come naturally with DATA? Well, they can be separated by commas. Okay, if I put on each DATA line a number and some text that just happens to follow the rules for forming numeric variable names in LB, I might search it with relative ease. But what about the "constant" side of the problem? I'd need some expression with a comma in it--a function call--but what's a function I can pass a number and a numeric variable that will always be zero and get back the number? MAX is the answer.

So then the question is, "Is MAX up to the task?." As it turns out, MAX is only about 1.2 times slower than a literal number in my benchmarks. I think that's plenty fast for what it gives us. We still have to Find/Replace and due to the naming restrictions the messages the user gets won't be as "friendly."

Here's some example code.
Code:
Function TokenName$( Token )
    Do
        Read t, n$
    Loop Until t = Token
    TokenName$ = n$

    Data 5, END.OF.LINE
    Data 10, FOR.STMT
    Data 15, WHILE.STMT
End Function

T = max(10, FOR.STMT)

It's certainly not a perfect solution. Do you think you can do better? If so, post it here and now.

_________________
Brent
Back to top
View user's profile Send private message Send e-mail
Kcproductionz
Guest





PostPosted: Jul 18th, 2007, 3:32am    Post subject: Re: [Long] Optimal human-readable enumerations Reply with quote

What I was thinking about with the RESTORE X thing was

Code:

function Token$(Token)

restore Token
read Token$
exit function

5 data "End of line"
10 data "For"
15 data "While"

end function
Back to top
Brent
Site Admin


Joined: 01 Jul 2005
Posts: 797

PostPosted: Jul 18th, 2007, 4:25am    Post subject: Re: [Long] Optimal human-readable enumerations Reply with quote

(Just so no one gets confused, Dan is referring to my LB 4.03 bug report here.)

Dan, I did try that method. I was sure it wouldn't work, but I felt I should try it anyway. I didn't mention it before because it would have lengthened an already long post--unnecessarily IMO. As I mentioned, the function doesn't need to be fast. I'm sure it wouldn't bother anyone to get their error message in 1/100s instead of 1/10000s.

_________________
Brent
Back to top
View user's profile Send private message Send e-mail
Kcproductionz
Guest





PostPosted: Jul 18th, 2007, 6:03am    Post subject: Re: [Long] Optimal human-readable enumerations Reply with quote

Yah, but this could also go to other applications to where speed would be important, but all well.
Back to top
Brent
Site Admin


Joined: 01 Jul 2005
Posts: 797

PostPosted: Jul 22nd, 2007, 5:43am    Post subject: Re: [Long] Optimal human-readable enumerations Reply with quote

I just want to let you all know what happened when I changed VFScript from using STRUCTs to the above method. I did this just for the runtime, eliminating the VFSOp STRUCT. Below are the demos I used and their relative speed differences.
Quote:
Code:
for i=1 to 1e5
next
1.47:1
Quote:
Code:
for i=1 to 1e4
x=int(1/i)+1
next
1.60:1
Quote:
Code:
i=2
do
x=(1000*rnd(0)) mod i
i=i+1
loop until i=10000
1.67:1
Quote:
Code:
for i=1 to 1e4
a$=a$;chr$(int(224*rnd(0))+32)
next
1.80:1
Quote:
Code:
for i=1 to 2e3
x=1+2+3+4+5+6+7+8+9
y=99-8-7-6-5-4-3-2-1
z=2*3*4*5*6*7*8*9
next
1.92:1

It's certainly not the 2-3 times increase I had expected. If I average these numbers, I have a speed increase close to 1.7 times the old version. Hpwever, judging from the results, I think I can safely conclude that, as the complexity of expressions increases, the new version's advantage also increases. Given a sufficiently complex program, I believe the advantage could surpass 2:1.

_________________
Brent
Back to top
View user's profile Send private message Send e-mail
Brent
Site Admin


Joined: 01 Jul 2005
Posts: 797

PostPosted: Jul 23rd, 2007, 5:34am    Post subject: Re: [Long] Optimal human-readable enumerations Reply with quote

Here's the final (abridged) function.
Code:
Function VFS.TokenName$( Token )
    Restore
    Do : Read t, n$ : Loop Until t = Token
    k = InStr(n$, ".KEYWORD")
    If k Then n$ = Left$(n$, k - 1)
    VFS.TokenName$ = n$
    Exit Function

    Data   5, END.OF.LINE
    Data  10, DO.KEYWORD
    Data  15, FOR.KEYWORD
    Data  20, IF.KEYWORD
    Data  25, CONFIRM.KEYWORD
    '...
    Data 355, ASTERISK.CHARACTER
    Data 360, SLASH.CHARACTER
    Data 365, BACKSLASH.CHARACTER
    Data 370, MOD.KEYWORD
    Data 999, NO.TOKEN
End Function

It removes ".KEYWORD" from a name, if that name contains it, thus avoiding a "FOR.KEYWORD without NEXT.KEYWORD" type of error message.

The other addition may be less obvious in its necessity. During testing I discovered something about DATA inside a SUB or FUNCTION: The data "pointer" that keeps track of rhe last item read is "static," meaning that the pointer is not reset when the SUB or FUNCTION ends. Here's a little demo:
Code:
    for i=1 to 3
        call mysub
    next
    end
sub mysub
    read x$
    print x$
    exit sub
    data one,two,three
end sub

I couldn't find any mention of this behavior in the help file. Did anyone already know this? Would you consider this to be a bug or a feature?

_________________
Brent
Back to top
View user's profile Send private message Send e-mail
Display posts from previous:   
Post new topic   Reply to topic    Bay Six Software Forum Index -> Code Optimization All times are GMT
Page 1 of 1
Jump to:  
Quick Reply
Username:
Message:
   Shortcut keys: Alt+Q to activate, Alt+P to preview, Alt+S to submit
You can post new topics in this forum
You can reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You cannot download files in this forum



Lo-Fi Version
Powered by phpBB © 2001, 2005 phpBB Group