25 May 2017
New Group/Mailinglist
This mailinglist/group is provided by Google and thus requires a Gmail-account. After you have signed in with Google using your Gmail-account you can join the list. Although it is a group, it works much in the same way as the previous mailinglist. Questions and answers are posted through e-mail.
12 May 2017
ANSI, UNICODE, BSTR and converting
For some reason the WStr() routine contained a stupid bug (that I now have fixed).
More info: The number of bytes to read from a BSTR-address was wrong. GFA-BASIC always uses the SysAllocStringLen(Null, lenbytes) when allocating COM String memory. The BSTR returned is preceded by a 32-bits value specifying the BSTR's number of bytes, not the number of characters! This is exactly the value needed when reading the BSTR-bytes into a String datatype using StrPeek(). So, the function should have been: StrPeek(BSTR, {BSTR-4}), see the updated function below.
Another point of confusion was about the number of terminating null-bytes that WStr() returned. The StrPeek() function in WStr() only copies the UNICODE characters from the BSTR to a String, without the two null-bytes that secretly follow a BSTR string. As a result, the UNICODE characters copied to the String datatype are followed by only one (1) null byte; the terminating null byte that each String secretly gets.
When a String of UNICODE characters is to be passed to a Wide API function, two null-bytes must be added 'manually'.
w$ = WStr("GFABASIC") + #0#0 ' assign two nullbytes
The post as it was:
In the previous post I discussed UNICODE versus ANSI in the ANSI-based GFA-BASIC. Basically, GB doesn’t support UNICODE because it expects 1-byte characters where strings are used. In UNICODE each character occupies 2 bytes and allows more than 256 characters. Conversion ANSI to UNICODE is ok, but conversion from UNICODE to ANSI might lead to a loss of characters with a value above 256. But there is more: Variants and BSTRs.
The introduction of COM in GB required the provision of a new data type, the Variant. The Variant is a 16-byte data type that holds data and a value that specifies the type of that data(LONG, CARD, DOUBLE, etc). A Variant can also be used to store (safe-) arrays, a specific COM array type, and BSTRs, special UNICODE strings. So to understand the String and BSTR/Variant in detail ….
How a String is stored
Because a BSTR is much like a GFA-BASIC String data type, I’ll first tell how a GB String is stored. You could skip this part if you already know.
Declaring (Dim) a String-variable introduces a name for a location. The String-variable itself requires four bytes to store a pointer to dynamically allocated memory for the characters. The declaration and assigning a location is handled by the compiler, the rest happens at runtime: assigning or initializing. When the String-variable is initialized a call to malloc() reserves memory for all its characters with an additional 5 bytes. The first 4-bytes are reserved to store the length of the string and the last byte for the null-byte (not included in the length value). After allocating and copying the characters, the address of the first character of the string is stored at the variable’s location, a 32-bits address or pointer.
Global a$ ' 32-bits location(=0) in data or stack a$ = "GFABASIC" ' assign pointer (address) to location l = Len(a$) ' address <> 0 return length {address-4} Clr a$ : a$= "" ' free memory, set locations to 0- String in memory: [xxxx|cccccc…c|0]
- Initially, the variable is a null pointer, the contents of the variable’s location is 0.
- String variable points the address of the first character c.
- Length is stored in position address – 4, and does not include the terminating zero.
Obtaining the string’s length is a 2-step process. First the variable is tested for a non-null pointer and than the value of the preceding 4 bytes (string-address – 4) is returned.
- Clearing a string (or assigning an empty string “”) will free the allocated memory and reset the variable’s contents to 0.
BSTR in GB
GB does not provide a data type BSTR, but it provides limited support of hidden BSTRs to pass and obtain BSTR-strings to and from COM objects. GB handles the conversion and memory allocation for BSTRs, but it does not provide string-manipulation functions for BSTRs, or even BSTRs in Variants. More on this below.
BSTR variables are always temporary, hidden local variables used to communicate with COM properties/methods that take or return BSTR arguments. These hidden BSTR variables are always destroyed when leaving a subroutine. Even the Naked attribute won’t prevent the inclusion of the termination code.
BSTR strings are COM based strings. They are allocated from COM-memory and consequently the memory can be managed by both the provider of the COM-object provider and the client. That is the first difference. Next a BSTR contains UTF-16 coded wide characters, which I discussed in ANSI and UNICODE. The way COM stores a BSTR is much the same as GB stores a String variable. In fact, a BSTR is 32-bits location that stores a pointer to dynamically allocated memory with UNICODE formatted characters. The length of the BSTR is stored In front of the BSTR, again like GB’s String data type.
Use Variant for BSTR
Although, GB provides hidden support for BSTRs, the only way to get access to a BSTR is by using a Variant. The following example assigns a GB-String to a Variant. At runtime the code allocates a BSTR by calling SysAllocStringLen(0, Len(GB-String)) followed by copying the converted GB-String to the returned address. The address of the BSTR together with its data type is stored in the Variant. When the Variant variable goes out of scope, the BSTR from the Variant is released through a call to SysFreeString(address).
Dim vnt1 = "Hello"Now it gets interesting. After GB invoked the SysAllocStringLen() COM API, it converts the ANSI string to UNICODE using a private conversion routine interspersing zero’s between the characters see ANSI and UNICODE. GB does not turn to the MultiByte*() APIs Windows provides, because GB supports ANSI characters only. In the conversion process to UNICODE no characters will be lost and the private function is extremely fast.
An optimized UNICODE conversion function
This knowledge makes it possible to obtain a UNICODE-string (not a BSTR) from a String argument through our own optimized conversion routine. Note
- A UNICODE string is required if you want to use the Wide version APIs.
- A UNICODE string does not have a length field in front of it. It is not a BSTR. It only specifies how much bytes a character occupies (2).
- It’s memory is managed by the program through malloc() – no COM memory - and it ends with two null-bytes (although it seems 1 is ok as well).
- The converted ANSI argument is placed in a String only because it is a convenient data type to store consecutive data.
(The $Export is there because it comes from a .lg32 file).
Function WStr(vnt As Variant) As String Naked ' Return UNICODEd string $Export Function WStr "(AnsiString) As String-UNICODE Naked" Dim BSTR As Register Long BSTR = {V:vnt + 8} ' BSTR address at offset 8 Return StrPeek(BSTR, {BSTR - 4}) ' <- 28-06-2017="" font="" updated="">1. A function very well suited for the Naked attribute, because it does not contain local variables that contain dynamically allocated memory that would otherwise require explicit release code.->EndFunc
2. The argument of the function is ByVal As Variant. This forces the caller (calling code) to create a Variant and than pass it by value by pushing 16-bytes (4 DWords) on the stack. Whether the Variant is passed by value or by reference, the calling subroutine is responsible for freeing the BSTR stored in the Variant. However, ByVal is interesting because …
3. The GFABASIC-compiler provides a hidden optimization when you pass a literal string to a ByVal As Variant. A ByVal Variant requires16 bytes to push on the stack, but the UNICODE characters the Variant points to are already converted at compile time. Therefor the following call is extremely efficient:
Dim t$ = WStr("GFABASIC")The GFA-BASIC compiler stores the literal string “GFABASIC” as a UNICODE sequence of bytes (2 per character) and does not need to allocate (COM) memory and convert at runtime. This also relieves the caller from releasing the BSTR-COM-memory, so the calling function doesn’t need to execute Variant destruction code.
Assigning a UNICODE formatted string this way, is almost as efficient as initializing a String with an ANSI literal string. It only takes a few cycles to call and execute the WStr() function.
4. The caller provides the String variable to store the return value of the function. That is the function’s ‘local variable’ WStr is silently declared in the calling subroutine. The hidden string is passed as a ByRef variable to the function. The return value (String) is directly assigned to the hidden variable. If an exception would occur in function Wstr() the termination code of the caller will release the hidden WStr string variable. (Therefor Naked is perfect for this function: it doesnot need to provide explicit release code.)
5. Inside the function you can see two more optimizations. First the local Long variable that stores the address of BSTR is a register variable; no stack memory and copying required. The other optimization is the Shl 1 expression that multiplies the length of the BSTR by 2. This results in an integer asm add eax, eax instruction, rather than a floating point multiplication. Also a significant optimization.
6. Other mathematic operations like V:vnt+8 and BSTR-4 are relative address operations and are properly compiled into indirect addressing instructions. So, no chance here to optimize.
I went in some detail to explain the function hoping you’ll find it useful. I hope to tell more about the way the compiler constructs subroutines and performs optimizations.
10 May 2017
Error free using a library
BUG - Runtime errors
When you run a project which includes a library it may generate strange, seemingly unrelated error messages. In particular the error "Hash Internal Error 1/2 (Version?)" pops up regularly. The reason for runtime errors inside the code of a library is a bug(!) in applying the setting for Branch Optimizations.
For a lg32 file, GFA-BASIC wants to apply the Full Optimization for Exe setting on the compiling process. However, it is never applied at all, because the code applies this setting in the wrong place, after the code is compiled ;). Consequently, the compiler switches to the trackbar/slider setting from Branch Optimizations.
This is a bug from a long time ago and it is simply never tested properly.
In general objectcode generated for a lg32 file is position independent, it differs from code generated for EXE (and GLL files). Therefor, the lg32-generated code for the jump-tables for Switch/Case statements and On n GoSub/Call statements are wrong (this is also true for a GLL, for which I always use the default settings).
The only setting that work flawlessly is the None setting of the slider in 'Branch Optimizations' and uncheck the 'Full Optimization' check box.
A lg32-file has to be compiled using the default settings for Branch Optimizations. The slider must be set to the first position (None) and the checkbox Full Optimization for Exe must be unchecked. |
Note - The slider setting is applied to compiling code in memory, independent of the required output file type (EXE, LG32, or GLL). The most right position (Full) is exactly the same as checking the Full Optimization for Exe - box. This way you can test fully optimized code inside the IDE.
Note - The branch optimizations of the compiler do not lead to remarkable performance results. These days with fast CPUs and large caches performance increase is hard to provide, the only real performance increase is accomplished by using Naked procedures. Remember however, Naked procedures do not include termination code and do not allow exception handlers.
The $Library statement
The $Library statement loads a lg32 file into memory. But sometimes it cannot locate the lg32 file. The IDE code to find a lg32 file is a bit complicated. In some conditions you may omit the extension and in others you cannot. It depends on the inclusion of a path in $Library statement. For instance, you may include a relative path (relative to the current directory, mostly the g32-file directory, but not necessarily), but than the extension must be provided. It's all a bit incoherent. But there is a solution that always works correctly. That is - the library is always located properly.
Solution for load errors
This solution adds more functionality to the $Library statement and so it complements the current functionality. You must add a (new) register entry to the GFA/BASIC key in the HKEY_CURRENT_USER/Software setting. The key must be named "lg32path" and the value can contain multiple full paths separated by commas. (The value uses the same syntax a the PATH environment variable).
New key: "lg32path", REG_SZ
Value: "C:\GFA\Include, D:\GFA\MyLibs"
Have fun with lg32 file.
ANSI and UNICODE
Updated 21-05-2017: Sample code at the end of the post.
GFA-BASIC 32 only supports ANSI strings, not UNICODE… What exactly does that mean?
ANSI-strings consist of a sequence of bytes – the characters of a string – where each byte represents a character. This allows for 256 different characters because a byte can contain a value between 0 and 255. Restricting strings to bytes limits the number of possible – mostly for not western languages – characters. To allow for more characters each character in a string must somehow occupy more than one byte. In Windows, each Unicode character is encoded using UTF-16 (where UTF is an acronym for Unicode Transformation Format). UTF-16 encodes each character as 2 bytes (or 16 bits). UTF-16 is a compromise between saving space and providing ease of coding. It is used throughout Windows, including .NET and COM.
In UNICODE the lower 256 values represent the same characters as in ANSI, but they are stored as a sequence of 16-bits integers. Additional characters are represented with higher values above 256. In UNICODE the first 256 characters have the same value as in ANSI, but each character requires 2-bytes of storage. When you convert an ANSI string to UNICODE it becomes twice the size of the ANSI string.
Let’s see what this means from a GFA-BASIC perspective.
ANSI in a GB String
When you store a literal string like “GFABASIC” in a String (ANSI, 1-byte representation), the string is filled with 8 bytes of (hexadecimal) values 47 46 41 42 41 53 49 43.
a_t$ = "GFABASIC" ' 47 46 41 42 41 53 49 43
The same string can be created by using Chr$() and populate these byte values. (A more general approach would be to use the Mk1$() function):
a_t$ = Chr($47, $46, $41, $42, $41, $53, $49, $43) a_t$ = Mk1($47, $46, $41, $42, $41, $53, $49, $43)
GFA-BASIC’s string functions expect ANSI strings only, and by default GB only communicates with the ANSI version of the Windows API functions. With a little knowledge you can do more.
Windows APIs are UNICODE
Windows is an UNICODE system. When a Windows API takes a string as an parameter, Windows always provides two versions of the same API. It provides an API for ANSI stings and an API for UNICODE strings. To differentiate between ANSI and UNICODE respectively, the names of the API function either ends with uppercase A - for ANSI parameters - and uppercase W for the version that accepts or expects UNICODE. A typical example would be the SetWindowText() API which comes in two flavors SetWindowTextA() and SetWindowTextW().
The GFA-BASIC’s built-in APIs are the ones that map to the functions that end with A. So the GB function SetWindowText() maps to the SetWindowTextA() function.
UNICODE in a GB String
By default, when you declare a literal string in your source code, the compiler turns the string's characters into an array of 8-bit data types, the String. You can not – in the same way - declare a literal UNICODE string. To assign a sequence of 2-byte characters you’ll need to use different methods. For instance by populating a String by hand. In the example above it only takes one change to create a UNICODE array of characters. Simply change the Mk1() function to Mk2():
u_t$ = Mk2($47, $46, $41, $42, $41, $53, $49, $43) + #0
Now each character occupies 2 bytes and has become UNICODE formatted, because it encodes each character using UTF16, interspersing zero bytes between every ASCII character, like so
u_t$ = Chr($47,0, $46,0, $41,0, $42,0, $41,0, $53,0, $49,0, $43,0) + #0
A GB String data type always adds a null-byte (only one) to zero-terminate the sequence of characters. Since the above assignments are GB controlled, the strings end with only one null-byte. UNICODE should end with two null-bytes. You should explicitly add an additional null at the end of the string to properly create a UNICODE string.
UNICODE is not BSTR
Note that we simply created a piece of memory to store characters in 2-bytes rather than in 1-byte. The String memory is allocated from the program’s global heap and this memory is only guarded by GB. Although the string contains UNICODE it is not a BSTR. A BSTR is a COM defined string type and is allocated from COM-memory. Both the client (a GB-program) and the provider/server have access to the same COM-memory.
When a string is assigned to a Variant, which supports BSTRs only, GB allocates COM string memory and converts the ANSI string to UNICODE.
Using pure UNICODE
The GFA-BASIC string-functions use a 1-byte character indexing system. However, you can overcome this limitation for 2-byte formatted strings and apply GB String-functions when you multiply the index and length parameters by 2. For instance:
u_t$ = Left(u_t$, ipos * 2) + #0 u_t$ = Mid(u_t$, ipos * 2, nBytes * 2) + #0
You can pass these UNICODE formatted strings to APIs that end with uppercase W. To introduce the wide character APIs to your code you must Declare them explicitly. For instance, this code displays u_t$ in the client area of a window.
Declare Function TextOutW Lib "gdi32.dll" Alias "TextOutW" ( _ ByVal hdc As Handle, // handle to DC _ ByVal nXStart As Int, // x-coordinate of starting position _ ByVal nYStart As Int, // y-coordinate of starting position _ ByVal lpwString As Long, // character string _ ByVal cbString As Long // number of characters _ ) As Long Form frm1 TextOutW(frm1.hDC, 1, 1, V:u_t$, Len(u_t$) / 2)
Remember one thing. Windows uses UNICODE only, including fonts. Whether you use TextOutW or TextOutA (as Text does), all output is performed using UNICODE fonts. The TextOutA first converts the text to UNICODE and than invokes TextOutW. By providing a UNICODE formatted to a W-version API only skips the conversion from ANSI. See below for an example.
Obtaining UNICODE text from Windows APIs
Since XP, all Windows APIs taking or returning a string parameter are implemented in UNICODE only. The ANSI version of these functions translate (or convert) the ANSI strings to and from UNICODE format. Well, GB only handles ANSI strings; it passes and retrieves ANSI strings to and from Windows APIs. What is the consequence of this restriction?
When an ANSI string is passed to an A – version of an API, the Windows API will convert the string to UNICODE and than invoke the W-version of that API. There is no loss of information in this conversion. All ANSI characters are converted to UNICODE by expanding the string with zero’s as explained above. The string-size is doubled, but that’s all.
The other way around is more problematic. A Windows API may return or provide a UNICODE formatted string containing non-ANSI characters, characters with a 2-byte value above 256 … When the A-version of the API is used to retrieve text, Windows will do the UNICODE-to-ANSI conversion on behalf of the A-version of that API and the characters with a higher value of 256 will be lost.
This won’t be a problem if the ANSI-based GFA-BASIC program is used in languages no other than Latin (English) alphabets. In other languages the Windows system accepts more characters and the text won’t be properly returned to the GFA-BASIC String data type.
When your program needs UNICODE input or use UNICODE strings you should explicitly declare all the required wide APIs. In addition, you might also need W replacements for the GDI text-out functions. To use the GB string functions, you should remember to multiply or divide all integer arguments with 2.
Displaying UNICODE glyph characters (updated 21-05-2017)
Windows 10 includes and uses a new graphical font: Segoe MDL2 Assets. This sample shows how to obtain the glyphs form the font icons for use in GB.
In the accessory Special Characters select Segoe MDL2 Assets and than select a graphical character. Write down the 16-bit value from the box at the bottom and assign it to a String. Here the value for the picture for saving is 0xE105.
Form frm1 ScaleMode = basPixels ' by default SetFont "Segoe MDL2 Assets" ' Display UNICODE string "GFABASIC" Dim u_t$ = Mk2($47, $46, $41, $42, $41, $53, $49, $43) + #0 TextW 1, 1, u_t$ ' Get a Picture Object from a glyph. ' Char-value from 'Special Characters' Accesorry Dim hBmp As Handle, p As Picture Dim size As SIZE u_t$ = Mk2(0xE105) ' the Save-glyph TextW 1, 31, u_t$ ' show it TextSizeW(Me.hDC, V:u_t$, Len(u_t$) / 2, size) Get 1, 31, 1 + size.cx, 31 + size.cy, hBmp ' a GDI-handle Put 50, 1, hBmp ' and test it Set p = CreatePicture(hBmp, True) ' into a Picture PaintPicture p, 70, 1 ' and test it Do Sleep Until Me Is Nothing Proc TextW(x As Int, y As Int, wstr As String) ' Assume Scalemode = basPixels, ScaleLeft=0, and ScaleTop=0 TextOutW(Me.hDC, x, y, V:wstr, Len(wstr) / 2) ' If AutoRedraw == True draw on bitmap. If Me.hDC2 Then TextOutW(Me.hDC2, x, y, V:wstr, Len(wstr) / 2) EndProc Declare Function TextOutW Lib "gdi32.dll" Alias "TextOutW" ( _ ByVal hdc As Handle, // handle to DC _ ByVal nXStart As Int, // x-coordinate of starting position _ ByVal nYStart As Int, // y-coordinate of starting position _ ByVal lpwString As Long, // character string _ ByVal cbString As Long // number of characters _ ) As Long Declare Function TextSizeW Lib "gdi32.dll" Alias "GetTextExtentPoint32W" ( _ ByVal hdc As Handle, // handle to DC _ ByVal lpString As Long, // text string _ ByVal cbString As Int, // characters in string _ ByRef lpSize As SIZE // string size _ ) As Long Type SIZE - Long cx, cy EndType
A few notes about this sample (compared to the previous version).
- The Segoe MDL2 Assets font is not a fixed-sized font (the LOGFONT member lfpitchAndFamily is not FIXED_PITCH). However, the glyphs in the font all have the same format. To obtain the size of a glyph-character we cannot use the ANSI GB functions TextWidth() and TextHeight(), since they cannot return the size of a 2-byte character. Therefor the inclusion of the TextSizeW() function.
- To conform to GB’s scaling the TextOutW function should take coordinates in the current ScaleMode and the text-output should obey the ScaleLeft and ScaleTop settings. In this sample TextW simply draws on a pixel resolution scale and relative to (0,0), located at the top-left of the client area. Note however that Get and Put actually use the current scaling. Be sure to use the same ScaleMode for both GB commands as API functions. (As long as B= basPixels (default scaling in GFA-BASIC, VB uses twips, do not confuse the both).
Finally, the return values of ScaleLeft and ScaleTop are wrong (al versions below Build 1200). Hope to update the GfaWin23.ocx as soon as possible).