12 May 2017

ANSI, UNICODE, BSTR and converting

Update 28-06-2017 - Conversion from Ansi to Unicode: WStr() function.
For some reason the WStr() routine contained a stupid bug (that I now have fixed).

More info: The number of bytes to read from a BSTR-address was wrong. GFA-BASIC always uses the SysAllocStringLen(Null, lenbytes) when allocating COM String memory. The BSTR returned is preceded by a 32-bits value specifying the BSTR's number of bytes, not the number of characters! This is exactly the value needed when reading the BSTR-bytes into a String datatype using StrPeek(). So, the function should have been: StrPeek(BSTR, {BSTR-4}), see the updated function below.

Another point of confusion was about the number of terminating null-bytes that WStr() returned. The StrPeek() function in WStr() only copies the UNICODE characters from the BSTR to a String, without the two null-bytes that secretly follow a BSTR string. As a result, the UNICODE characters copied to the String datatype are followed by only one (1) null byte; the terminating null byte that each String secretly gets.
When a String of UNICODE characters is to be passed to a Wide API function, two null-bytes must be added 'manually'.
w$ = WStr("GFABASIC") + #0#0 ' assign two nullbytes

The post as it was:
In the previous post I discussed UNICODE versus ANSI in the ANSI-based GFA-BASIC. Basically, GB doesn’t support UNICODE because it expects 1-byte characters where strings are used. In UNICODE each character occupies 2 bytes and allows more than 256 characters. Conversion ANSI to UNICODE is ok, but conversion from UNICODE to ANSI might lead to a loss of characters with a value above 256. But there is more: Variants and BSTRs.
The introduction of COM in GB required the provision of a new data type, the Variant. The Variant is a 16-byte data type that holds data and a value that specifies the type of that data(LONG, CARD, DOUBLE, etc). A Variant can also be used to store (safe-) arrays, a specific COM array type, and BSTRs, special UNICODE strings. So to understand the String and BSTR/Variant in detail ….

How a String is stored
Because a BSTR is much like a GFA-BASIC String data type, I’ll first tell how a GB String is stored. You could skip this part if you already know.
Declaring (Dim) a String-variable introduces a name for a location. The String-variable itself requires four bytes to store a pointer to dynamically allocated memory for the characters. The declaration and assigning a location is handled by the compiler, the rest happens at runtime: assigning or initializing. When the String-variable is initialized a call to malloc() reserves memory for all its characters with an additional 5 bytes. The first 4-bytes are reserved to store the length of the string and the last byte for the null-byte (not included in the length value). After allocating and copying the characters, the address of the first character of the string is stored at the variable’s location, a 32-bits address or pointer.
Global a$       ' 32-bits location(=0) in data or stack
a$ = "GFABASIC" ' assign pointer (address) to location
l = Len(a$)     ' address <> 0 return length {address-4}
Clr a$ : a$= "" ' free memory, set locations to 0
- String in memory: [xxxx|cccccc…c|0]
- Initially, the variable is a null pointer, the contents of the variable’s location is 0.
- String variable points the address of the first character c.
- Length is stored in position address – 4, and does not include the terminating zero.

Obtaining the string’s length is a 2-step process. First the variable is tested for a non-null pointer and than the value of the preceding 4 bytes (string-address – 4) is returned.
- Clearing a string (or assigning an empty string “”) will free the allocated memory and reset the variable’s contents to 0.

BSTR in GB
GB does not provide a data type BSTR, but it provides limited support of hidden BSTRs to pass and obtain BSTR-strings to and from COM objects. GB handles the conversion and memory allocation for BSTRs, but it does not provide string-manipulation functions for BSTRs, or even BSTRs in Variants. More on this below.
BSTR variables are always temporary, hidden local variables used to communicate with COM properties/methods that take or return BSTR arguments. These hidden BSTR variables are always destroyed when leaving a subroutine. Even the Naked attribute won’t prevent the inclusion of the termination code.
BSTR strings are COM based strings. They are allocated from COM-memory and consequently the memory can be managed by both the provider of the COM-object provider and the client. That is the first difference. Next a BSTR contains UTF-16 coded wide characters, which I discussed in ANSI and UNICODE. The way COM stores a BSTR is much the same as GB stores a String variable. In fact, a BSTR is 32-bits location that stores a pointer to dynamically allocated memory with UNICODE formatted characters. The length of the BSTR is stored In front of the BSTR, again like GB’s String data type.

Use Variant for BSTR
Although, GB provides hidden support for BSTRs, the only way to get access to a BSTR is by using a Variant. The following example assigns a GB-String to a Variant. At runtime the code allocates a BSTR by calling SysAllocStringLen(0, Len(GB-String)) followed by copying the converted GB-String to the returned address. The address of the BSTR together with its data type is stored in the Variant. When the Variant variable goes out of scope, the BSTR from the Variant is released through a call to SysFreeString(address).
Dim vnt1 = "Hello"
Now it gets interesting. After GB invoked the SysAllocStringLen() COM API, it converts the ANSI string to UNICODE using a private conversion routine interspersing zero’s between the characters see ANSI and UNICODE. GB does not turn to the MultiByte*() APIs Windows provides, because GB supports ANSI characters only. In the conversion process to UNICODE no characters will be lost and the private function is extremely fast.
An optimized UNICODE conversion function
This knowledge makes it possible to obtain a UNICODE-string (not a BSTR) from a String argument through our own optimized conversion routine. Note
  • A UNICODE string is required if you want to use the Wide version APIs.
  • A UNICODE string does not have a length field in front of it. It is not a BSTR. It only specifies how much bytes a character occupies (2).
  • It’s memory is managed by the program through malloc() – no COM memory - and it ends with two null-bytes (although it seems 1 is ok as well).
  • The converted ANSI argument is placed in a String only because it is a convenient data type to store consecutive data.
The function makes use of the BSTR allocation and conversion functionality of the Variant.
(The $Export is there because it comes from a .lg32 file).
Function WStr(vnt As Variant) As String Naked ' Return UNICODEd string
  $Export Function WStr "(AnsiString) As String-UNICODE Naked"
  Dim BSTR As Register Long
  BSTR = {V:vnt + 8} ' BSTR address at offset 8
  Return StrPeek(BSTR, {BSTR - 4}) ' <- 28-06-2017="" font="" updated="">
EndFunc
1. A function very well suited for the Naked attribute, because it does not contain local variables that contain dynamically allocated memory that would otherwise require explicit release code.
2. The argument of the function is ByVal As Variant. This forces the caller (calling code) to create a Variant and than pass it by value by pushing 16-bytes (4 DWords) on the stack. Whether the Variant is passed by value or by reference, the calling subroutine is responsible for freeing the BSTR stored in the Variant. However, ByVal is interesting because …
3. The GFABASIC-compiler provides a hidden optimization when you pass a literal string to a ByVal As Variant. A ByVal Variant requires16 bytes to push on the stack, but the UNICODE characters the Variant points to are already converted at compile time. Therefor the following call is extremely efficient:
Dim t$ = WStr("GFABASIC")
The GFA-BASIC compiler stores the literal string “GFABASIC” as a UNICODE sequence of bytes (2 per character) and does not need to allocate (COM) memory and convert at runtime. This also relieves the caller from releasing the BSTR-COM-memory, so the calling function doesn’t need to execute Variant destruction code.
Assigning a UNICODE formatted string this way, is almost as efficient as initializing a String with an ANSI literal string. It only takes a few cycles to call and execute the WStr() function.
4. The caller provides the String variable to store the return value of the function. That is the function’s ‘local variable’ WStr is silently declared in the calling subroutine. The hidden string is passed as a ByRef variable to the function. The return value (String) is directly assigned to the hidden variable. If an exception would occur in function Wstr() the termination code of the caller will release the hidden WStr string variable. (Therefor Naked is perfect for this function: it doesnot need to provide explicit release code.)
5. Inside the function you can see two more optimizations. First the local Long variable that stores the address of BSTR is a register variable; no stack memory and copying required. The other optimization is the Shl 1 expression that multiplies the length of the BSTR by 2. This results in an integer asm add eax, eax instruction, rather than a floating point multiplication. Also a significant optimization.
6. Other mathematic operations like V:vnt+8 and BSTR-4 are relative address operations and are properly compiled into indirect addressing instructions. So, no chance here to optimize.
I went in some detail to explain the function hoping you’ll find it useful. I hope to tell more about the way the compiler constructs subroutines and performs optimizations.

1 comment:

  1. in the function that you wrote the returned unicode string is NULL terminated? I ask this because i have to pass an unicode string to another library that expect to receive a null terminated unicode string. In unicode a string is #0#0 terminated? or a single #0?

    ReplyDelete