10 May 2017

ANSI and UNICODE

Updated 21-05-2017: Sample code at the end of the post.

GFA-BASIC 32 only supports ANSI strings, not UNICODE… What exactly does that mean?

ANSI-strings consist of a sequence of bytes – the characters of a string – where each byte represents a character. This allows for 256 different characters because a byte can contain a value between 0 and 255. Restricting strings to bytes limits the number of possible – mostly for not western languages – characters. To allow for more characters each character in a string must somehow occupy more than one byte. In Windows, each Unicode character is encoded using UTF-16 (where UTF is an acronym for Unicode Transformation Format). UTF-16 encodes each character as 2 bytes (or 16 bits). UTF-16 is a compromise between saving space and providing ease of coding. It is used throughout Windows, including .NET and COM.

In UNICODE the lower 256 values represent the same characters as in ANSI, but they are stored as a sequence of 16-bits integers. Additional characters are represented with higher values above 256. In UNICODE the first 256 characters have the same value as in ANSI, but each character requires 2-bytes of storage. When you convert an ANSI string to UNICODE it becomes twice the size of the ANSI string.
Let’s see what this means from a GFA-BASIC perspective.

ANSI in a GB String
When you store a literal string like “GFABASIC” in a String (ANSI, 1-byte representation), the string is filled with 8 bytes of (hexadecimal) values 47 46 41 42 41 53 49 43.

a_t$ = "GFABASIC"   ' 47 46 41 42 41 53 49 43

The same string can be created by using Chr$() and populate these byte values. (A more general approach would be to use the Mk1$() function):

a_t$ = Chr($47, $46, $41, $42, $41, $53, $49, $43)
a_t$ = Mk1($47, $46, $41, $42, $41, $53, $49, $43)

GFA-BASIC’s string functions expect ANSI strings only, and by default GB only communicates with the ANSI version of the Windows API functions. With a little knowledge you can do more.

Windows APIs are UNICODE
Windows is an UNICODE system. When a Windows API takes a string as an parameter, Windows always provides two versions of the same API. It provides an API for ANSI stings and an API for UNICODE strings. To differentiate between ANSI and UNICODE respectively, the names of the API function either ends with uppercase A - for ANSI parameters - and uppercase W for the version that accepts or expects UNICODE. A typical example would be the  SetWindowText() API which comes in two flavors SetWindowTextA() and SetWindowTextW().
The GFA-BASIC’s built-in APIs are the ones that map to the functions that end with A. So the GB function SetWindowText() maps to the SetWindowTextA() function.

UNICODE in a GB String
By default, when you declare a literal string in your source code, the compiler turns the string's characters into an array of 8-bit data types, the String. You can not – in the same way - declare a literal UNICODE string. To assign a sequence of 2-byte characters you’ll need to use different methods. For instance by populating a String by hand. In the example above it only takes one change to create a UNICODE array of characters. Simply change the Mk1() function to Mk2():

u_t$ = Mk2($47, $46, $41, $42, $41, $53, $49, $43) + #0

Now each character occupies 2 bytes and has become UNICODE formatted, because it encodes each character using UTF16, interspersing zero bytes between every ASCII character, like so

u_t$ = Chr($47,0, $46,0, $41,0, $42,0, $41,0, $53,0, $49,0, $43,0) + #0

A GB String data type always adds a null-byte (only one) to zero-terminate the sequence of characters. Since the above assignments are GB controlled, the strings end with only one null-byte. UNICODE should end with two null-bytes. You should explicitly add an additional null at the end of the string to properly create a UNICODE string.

UNICODE is not BSTR
Note that we simply created a piece of memory to store characters in 2-bytes rather than in 1-byte. The String memory is allocated from the program’s global heap and this memory is only guarded by GB. Although the string contains UNICODE it is not a BSTR. A BSTR is a COM defined string type and is allocated from COM-memory. Both the client (a GB-program) and the provider/server have access to the same COM-memory.
When a string is assigned to a Variant, which supports BSTRs only, GB allocates COM string memory and converts the ANSI string to UNICODE.

Using pure UNICODE
The GFA-BASIC string-functions use a 1-byte character indexing system. However, you can overcome this limitation for 2-byte formatted strings and apply GB String-functions when you multiply the index and length parameters by 2. For instance:

u_t$ = Left(u_t$, ipos * 2) + #0
u_t$ = Mid(u_t$, ipos * 2, nBytes * 2) + #0

You can pass these UNICODE formatted strings to APIs that end with uppercase W. To introduce the wide character APIs to your code you must Declare them explicitly. For instance, this code displays u_t$ in the client area of a window.

Declare Function TextOutW Lib "gdi32.dll" Alias "TextOutW" ( _
  ByVal hdc As Handle,        // handle to DC _
  ByVal nXStart As Int,       // x-coordinate of starting position _
  ByVal nYStart As Int,       // y-coordinate of starting position _
  ByVal lpwString As Long,    // character string _
  ByVal cbString As Long      // number of characters _
  ) As Long

Form frm1
TextOutW(frm1.hDC, 1, 1, V:u_t$, Len(u_t$) / 2)

Remember one thing. Windows uses UNICODE only, including fonts. Whether you use TextOutW or TextOutA (as Text does), all output is performed using UNICODE fonts. The TextOutA first converts the text to UNICODE and than invokes TextOutW. By providing a UNICODE formatted to a W-version API only skips the conversion from ANSI. See below for an example.

Obtaining UNICODE text from Windows APIs
Since XP, all Windows APIs taking or returning a string parameter are implemented in UNICODE only. The ANSI version of these functions translate (or convert) the ANSI strings to and from UNICODE format. Well, GB only handles ANSI strings; it passes and retrieves ANSI strings to and from Windows APIs. What is the consequence of this restriction?

When an ANSI string is passed to an A – version of an API, the Windows API will convert the string to UNICODE and than invoke the W-version of that API. There is no loss of information in this conversion. All ANSI characters are converted to UNICODE by expanding the string with zero’s as explained above. The string-size is doubled, but that’s all.

The other way around is more problematic. A Windows API may return or provide a UNICODE formatted string containing non-ANSI characters, characters with a 2-byte value above 256 … When the A-version of the API is used to retrieve text, Windows will do the UNICODE-to-ANSI conversion on behalf of the A-version of that API and the characters with a higher value of 256 will be lost.
This won’t be a problem if the ANSI-based GFA-BASIC program is used in languages no other than Latin (English) alphabets. In other languages the Windows system accepts more characters and the text won’t be properly returned to the GFA-BASIC String data type.
When your program needs UNICODE input or use UNICODE strings you should explicitly declare all the required wide APIs. In addition, you might also need W replacements for the GDI text-out functions. To use the GB string functions, you should remember to multiply or divide all integer arguments with 2.

Displaying UNICODE glyph characters (updated 21-05-2017)
Windows 10 includes and uses a new graphical font: Segoe MDL2 Assets. This sample shows how to obtain the glyphs form the font icons for use in GB.
In the accessory Special Characters select Segoe MDL2 Assets and than select a graphical character. Write down the 16-bit value from the box at the bottom and assign it to a String. Here the value for the picture for saving is 0xE105.

image

Form frm1
ScaleMode = basPixels   ' by default
SetFont "Segoe MDL2 Assets"
' Display UNICODE string "GFABASIC"
Dim u_t$ = Mk2($47, $46, $41, $42, $41, $53, $49, $43) + #0
TextW 1, 1, u_t$

' Get a Picture Object from a glyph.
' Char-value from 'Special Characters' Accesorry
Dim hBmp As Handle, p As Picture
Dim size As SIZE
u_t$ = Mk2(0xE105)          ' the Save-glyph
TextW 1, 31, u_t$           ' show it
TextSizeW(Me.hDC, V:u_t$, Len(u_t$) / 2, size)
Get 1, 31, 1 + size.cx, 31 + size.cy, hBmp   ' a GDI-handle
Put 50, 1, hBmp             ' and test it
Set p = CreatePicture(hBmp, True)  ' into a Picture
PaintPicture p, 70, 1       ' and test it
Do
  Sleep
Until Me Is Nothing

Proc TextW(x As Int, y As Int, wstr As String)
  ' Assume Scalemode = basPixels, ScaleLeft=0, and ScaleTop=0
  TextOutW(Me.hDC, x, y, V:wstr, Len(wstr) / 2)
  ' If AutoRedraw == True draw on bitmap.
  If Me.hDC2 Then TextOutW(Me.hDC2, x, y, V:wstr, Len(wstr) / 2)
EndProc

Declare Function TextOutW Lib "gdi32.dll" Alias "TextOutW" ( _
  ByVal hdc As Handle,         // handle to DC _
  ByVal nXStart As Int,        // x-coordinate of starting position _
  ByVal nYStart As Int,        // y-coordinate of starting position _
  ByVal lpwString As Long,     // character string _
  ByVal cbString As Long       // number of characters _
  ) As Long
Declare Function TextSizeW Lib "gdi32.dll" Alias "GetTextExtentPoint32W" ( _
  ByVal hdc As Handle,        // handle to DC _
  ByVal lpString As Long,     // text string _
  ByVal cbString As Int,      // characters in string _
  ByRef lpSize As SIZE        // string size _
  ) As Long
Type SIZE
  - Long cx, cy
EndType

A few notes about this sample (compared to the previous version).

  1. The Segoe MDL2 Assets font is not a fixed-sized font (the LOGFONT member lfpitchAndFamily is not FIXED_PITCH). However, the glyphs in the font all have the same format. To obtain the size of a glyph-character we cannot use the ANSI GB functions TextWidth() and TextHeight(), since they cannot return the size of a 2-byte character. Therefor the inclusion of the TextSizeW() function.
  2. To conform to GB’s scaling the TextOutW function should take coordinates in the current ScaleMode and the text-output should obey the ScaleLeft and ScaleTop settings. In this sample TextW simply draws on a pixel resolution scale and relative to (0,0), located at the top-left of the client area. Note however that Get and Put actually use the current scaling. Be sure to use the same ScaleMode for both GB commands as API functions. (As long as B= basPixels (default scaling in GFA-BASIC, VB uses twips, do not confuse the both).

Finally, the return values of ScaleLeft and ScaleTop are wrong (al versions below Build 1200). Hope to update the GfaWin23.ocx as soon as possible).

No comments:

Post a Comment