Publication 6 May 2016, written 15 December 2008 or before. Updated 28 September 2019

Sources belong to this explanation.

Source code

/* Copyright 2008 by R. Harmsen. But I won't sue anyone who uses or adapts the code. 28 September 2019: Added function CP1252_to_scalar() */ /*********************************************************************** This function examines the first byte of a UTF8 sequence, and returns that sequence's length (including the first byte itself), based on the value of that first byte. This is a theoretical value, so the outcome doesn't say anything as to whether enough valid UTF8 bytes actually follow, nor as to whether any bytes follow at all. **********************************************************************/ int utf8len (int firstbyte); /*********************************************************************** This function checks if buf points to a sequence of at least two valid UTF8 bytes which represent a single Unicode scalar value. The function then returns non-zero, else zero. Note: the buffer MUST be at least seven bytes long! Note that simple Ascii and UTF8 follow-up chars are also considered non-valid for this function. Only the start of a UTF8 sequence, including its follow-ups, is considered valid. The character AFTER a valid UTF8, in a stream of UTF8 text, could be either ASCII (including a null byte) or the start of a new UTF8 char. It cannot be a UTF8 follow-up byte. However, because we want to be able to test also mixed text, possibly containing UTF8 and ISO-8859, this is not tested here! This is intended behaviour! **********************************************************************/ int utf8valid (unsigned char *start); /*********************************************************************** This function expects that `start' points to a sequence of valid UTF8 bytes which represent a single Unicode scalar value. It does not itself check validity. Note: the buffer MUST be at least seven bytes long! Valid Utf8 in this case does include plain ASCII (7 bits values). This differs from the assumption by function utf8valid. The return value is the Unicode scalar value that the UTF8 sequence represents. **********************************************************************/ long utf2scalar (unsigned char *start); /*********************************************************************** Returns a pointer to a null-terminated byte string in a static buffer. Scalar is a Unicode scalarvalue. The buffer will contain the representation in UTF8. **********************************************************************/ unsigned char *scalar2utf8 (long scalar); /*********************************************************************** This function tests if a Unicode scalar value can be represented as a valid ISO-8859-1 code. This means it must be ASCII (7 bits, less than 0x80), or be between 0xa0 and 0xff (both including). Returns non-zero for valid, zero means invalid. **********************************************************************/ int isconvertibleISO8859_1 (long scalar); /*********************************************************************** This function is like isconvertibleISO8859_1 except that it also accepts scalar Unicode values that map to the Windows 1252 byte codes in the range 0x80 thru 0x9f. If p_converted is NULL, this argument is ignored, but if a valid pointer is passed, the location it points to will be set to the corresponding Windows 1252 value (or -1 if invalid). **********************************************************************/ int isconvertibleCP1252 (long scalar, int *p_converted); /*********************************************************************** This function expects a Unicode scalar that maps to a valid Windows 1252 code in the range 0x80 thru 0x9f. It then returns that value, else -1. Examples: Unicode scalar value hex 2026 is the ellipis (three dots). This maps to code position hex 85 in Windows code page 1252. Unicode scalar value hex 201d is the closing curly double quote (99 shaped, top of the line). This maps to code position hex 94 in Windows code page 1252. **********************************************************************/ int convert2CP1252 (long scalar); /*********************************************************************** This function expects a CP1252 code byte, and returns its Unicode scalar, or -1 if unconvertible. If CP1252 is also ASCII, or in the valid range of ISO-8859-1, the function returns its input. Otherwise, the scalar is looked up in a table. **********************************************************************/ long CP1252_to_scalar (int CP1252);

Colours: Neutral Weird No preference Reload screen