Author Topic: Character sets and encoding (Read 3302 times)

martink · « **on:** February 03, 2020, 05:47:04 AM »

Hi,
I wonder if you could help me with the tricky issue of character sets and encoding.
I have a sentence in a caption which includes this phrase:
Trust’s ‘Words
there are legacy systems which do not recognise the particular apostrophe (as in Trust’s) and an open single quote (as at the start of ‘Words) and error those - as they will for a close single quote.
This problem crops up most frequently after a copy and paste into the caption from a Word document.
What encoding should I use so that within the caption metadata those words are forced to look like
Trust's 'Words
where both the apostrophe, the open single quote, and the close single quote are all the same.
Many thanks for your help,
Martin

Kirk Baker · « **Reply #1 on:** February 03, 2020, 09:01:03 AM »

Martin,

There is no encoding that will force one letter to look like another. UTF-8 can represent both forms just fine. Do these legacy systems not understand Unicode?

-Kirk

Odd Skjaeveland · « **Reply #2 on:** February 04, 2020, 12:18:44 AM »

Quote from: martink on February 03, 2020, 05:47:04 AM

...What encoding should I use so that within the caption metadata those words are forced to look like Trust's 'Words where both the apostrophe, the open single quote, and the close single quote are all the same.

I think that is more about changing characters, less about character sets. It looks like you want to replace character ’ with ' and also replace character ‘ with '. Try search/replace in a source document before you copy/paste from it.

The following digs a little deeper and hopefully explains my answer.

The wanted ' apostrophe or single quote is ASCII character 27 Hex (39 decimal).
The unwanted ‘ or grave accent is ASCII character 60 Hex (96 decimal)
The unwanted ’ or acute accent is not part of the original 7 bit ASCII set, but it is commonly included in 8 bit "extended" ASCII sets. In ISO 8859-1, Latin-1, the acute accent is B4 Hex (180 decimal), B4 also works with UTF-8.

Various character codes are in use for the acute accent. In Code Page 850 that character is at EF Hex (239 decimal) while in Code Page 863 the acute accent is at A1 Hex (161 decimal) and the acute accent is probably not in Code Page 437 at all.

Changing encoding, as per your question, is normally about keeping the character as such and switch the code as required to get that character. Switching character sets from say CP863 to CP850 you need to change occurrences of character code A1 Hex to EF Hex to make the acute accent be an acute accent in both texts. From CP863 to ISO8859-1 means changing from A1 to B4 Hex, but both will be, and are supposed to be, the acute accent. There are quite a number of tools that can change the character set for a given text, but they should not change a character into a different character (like X into Y).

If I understand your question correctly, you want to change both the grave accent ‘ and the acute accent ’ into the apostrophe/single quote character ' which is a different character. That it is different from changing character set and keep the grave and acute accents. The stated problem seems parallel to changing occurrences of characters "A" and "B" into character "C", which most text editors can solve with search/replace.

martink · « **Reply #3 on:** February 12, 2020, 05:50:30 AM »

Hi Kirk and Odd,
Very many thanks for your thoughts on this ...
I am now talking to my colleagues who look after the system downstream to see whether it understands Unicode.
I do now know what to ask!
Best wishes,
Martin

Camera Bits Forums

News:

Author Topic: Character sets and encoding (Read 3302 times)

martink

Character sets and encoding

Kirk Baker

Re: Character sets and encoding

Odd Skjaeveland

Re: Character sets and encoding

martink

Re: Character sets and encoding