utf8 in QrCode

May 27, 2013 at 6:45 AM
Want to know if I am doing to it right.
string text = "Meniň adym Begenç Amanow";
var qrEncoder = new QrEncoder(ErrorCorrectionLevel.H);
var qrCode = new QrCode();
qrEncoder.TryEncode(text, out qrCode);
var renderer = new GraphicsRenderer(new FixedCodeSize(200, QuietZoneModules.Two));
var ms = new System.IO.MemoryStream();
renderer.WriteToStream(qrCode.Matrix, System.Drawing.Imaging.ImageFormat.Png, ms);
Wasn't properly encoding Unicode characters.
I used solution provided in this link: StackOverflow
i.e I converted Utf-16 string to Utf-8 string
string text = Utf16ToUtf8("Meniň adym Begenç Amanow");
and it worked like charm :)
Is there any problem with my approach or is there proper way to do this?

Thank you for this piece of art.
Begench Amanov.
Coordinator
May 27, 2013 at 10:50 AM
Somehow it is encode with iso 8859-1. I will check code later see if I can fix it.

Thanks for report.
May 29, 2013 at 8:37 PM
Thank you for your concern. Will wait for your answer.
Coordinator
Jun 2, 2013 at 4:41 AM
I have fixed at my side. Problem is not relate to UTF16 to UTF8 but ISO-8859-1's encoding.

What happen is when we try to convert "ň" by using ISO-8859-1 to byte array, it return 110. Which is "n". Char "ň" doesn't exist in ISO-8859-1, where it should normally return question mark instead. That's where it went all wrong. All the char you have provide all turn into normal English letter. I don't know what's behind with MSFT's magic encoding solution. That makes char recognition really hard.

The way I have resolved is to convert back to string after initial and check if both string are same, that will introduce one more step of work. On the other hand, I have removed majority of encoding table check from ELC set. Which should help balance everything out. As since we have released our library, more and more case is showing complete ELC set might cause problem in one way or another. Thus remove all by default encoding and utf8 probably is best way for now.

I will do some more work before commit my fix. This is post just to let you know what exactly happen.
Jun 2, 2013 at 5:04 AM
Thank you.
My I suggest?
Will not it simplify encoding if you only use utf-8 and remove everything other?
Coordinator
Jun 2, 2013 at 11:54 AM
That will never be an option. Reason behind is UTF8 QrCode will be huge. It should be avoid whenever possible. And utf8 is not on ISO QrCode specification. Current UTF8 implementation is what all decoder and encoder agreed under the table. It's not in char table set nor we use any char table indication. Only thing we can tell it is UTF8 is BOM char.

Let's say "Hellow World". If use iso 8859-1 to encode, then code will be around size version 2~3 (my guess), then if use UTF8 it will be larger. So, if we want to print QrCode on business card what will happen? Each white and black square will be so small that cause huge density also cause code hard to read and easily damaged by scratch.

Priority of which one to use should always be following: Numeric > alpha numeric > kanji (japanese) > 8859-1 (or shift-js for japanese specification) > other 8bit byte char set > UTF8.

Or in other word, if I can use 2 bytes to represent 4~6 char, that's far better than 2 bytes only display 1 char. On the other hand, with least bytes, code's number of zero and one will be so small that on same size People can scan from far away. That's the power of use better char set.
Jun 2, 2013 at 4:16 PM
Ok now i understand. So you only use utf-8 if other encoding doesn't have character.
Very good decision. But can you compare utf-8 and other encodings? I don't think
it will so much larger.
Coordinator
Jun 2, 2013 at 10:12 PM
Very large. Other encoding have special process to make them smaller.

For version 1 QrCode. The smallest normal QrCode. It can contain 42 numeric, or 25 alpha numeric, or 17 bytes, or 10 kanji. While for bytes.. 8859-1 means 17 char, while utf8 only 8 char.

For full table please check ISO/IEC 18004:2006(E) - QR code specification - page 33.

That specification you can download from our first page.
Coordinator
Jun 2, 2013 at 10:14 PM
As for Kanji, we normally check before anything else. Kanji only for japanese's word char. Thus all belong to unicode range. While kanji have special process to make them smaller than unicode. Each char is smaller than 2 bytes. Little bit not huge.
Jun 3, 2013 at 6:16 AM
I downloaded pdf on first page and read page 33. Now i understand. Thank you for your explanation.