1
Vote

issues for encoding unicode string in Recognise function

description

The logic in your Recognise function has flaws. I suggest the sequence is:
Numeric -->AlphaNumeric -->Kanji -->utf8

The purpose is to reduce the encoded size of Matrix.

However,your logic always return utf-8 for unicode (except kanji).

e.g. "123支花朵" is Chinese, logically, if we don't assign an encoding name, program should give encoding name "gb2321" or "GB18030" with ECIValue 29 and 8 bit mode instead of utf-8.

comments

Silverlancer wrote Jul 3, 2014 at 8:30 PM

I have re-read my code, there is nothing wrong with it. The on you have proposed is wrong. You only looked at high level of my method order but didn't actually look into method itself.

I check Kanji first, base on result I separate them to 3 category. First one is kanji success, second one is detect mix between alphanumeric and kanji, which automatically go to 8bitbyte encoding. Last one is high possibility of alphanumeric, which then go to loop numeric -> alpha numeric.

Kanji first is the way to check mix match. It might be slow for pure numeric, but overall it's probably fast for mixed one. Like your example.

Why we don't use GB2321?, that's not on code table. Majority of decoder not even support it. I did my research, I had full table of barcode specification of which kind of encode table is allowed. Utf8 is not on the table but was widely used. And people use Utf8 dom to identify.

Your example is clearly utf8 category. Unless you use chinese QRCode specification. I do have, but it's not supported. This library focus on ISO 18004:2006 and ISO 18004:2000. Either international or japanese. Majority of decoder support 18004:2000, latest decoder support mainly 2006 while good one support both.

Silverlancer wrote Jul 3, 2014 at 8:34 PM

Also kanji is japanese, not chinese. In your case is most likely 99% of time utf8. Avoid use simplified Chinese might help. Traditional Chinese might often go to Kanji category which result as smaller size. But Kanji table cannot mix match with alphanumeric. If you check Kanji table's rule you will find it require pure kanji.

Silverlancer wrote Jul 3, 2014 at 8:37 PM

https://qrcodenet.codeplex.com/discussions/274343

If you take look at research I did way back then, might find something useful. I have dig through whole kanji table and related Japanese specification. QrCode was originally designed by japanese for their product label use.

Silverlancer wrote Jul 3, 2014 at 9:19 PM

8 bite byte supported encoding table is called ECI table.

https://qrcodenet.codeplex.com/SourceControl/latest#Gma.QrCodeNet/Gma.QrCodeNet.Encoder/DataEncodation/ECISet.cs

It's under here, I also have link to the resource. Official ISO for ECI table cost money, which I don't have. Chinese listed there at bottom, I'm not sure how many have been supporting it. You can simply add to that dictionary list and give it a try. Main thing is match encode table with that number. GB is 29. Also you have to find encode table name from .net.

jonney3099 wrote Jul 4, 2014 at 12:27 AM

Chinese listed there at bottom, I'm not sure how many have been supporting it. You can simply add to that dictionary list and give it a try. Main thing is match encode table with that number. GB is 29. Also you have to find encode table name from .net. 
Since GB (Value 29) is in the ECI table, that means we should use CodePage 936 instead of 65001. I know most scanners has problem to read but theory is right. Just like your put BOM for 65001,but ZXing android application failed to read.

in function TryEncodeEightBitByte:

else if(length > 1)
return index;
supposed to be:
Else If (length > 1 And index == contentLength - 1)
  return -1;

Silverlancer wrote Jul 4, 2014 at 2:34 AM

I don't think you understand what I did there, but it's ok.

Result of what I did there can decrease execute time, able to use BigO(n) to find correct solution. Your propose is BigO(n^2).

I have sorted encoding sorting order to be smallest table first then go to wider range. If I checked up to index 10, I'm not going to check previous 10 chars anymore for larger table. I will execute from 10th and forward. Until I find a table that can execute to the end, I will run it again. So in theory it's BigO(n*2). But since it's BigO, normally we count as BigO(n). If my algorithm course didn't fail me. :P

I'm not going to add 29, it has high failure rate, if 80~90% of decoder doesn't support it, then there is no point to output such thing.

QrCode now a day, most of usage in my guess is Uri/Url. It shouldn't include any special char, and should always use minimize service to shorten URL. This is best practice. So if you are going to do anything with Url and have special char, you are doing it wrong.

Else you just have to add that to list and recompile it, since you went so far you should know what's going on there.

Silverlancer wrote Jul 4, 2014 at 2:44 AM

ZXing encoder only support Utf8, ISO8859-1, way back then. I'm not sure about now. We had more table but I was thinking of remove them at some stage, since not many decoder supports it. Not saying there is Chinese GB.

For utf8, I remember I have tested against ZXing decoder online before.
http://zxing.org/w/decode.jspx

I might do some test later.

jonney3099 wrote Jul 4, 2014 at 3:44 AM

The BOM signature is &HEF,&HBB and &HBF. After I reverse those 3 bytes (&HBF,&HBB and &HEF), then ZXing can read my QRCode. ZXing can read UTF-8 QRCode generated by your .NET program.
Sorry for this confusion.

I am wondering why ECITable has GB. Since the ECI header only has ECI mode indicator and and ECI Assignment Number but without specifying CodePage, how the decoder know what CodePage the encoder using? This stupid mistake just like MS ANSI text file.

OK, I agree you. Just ignore GB by using UTF-8.

Silverlancer wrote Jul 4, 2014 at 5:35 AM

Decoder use ECI indicator and that number to decide which encoder to use. Indicator will include 29 for GB for example. so when decoder see 29 and know it's for ECI, it will use their programming language's proper char encoder to decode.

jonney3099 wrote Jul 4, 2014 at 7:16 AM

Thanks for your explanation. I know better now.

jonney3099 wrote Jul 5, 2014 at 1:51 AM

//in TryEncodeKanji
else if((mostSignificantByte < 0x81 || mostSignificantByte > 0x9F) && (mostSignificantByte < 0xE0 || mostSignificantByte > 0xEB))
Any documentation support those boundary?
I didn't see those identified Hex values in kanji code table.

http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml

jonney3099 wrote Jul 5, 2014 at 4:47 AM

From the Shift-JIS Kanji Table, the first byte fall in 0x81-0x9F and 0xE0-0xEE, not you specify the range of 0xE0-0xEB.

Maybe I am wrong again.

jonney3099 wrote Jul 5, 2014 at 10:08 AM

I found a document how to add Chinese GB2312 support without using UTF-8. In GB/T 18284—2000 specification (can't find original official pdf,but got photocopy one in Chinese), there is a new Mode called Hanzi = 13 ' binary 1101, We should add GB sub-character set indicator 0001 for GB2312 after Mode indicator 1101.

a. Put bits "1101", bits count is 4
b. Put bits "0001", bits count is 4
c. Character Count Indicator
d. content encoding using Codepage 936 which is gb2312

Content encoding is similar with Kanji but slight difference:

1、First Byte is in range 0xA1~0xAA and second byte is in range 0xA1~0xFE:
a) (FirstByte -0xA1) * 0x60 + (secondbyte - 0xA1)
b) write into 13 bits format

2、First Byte is in range 0xB0~0xFA and second byte is in range 0xA1~0xFE:
a) (FirstByte -0xA6) * 0x60 + (secondbyte - 0xA1)
b) write into 13 bits format

You can write similar TryEncodeHanzi like TryEncodeKanji.
( you read http://ash.jp/code/cn/gb2312tbl.htm and http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml then you will understand how to check the pattern of GB)

ZXing android can scan GB2312 encoding qrcode.

Edited: I still think you Recognise function can be simplified. (tryEncodePos = -1,-2,0,index make the checking complicated; TryEncodeEightBitByte has problem as I mentioned early)

Silverlancer wrote Jul 6, 2014 at 8:51 PM

That's ISO 18004:200x one of specification for China. I do have that specification in pdf, it is on same category as kanji. I'm not sure how decoder deal with that, how do you identify whether it's kanji or GB.

As for kanji table, that if loop is for fail, not in range. When return -1 that's success. Also that if loop is working but might not be perfect. Forgot how I write it there. It should turn into following.

if(mostSignificantByte < 0x81 || (mostSignificantByte > 0x9F && mostSignificantByte < 0xE0) || mostSignificantByte > 0xEB)

{ return 0 or -2 }

I know that return value with -1, -2, 0 aren't very clear. It will be nice to use Enum or other way of identify result. But that's what I come up with way back then.

If you could give an example of when try encode eight bit byte will fail, please let me know. In simple answer, I don't want to encode same string 20+ times to find out which one is correct. This goes same to final QrCode mark evaluation. Both use same principle. Else it will be as slow as ZXing's encoder. I know from normal coding view, it's so weird. This is more relate to probability and pattern diagnose.

Write a lot of different input from different table, see if it come up with correct result then try to understand why I did that or let me know why I failed. At least for now I cannot find a string that will fail through that eight bit byte check.