8BitByte Encode and Test

Coordinator
Sep 30, 2011 at 8:09 AM

I have found two source to solve 8BitByte JIS 8 table map with Unicode 16. Which is C#'s standard. 

http://charset.7jp.net/jis0201.html

http://en.wikipedia.org/wiki/JIS_X_0201

Encode for JIS8 should be "euc-jp" if we use C#'s GetEncoding("euc-jp").getbytes(content). Which used to represent the element of three Japanese character set standards, namely JIS X 0208, JIS X 0212 and JIS X 0201

ZXing's encode check I don't really understand. If that one didn't limit to specific encoding. That might lead to 8BitByte encode not working as intend. 8BitByte only support value from 0x00 to 0xFF. Bit count for each char is limited to 8. 

 

ZXing's encoding string check

System.String encoding = hints == null?null:(System.String) hints[EncodeHintType.CHARACTER_SET];

Coordinator
Sep 30, 2011 at 8:27 AM

I would suggest you write a test against reference implementation and play around with proposed options.
We should not stick with ZXing implementation thus it was ported from old C implementation and don't use any modern encoding out-of-the box functionality which is already available in modern Java and .NET, but we still need to remain 100% compatible with that proven implementation.

Coordinator
Sep 30, 2011 at 9:22 PM

I will try to write a test against reference implementation later. 

Also I might found reason why ZXing is using ISO 88591 Encoding as default for 8BitByte. 

https://www.kssn.net/StdForeign/corrDown.asp?category=ISO&fileName=CPDF043655E.pdf&stdNumber=ISOIEC%2018004_2006Cor%201_2009.pdf

Page 28 table 6. Instead of ISO/IEC 18004:2000 table 6 JIS8. it uses ISO/IEC 8859-1. and call that as Byte mode instead of 8Bit-Byte mode, same as ZXing's 8bitbyte mode is represent as Mode.byte. Problem is under that table's notes is still for JIS8. not ISO/IEC 8859-1. And there is no watermark for that documentation. 

I have also download ISO/IEC 18004 2006 Cor 1:2009 from official ISO website. Every fix towards ISO/IEC18004:2006 is at correct page location. But fix starts from 55. Nothing about page 28. 

Coordinator
Oct 3, 2011 at 1:35 AM
Edited Oct 3, 2011 at 2:16 AM

I have just implement reference test for EightBitByteEncoder test. I will explain result as below. 

First change to Legacy EncoderInternal.cs

EncoderInternal.append8BitBytes will take "encoding" property. While EncoderTestCaseFactory.EncodeUsingReferenceImplementation simply give "Encoding" property as "null"

So I have changed code under append8BitBytes method according for different test.

 Case 1. Auto generate char between "" and ""  [EncoderInternal.append8BitBytes Encoding = "EUC-JP"]

\xFF8B => JIS8 203 -  1100 1011  
Gma.QrCodeNet.Encoding.Tests.DataEncodation.EightBitByteEncoderTest.Test_against_reference_implementation("\xFF8B",1,0100000000011000111011001011):  Expected: equivalent to < False, True, False, False, False, False, False, False, False, False... >  But was:  < False, True, False, False, False, False, False, False, False, False... >

0100[8BitByte] 0000 0001[Num of char=1] 1000 1110[???] 1100 1011[Char \xFF8B]

\xFF96 => JIS8 214 - 1101 0110
Gma.QrCodeNet.Encoding.Tests.DataEncodation.EightBitByteEncoderTest.Test_against_reference_implementation("\xFF96",1,0100000000011000111011010110):  Expected: equivalent to < False, True, False, False, False, False, False, False, False, False... >  But was:  < False, True, False, False, False, False, False, False, False, False... >

0100[8BitByte] 0000 0001[Num of Char=1] 1000 1110[???] 1101 0110[Char \xFF96]

I really don't know why ZXing's code put 1000 1110 there. Maybe relate to [bytes = SupportClass.ToSByteArray(System.Text.Encoding.GetEncoding("EUC-JP").GetBytes(content));]

Case 2. Auto generate char between "A" and "Z" [EncoderInternal.append8BitBytes Encoding = "EUC-JP"]

No error. All passed

Case 3.  Auto generate char between "A" and "z" [EncoderInternal.append8BitBytes Encoding = "EUC-JP"]

Error when auto generate content have char "\" . Because JIS doesn't have "\", and use "¥" [JP Dollar mark] instead.  So as long as auto generate char doesn't give char "\" it will not have any error. 

 Case 4. Auto generate char between "À" and "ÿ" [EncoderInternal.append8BitBytes Encoding = "ISO-8859-1"]

[Table for ISO-8859-1: http://en.wikipedia.org/wiki/ISO/IEC_8859-1]

\x00F1 241 -  1111 0001
Gma.QrCodeNet.Encoding.Tests.DataEncodation.EightBitByteEncoderTest.Test_against_reference_implementation("\x00F1",1,01000000000111110001):  Expected: equivalent to < False, True, False, False, False, False, False, False, False, False... >  But was:  < False, True, False, False, False, False, False, False, False, False... >
0100[8BitByte] 0000 0001[Num of char = 1] 1111 0001[\x00F1]  There is no strange char "1000 1110"

My EightBitByteEncoder is according to ISO/IEC 18004 2000-06-15 First Edition. It use JIS-8 table instead of ISO-8859-1. That's why Case 4 is full of errors. 

If I'm not change EncoderInternal.append8BitBytes's Encoder property and use default [bytes = SupportClass.ToSByteArray(System.Text.Encoding.GetEncoding(encoding).GetBytes(content));]. It will throw exception. As encoding can not be "null".

That's my test so far. Currently ZXing's code can not use to test JIS8 table as that strange char "1000 1110". 

Also if the document(ISO/IEC 18004 2006 Second edition) I have found earlier is not fake. Then use ISO-8859-1 table for 8BitByte is reasonable. But we can not determine during decode, as 8BitByte doesn't indicate which char table it uses. So our 8BitByte code will be either for Second edition only or First edition.   

Coordinator
Oct 3, 2011 at 2:14 AM
Edited Oct 3, 2011 at 2:20 AM

I have found reason behind char "1000 1110". That char is 8E. According to link below. 

http://www.sljfaq.org/afaq/encodings.html#encodings-EUC-JP

EUC-JP is combination of JIS X 0201, JIS X 0208, JIS X 0212. 

JIS X 0201: Two-byte encoding. 1st byte: 0x8E. 2nd byte: raw JIS X 0201 byte.
JIS X 0208: Two-byte encoding. Just take the raw JIS X 0208 two-byte code and set the top bit of each byte.
JIS X 0212: Three-byte encoding. 1st byte: 0x8F. 2nd and 3rd bytes: take the raw JIS X 0212 code and set the top bit of each byte.

Our JIS8 table is from JIS X 0201. Thus 8E is placed before each char. 

As I look further into encoding table. I have found Shift_JIS can be use for JIS X 0201. Which I changed encoding from EncoderInternal.append8BitBytes according to that. Reference test all green for auto generate char between "" and "".

Coordinator
Oct 3, 2011 at 4:34 AM
Edited Oct 3, 2011 at 9:19 AM

Hi. gmamaladze. Because I was curious about how QR Code only apply to English and Japanese char, so  I did some research towards QR code's encoding. Here is one of the document I have found. 

http://qrbcn.com/imatgesbloc/Three_QR_Code.pdf

On page 063. Efficiently encoding of Kanji and Kana characters. According to that document, different country could have their own set of rule for 8BitByte encode and Kanji encode. So for most European country use ISO-8859-1 for 8BitByte encoding should be better than JIS8. As it covers most European country's language. Thus we need add encoding property for encoder method.

Rule for 8bitbyte encode and decode should be as long as encoding for each char is one byte length, we should accept it. And let application side to choose which encoding for 8BitByte instead of library side.

Edit: I got wrong with Kanji part. I will try to finish Kanji encode tomorrow

Coordinator
Oct 3, 2011 at 11:26 AM

Great document I will refer to it on the start page.