Friday, December 4, 2009

Trimming JSON

JSON is a fairly succinct markup language. For instance, say I have two people Bob and Alice, and I have very private records on them:

{
"Bob":
{
"SSN": "502-80-2012",
"CCNumber": "3842-4234-1023",
"Mother's Maiden Name": "Swanson"
},

"Alice":
{
"SSN": "312-45-1231",
"CCNumber": "8481-3231-2234",
"Mother's Maiden Name": "Swathmore"
}
}

Now say I wanted to put this imaginary pseudo structure in "JSON format". Conveniently enough, it already is. No really. I'm not kidding. That's it. We are actually done. Or, in a more compact form:

{"Bob":{"SSN": "502-80-2012","CCNumber": "3842-4234-1023","Mother's Maiden Name": "Swanson"},"Alice":{"SSN": "312-45-1231","CCNumber": "8481-3231-2234","Mother's Maiden Name": "Swathmore"}}

Unnecessary Syntax
Now that's all gravy and really useful, but wait; did you know that there are only a few reserved words in Javascript and that everything else can safely be used as labels; as long as it doesn't have a space or some other reserved character. The stuff above then could be:

Before: (188b)
{"Bob":{"SSN": "502-80-2012","CCNumber": "3842-4234-1023","Mother's Maiden Name": "Swanson"},"Alice":{"SSN": "312-45-1231",CCNumber": "8481-3231-2234","Mother's Maiden Name": "Swathmore"}}

After: (173b)
{Bob:{SSN:"502-80-2012",CCNumber:"3842-4234-1023","Mother's Maiden Name": "Swanson"},Alice:{SSN:"312-45-1231",CCNumber:"8481-3231-2234","Mother's Maiden Name": "Swathmore"}}

Those extra bytes don't need to be transferred, really. So don't do it.

Zero-information data
Some information above offers Zero information. I could, for example, specify that I will always get the fields in this order:
  1. SSN
  2. CCNumber
  3. Mother's Maiden Name
As I did above. Given this then, those labels contain Zero information and ought to be eliminated if you care about space. The datastructures then, could be converted into arrays as follows:

Before: (173b)
{Bob:{SSN:"502-80-2012",CCNumber:"3842-4234-1023","Mother's Maiden Name": "Swanson"},Alice:{SSN:"312-45-1231",CCNumber:"8481-3231-2234","Mother's Maiden Name": "Swathmore"}}

After: (101b)
{Bob:["502-80-2012","3842-4234-1023","Swanson"],Alice:["312-45-1231","8481-3231-2234","Swathmore"]}

There! Removing Zero-information data is quite a space saver - as long as you are competent enough to know how to keep your code and your data separate.

Conceptual Run-time Analysis

The above optimizations will cost you a few regular expressions on the server side before emitting it to the client but this will be saved on the client side a few ways
  1. Less data on the wire
  2. Smaller string to parse
  3. Array indexes are faster then hashmap lookups --- they just are.
So it's the cumulative time of (server) and (client):
extra regex - data transfer - parsing - hash overhead
And seeing as the server side environment is much more diverse and gives you much more options then clientside (where it's javascript) - there's probably more reason to believe that this is a feasible net gain then not.

Questionable Optimizations
There are further steps that you can do for byte-count reduction --- but it is fairly questionable whether there will be a cumulative performance gain. I've broken the further steps down to a few methods:

Data Structure Flattening
Some background
In C, when you declare a multi-dimensional array, say 50x50, you can either index the array as you defined it, say using something like [4][20]; or you could just ignore it and treat the datastructure as it exists in memory, just a sequence of 250 units. You can cast the datastructure to the generic pointer type and then index it at [4 * 50 + 20] - sometimes it's easier, sometimes it isn't.

Flattening JSON
Let's revisit our previous datastructure and the principles of Zero Information. If we know that the structure will be something like this:

{ name: [3 fields], name: [3 fields], ... name: [3 fields] }


Then we could utilize a trick very similar to the C one discussed above. That is to say, just make the datastructure one gigantic array:

Before: (101b)
{Bob:["502-80-2012","3842-4234-1023","Swanson"],Alice:["312-45-1231","8481-3231-2234","Swathmore"]}

After: (99b)
["Bob","502-80-2012","3842-4234-1023","Swanson","Alice",
"312-45-1231","8481-3231-2234","Swathmore"]

There's a few important things we have to give up: First we no longer have 'labels' so the quotes have to go back in, adding a few bytes. But as a result, we can simplify the syntax a bit. Iterating through this can be done in many ways and the optimal one depends largely on the context of the data usage. It is however questionable, whether having to do additional math overhead is worth the extra 2 bytes. But there is, however, one more optimization we can do in this field

Stringifying
If we are willing to take the math overhead in stride, why don't we just go all the way then and simply make the data CSV - wherein you only need to quote things that have the reserved characters:

Before: (99b)
["Bob","502-80-2012","3842-4234-1023","Swanson","Alice",
"312-45-1231","8481-3231-2234","Swathmore"]


After: (83b)
"Bob,502-80-2012,3842-4234-1023,Swanson,Alice,312-45-1231,8481-3231-2234,Swathmore"

Transferring the output into a place usable for Javascript is debatable. You can't just split on the ',' since you need to accommodate for the escaped \'.

So you make yourself a nice regex to do the split and come up with an array at the end. But this is foolish. What you really ought to do is use the exec method and work your way through the string. This is of course, quite a bit more work then the code above.

This more work means more javascript code, more lines to execute, more lines to compile, etc. Is it worth it? Eh, don't know ... really, I don't.

Tables
Another thing we could do is to have substitution tables. For instance, let's say that when you analyze your data-structure after you've done the Zero-information removal techniques above, you still see a lot of redundancy; simply because there is a user that is very active, or something else of that nature. You can then make a static lookup table and implement it in JS and your server side scripting language of choice. During the JSON generation you simply swap things out. For instance, pretend we have this data set:

{Alice:["Marina Del Rey","California"],Bob:["Marina Del Rey","California"],Eve:["Playa Del Rey", "California"],Doug:["Playa Del Rey","California"]}


This would be likely say, if you are a local company and have local clients in these areas. You already have the fancy gzip compression and decompression on the server and client side, so you think you are good there, but not really:
  1. gzip still needs to pass over the strings on both sides; parse them, create the tables, allocate the memory - etc. It's real CPU time that you don't need to spend
  2. Javascript needs to do the same stuff.
So you just make a simple table:

var C = {
MDR:"Marina Del Rey",
CA:"California",
PDR:"Playa Del Rey"
};

Expose this on both ends and put it in your JSON encoder on the server side:

Before: (149b)
{Alice:["Marina Del Rey","California"],Bob:["Marina Del Rey","California"],Eve:["Playa Del Rey", "California"],Doug:["Playa Del Rey","California"]}

After: (74b)
{Alice:[C.MDR,C.CA],Bob:[C.MDR,C.CA],Eve:[C.PDR,C.CA],Doug:[C.PDR,C.CA]}

Not only have you saved significant bytes, but now those entries are just pointers to pre-existing strings in a table, so JS doesn't have to allocate new memory for it.

Dynamic Tables

If you really enjoy hitting your server CPU hard, you could create dynamic tables:

Before: (149b)
{Alice:["Marina Del Rey","California"],Bob:["Marina Del Rey","California"],Eve:["Playa Del Rey", "California"],Doug:["Playa Del Rey","California"]}

After: (136b)
{C:{MDR:"Marina Del Rey",CA:"California",PDR:"Playa Del Rey"}
,
Alice:[C.MDR,C.CA],Bob:[C.MDR,C.CA],Eve:[C.PDR,C.CA],Doug:[C.PDR,C.CA]}

But this is not recommended. Because you aren't doing the zero-information principle which is to remove, not just rearrange. What you have done above is just complicate things for a what may actually work out to be no byte gain at all, and will certainly be more CPU intensive to generate and parse on both the client and the server side.

2 comments:

  1. > Array indexes are faster then hashmap lookups --- they just are.

    JavaScript Array is NOT an linear allocation of memory where indexes are just memory offsets. Even numbered indexes are object properties. JavaScript has no real arrays in the classical sense.

    This article is nonsense; ignore it.

    ReplyDelete
  2. I must be dreaming, but it seems like you're advocating not using gzip!? Manually replacing string segments on both the server and client side is never going to be faster than gzip. Nevermind the fact that you have to maintain this ridiculous conversion table. Wtf.

    I agree with Timothy. These suggestions are ridiculous.

    ReplyDelete

Followers