Thursday, December 17, 2009

JS Array to object

So I wanted to be able to use an object in JSON but transfer it as an array.

The issue is that the syntax of el.name is nice, but the {name:"joe"} is longer then just ["joe"].

The way around it is very simple actually.

We'll start with a normal object
{
firstname: "Joe",
lastname: "Shmoe",
address: "123 Twiddledeedoo Lane",
}

Now we do this in javacript:

var keys = ["firstname", "lastname", "address"], obj = {};

And we transfer ["Joe","Shmoe","123 Twiddledeedoo Lane"] as the payload.

Now the code is fairly straight forward. Assuming:

var values = ["Joe","Shmoe","123 Twiddledeedoo Lane"];

To efficiently get back to the object we do the following:

do{
obj[keys.pop()] = values.pop();
} while (keys.length);

And that's it. It's so straight forward ... I don't know why it was a mystery for so long...

Tuesday, December 15, 2009

Friday, December 11, 2009

Stupid ssh script

I'm usually too lazy to go through the dance of sending over an auth key to the host. So I wrote a dumb script to help me, here it is: http://qaa.ath.cx/authadd.txt

There's a little bit of scary logic in there to avoid duplicates... You may want to take it out.

Journeys through PHP

Over the years I've used a bit of server side languages: C, Perl, C Shell; I've even wrote my own web server for kicks once. Sounds very 1997, right?

Back to 2009...

So I had to make this website for this company, a big idea company, from scratch. I was a systems programmer who wanted to become familiar with web stuff.

The feasible options in my subjective mind were:

Java

Java was a pain to use in college, and I never really understood Tomcat. Also, I just personally feel like Java requires more thoughtful design and more work. I hate doing both those things.

Scala

Too young still. No really, I don't believe it has proven itself yet. I'm sure it's full of crazy bugs. Would you consider twitter stable? It uses Scala.

Perl

Too old. No I mean ... this would have been ok, I guess. Slashdot is probably still in Perl ... as is, I dunno. It's a good language - probably wouldn't have been a bad choice ... oh well. So much for that.

Ruby

I don't know what is up with this language. People love it one day then hate it the next. I don't think I've heard of anyone at like, the big 50 sites talking about it; so until that happens, until I see an article entitled something like
How we reduced latency and increased load on ebay general search with a distributed Ruby based approach
Until that day, it's really not worth looking into.

Python

Oh a language where print had to be redesigned to be more 'pythonic'. These people actually get bent out of shape on some imaginary philosophy that constitute "Good Programming Language Design". They are like the Libertarians of the programming world. Maybe there is some manifesto nobody has showed me or some magazine I'm not subscribed to.

It's a fine language, but when basic things require the runaround (like a switch statement), I'm told, "But this; this is 'proper'"; the language purists argue, referencing some invisible playbook from which all their assertions come.

I have to give it some credit however; for when you want to do non-trivial advanced things Python handles them almost as well as Perl. Only I guess using Python instead of Perl will make me more friends amongst the wine sipping, dining philosophers of the programming world.

PHP

Ah, the 'bottom rung', 'used dishrag', 'tragically maldesigned' accident of computer science. Why would anyone ever engage themselves in such a huge waste of time?

Well,
  • It's easy
  • It's usually fast
  • It's relatively bug free
  • It's philosophy-free
  • It can do almost anything that takes less then a page of math to explain
  • It integrates nicely with webpages
That sounds like a decent language. But for some reason, it's totally uncool.



Maybe because it's such an easy choice, the large amount of bad code from amateur programmers in PHP has given the language itself a bad reputation --- you know, like what happened with Java. Sorry Sun. :-(

Some of the distaste for PHP however, is well-founded.

The Bad Parts of PHP

Ok, so PHP is a fine language. There. I said it. Go unfriend me on facebook if you want. But really, it grabs things from a database and can easily emit html. That usually covers it.

The truth is, most of the time, if you are doing webpages, you can get by with any language that supports the following:
  • Conditionals (if/then/else)
  • Variables that vary (haha - take that Haskell)
  • Data structures (array)
  • Data resource connections (like pg_query)
  • Easy to use I/O routines (sorry Erlang)
  • Iterators (for, foreach, while, do)
And that's it. You can totally make a huge ass website if you just have those things in your language. This isn't asking much, and when you don't ask for much, PHP delivers!



Alas, their are some problems:

Concurrency Support

Concurrency isn't something that is an after-thought. You can't just slap it on to a language like you can with SHA1 support. Well, you can, but then you get a mess. Ideally, you need to have a language that was carefully thought out (read Python) or something designed for this mess (read Erlang).

Overhead

Object memory overhead in PHP is obscene. Just ridiculous. Most web objects are small, like on the order of kilobytes --- which PHP happily translates into megabytes. But if you want to do some Data Warehousing applications and you have a huge framework already in PHP, then PHP for the DW sounds like a reasonable choice. Oh, my friend, be prepared for a surprise.
  • PHP's hash tables don't scale well. Aggregated O(1) my ass --- yeah if 1 means 1 second.
  • PHP's memory limit is a 32 bit signed int --- that wraps around and does some ABS or something on itself --- so if you make it like 2049M then you have a 1MB limit. You can try it yourself if you like
  • Object oriented-ness was tacked on ... once again, if it's going to be there, then it has to be a core language design feature, not an after-thought. When you try to retrofit a traditional language into an object oriented syntax, you usually end up with something much more complex then need be, i.e. C++.
As a result, doing anything more then C-style software engineering in PHP is just going to lead to bad news. In PHP, you need to embrace globals, carefully name your variables, and sort your code like it's a C program from back in the day. Then when you hit the scaling point of multiple databases spanning across multiple servers you realize a problem:

No serious backend software is in PHP

Why does this matter? Well let's look at the DW again. If you are using a Java based system like say, Solr (Lucene), HBase (Hadoop), or Cassandra, then you have two options:
  1. Use a restful API, or in the case of Cassandra, Thrift.
  2. Program in Java
The problem with number 1, is that unless you have scaled your application over sooo many systems, so poorly, that your data is far away from your processing, then you are duplicating data in memory. Sure it's fast, but it's also stupid. What you are saying is this:
As a matter of convenience and stubbornness and my decision to program in language X when using a library in language Y, I will now translate all the data from language Y to X through some crazy interface before I do my mapping and reducing steps.
And that, my dear, is horribly irresponsible.

Despite all this, I'm guilty of all the above and use PHP extensively. :-p

Wednesday, December 9, 2009

adding exclusionary search to craigslist

So I'm looking for a car, but craigslist doesn't support the - sign in search

Try doing -salvage in the car search ... doesn't work.

So I made a filter bar to exclude things from a craigslist search:

Get it here

Use it to type in obvious things...
Screen shot

Tuesday, December 8, 2009

Simplifying libpcap filter creation

Making a capture filter for tcpdump from wireshark has always been a pain in my mind. Maybe there is a nice tool out there to do it for me --- but I really don't know. So the useful syntax for tcpdump is basically:
(network protocol)[offset] == (decimal value)
For instance:
ether[100] == 123 and ether[102] == 124
I mean sure, some wankers probably want to do more with their packets but I'm not one of them. I simply have a typical packet like this:
 0000  02 00 00 00 45 00 02 8a  db 46 40 00 40 06 b0 6c   ....E... .F@.@..l
0010 43 7f e9 5a 4a 7d 35 64 fa 4c 00 50 a0 b6 33 f4 C..ZJ}5d .L.P..3.
0020 6b 36 b9 97 80 18 20 8a ca 1e 00 00 01 01 08 0a k6.... . ........
0030 5b 11 94 89 ca d6 48 ed 47 45 54 20 2f 20 48 54 [.....H. GET / HT
0040 54 50 2f 31 2e 31 0d 0a 48 6f 73 74 3a 20 67 6f TP/1.1.. Host: go
0050 6f 67 6c 65 2e 63 6f 6d 0d 0a 55 73 65 72 2d 41 ogle.com ..User-A
0060 67 65 6e 74 3a 20 4c 69 6e 6b 73 20 28 32 2e 31 gent: Li nks (2.1
0070 70 72 65 33 37 3b 20 46 72 65 65 42 53 44 20 37 pre37; F reeBSD 7
0080 2e 30 2d 52 45 4c 45 41 53 45 20 69 33 38 36 3b .0-RELEA SE i386;
0090 20 38 30 78 32 34 29 0d 0a 41 63 63 65 70 74 3a 80x24). .Accept:
00a0 20 2a 2f 2a 0d 0a 41 63 63 65 70 74 2d 45 6e 63 */*..Ac cept-Enc
00b0 6f 64 69 6e 67 3a 20 67 7a 69 70 2c 20 64 65 66 oding: g zip, def
00c0 6c 61 74 65 2c 20 62 7a 69 70 32 0d 0a 41 63 63 late, bz ip2..Acc
00d0 65 70 74 2d 43 68 61 72 73 65 74 3a 20 75 73 2d ept-Char set: us-
00e0 61 73 63 69 69 2c 20 49 53 4f 2d 38 38 35 39 2d ascii, I SO-8859-
00f0 31 2c 20 49 53 4f 2d 38 38 35 39 2d 32 2c 20 49 1, ISO-8 859-2, I
0100 53 4f 2d 38 38 35 39 2d 33 2c 20 49 53 4f 2d 38 SO-8859- 3, ISO-8
0110 38 35 39 2d 34 2c 20 49 53 4f 2d 38 38 35 39 2d 859-4, I SO-8859-
0120 35 2c 20 49 53 4f 2d 38 38 35 39 2d 36 2c 20 49 5, ISO-8 859-6, I
0130 53 4f 2d 38 38 35 39 2d 37 2c 20 49 53 4f 2d 38 SO-8859- 7, ISO-8
0140 38 35 39 2d 38 2c 20 49 53 4f 2d 38 38 35 39 2d 859-8, I SO-8859-
0150 39 2c 20 49 53 4f 2d 38 38 35 39 2d 31 30 2c 20 9, ISO-8 859-10,
0160 49 53 4f 2d 38 38 35 39 2d 31 33 2c 20 49 53 4f ISO-8859 -13, ISO
0170 2d 38 38 35 39 2d 31 34 2c 20 49 53 4f 2d 38 38 -8859-14 , ISO-88
0180 35 39 2d 31 35 2c 20 49 53 4f 2d 38 38 35 39 2d 59-15, I SO-8859-
0190 31 36 2c 20 77 69 6e 64 6f 77 73 2d 31 32 35 30 16, wind ows-1250
01a0 2c 20 77 69 6e 64 6f 77 73 2d 31 32 35 31 2c 20 , window s-1251,
01b0 77 69 6e 64 6f 77 73 2d 31 32 35 32 2c 20 77 69 windows- 1252, wi
01c0 6e 64 6f 77 73 2d 31 32 35 36 2c 20 77 69 6e 64 ndows-12 56, wind
01d0 6f 77 73 2d 31 32 35 37 2c 20 63 70 34 33 37 2c ows-1257 , cp437,
01e0 20 63 70 37 33 37 2c 20 63 70 38 35 30 2c 20 63 cp737, cp850, c
01f0 70 38 35 32 2c 20 63 70 38 36 36 2c 20 78 2d 63 p852, cp 866, x-c
0200 70 38 36 36 2d 75 2c 20 78 2d 6d 61 63 2c 20 78 p866-u, x-mac, x
0210 2d 6d 61 63 2d 63 65 2c 20 78 2d 6b 61 6d 2d 63 -mac-ce, x-kam-c
0220 73 2c 20 6b 6f 69 38 2d 72 2c 20 6b 6f 69 38 2d s, koi8- r, koi8-
0230 75 2c 20 6b 6f 69 38 2d 72 75 2c 20 54 43 56 4e u, koi8- ru, TCVN
0240 2d 35 37 31 32 2c 20 56 49 53 43 49 49 2c 20 75 -5712, V ISCII, u
0250 74 66 2d 38 0d 0a 41 63 63 65 70 74 2d 4c 61 6e tf-8..Ac cept-Lan
0260 67 75 61 67 65 3a 20 65 6e 2c 20 2a 3b 71 3d 30 guage: e n, *;q=0
0270 2e 31 0d 0a 43 6f 6e 6e 65 63 74 69 6f 6e 3a 20 .1..Conn ection:
0280 4b 65 65 70 2d 41 6c 69 76 65 0d 0a 0d 0a Keep-Ali ve....

And now I want to like, create a libpcap filter on the "GET" to detect whether it's a GET request.

Now I know what the wankers are saying:
"Well, what if this was some crazy ass packet over here with the boundary of one layer of the network stack just happening to translate into the ASCII character 'G' and then say, the signature or magic number of the next layer down translating into 'ET'. THEN WHAT? Then what?"

Um, then, I get those packets too; all zero of them. Really. I mean, get a life. That shit doesn't happen.

So let's go back to the real problem. I want to use wireshark and figure out how to write this ...

So I fire up wireshark and then go to the three pane view. I expand the "Hypertext Transfer Protocol" node, then the "GET" node and right click on "Request Method: GET". I go up to apply as filter then click "selected":

This seems quite reasonable. After clicking on it I do indeed get something in the filter syntax:

Absolutely stunning. I get this:
http.request.method == "GET"
That's not libpcap. Totally not it. You totally lose. I cannot pass that string to tcpdump. Luckily, the wireshark people have a command line capture form of wireshark, called tshark. Fantastic, let's see how to use it! :-)

$ tshark -h
TShark 1.0.3
Dump and analyze network traffic.
See http://www.wireshark.org for more information.

Copyright 1998-2008 Gerald Combs (email withheld by me) and contributors.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Usage: tshark [options] ...

Capture interface:
-i <interface> name or idx of interface (def: first non-loopback)
-f <capture filter> packet filter in libpcap filter syntax
-s <snaplen> packet snapshot length (def: 65535)
-p don't capture in promiscuous mode
-y <link type> link layer type (def: first appropriate)
-D print list of interfaces and exit
-L print list of link-layer types of iface and exit


Great, just great. You can't even accept your own syntax as a capture filter, just as a display one. I'm trying to reduce file size here...

What's the point?


So anyway, to make life easy I did a nice javascript app, as I love to do sooooo much and you can use it here.

So we go back to my initial packet and copy and paste it in the gargantuan textarea and click the only button there 'process'.

After you click on the obvious blue links, something magical starts to happen:


The filter syntax shows up for you. You don't have to play CSI Miami and count the bytes with VB by hand or something any more. Welcome to the 1980s ... you can use a mouse and construct a filter with ease. Enjoy.

It's worth mentioning that there is actually no protocol level analysis being done. If you have a radiotap header, the silly javascript app won't know. If you are listening on a device without an ethernet layer, such as tun0 or tun1, it will also not know. You basically need to be listening on wired ethernet. But that isn't hard now, is it!

Sunday, December 6, 2009

JS Challenge #1

Self Evaluating JS
Here is an example:
<div id=toexec>
"Hello World<!--".replace("Hello World",'"Goodbye World"');//-->"
</div>
<button onclick=exec()>Exec</button>
<script>

function exec(){
var n = document.getElementById('toexec');
n.innerHTML = eval(n.innerHTML);
}
</script>

The Challenge
Actually there isn't one. I just really like that code

http://qaa.ath.cx/pexec.html

CSS Challenge #3

No diagram for this.

The challenge is to define some CSS so that you can do basic math operations. I have a nice working example hidden away.

The specs are

class="number operator number"

where number is one of [one, two, three, four, five, six, seven, eight, nine]

and operator is either [plus, minus]

So that if you do
<div class='three plus eight' />

you'll see say, 11 squares or some other kind of unit.

Extra points if you can make negatives work too.

You ought to be able to do it in 12 definitions or less...

good luck!

CSS Challenge #2

Same HTML as #1, but the CSS must be within this:

div { /* You probably need at least this */ }
body{ /* and you can use this too */ }

CSS Challenge #1

So I give you the HTML, you give me the CSS:


<div style=border-color:lime />
<div style=border-color:yellow />
<div style=border-color:red />

Programming Challenge #2

If you can solve 1, you can solve 2

#include<stdio.h>
main(){
char ByModifyingThis[]="What happened?";
char printThis[]="You won!";

// your space

printf("%s",ByModifyingThis);
}

Programming Challenge #1

Ok, first to post a correct 1 line solution wins...

#include<stdio.h>
void gettohere()
{
printf("You win!");
}

int main(void)
{
// From Here
// Without calling gettohere();

return 0;
}

Friday, December 4, 2009

Trimming JSON

JSON is a fairly succinct markup language. For instance, say I have two people Bob and Alice, and I have very private records on them:

{
"Bob":
{
"SSN": "502-80-2012",
"CCNumber": "3842-4234-1023",
"Mother's Maiden Name": "Swanson"
},

"Alice":
{
"SSN": "312-45-1231",
"CCNumber": "8481-3231-2234",
"Mother's Maiden Name": "Swathmore"
}
}

Now say I wanted to put this imaginary pseudo structure in "JSON format". Conveniently enough, it already is. No really. I'm not kidding. That's it. We are actually done. Or, in a more compact form:

{"Bob":{"SSN": "502-80-2012","CCNumber": "3842-4234-1023","Mother's Maiden Name": "Swanson"},"Alice":{"SSN": "312-45-1231","CCNumber": "8481-3231-2234","Mother's Maiden Name": "Swathmore"}}

Unnecessary Syntax
Now that's all gravy and really useful, but wait; did you know that there are only a few reserved words in Javascript and that everything else can safely be used as labels; as long as it doesn't have a space or some other reserved character. The stuff above then could be:

Before: (188b)
{"Bob":{"SSN": "502-80-2012","CCNumber": "3842-4234-1023","Mother's Maiden Name": "Swanson"},"Alice":{"SSN": "312-45-1231",CCNumber": "8481-3231-2234","Mother's Maiden Name": "Swathmore"}}

After: (173b)
{Bob:{SSN:"502-80-2012",CCNumber:"3842-4234-1023","Mother's Maiden Name": "Swanson"},Alice:{SSN:"312-45-1231",CCNumber:"8481-3231-2234","Mother's Maiden Name": "Swathmore"}}

Those extra bytes don't need to be transferred, really. So don't do it.

Zero-information data
Some information above offers Zero information. I could, for example, specify that I will always get the fields in this order:
  1. SSN
  2. CCNumber
  3. Mother's Maiden Name
As I did above. Given this then, those labels contain Zero information and ought to be eliminated if you care about space. The datastructures then, could be converted into arrays as follows:

Before: (173b)
{Bob:{SSN:"502-80-2012",CCNumber:"3842-4234-1023","Mother's Maiden Name": "Swanson"},Alice:{SSN:"312-45-1231",CCNumber:"8481-3231-2234","Mother's Maiden Name": "Swathmore"}}

After: (101b)
{Bob:["502-80-2012","3842-4234-1023","Swanson"],Alice:["312-45-1231","8481-3231-2234","Swathmore"]}

There! Removing Zero-information data is quite a space saver - as long as you are competent enough to know how to keep your code and your data separate.

Conceptual Run-time Analysis

The above optimizations will cost you a few regular expressions on the server side before emitting it to the client but this will be saved on the client side a few ways
  1. Less data on the wire
  2. Smaller string to parse
  3. Array indexes are faster then hashmap lookups --- they just are.
So it's the cumulative time of (server) and (client):
extra regex - data transfer - parsing - hash overhead
And seeing as the server side environment is much more diverse and gives you much more options then clientside (where it's javascript) - there's probably more reason to believe that this is a feasible net gain then not.

Questionable Optimizations
There are further steps that you can do for byte-count reduction --- but it is fairly questionable whether there will be a cumulative performance gain. I've broken the further steps down to a few methods:

Data Structure Flattening
Some background
In C, when you declare a multi-dimensional array, say 50x50, you can either index the array as you defined it, say using something like [4][20]; or you could just ignore it and treat the datastructure as it exists in memory, just a sequence of 250 units. You can cast the datastructure to the generic pointer type and then index it at [4 * 50 + 20] - sometimes it's easier, sometimes it isn't.

Flattening JSON
Let's revisit our previous datastructure and the principles of Zero Information. If we know that the structure will be something like this:

{ name: [3 fields], name: [3 fields], ... name: [3 fields] }


Then we could utilize a trick very similar to the C one discussed above. That is to say, just make the datastructure one gigantic array:

Before: (101b)
{Bob:["502-80-2012","3842-4234-1023","Swanson"],Alice:["312-45-1231","8481-3231-2234","Swathmore"]}

After: (99b)
["Bob","502-80-2012","3842-4234-1023","Swanson","Alice",
"312-45-1231","8481-3231-2234","Swathmore"]

There's a few important things we have to give up: First we no longer have 'labels' so the quotes have to go back in, adding a few bytes. But as a result, we can simplify the syntax a bit. Iterating through this can be done in many ways and the optimal one depends largely on the context of the data usage. It is however questionable, whether having to do additional math overhead is worth the extra 2 bytes. But there is, however, one more optimization we can do in this field

Stringifying
If we are willing to take the math overhead in stride, why don't we just go all the way then and simply make the data CSV - wherein you only need to quote things that have the reserved characters:

Before: (99b)
["Bob","502-80-2012","3842-4234-1023","Swanson","Alice",
"312-45-1231","8481-3231-2234","Swathmore"]


After: (83b)
"Bob,502-80-2012,3842-4234-1023,Swanson,Alice,312-45-1231,8481-3231-2234,Swathmore"

Transferring the output into a place usable for Javascript is debatable. You can't just split on the ',' since you need to accommodate for the escaped \'.

So you make yourself a nice regex to do the split and come up with an array at the end. But this is foolish. What you really ought to do is use the exec method and work your way through the string. This is of course, quite a bit more work then the code above.

This more work means more javascript code, more lines to execute, more lines to compile, etc. Is it worth it? Eh, don't know ... really, I don't.

Tables
Another thing we could do is to have substitution tables. For instance, let's say that when you analyze your data-structure after you've done the Zero-information removal techniques above, you still see a lot of redundancy; simply because there is a user that is very active, or something else of that nature. You can then make a static lookup table and implement it in JS and your server side scripting language of choice. During the JSON generation you simply swap things out. For instance, pretend we have this data set:

{Alice:["Marina Del Rey","California"],Bob:["Marina Del Rey","California"],Eve:["Playa Del Rey", "California"],Doug:["Playa Del Rey","California"]}


This would be likely say, if you are a local company and have local clients in these areas. You already have the fancy gzip compression and decompression on the server and client side, so you think you are good there, but not really:
  1. gzip still needs to pass over the strings on both sides; parse them, create the tables, allocate the memory - etc. It's real CPU time that you don't need to spend
  2. Javascript needs to do the same stuff.
So you just make a simple table:

var C = {
MDR:"Marina Del Rey",
CA:"California",
PDR:"Playa Del Rey"
};

Expose this on both ends and put it in your JSON encoder on the server side:

Before: (149b)
{Alice:["Marina Del Rey","California"],Bob:["Marina Del Rey","California"],Eve:["Playa Del Rey", "California"],Doug:["Playa Del Rey","California"]}

After: (74b)
{Alice:[C.MDR,C.CA],Bob:[C.MDR,C.CA],Eve:[C.PDR,C.CA],Doug:[C.PDR,C.CA]}

Not only have you saved significant bytes, but now those entries are just pointers to pre-existing strings in a table, so JS doesn't have to allocate new memory for it.

Dynamic Tables

If you really enjoy hitting your server CPU hard, you could create dynamic tables:

Before: (149b)
{Alice:["Marina Del Rey","California"],Bob:["Marina Del Rey","California"],Eve:["Playa Del Rey", "California"],Doug:["Playa Del Rey","California"]}

After: (136b)
{C:{MDR:"Marina Del Rey",CA:"California",PDR:"Playa Del Rey"}
,
Alice:[C.MDR,C.CA],Bob:[C.MDR,C.CA],Eve:[C.PDR,C.CA],Doug:[C.PDR,C.CA]}

But this is not recommended. Because you aren't doing the zero-information principle which is to remove, not just rearrange. What you have done above is just complicate things for a what may actually work out to be no byte gain at all, and will certainly be more CPU intensive to generate and parse on both the client and the server side.

Geoiplookup script

I found a syntax error in one of our javascript files for the insertion of Google Analytics. This of course meant that the results were flatlined:


On December 4th, however we officially 'launched'. Their are now two problems:
  1. How much traffic did we get on Dec 2 - 4?
  2. Where did this traffic come from?
The first question is relatively easy to answer. Since we use nginx as a proxy for apache, I went over to /var/log/nginx and gunzip'd a few log files... nothing fancy.

How much traffic did we get on Dec 2 - 4
To find out how many unique visitors there were over the span of four log files we do the following pseudo-code:

print logs | show only the ip address for each line | find the unique ones | give me the count

This becomes

$ cat www.izuu.com.access.log www.izuu.com.access.log.[1-3] | awk ' { print $1 } ' | sort | uniq | wc -l
315

Not bad. To do it date based, we just add a grep line in the stack:

December 2nd:
$ cat www.izuu.com.access.log www.izuu.com.access.log.[1-3] | grep 02\/Dec | awk ' { print $1 } ' | sort | uniq | wc -l
52


December 3rd:

$ cat www.izuu.com.access.log www.izuu.com.access.log.[1-3] | grep 03\/Dec | awk ' { print $1 } ' | sort | uniq | wc -l
59


December 4th:
$ cat www.izuu.com.access.log www.izuu.com.access.log.[1-3] | grep 04\/Dec | awk ' { print $1 } ' | sort | uniq | wc -l
204


Nice healthy jump there. But wait, there's more.

Where did this traffic come from?

In order to do this I found a tool called geoiplookup - which is available in the apt repositories.

$ apt-cache search geo | grep IP
libgeoip-dev - Development files for the GeoIP library
libgeoip1 - A non-DNS IP-to-country resolver library
python-geoip - python bindings for the GeoIP IP-to-country resolver library
python-geoip-dbg - python bindings for the GeoIP IP-to-country resolver library (debug extension)
geoip-bin - IP lookup command line tools that use the GeoIP library
kipi-plugins - image manipulation/handling plugins for KIPI aware programs
libapache2-mod-geoip - GeoIP support for apache2
libgeo-ip-perl - Perl bindings for GeoIP library
php5-geoip - GeoIP module for php5
tclgeoip - Tcl extension implementing GeoIP lookup functions
tor-geoipdb - geoIP database for Tor

$ sudo apt-get install geoip-bin

The geoiplookup tool only comes with a database that narrows the IP address to a specific country --- which is not very interesting. However, using the magic oracle, I discovered a much more specific city-based database at http://geolite.maxmind.com/download/geoip/database/

so then I did a nice:
$ cd
$ wget http://geolite.maxmind.com/download/geoip/database/GeoLiteCity.dat.gz

Thought a few seconds... then gave up:

$ man geoiplookup

OPTIONS
-f Specify a custom path to a single GeoIP datafile.

-d Specify a custom directory containing GeoIP datafile(s). By default geoiplookup looks in /usr/share/GeoIP

Wow, that's damn confusing. Is this right?:

$ mkdir geo
$ mv GeoLiteCity.dat.gz geo
$ cd geo
$ gunzip GeoLiteCity.dat.gz
$ geoiplookup -f ~/geo/ 4.2.2.4
Error Traversing Database for ipnum = 67240452 - Perhaps database is corrupt?
Segmentation fault
$

Lovely. So the next step

$ strace !! |& less
...
brk(0) = 0x603000
brk(0x624000) = 0x624000
open("/home/chris/geo/", O_RDONLY) = 3
fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f866a21f000
fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
...

It's rather pathetic when strace gives you better documentation then the man page. It clearly wants the full path.

$ geoiplookup -f ~/geo/GeoLiteCity.dat 4.2.2.4
GeoIP City Edition, Rev 1: US, (null), (null), (null), 38.000000, -97.000000, 0, 0

Final Packaging
Ok, that's the ticket. Now we just do a little bit of xargs:
$ cat www.izuu.com.access.log www.izuu.com.access.log.[1-3] | grep 04\/Dec | awk ' { print $1 } ' | sort | uniq | xargs -n 1 geoiplookup -f ~/geo/GeoLiteCity.dat

That does the trick, but now we have this huge ass "GeoIP City Edition, Rev 1:" in front of everything. That's ok, I know sed ... let's make it ordered by country too:

$ cat www.izuu.com.access.log www.izuu.com.access.log.[1-3] | grep 04\/Dec | awk ' { print $1 } ' | sort | uniq | xargs -n 1 geoiplookup -f ~/geo/GeoLiteCity.dat | sort | sed s/^.\*:\ //g


Ah, almost complete. Now we just need to mail it off so I can forward it to the boss

$ cat www.izuu.com.access.log www.izuu.com.access.log.[1-3] | grep 04\/Dec | awk ' { print $1 } ' | sort | uniq | xargs -n 1 geoiplookup -f ~/geo/GeoLiteCity.dat | sort | sed s/^.\*:\ //g | mail cmckenzie


And there. Then I have a nice list of stuff to give to the boss that albeit, not graphical, is still digestible and better then a line at 0... sw33t.



Followers