Splunk: Analytics done right
With Bodega and Centrix.ca, I’m always looking at analytics data. I try to get as much data as humanly possible (within reason) to be able to make good business decisions. That usually means a mix of SQL queries in stuff that I’ve logged to a database, awstats, and Google Analytics. They each serve a different need, but often I’ve hit questions that they can’t answer, and I’m left doing gross stuff like grepping through log files looking at data, and using tools like sed and friends to start making sense of it.
Google Analytics is pretty cool, but it has a huge flaw that I can’t ignore: you have to tell it everything before hand. I have to tell it when a user clicks “download” on my site. It’s painful to do that stuff. Then when you get into the multi-domain tracking, it gets even hairier. I’m not sure why, it seems to me that it shouldn’t be rocket science to do cross-domain tracking… but apparently it is. Their UI manages to make me feel like an idiot which really doesn’t help anyone. Another huge flaw is that if someone decides to link directly to your download instead of your page, that’ll never be counted by GA, cause no Javascript told it about that event. That shit happens all the time.
AWStats… well that project isn’t going anywhere fast, but it’s got one huge strength: it looks at my apache log files directly and shows me the most relevant info. It’s not super detailed, it’s extremely limited, but for the easy cases, it works amazingly well.
I had done some research, trying to find better web analytics tools, but everyone’s systems seem to all be based around the same model as Google’s: using Javascript to tell a server about events. It’s just such a broken model to me. I have the data, I have months, and years worth of data right here, in these log files.
I had a bunch of questions I needed answers to, and no good way to get them. Google Analytics wasn’t playing ball with the simplest of these, never mind the fun stuff. I knew this data was there, I could manually do some of my queries by grepping through the log files like an animal. This led me to start developing a little tool to help me with it. A cocoa app that would import apache log files and let me start seeing some of the relationships I wanted. (CoreData isn’t super fast at importing huge sets of data apparently) I got to a proof of concept stage, but we needed something more now-ish, not “When Rick gets around to finishing it”-ish. On top of the fact that, really, using Cocoa for such a tool is a stupid idea, I was just too stubborn to start writing a web app.
Here’s a simple stat I couldn’t get to easily: Bodega download counts. AWStats wasn’t counting them because we redirect to Amazon S3, and I was getting underreported values from GA due to the direct download issue I mentioned earlier.. Amazon’s reporting mechanisms really aren’t suited for that, either. Sure this one was easy to grep for, but doing grouping by day wasn’t exactly fun.
Phil’s uncle’s brother’s fiance or something worked at Splunk and Phil remembered their claim to be able to index anything. Anytime someone makes a claim like that, I go in assuming that they’re lying liars. A bit of prodding by Phil, and I installed it on my Mac to test it out, and fed it a day’s worth of logs. I wanted to see how fast Splunk could index that, since my app took a couple minutes. Seconds later… it said it was ready. Umm… alright. It passed the “can it tell me how many Bodega downloads we got?” test with flying colors, and a pretty graph to boot. I never told it the format of my log file or anything like that, it just figured it out, like software should.
Splunk’s interface is really really simple, basically just a query textfield, a graph area, and a place to see a tabular version of your results. It does all of its searching and displaying of results via AJAX so it feels really fast.
We installed it onto our server, and fed it the many gb of archived log files we had for Apache. It chewed through these with an insane speed (granted that box is no slouch), the only real painful part was doing all of the clicking to feed it the files one at a time. I then configured it to monitor the directories that Apache stores its files in before doing log rotation to get incoming data, and voila.. damn-near real time results.
Turns out Splunk can’t index anything, but they do have a really impressive list of log files that it understands, basically anything common.
Its query language takes some getting used to, it feels a little like the land of Unix commands where they decided every byte matters, so things should be unnecessarily shortened. Seeing a graph of the updates done via Sparkle over time is pretty simple: “bodega_*zip | timechart span=1d count by uri_path”

I still have some pretty intense queries that I have yet to get Splunk to do, though I think they’re all possible, the more I look at their documentation. It’s just a matter of figuring out the different commands. I haven’t had a ton of time to put into this yet.
Splunk is now an integral part of how I get analytics data. When I get a question about whether we can get some type of data, and it’s not something we designed Bodega to do natively, my first instinct is to check out Splunk.
The best part of it is… it’s free. Well, not really, but if you aren’t indexing a massive amount of new data on a daily basis, you’re fine with the free version. A real license costs $5000, which by the time we get to the point of needing, I’m pretty sure I’ll be happy to pay that.