Wednesday, March 27, 2013

Failing Gracefully

Recently ran into two problems, one in-game and the other in Prosper.  I'm going to try and merge both thoughts into a single blog post.  Might be reading too much of Chella's Low-Sec Lifestyle!

In-Game: The Bad Investment

I've stressed mea culpa on my bad investments before, like the Nomad.  Recently, the ball strapped to my ankle has been a perfect-storm with Redeemers.  First, I repeated the same mistake I made with the Nomad, hopping in as the bubble was crashing.  Second, I managed to buy-in when materials were spiking.  So I'm stuck with a double whammy absolutely murdering my margin.

I cleared my buy-in schedule to try and ride out the turmoil, but it looks like I will be selling out near-cost.  With T1 minerals rapidly deflating, global production volume growing, and sales volume really not keeping pace, black-ops have settled to similar margins as everything else.  If you don't derp the buy-in, there's still money to make... but it's a lot of effort and capital to tie up.  Call me greedy, but if I'm going to plunge in 2-3B on a buy in, I want to see really strong returns for the trouble.  Otherwise there are plenty of smaller bets that are paying off well enough.  Also big buy-ins just make me edgy when it's time to ship kits to the factory... still waiting for Miniluv to gank one of my freighters.

The funny part about all this was I flew in a Gal FW fleet on my main a couple weeks ago, and one of my old friends from Aideron Robotics was jabbing me about my recent industry fails.  Except, that in none of these fails have I lost money.  It's similar to the hard time I give my cube-mate in real life.  He sold of his company stock 6wks earlier than he should have... whenever I give him a friendly jab about it, he reminds me he still took home money... what's there to complain about?

Without risk, the game isn't fun.  I actually get excited for days as I ramp up to a big risky project.  Maintaining a status quo is boring.  In fact, I might be taking on a shiny new big project very soon.  I hope to blog about it as soon as details are hammered out.

Out-of-Game: Shaky API

March has flown by at mach speeds just like Feb before it.  I've found very little time to commit to actually getting any personal work done.  

Seems Poetic Stanziel over at Poetic Discourse has started a data farming project too.  Their post highlighted an issue I was about to have with my own Toaster project.  Though I have parsed down my zKB calls to allow for rather extensible operation, I made no accounting for failing mid stream.

When I set up my eve-central parser, it's set up to take bites by-day (because that's the way the dumps are).  If I fail to pull a day down, that date is not written into the result DB.  When I start the eve-central parser, it first figures out what days are missing before starting to query eve-central.  If I crash for whatever reason in the eve-central parsing, all I have to do is run the script again and it will pick up where it left off (or do some small ammount of DB admin to clean up FUBAR'd data).

With zKB, my bites have to be much finer.  Though I start with the same date filtering as eve-central, I then cut the queries even finer to by-group.  zKB then needs one more cut, since it will only return 200 results per query.  So, in a single day, I could require 100+ queries.  If I start dropping calls catastrophically, I could have big chunks of half-finished data in the ToasterDB.  This would be compounded at restart, leaving me with big chunks of unreliable data.

The solution took several layers:

Try-Try-Again

First, I needed to add a simple try/retry loop.  It's reasonable to expect that any one query in a batch of 100+ will fail, even on my eve-central parser.  So, the first stage was to add a try-wait-try-fail routine to the actual fetch operations.
request = urllib2.Request(dump_url)
request.add_header('Accept-encoding', 'gzip')
for tries in range(0,max_tries)
try:
opener = urllib2.build_opener()
except HTTPError as e:
time.sleep(sleep_time)
continue
except URLError as er:
time.sleep(
sleep_time)
continue
else:
break
else:
print "Could not fetch %s" % (query)
//place query back in queue//
//fail mode//
Pretty simple copy-pasta code from Stack Overflow.  It tries to use the opener (after setting request encoding), and retries until failure.

3- Strikes

It would be easy enough to fail the retries with a "Try again later, champ" message.  Unfortunately, this is a very bad idea for my zKB tool.  If the connection is going to be shaky, but not totally crash, I need to be able to handle failures gracefully.

This means I have to do the following:
  1. Place the failed query back into the work queue, or have another way to keep track of reruns
  2. Count these failures and only crash the program when things have gone critically wrong
To handle this, I built a class:
class strikes:
def __init__(self, what):
self.strike = 0
self.max_strikes = config.get("GLOBALS","strikes")
self.what = what
def increment(self):
self.strike +=1
self.strike_out()
def decrement(self):
self.strike += -1
if self.strike < 0:
self.strike=0
def strike_out(self):
if self.strike > self.max_strikes:
print "Exceded retry fail limit for %s" % self.what
fail_gracefully()
This isn't the final revision but it gives you an idea.  Each part of DB_builder.by will have its own strike counter.  If too many strike.increments() happen, the whole program will fail.  It also allows a cool-down function to reset the strikes if the connection is just wavy.

Die-Gracefully

It's bad enough you failed the query (repeatedly), and this happened so often it's just time to die... but now I have to die in a manner that is useful for retry.  Pardon the pure pseudo code, I haven't written this part yet.

//fetch ALL the current work queues//
//generate fail_<script>.json files//

This is several fold.  First, I intend to make the program multi-threaded/parallel on final release.  This means any sub script inside the DB_builder.py wrapper could fail.  When an abort is called, I need to get the data on where all the children were on the queue and dump out the progress.  JSON is just because I can basically write/read the objects as they appear inside the program without having to make human-translations.  

This will mean I also have to change the program init stage.  Where I was originally just asking the DB what is missing, I can also check against fail_*.json logs to restart the program.

Given Infinite Time...

The idea with the retry scheme was to mimic some of what TCP does to manage throughput on a busy connection.  Though I am not worried at all about the actual bandwidth delivering me my data, employing an intelligent fetcher will make all parties run smoother.  

To do this, I would need to make my strikes/sleeps/retries more dynamic.  This would mean every time a certain operation failed, it would save new data about the issue.  I'm not a classical CS guy, but it would be possible to keep a running tally of "time since last fail" or other metrics to categorize each fail until a dynamic steady state was found.

... but that won't get me my data... and is over engineering.  Perhaps when I have more time than sense I'll consider putting some dynamic back-off, but the retry-wait-strikes-fail routine I have already should be adequate to build the DB I'm after... even if that takes several retries.