Prosper: An EVE Online Tool Development Blog: Failing Gracefully

Wednesday, March 27, 2013

Failing Gracefully

Recently ran into two problems, one in-game and the other in Prosper. I'm going to try and merge both thoughts into a single blog post. Might be reading too much of Chella's Low-Sec Lifestyle!

In-Game: The Bad Investment

I've stressed mea culpa on my bad investments before, like the Nomad. Recently, the ball strapped to my ankle has been a perfect-storm with Redeemers. First, I repeated the same mistake I made with the Nomad, hopping in as the bubble was crashing. Second, I managed to buy-in when materials were spiking. So I'm stuck with a double whammy absolutely murdering my margin.

I cleared my buy-in schedule to try and ride out the turmoil, but it looks like I will be selling out near-cost. With T1 minerals rapidly deflating, global production volume growing, and sales volume really not keeping pace, black-ops have settled to similar margins as everything else. If you don't derp the buy-in, there's still money to make... but it's a lot of effort and capital to tie up. Call me greedy, but if I'm going to plunge in 2-3B on a buy in, I want to see really strong returns for the trouble. Otherwise there are plenty of smaller bets that are paying off well enough. Also big buy-ins just make me edgy when it's time to ship kits to the factory... still waiting for Miniluv to gank one of my freighters.

The funny part about all this was I flew in a Gal FW fleet on my main a couple weeks ago, and one of my old friends from Aideron Robotics was jabbing me about my recent industry fails. Except, that in none of these fails have I lost money. It's similar to the hard time I give my cube-mate in real life. He sold of his company stock 6wks earlier than he should have... whenever I give him a friendly jab about it, he reminds me he still took home money... what's there to complain about?

Without risk, the game isn't fun. I actually get excited for days as I ramp up to a big risky project. Maintaining a status quo is boring. In fact, I might be taking on a shiny new big project very soon. I hope to blog about it as soon as details are hammered out.

Out-of-Game: Shaky API

March has flown by at mach speeds just like Feb before it. I've found very little time to commit to actually getting any personal work done.

Seems Poetic Stanziel over at Poetic Discourse has started a data farming project too. Their post highlighted an issue I was about to have with my own Toaster project. Though I have parsed down my zKB calls to allow for rather extensible operation, I made no accounting for failing mid stream.

When I set up my eve-central parser, it's set up to take bites by-day (because that's the way the dumps are). If I fail to pull a day down, that date is not written into the result DB. When I start the eve-central parser, it first figures out what days are missing before starting to query eve-central. If I crash for whatever reason in the eve-central parsing, all I have to do is run the script again and it will pick up where it left off (or do some small ammount of DB admin to clean up FUBAR'd data).

With zKB, my bites have to be much finer. Though I start with the same date filtering as eve-central, I then cut the queries even finer to by-group. zKB then needs one more cut, since it will only return 200 results per query. So, in a single day, I could require 100+ queries. If I start dropping calls catastrophically, I could have big chunks of half-finished data in the ToasterDB. This would be compounded at restart, leaving me with big chunks of unreliable data.

The solution took several layers:

Try-Try-Again

First, I needed to add a simple try/retry loop. It's reasonable to expect that any one query in a batch of 100+ will fail, even on my eve-central parser. So, the first stage was to add a try-wait-try-fail routine to the actual fetch operations.

request = urllib2.Request(dump_url)
request.add_header('Accept-encoding', 'gzip')
for tries in range(0,max_tries)
try:
opener = urllib2.build_opener()
except HTTPError as e:
time.sleep(sleep_time)
continue
except URLError as er:
time.sleep(sleep_time)
continue
else:
break
else:
print "Could not fetch %s" % (query)
//place query back in queue//
//fail mode//

Pretty simple copy-pasta code from Stack Overflow. It tries to use the opener (after setting request encoding), and retries until failure.

3- Strikes

It would be easy enough to fail the retries with a "Try again later, champ" message. Unfortunately, this is a very bad idea for my zKB tool. If the connection is going to be shaky, but not totally crash, I need to be able to handle failures gracefully.

This means I have to do the following:

Place the failed query back into the work queue, or have another way to keep track of reruns
Count these failures and only crash the program when things have gone critically wrong

To handle this, I built a class:

class strikes:
def __init__(self, what):
self.strike = 0
self.max_strikes = config.get("GLOBALS","strikes")
self.what = what
def increment(self):
self.strike +=1
self.strike_out()
def decrement(self):
self.strike += -1
if self.strike < 0:
self.strike=0
def strike_out(self):
if self.strike > self.max_strikes:
print "Exceded retry fail limit for %s" % self.what
fail_gracefully()

This isn't the final revision but it gives you an idea. Each part of DB_builder.by will have its own strike counter. If too many strike.increments() happen, the whole program will fail. It also allows a cool-down function to reset the strikes if the connection is just wavy.

Die-Gracefully

It's bad enough you failed the query (repeatedly), and this happened so often it's just time to die... but now I have to die in a manner that is useful for retry. Pardon the pure pseudo code, I haven't written this part yet.

//fetch ALL the current work queues//
//generate fail_<script>.json files//

This is several fold. First, I intend to make the program multi-threaded/parallel on final release. This means any sub script inside the DB_builder.py wrapper could fail. When an abort is called, I need to get the data on where all the children were on the queue and dump out the progress. JSON is just because I can basically write/read the objects as they appear inside the program without having to make human-translations.

This will mean I also have to change the program init stage. Where I was originally just asking the DB what is missing, I can also check against fail_*.json logs to restart the program.

Given Infinite Time...

The idea with the retry scheme was to mimic some of what TCP does to manage throughput on a busy connection. Though I am not worried at all about the actual bandwidth delivering me my data, employing an intelligent fetcher will make all parties run smoother.

To do this, I would need to make my strikes/sleeps/retries more dynamic. This would mean every time a certain operation failed, it would save new data about the issue. I'm not a classical CS guy, but it would be possible to keep a running tally of "time since last fail" or other metrics to categorize each fail until a dynamic steady state was found.

... but that won't get me my data... and is over engineering. Perhaps when I have more time than sense I'll consider putting some dynamic back-off, but the retry-wait-strikes-fail routine I have already should be adequate to build the DB I'm after... even if that takes several retries.

12 comments:

Anonymous said...: I've thought long and hard about how to use zKill and have it safe from mid-stream failures. Originally I was going to use the startTime and endTime attributes, but I found that zKill often timed-out when using those.

Instead I'm going to make use of the afterKillID attribute. And I'm doing it on a per region basis. I poll each region, one at a time, using the most recent KillID as the value I use with the afterKillID attribute.

I've yet to finish the code ... but I'm feeling confident that this is the route to take.; March 27, 2013 at 5:56 PM
Unknown said...: We have different goals. Since I want historic data boiled down, the easiest way to sort that load is Day:group:page. But I am concerned about failing the group of queries collecting the frigate data.

If I can be smart about my delay/retry scheme, it should be reasonably trivial to fix (how many frigates or pods can really die in a day). And checking off the collection by Day:group gives a easy to manage work queue if the program has to be retried.; March 27, 2013 at 9:12 PM
Anonymous said...: This comment has been removed by the author.; March 27, 2013 at 9:59 PM
Anonymous said...: I'm not sure what groupID actually represents.

But use beforeKillID, along with groupID, if you're looking to go back in time and sort desc. Use the oldest/smallest KillID in the database before each query. Don't worry about paging ... that will cause you to most headaches with zKill, since API calls can and will fail regularly.; March 27, 2013 at 10:00 PM
Unknown said...: Well, I only use the hull group ID's... that lets me return "all the frigates" for a day, for instance.

And I am using my scheme because when I init the program fresh (without an error log), I won't have "latest kill ID", I'll have a list of itemID's (maybe group IDs), and dates. Also, when I add threading to the program, a to-do queue is more efficient. Lastly, it matches my scheme for eve-central fetching... and recycling code for great justice.

I agree your scheme is great for straight brute force. I will keep it in my back pocket when I say "fuck doing it the 'right' way"; March 27, 2013 at 10:06 PM
Anonymous said...: The right way?

I agree on having an initial killID as a seed. I want to start my data at 2013-01-01 00:00:00 ... so I found a kill ID for 2012-12-28 and will use that as my initial seed value once I get the program up and running.

I don't want a lot of historical data though. I'm picking a particular date to start from, and will poll forwards. It will work great once I get to the current datetime. I'll be polling each region once every two hours, and using the killID in the database (for a particular region) as my afterKillID value. For my purposes, that's not the wrong way of doing it ... it's probably the best way to do it, if I don't want to miss killmails.

I'm using zKill for two reasons. To scrape new characters and corporations. And to record PvP activity per character. I want to basically track how active characters are, and where they are active.; March 27, 2013 at 10:19 PM
Sugar Kyle said...: I'm not even sure what I am being blamed for! :P; March 27, 2013 at 10:23 PM
Unknown said...: The mongoose flies at midnight; March 27, 2013 at 10:29 PM
Unknown said...: I deal with so much spaghetti code at my IRL job, I get a little obsessed with "extensible code".

We're after totally different data though. I agree your method is wise for your use. I frankly am throwing away a lot of the data I'm pulling, since all I care about is a tally of what is destroyed day-to-day. My goal is to match destruction data next to sales volume data. Ultimate graph-porn.

I'd love a chance to peek over your code though. Are you posting it on github? Also, what language are you using?; March 27, 2013 at 10:33 PM
Anonymous said...: I have a root API class, and then a class for each different type of API call I make. Five classes total.

public class API {
   protected String _url;
   protected String _raw;
   protected String _formatted;
   protected String _error;

   protected API() { }
   public virtual bool RequestData() { return true; }
   public virtual bool ProcessData(SQLiteDatabase db) { return true; }
   public virtual void Parse() { }
   public virtual String GetRaw() { return _raw; }
   public virtual String GetFormatted() { return _formatted; }
   public virtual String GetError() { return _error; }
   public virtual void Clear() {
      _raw = "";
      _formatted = "";
      _error = "";
   }
}

public class EVEOnlineAllianceAPI : API {}
public class EVEOnlineCorporationAPI : API {}
public class EVEOnlineCharacterAPI : API {}
public class EVEKillCharacterActivityAPI : API {}
public class EVEWhoCorporationAPI : API {}

I'm using C# and .NET 4.5.; March 27, 2013 at 10:47 PM
Anonymous said...: And no. I won't be releasing this code. I'm not sure I want to be responsible for several hundred people (potentially) all hammering on the APIs and gather duplicate data.

I'll just sell the data for ISK to other players.; March 27, 2013 at 10:48 PM
Unknown said...: Ripoff Report | Kurth Contracting | Complaint Review: 926305

http://www.ripoffreport.com/kurth-contracting/builders-contractors/kurth-contracting-rob-kurth-st-29095.htm

Ripoff Report | Kurth Contracting | Complaint Review: 926305

http://www.ripoffreport.com/kurth-contracting/builders-contractors/kurth-contracting-rob-kurth-st-29095.htm

Ripoff Report | Kurth Contracting | Complaint Review: 926305

http://www.ripoffreport.com/kurth-contracting/builders-contractors/kurth-contracting-rob-kurth-st-29095.htm; April 6, 2013 at 2:40 AM