Thursday, October 31, 2013

Let's Make a Deal

Since industry is mostly about market PVP and making ISK, there is a lot of wheeling and dealing.  Sometimes those deals are bad (and they should feel bad), sometimes those deals are great, and sometimes things shake out generally neutral.  In general, I avoid direct/bulk sales, but I do try to be helpful to friends when I get a chance.

I recently left Paxton Industries and came back to Aideron Robotics.  Half of the reasons are for business/play, the other half is a rant about nullsec that I'll have to indulge later.  I am returning to Aideron Robotics because I know the people, the density of code/tool oriented people is the highest I've ever found, and Marcel Deveroux extended a deal I couldn't refuse: help the corp generate income through a POS reaction farm.  Also, FW is more my pace for ops, even if the ships are smaller than I usually like flying.

Any negotiator will tell you the secret to successful bargaining is to make sure both parties feel like they walk away from the table with a win.  Though there's a small segment of scams that prey on this rule by fooling the victim, you can't fool all the people all the time.  For the vast majority of business, you have to strike a bargain for both parties; very rarely do you get the fortunate position of being a dictator.

Making Corp/Alliance/Coalition Programs

The big pet peeve I had against Paxton was every one of their "deals" was looking to screw the other party.  Their capital program for allies was far more expensive than the open market or any other internal service, their industry and POS programs were all designed to take 100% of proceeds with only a "service guarantees citizenship" nod in return.  Frankly, these practices are not sustainable, and will cripple growth.  Your volumes will be difficult to maintain if you're expecting nearly scam-level returns.  Your personnel churn will be high without rewarding the staff directly.  Worst of all, it leads to complacency among the management and membership because "why do better?"

The truly great programs benefit everyone.  You provide a service that customers need, and you treat your employees well.  The GSOL presentation at EVE Vegas (the best presentation of the convention IMO) showcased the backbone of GSF/CFC's military might: their ability to deliver the hardware where it's needed when it's needed.  GSOL membership is paid, often in PLEX, for their efforts.  They are constantly itterating their tools and practices.  They have a focus on getting the right staff, and doing their damnedest to mitigate burnout.  The CFC can absolutely turn the tide of battle before the first shot is fired, all because GSOL is ready to deliver.



The proposal I brought to Aideron Robotics was "I will manage x towers for 1/3rd of the profits".  In the proposal, I outlined a means to expand and include more members, and the expected limits of our reach.  I'm even fronting half of the set up cost (even though I probably shouldn't) as a means to support my friends.  In this deal, all parties get a win out.  I probably would have been more cut throat if I didn't already have a prototype tool made, but things fell into place pretty easily considering.  These are my friends, both in and out of game, and I want to be a force multiplier for them to achieve their goals.

People in Glass Houses

I am not innocent of being a dick when it comes to deals.  I actively run away from direct deals, I dissuade people from selling directly to corp if we don't have a significant need.  So much of this is because Jita represents a gold standard for prices.  I center my projects around Jita, so it would require less-than-Jita prices to make it worth making a substitution.  Either way, one party ends up screwed in the relationship.

I would rather see my friends get full price for their work than prey on their generosity.  The only time it makes sense to me is when someone is already chasing Jita prices locally, so we both avoid a shipping step.  If we are not BOTH making money, I am effectively robbing Peter to pay Paul.  By lowering line-member income, I am causing them to demand more from the corp... and programs may not be strong enough to allow that relationship.

The one position I'm missing here is the communist/socialist WH corp.  This relationship is different, since many of your members will live and die by the corp's stream of goods and equipment in and out of the WH, and the logistics of keeping proceeds individual are just too painful.  Also, it helps that in WH, your wallet balance does you a fat lot of good without access to a market.  But still, this relationship is a win-win on both parties: members get the equipment they need to live and thrive in WH, while the corp gets what it needs to enable that.  Without a feed down to line-members, the WH operation shrivels and dies.

Monday, October 28, 2013

Treading a Fine Line

I had an excellent convo with @EVE_WOLFPACKED and @ChiralityT the other day about the broader topics of delivering an app to the EVE community.  Pair that with the recent Somer Blink scandals, there has been a lot of noise in the 3rd party service sphere.

I've been sitting on this post for a week now, and may regret publishing it.  But I feel that the backlash from this SomerBlink scandal is quickly reaching a fever pitch, and a lot of players are losing sight of an important pillar of EVE: Making a profit isn't evil.  Though I do not condone ISK-->IRL conversion in any path, the service-->ISK angle has been a long-standing feature of the 3rd party sphere.  And as nice as it would be to be a household name like EVE-Central or zKillboard, I'm going to have a very hard time justifying making a public portal like those if it will put me in the poor house, or I lose all my free time to SEO or ad management.

Open Sources Thanks to Open Sources

Let me be 100% clear, my work is enabled only because others have provided APIs.  I would not have kill data if it weren't for zKillBoard.  I would not have in-game history data if it weren't for EVE-Marketdata.  I would not have order histories if it weren't for EVE-Central providing raw daily dumps.

After talking with the contributors from eve-kill/zKillboard on how to best handle their API for my data, they made it clear that I could not in turn make that data secret.  They state in their TOS:
Using the zKillboard database and API for the purpose of datamining, in an attempt to gain an unfair advantage over corporations and/or alliances, is not allowed
And Squizz was extremely explicit in noting they enforce that rule.  EVEwho exists only because someone had tried to make a private spy network with that data, and Squizz instead published a public version and put him out of business.  

So, for the short term, I am in the uncomfortable position that I have collected this data, but am still lacking a distribution means.  I have tried to be open to those that ask questions about the data, but short of maintaining a SQL dump by hand, I don't have a better method than telling people to scrape their own version.

Gotta Make A Living

If things go according to plan, I will be shooting my own margins in the foot when I release an open version of my tool.  Also, I need to figure out a way to pay for service overhead, and I'd prefer not to plaster my site in ads.  Lastly, if I expect to get people to hand me some sort of compensation, I have to be able to prove my service is worth paying for.

Public Features

I'd like to be able to publish access to all the charts I've been making.  I'd like if users could browse through items like Eve-Markets, and see:
  • Market candlestick
  • Total volume
  • Buy orders/day
  • Sell orders/day
  • destruction (and access to by-location binning: HS/LS/NS/WH)
  • Build costs
--also--
  • Personal S&I tracking
    • Job Tracking
    • Kit Building
    • Accounting
This should give enough view to return the favor of data given.  I'd like to be able to let people do their own market research on my site, and have some look under the hood how the planning/accounting would work for their own projects.  I have no problem providing data outlays for the big hubs, but the finer the data gets, the more trouble it is to serve for marginal increase in value.

Paywall Features

Let me be 100% clear.  I am in this business to make a buck.  I'm fine with getting paid in ISK, but there are a ton of manhours to be spent in development, and server space ain't free.  In an effort to subsidize the work and server space, I'd like to offer the full suite of tools for ISK at two tiers: a "corp level" and an "alliance level".  Since the intended audience is organizations and not individuals, I'd like to leave the beefier features behind a paywall:
  • Price predictions
  • Automated production planner (at least suggestions)
  • Org-level accounting: paying contributors, tracking stockpiles
  • Localized market prediction actions:
    • "A lot of people died in this area, market activity should increase"
    • "Ship x item to y for increased margin"
  • Localized reports outside major hubs
Where the "corp" option allows a flat fee for a limited number of active builders tracked, the "alliance level" would be a % of profits (as they are tracked through the system, no fair taxing un-fulfilled plans) without limits.  I would love to provide tastes of these features to the personal account option, but we'll have to see when we get there.  I seriously doubt my prediction tools will be strong enough to provide a "1wk taste, 4wk 'feature'".

The entire strategy here is to share what was fetched from public sources, but protect the unique tools.  I will keep most of the code environment open, but intend to only protect a very small set of features:
  • Machine learning modules
    • Open inputs, but protect trained black-box
  • Automated decision making tools
    • Core feature is to free up S&I managers/staff to play EVE
    • Data streams = open, automated decision making = closed
My hope is that only these two modules would remain "sekret", and though the corp implementation would be behind the paywall, the base code behind accounting, kit building, job tracking, would remain open.  

Monetization is my last priority, and is the very last feature to be implemented.  The goal is to allow the 5% that need the horsepower subsidize the 95% of casual users.  If I can provide an ad-free clearinghouse for all the data any industry or market player could want.  

What If Paywall Is Verboten?

CCP has not done a good job in making clear what kinds of services are okay and which are not.  With the recent SomerBlink scandal and API EULA drama over the last year, my plans are very close to (or even over) the line of what is allowed.  Furthermore, I am at the mercy of other API providers, and if I run afoul of them, my entire tool goes dark.  So, it is clear I need to have a Plan-B.

First, providing a front-end that distributes the data I'm collecting needs to be a priority.  I may need to rely on ad revenue to pay for upkeep, but I would love to be able to provide that service to the general EVE playerbase.  

If I can't sell the automated/corp S&I tools, then they will remain private.  If the headache of trying to get my team paid for their work becomes too much, or I get embroiled in a scandal for trying to provide this service to the community, the simplest answer is to lock it down.  I'd prefer not to do that, more users would provide extremely valuable feedback on designing better optimizations options and incentive to really develop a top grade tool... but it's just a game, and I'd rather be largely unknown than infamous and derided.  
The goal is only to "make the best industry corp in EVE" or "win at the market", I'm not looking to pay for a new car/house/whatever and get IRL rich.  My primary goal is to help as many people in my own organization play the game for free, while crushing competitors under my heel in the truest and most extreme version of market PVP.  It also wouldn't hurt to unseat Mynnna as the richest bastard in EVE.

Tuesday, October 22, 2013

A Little Less Talk: EMD scraper v2

In my fervor to get at one subset of data, I wrote myself into a corner.  So, I spent this last weekend ripping out the inner workings of my pricefetch script and bringing it line with the style/stability of my zkb scraper.

Code at Github

This exercise was painful because I had to essentially start over and rework the entire tool from top to bottom.  This did give me the chance to clean up a lot of errors (data backfill was bugged all along), and now things are pretty and fast.  I still have the issue of "fast as you please, there's still xGB's to parse", but I think I've worked the tool down into a sweet spot for effort/speed.

I owe a lot of thanks to the recent progress to Valkrr and Lukas Rox.  Seeing as I am so painfully green with databases, they've been exceptionally helpful in cleaning up some of the pitfalls I've run into.

What Changed?

Where pricefetch was designed to grab everything from one region, EMD_scraper is designed to grab everything from everywhere.  To accomplish this I put in two modes for scraping:
  • --regionfast
  • --itemfast
These handles help define the method of scraping.  --regionfast will attempt to pull as many regions as possible, resulting in a one-item-per-call return.  --itemfast does the opposite, trying to pull as many items as possible, one region at a time.  Also, unlike zKB_scraper which goes in dictionary-order, regions have been placed in a "most relevant" configuration on this release.  Namely big hubs first, then HS, LS, Nullsec.  It still accepts smaller lists, and you can modify the lookup.json values to your heart's content as well.

This also necessitated some updates to the crash handler.  Crashes now dump the entire progress so far (region,item) and the script modifies the outgoing calls to skip region/item combinations already run.  I'd really like a more efficient crash/fetch routine, trying to get the full 10k returns each query... but I can't know the limits ahead of time with the current layouts.  I'll take 10k max with 5-7k avg returns rather than try to dynamically update the query.  EMD isn't designed to crawl like zKB.

I'm not wholly pleased about how --itemfast runs.  I may have to rewrite to crawl through all items in one region before moving onto the next.  It's currently blasting through a large number of items and increments region.  

Beautification

Coding on my own, I have this habit of scrawling down code/files willy-nilly until I can get a stable working midpoint.  Since my professional code habits stem from more time spent repairing code or tacking features onto an existing project, I lack a lot of intuition on building foundations.

Repository Maintenance

When I first created the Prosper repository (about a year ago now) I spent a good deal of time trying to create a monolithic DB scraper/builder.  With this second try, I wanted to split the tasks into finer pieces and make the code more independent.  If I could adopt a "First: make it run" mentality, I could at least get to a manageable midpoint with data, rather than burning a bunch of effort in crafting expert code.  This resulted in a lot of duplicated work, and I figured since the paradigm shifted so far, I might as well gut that original code and promote the new scripts to "DB_builder" status

I am banking all of my examples to a scraps directory, but I need to make sure I am adding them all to the repository.  Thankfully, I find myself ransacking those samples to help move the project forward.  Much of the zKB urllib2 code was previously written.  Also, many of the item lookup JSONs were pre-existing.

A tack on the TODO list though is to add more sample data dumps into the SQL portion of the repository.  I was avoiding tracking these to avoid making the repo too large, but as Valkrr pointed out, at least keeping the SQL scripts of common queries would be useful as examples.

Death to Global Variables

I had a good Samaritan swing by my code and point out that I should de-commit some globals, like db_username/db_password, and replace them with configuration scripts.  After a little back-and-forth, he was so gracious as to add the .ini handlers for me into the zkb script.

I figured it was a good time to add some extra functionality and roped those changes into a more complete set.  Now zKB and EMD scrapers both pull from the same .ini; as will any other outgoing scraper (EVE-Central, eveoffline?).  I'd like to compartmentalize internal and external scrapers to use different .ini files, but we'll see how long that continues.

Cleaner, Clearer Code

If you look at the previous version of the EMD_scraper, you'll see a lot of commented code around working code.  I left a lot of the trial-and-error in the first version.  I have since cleaned a lot of that out, leaving only some quick handles in there for debug printing.

I would like to take another pass at these scripts down the line to make very-pretty output, instead of the progress dumping to the command line.  This is purely cosmetic though, so expect the priority to be extremely low.

SQL-Fu

I seriously underestimated how much trouble data warehousing would be.  I have spent a lot of time over the last week trying to understand where I am going wrong and what steps I am missing.

Steps so far:
  • Reduce DB size by reducing strings
    • Removed itemname from priceDB
  • Design the DB to have the data, use queries to make the form
    • Abandoned "binning" directly from zKB data
    • Instead save by system, binning can be handled in a second-pass method
  • OPTIMIZE TABLE is your friend
  • CUSTOM INDEX's for common queries: added some, need to read more
  • CCP and NVIDIA are sloppy with their previous patch cleanup:
    • Check C:\Program Files\CCP\EVE\
    • Check C:\nvidia\ 
  • mySQL is a hog

JOIN and SUM(IF(..)): two flavors that don't go well together

One bug I mentioned is that some of my queries are returning hilariously high values.  On my Neutron Blaster Cannon II experiment, the raw numbers were 10x what were in the DB.  When Powers, from the #tweetfleet asked for freighter data, I was returning something like 138x.  It seems I have been confused about order of operations in SQL.  

This is why I really want to get the "bridge" scripts done so I can just splice together the tables I want to have all the data I need.  Since the data is local, rescraping should be mostly trivial, and it would give me data stores in the shapes I need to move onto the next step of the machine.

Thursday, October 17, 2013

Rubicon - Siphon Unit Statistics and Operation Anounced

Dev Blog announcing new Siphon Unit

This is the biggest change I've been waiting for from the original Rubicon announcement.  Personally, I've been waiting to move forward on my own ambitions to deploy a reaction farm until this feature was better explained.

TL;DR

Click for full res
With Rubicon, there is only one new "space yurt" in the proposed Siphon group: Small Siphon Unit.  It's very small (20m3 in cargo), very cheap (~10M), and can be dropped anywhere within 50km of a control tower to interrupt the moon mining supply chain.  Also, it's important to mention that this module will only interfere with the "Raw Material" and "Processed Material", and can only steal from the final step in the chain.

As of this announcement, there are some important things to note:
  • Rumor: Tower will not notify owner a siphon has been placed (hope this changes)
  • POS will not automatically aggress the siphon
  • Siphon has a waste factor
    • some of its yield is destroyed in transit from POS --> Siphon unit
  • Anyone can access the cargo of the siphon unit
  • Any number of siphons can be placed outside a tower
    • Each siphon steals (in dropped order) from the final yield
  • Only steals from "last step" in the chains it can steal from
  • Siphon steals from the "link" not the silo.  Once it's in a silo, it's safe
What isn't 100% clear is if siphon units will steal from Biochemical Reactors (drugs) or Polymer Reactors (T3 reactions).  I think the answer is no, but I would rather it expressed implicitly.

What Is Good

I preface that "what is good" may not be good for a particular subset, but healthy for the game in general.  The short story is the siphon waste factor should constrict moon material supplies.  As of today, most of these materials are in oversupply.  Up to now, the only place that slack could be drawn out with was T2 consumption (which has been in steady decline up to now thanks to tiericide).  Also, it breaks down the ivory tower big moon holders have enjoyed and allows pilots to interfere at a small scale.  Lastly, this forces reaction holders to invest more active time into managing their lines, lest someone steal from you.

I think this module is a great add and though I personally fear for my own projects, I think it's a great middle ground to push industry and PVP closer together.  Also, I really like the idea of more moon goo slipping out of the supply chain to waste.

What Is Bad

I think the balance on this module is still a little off.  Personally, I'd like to see a few things changed
  • Notify owners when a Siphon goes up
  • Swap stolen quantities.  I think the raw material steal rate is a bit too high
  • Add stacking penalty.  Either increase waste, or lower yields with more siphons dropped
  • Up cost OR up m3 requirement.  I'd prefer if frigates couldn't carry them, or if they cost ~75M
The big losers in this change aren't the blocs or corps, but the logistics guys who already have to deal with a crappy POS system.  I think the major rallying point should be owners MUST be notified about siphons being deployed on their equipment.  Otherwise, prioritizing defense will become far too cumbersome.  I personally was hoping that POS would auto-engage the siphons, but I can understand why CCP didn't do this.  

I also think the linear nature is not as good.  I would rather see diminishing returns, since a single player could very easily drop a dozen of these siphon units around a tower and milk the system dry.  I understand why it might be difficult to implement/explain, but since small siphons are so small/cheap, it's not well balanced against the "What about Goons?" test.  

Tuesday, October 15, 2013

Objective Complete: zKB Data Get

3.75M Kills parsed (2013 so far)
17.5M Entries
40hr estimated parsing time

Frigate 905,329
Cruiser 315,493
Battleship 77,617
Industrial 81,642
Capsule 929,041
Titan 25
Shuttle 41,814
Rookie ship 246,308
Assault Frigate 110,147
Heavy Assault Cruiser 30,583
Deep Space Transport 2,421
Combat Battlecruiser 170,480
Destroyer 332,165
Mining Barge 54,804
Dreadnought 2,218
Freighter 1,960
Command Ship 6,340
Interdictor 32,956
Exhumer 24,032
Carrier 4,873
Supercarrier 113
Covert Ops 39,401
Interceptor 52,546
Logistics 15,082
Force Recon Ship 24,132
Stealth Bomber 97,226
Capital Industrial Ship 247
Electronic Attack Ship 5,940
Heavy Interdiction Cruiser 3,961
Black Ops 1,129
Marauder 1,375
Jump Freighter 861
Combat Recon Ship 8,368
Industrial Command Ship 2,986
Strategic Cruiser 32,309
Prototype Exploration Ship 265
Attack Battlecruiser 82,652
Blockade Runner 11,583


Remaining To-Do

  1. Investigate count bug
    • Initial dump is 10x expected values on items?
  2. Finish "prettying" for release
  3. Update pricefetch to scrape all regions for full market picture
  4. Find a way to maintain/release .sql dump of data generated
  5. mySQL optimization and "bridge" scripts for smaller passes

Progress So Far

I have to thank a bunch of people for helping me get to this point where I have at least a passable crawler and data set to munch on.  I would like to get EVE-Central's dumps processed before moving onto the data science step, but we will see what happens.

Extra special thanks to:
I still have a lot of work to go between "working" and "good", but being able to stand upright and get my hands on this data is exceptionally awesome.

Finally, I can put together data like this:

Monday, October 14, 2013

Fool's Errand

I found today's Nobel prize in Economics interesting, especially since it's partially related to my project.

The prize winners, all vastly more qualified than me, state through their research that you can't know short term price fluctuations, but should be able to map longer term trends.  I might be in trouble, since my project is looking to do the opposite: chart with decreasing certainty a small number of weeks into the future.

The end product here is that I may not be able to do what I want with all this data.  But if I don't try, because it's "impossible", then I will never know.  I'd like to take this moment to talk through some of my dissenter's opinions.

Imperfect Data

This is the most common dissent I hear when people hear what I am trying to do: "But the out-of-game feeds are imperfect.  How could you possibly know EXACTLY [pick your metric]?"  I always end up countering with a classical engineering retort: "But I can get close enough"

If I may extend the metaphor, imagine you couldn't possibly see something with the naked eye (ISS flying overhead, for example).  If I could get a telescope to take a half-decent black-and-white picture of it, would that not be "close enough" for practical purposes to show you that it was there and what it kinda looked like?  I may not be able to provide the stunning HD pictures NASA can, but something beats nothing.

Exploring the frontier is all about using what you can to get what you need.  I may not be able to tell you EXACTLY how many noob frigates died this year, but I can tell you it's on the order of ~250K and probably under 300K.  Just because I can't know the EXACT number doesn't mean a good estimate has no value.  

Understanding Limitations

It's important to know the relative accuracy of the data you're collecting, and what your blind spots may be.  As far as kill data goes, these are the assumptions I am using:
  • PVP-kill quality
    • 95% quality.  API-only kills should provide extremely good coverage
    • HS kills will be less thourough.  But gaps should be very small
  • Other kill quality (NPC kills, CONCORD kills, self-destructs)
    • No way to view these kills.  zKB filters NPC-only kills before adding them to DB
    • These kills should account for a very low percentage of destruction data
The thinking goes: if something dies in PVP, it should get to zKB somehow.  It only takes one key to get the data.  Either from the victim or the killer, or their corps.  Now, it is possible to have kills unaccounted for, where the killer (killing blow) or victim or their corps don't have a key in zKB/eve-kill's records.  But losing sleep over the last ~10% that I can know is not worth derailing the 90% I'm already getting.

The things that worry me that I'm not seeing:
  • PVE deaths: BS/BC/T2 losses to rats
  • Suicide bombers: Attacker km's are ignored
  • Self destruct data: small segment of pod data not being tracked
  • NPC corp data that might be missed because killer doesn't have correct keys
The hope is these groups account for a very small fraction of the data out there.  

Prediction Quality

I expect to get a decent idea of the future price of something (trend up or down, by how much?) and network all those predictions together to feed to a machine that will automatically task out my manufacturing lines.  If the tuning is strong enough, getting a leg up on the shipping margin economy is a second avenue for activity.  

I catch flak when I describe this because people get mired in the fine details.  I might predict a 20% rise in price on a weapon, but only see a 10% rise.  That's still enough to pocket profit, and I'm better off having some numbers-based prediction than spending a ton of my time scouring the numbers and playing by "gut feeling".  Large repetitive math is EXACTLY what computers are for, and if I can tune the machine to have even a sliver of intuition, then I am ahead of my competition.

Today, I am using today's-cost and today's-profit to say that when I do get to market, I will be somewhere close to that prediction.  I also watch market order volume to make sure what I bring to market is a suitably small percentage of actual sales, so as not to be the downward force.  This is fine in products that swing slowly (most modules) but can be extremely troublesome in ships where bubbles are constantly forming/popping with fickle player tastes.  

My preliminary data doesn't show kills as a predictive metric.  But with kill data being extremely spiky (weekend warriors) I may not be looking at the groupings correctly yet.  So far, only pure-market numbers look like the trend setters.  This is probably because the kill data I am scraping isn't as publicly and easy to access as eve-central data.  But there has to be some amount of weight to put into "replacement" behavior rather than just purely buying and selling commodities without any other basis in reality.

I am wondering if I should get in touch with Chribba or Ripard Teg for their PCU numbers too.  Since player participation is pretty directly related to profitability.

Progress Update

In the end, you won't know unless you try.  Even if the data is purely scientific it's been extremely interesting to get a look at what is destroyed.  Raw dumps for those interested.

As interesting as that is, you get a very similar picture with sales data

If you overlay the two charts, the levels and spikes line up pretty similarly.
As for data parsed so far: total ~3.15M mails parsed
Frigate 902227
Cruiser 314795
Battleship 77331
Industrial 81221
Capsule 929040
Shuttle 3153
Rookie ship 245810
Assault Frigate 109897
Heavy Assault Cruiser 164
Deep Space Transport 30
Combat Battlecruiser 387
Destroyer 314605
Mining Barge 658
Command Ship 6280
Interdictor 32810
Exhumer 23516
Carrier 4873
Covert Ops 537
Interceptor 276
Logistics 14973
Force Recon Ship 177
Stealth Bomber 96750
Capital Industrial Ship 247
Electronic Attack Ship 32
Heavy Interdiction Cruiser 137
Black Ops 19
Marauder 27
Combat Recon Ship 53
Strategic Cruiser 377
Prototype Exploration Ship 236
Attack Battlecruiser 192
Blockade Runner 164

Thursday, October 10, 2013

A Little Less Talk: Part 2 - zKB cooking with fire

Like Frankenstein's monster, the parts are coming together.  zKB crawling is almost ready for the first full time passes.  Figured it would be worth blogging some of the work.

Throughput is a Bitch

After 2 days of running the "pre-alpha" full-flow version of my binner, the progress was as follows:
Frigate319,689 (to Aug1)
Rookie ship241,695
Logistics14,736
Capital Industrial Ship245
Prototype Exploration Ship186
That's a lot of mails parsed, but my rate was something like 400 kills/minute.  This is abysmally slow, and means I would have needed several days to have any hope of getting the whole destruction picture.  

Thankfully, the dudes behind zKB just added some keys to better communicate server status and my dry run was pulling 2,600 kills/minute and stands to run stable at up to 3,800 kills/minute.  Still pretty slow compared to the market data (10,000 entries/minute) but I'll take the sizable improvement.

For those playing along, 3 throughput keys were added to the HTTP header:
X-Bin-Attempts-Allowed
X-Bin-Requests
X-Bin-Seconds-Between-Request
Leveraging these keys lets me set the between-call waits on the fly.  As the budget is changed, I am able to adapt to that and pull "as fast as possible" according to the rules.  I would like to implement a more dynamic back-off routine, that keeps a more steady stream, but that is not yielding a better throughput at the moment. 

Still a Database Scrub

Originally, I was making the scraper set up dynamic "bins" from a file, and push those into a table.  The output can be found on my gdoc dump.  As this is practical for serving from SQL->user, it is not efficient or elegant.  By relying on the data dump for translation, I'm now only storing the required information:
  • Date destroyed
  • Week destroyed (becuase I don't want to do the date->week conversion)
  • typeID
  • typeGroup (also for easy grouping)
  • systemID
  • destroyed count
I could stand to lose Week/typeGroup from the DB, but I like to have the quicker grouping handy... and being numbers instead of strings means they are much smaller to deal with.

Results

Frigate27
Cruiser1
Industrial51
Capsule1
Shuttle5
Rookie ship50349
Assault Frigate3
Heavy Assault Cruiser1
Deep Space Transport9
Combat Battlecruiser4
Destroyer4
Mining Barge19
Interdictor2
Exhumer63
Covert Ops2
Force Recon Ship2
Stealth Bomber5
Capital Industrial Ship247
Prototype Exploration Ship189
Blockade Runner39
In ~30mins, I was able to crunch nearly 50,000 kills.  The numbers aren't as tidy as before (should add bool value for ships killed vs cargo destroyed), but this is leaps and bounds better than before.  Odds are good that by Monday I'll have a nearly complete picture of destruction statistics in EVE.

To Do

  • Add something to watch for repeated killID's
  • Clean up home PC so I can parse this data at home
  • Better test "polite snooze" routine
The zKB devs have asked me to contribute this feature to them so they can serve the data themselves.  I would be more than happy to open up that data to the world through them, but seeing as kill data is such a small segment of my project, I would rather focus on my goals for the time being.  If I can get to the point where I am able to hire contributors, then I might be able to loop back and contribute to them.

Also, zKB has a service like EMDR that throws live data to listeners.  If I can get most of the parts I'd like stable on the "cron" data, then I'd be more than happy to switch feeds over.  Unfortunately, since I have no reliable web space to catch these live feeds, I am not able to get the qualities I need from them at this time.

Tuesday, October 8, 2013

Kill Data Progress

Short and sweet on this one.  Just want to share some of the graphpr0n I've managed to get out of the work so far.  I've pushed the raw data dumps to gdoc if you want to play along.

Stacked Area

Stacked Area

Stacked Area

Progress

Frigate28000
Rookie ship241695
Logistics14736
Capital Industrial Ship245
Prototype Exploration Ship186

As of 2013-10-08 @ 17:45 MDT

Notes

The main guy I am bouncing ideas off of with this project is Valkrr.  Since he's doing a far more comprehensive version of my table, we keep swapping tips and tricks.  Some quick progress stuff
  • Features yet to add
    • Clean up progress printing
    • Better tune between-call sleeps
    • More information in the crash handler (group progress numbers, "globals")
    • Repeated kill protection
    • Remove extra id/strings from DB
  • Might swap crawling style from groupID to regionID (as per Valkrr's advice)
    • better quality data faster
    • current schema needs WHOLE SCRIPT to run before module data is useful
    • not as easy to clean up overwrites?
  • IT TAKES FOREVER TO GET THIS DATA!
    • ~400kills/minute, 264k kills so far... 
    • still need cruisers, frigates, shuttles, and pods

Coup de Grace

Click to embiggen

Monday, October 7, 2013

A Little Less Talk: Part 1.5

I talked a few weeks ago about switching to a "Make it go first, then make it pretty" approach to building new tools.  With some more slump time at work and my manic side back again, I am finally attacking zKB data again.

Follow code progress on GitHub

I have two goals for parsing zKB data.  First, and foremost, is to be able to incorporate destruction counts into my market tools.  Second, which is WAY harder, is to build a box that will track "big fights" in EVE.  zkb_binner.py is only an answer to the first problem, but I have been talking with Valkrr about how to organize a local DB to efficiently parse the data presented in the zKB API.

A small taste of the data I'm generating:

Short Goals for Great Justice

I am finding, as much as I hem and haw about writing good code at work, I only have so much code drive running solo.  Though I am recycling a lot of the foundation from my earlier burn-out, my approach this time is much more direct.  

Work flow was something like:
  1. Make initializations (DB, connections, bare-bones parseargs)
  2. Improve original HTTP fetch code: more direct and useful in crash
  3. Really flesh out robust fetch/crunch code for easy iteration (even if not pretty, should be class-based)
  4. Incorporate crash tools from pricefetch
  5. Tune up all parts until lined up with goals
Unfortunately, by taking a quick-and-dirty approach, I've cut off my toes a little in writing well formed code.  In all honesty, my Python looks more like Perl or basic C, which is reinforcing bad habits... but I am lacking the size and scale of project to really require that kind of work.  If I'd buck up and start working on UI stuff, or get out of the monolithic projects like these scrapers, I am sure I could find the need for fancier structures.

Databases Are Hell

SQL continues to be the sticking point in a lot of this work.  I love the final result, being able to slice-n-dice off a local copy rather than have to scrape custom values, but there are so many little quirks, and SQL is so finicky.  I also have a problem that I'm trying to mimic the EVE DB's, since it's all about joining the data artfully, and it spreads the load out over several tables, but I just have the hardest time thinking that way.  Again, Valkrr has been invaluable in helping me understand and think about the problem better.

For instance, pricefetch generates a 1.7M row table, and most queries take 30s-90s to return.  This is fine when I'm fiddling, but it's absolutely atrocious for hosting live anywhere.  If I want to expand that table to include multiple regions, and add eve-central dump data, I need to find a more efficient layout.  Furthermore, I'm having trouble on my home computer that it's time to do a fresh sweep or finally invest in some more disk storage.  If each of these tables is going to hover near 1GB or more, some new hardware might be in order.

Furthermore, I keep running into weird corner cases.  For instance, while trying to run the script last night, I caused a BSOD that nearly scrapped my SSD.  It seemed to be more a problem with the crash reporter, which updates the progress file after every successfully parsed page, and not the actual DB itself.  I won't lie, I'm a little hesitant to keep up the work on my home machine (or keep doing 1yr of data).

Error Handling.  Error Handling Everywhere.

As I opined earlier, this project is more about writing error resistant code than the actual work of slicing and dicing data.  So often in my other projects, I keep the scope so narrow that the number of handled errors is small, but using someone else's API means you have to suit yourself to their methods.  This is most apparent when you contrast the two return sizes between EVE-Marketdata and zKillboard.  EVE-Marketdata returns up to 10,000 entries (I split my calls to 8,000 to avoid time out or overflow); zKillboard only returns 200.  I was up to 50,000 kills parsed in the noob-frigate group, and wasn't out of August.  So, getting a whole year's worth of data in one place might prove to be slightly challenging and time consuming.

After chatting with some friends with more sense than me, I am thinking I may need to split the databases into two versions.  A "data project" version, which is large and unwieldy, and a "live production" version, that caps the stored data to ~90 or ~120 days.  Also, I am thinking of building a "merge buddy" script that helps combine all the different data sources into one monolithic source, or at least helps with the merging of individual items for data projects.  I could probably make it work if I had better SQL fu, but my merge attempts were taking north of 5min to return, and the grouping was not working well.

To Do List

  1. Make progress updates pretty for scraping scripts
    • Currently dumps a lot of debug info.  Would like to instead have running "processing x" line(s)
  2. Work on scraping EVE-Central raw dump files for actual market order info
    • Want to combine history/kills/orders data so I can present a candlestick + sales volume + kill volume + build cost
  3. Work on a build cost script
    • Might work 50/50 between making build cost db and updating my kludgy perl kitbuilder
    • Would rather have pretty kitbuilder than basic merge script
    • Thanks to Valkrr for letting me steal his RapidAssembly tool
  4. Work on a work/sales accounting tool.  
    • Want to be able to open a industry corp, but lacking work accounting
    • Would like to pay better/different than my Aideron Technologies buddies
  5. Finally get to making predictive robot?
The further down that to-do list we get, the fuzzier the future gets.  I've been having a lot of luck recently taking small steps.  I already have the trouble of getting the motivation to code, and if I make the project too big I tend to lose focus and cannot resume where I left off.  Also, I have some desire to start working toward a presentable frontend, and people keep raving about Bootstrap, so I may have to change the plan to suit that interest.