Tuesday, October 22, 2013

A Little Less Talk: EMD scraper v2

In my fervor to get at one subset of data, I wrote myself into a corner.  So, I spent this last weekend ripping out the inner workings of my pricefetch script and bringing it line with the style/stability of my zkb scraper.

Code at Github

This exercise was painful because I had to essentially start over and rework the entire tool from top to bottom.  This did give me the chance to clean up a lot of errors (data backfill was bugged all along), and now things are pretty and fast.  I still have the issue of "fast as you please, there's still xGB's to parse", but I think I've worked the tool down into a sweet spot for effort/speed.

I owe a lot of thanks to the recent progress to Valkrr and Lukas Rox.  Seeing as I am so painfully green with databases, they've been exceptionally helpful in cleaning up some of the pitfalls I've run into.

What Changed?

Where pricefetch was designed to grab everything from one region, EMD_scraper is designed to grab everything from everywhere.  To accomplish this I put in two modes for scraping:
  • --regionfast
  • --itemfast
These handles help define the method of scraping.  --regionfast will attempt to pull as many regions as possible, resulting in a one-item-per-call return.  --itemfast does the opposite, trying to pull as many items as possible, one region at a time.  Also, unlike zKB_scraper which goes in dictionary-order, regions have been placed in a "most relevant" configuration on this release.  Namely big hubs first, then HS, LS, Nullsec.  It still accepts smaller lists, and you can modify the lookup.json values to your heart's content as well.

This also necessitated some updates to the crash handler.  Crashes now dump the entire progress so far (region,item) and the script modifies the outgoing calls to skip region/item combinations already run.  I'd really like a more efficient crash/fetch routine, trying to get the full 10k returns each query... but I can't know the limits ahead of time with the current layouts.  I'll take 10k max with 5-7k avg returns rather than try to dynamically update the query.  EMD isn't designed to crawl like zKB.

I'm not wholly pleased about how --itemfast runs.  I may have to rewrite to crawl through all items in one region before moving onto the next.  It's currently blasting through a large number of items and increments region.  

Beautification

Coding on my own, I have this habit of scrawling down code/files willy-nilly until I can get a stable working midpoint.  Since my professional code habits stem from more time spent repairing code or tacking features onto an existing project, I lack a lot of intuition on building foundations.

Repository Maintenance

When I first created the Prosper repository (about a year ago now) I spent a good deal of time trying to create a monolithic DB scraper/builder.  With this second try, I wanted to split the tasks into finer pieces and make the code more independent.  If I could adopt a "First: make it run" mentality, I could at least get to a manageable midpoint with data, rather than burning a bunch of effort in crafting expert code.  This resulted in a lot of duplicated work, and I figured since the paradigm shifted so far, I might as well gut that original code and promote the new scripts to "DB_builder" status

I am banking all of my examples to a scraps directory, but I need to make sure I am adding them all to the repository.  Thankfully, I find myself ransacking those samples to help move the project forward.  Much of the zKB urllib2 code was previously written.  Also, many of the item lookup JSONs were pre-existing.

A tack on the TODO list though is to add more sample data dumps into the SQL portion of the repository.  I was avoiding tracking these to avoid making the repo too large, but as Valkrr pointed out, at least keeping the SQL scripts of common queries would be useful as examples.

Death to Global Variables

I had a good Samaritan swing by my code and point out that I should de-commit some globals, like db_username/db_password, and replace them with configuration scripts.  After a little back-and-forth, he was so gracious as to add the .ini handlers for me into the zkb script.

I figured it was a good time to add some extra functionality and roped those changes into a more complete set.  Now zKB and EMD scrapers both pull from the same .ini; as will any other outgoing scraper (EVE-Central, eveoffline?).  I'd like to compartmentalize internal and external scrapers to use different .ini files, but we'll see how long that continues.

Cleaner, Clearer Code

If you look at the previous version of the EMD_scraper, you'll see a lot of commented code around working code.  I left a lot of the trial-and-error in the first version.  I have since cleaned a lot of that out, leaving only some quick handles in there for debug printing.

I would like to take another pass at these scripts down the line to make very-pretty output, instead of the progress dumping to the command line.  This is purely cosmetic though, so expect the priority to be extremely low.

SQL-Fu

I seriously underestimated how much trouble data warehousing would be.  I have spent a lot of time over the last week trying to understand where I am going wrong and what steps I am missing.

Steps so far:
  • Reduce DB size by reducing strings
    • Removed itemname from priceDB
  • Design the DB to have the data, use queries to make the form
    • Abandoned "binning" directly from zKB data
    • Instead save by system, binning can be handled in a second-pass method
  • OPTIMIZE TABLE is your friend
  • CUSTOM INDEX's for common queries: added some, need to read more
  • CCP and NVIDIA are sloppy with their previous patch cleanup:
    • Check C:\Program Files\CCP\EVE\
    • Check C:\nvidia\ 
  • mySQL is a hog

JOIN and SUM(IF(..)): two flavors that don't go well together

One bug I mentioned is that some of my queries are returning hilariously high values.  On my Neutron Blaster Cannon II experiment, the raw numbers were 10x what were in the DB.  When Powers, from the #tweetfleet asked for freighter data, I was returning something like 138x.  It seems I have been confused about order of operations in SQL.  

This is why I really want to get the "bridge" scripts done so I can just splice together the tables I want to have all the data I need.  Since the data is local, rescraping should be mostly trivial, and it would give me data stores in the shapes I need to move onto the next step of the machine.