Follow code progress on GitHub
I have two goals for parsing zKB data. First, and foremost, is to be able to incorporate destruction counts into my market tools. Second, which is WAY harder, is to build a box that will track "big fights" in EVE. zkb_binner.py is only an answer to the first problem, but I have been talking with Valkrr about how to organize a local DB to efficiently parse the data presented in the zKB API.
A small taste of the data I'm generating:
Short Goals for Great Justice
I am finding, as much as I hem and haw about writing good code at work, I only have so much code drive running solo. Though I am recycling a lot of the foundation from my earlier burn-out, my approach this time is much more direct.
Work flow was something like:
- Make initializations (DB, connections, bare-bones parseargs)
- Improve original HTTP fetch code: more direct and useful in crash
- Really flesh out robust fetch/crunch code for easy iteration (even if not pretty, should be class-based)
- Incorporate crash tools from pricefetch
- Tune up all parts until lined up with goals
Unfortunately, by taking a quick-and-dirty approach, I've cut off my toes a little in writing well formed code. In all honesty, my Python looks more like Perl or basic C, which is reinforcing bad habits... but I am lacking the size and scale of project to really require that kind of work. If I'd buck up and start working on UI stuff, or get out of the monolithic projects like these scrapers, I am sure I could find the need for fancier structures.
Databases Are Hell
SQL continues to be the sticking point in a lot of this work. I love the final result, being able to slice-n-dice off a local copy rather than have to scrape custom values, but there are so many little quirks, and SQL is so finicky. I also have a problem that I'm trying to mimic the EVE DB's, since it's all about joining the data artfully, and it spreads the load out over several tables, but I just have the hardest time thinking that way. Again, Valkrr has been invaluable in helping me understand and think about the problem better.
For instance, pricefetch generates a 1.7M row table, and most queries take 30s-90s to return. This is fine when I'm fiddling, but it's absolutely atrocious for hosting live anywhere. If I want to expand that table to include multiple regions, and add eve-central dump data, I need to find a more efficient layout. Furthermore, I'm having trouble on my home computer that it's time to do a fresh sweep or finally invest in some more disk storage. If each of these tables is going to hover near 1GB or more, some new hardware might be in order.
Furthermore, I keep running into weird corner cases. For instance, while trying to run the script last night, I caused a BSOD that nearly scrapped my SSD. It seemed to be more a problem with the crash reporter, which updates the progress file after every successfully parsed page, and not the actual DB itself. I won't lie, I'm a little hesitant to keep up the work on my home machine (or keep doing 1yr of data).
Furthermore, I keep running into weird corner cases. For instance, while trying to run the script last night, I caused a BSOD that nearly scrapped my SSD. It seemed to be more a problem with the crash reporter, which updates the progress file after every successfully parsed page, and not the actual DB itself. I won't lie, I'm a little hesitant to keep up the work on my home machine (or keep doing 1yr of data).
Error Handling. Error Handling Everywhere.
As I opined earlier, this project is more about writing error resistant code than the actual work of slicing and dicing data. So often in my other projects, I keep the scope so narrow that the number of handled errors is small, but using someone else's API means you have to suit yourself to their methods. This is most apparent when you contrast the two return sizes between EVE-Marketdata and zKillboard. EVE-Marketdata returns up to 10,000 entries (I split my calls to 8,000 to avoid time out or overflow); zKillboard only returns 200. I was up to 50,000 kills parsed in the noob-frigate group, and wasn't out of August. So, getting a whole year's worth of data in one place might prove to be slightly challenging and time consuming.
After chatting with some friends with more sense than me, I am thinking I may need to split the databases into two versions. A "data project" version, which is large and unwieldy, and a "live production" version, that caps the stored data to ~90 or ~120 days. Also, I am thinking of building a "merge buddy" script that helps combine all the different data sources into one monolithic source, or at least helps with the merging of individual items for data projects. I could probably make it work if I had better SQL fu, but my merge attempts were taking north of 5min to return, and the grouping was not working well.
To Do List
- Make progress updates pretty for scraping scripts
- Currently dumps a lot of debug info. Would like to instead have running "processing x" line(s)
- Work on scraping EVE-Central raw dump files for actual market order info
- Want to combine history/kills/orders data so I can present a candlestick + sales volume + kill volume + build cost
- Work on a build cost script
- Might work 50/50 between making build cost db and updating my kludgy perl kitbuilder
- Would rather have pretty kitbuilder than basic merge script
- Thanks to Valkrr for letting me steal his RapidAssembly tool
- Work on a work/sales accounting tool.
- Want to be able to open a industry corp, but lacking work accounting
- Would like to pay better/different than my Aideron Technologies buddies
- Finally get to making predictive robot?
The further down that to-do list we get, the fuzzier the future gets. I've been having a lot of luck recently taking small steps. I already have the trouble of getting the motivation to code, and if I make the project too big I tend to lose focus and cannot resume where I left off. Also, I have some desire to start working toward a presentable frontend, and people keep raving about Bootstrap, so I may have to change the plan to suit that interest.
2 comments:
what about if you ran the DB on a ram drive instead of SSD or Platter?
http://memory.dataram.com/products-and-services/software/ramdisk/
the base version without registration creates a drive upto 4Gb.
I've been working on an SSIS package that takes a base eve-central csv file and cleanses / calculates trade data for any or all regions. It's very much a side-project at the moment, but drop me an e-mail if you're interested in the SQL as I've not really had the time to turn it into my one stop inter-hub ISK making tool yet!
Post a Comment