Friday, September 23, 2016

ProsperWarehouse - Building Less-Bad Python Code

EVE Prosper is first and foremost a data science project.  And though hack-and-slash has got us this far, we need to consider a proper design/environment if we want to actually expand coverage rather than just chase R/CREST/SQL bugs.

There has been some work moving Prosper to a v2 codebase (follow the new github projects here) but ProsperWarehouse is a big step toward that design.  This interface should allow us to open up a whole new field of projects, so it's critical nail this design on the first-pass before moving on.

What The Hell Is This Even
Building a Database Abstraction Layer (DAL).  

Up to now we have used ODBC, but there are some issues with cross-platform deployment, and database-specific weirdness that have caused issues: such as ARM and MacOS support being painful.  Furthermore, relying only on ODBC means we aren't able to integrate non-SQL sources like MongoDB or Influx into our stack without rewriting huge chunks of code.  Lastly, we have relied on raw-SQL and string-hacks sprinkled all over the original codebase, making updates a nightmare.

There are two goals of this project:
  1. Reduce complexity for other apps by giving standardized get/put methods.
  2. Allow easier conversion of datastore technologies.  Change connection without changing behavior
By adopting Pandas as the actual data-transporter, this means everything can talk the same talk and move data around with very little effort.  Though some complexity will come from cramming noSQL style data into traditional dataframes, that complexity can be abstracted under the hood and always yield the same structures when prompted.

How Does It Work?
The Magic of Abstract Methods

I've never been a great object-oriented developer, and I've been especially weak with parent/children relationships.  Recent projects at work have taught me some better tenants of API design and implementation, and I wanted to apply those lessons somewhere personal.  

Database Layer

Holds generic information about the connection; esentially the bulk of the API skeleton.  Whatever Database() defines will need to be filled in by its children.  This container doesn't do much work, but acts as the structure for the whole project under the hood.

Technology Layer

Right now, that's only SQLTable(), but this is designed to hold/init all the technology-specific weirdness.  Connections, query lingo, test infrastructure, configurations.  This is supposed to be interchangeable so you could pull out the SQLTable and replace it with a MongoDB- or Influx-specific structure.  This isn't 100% foolproof with some of the test hooks the way they are built in right now, but by standardizing input/output, conversion shouldn't be a catastrophe.

Datasource Layer

A connection-per-resource is the goal going forward.  This means we give up JOIN functionality inside SQL, but gain an easier to manage resource that can be abstracted.  All of the validation, connection setup/testing, and any special-snowflake modifications go to this layer.  Also, because these have been broken out into their own py files, debug tests can be built into __main__ as a way for humans to actually fix problems without having to rely on shoddy debug/logging.

This adds a lot of overhead for initializing a new datasource.  In return for that effort we get the ability to test/use/change those connections as needed rather than going up a layer and fixing everything that connected to that source.  It's not free, but should be a cost-benefit for faster development down the line.

Importlib Magic

The real heavy lifter for the project isn't just the API object design, but a helper that turns an ugly set of imports/inits into a far simpler fetch_data_source() call.  I would really like to dedicate a blog to this, but TL;DR: importlib lets us interact with structures more like function-pointers.  This was useful for a work project because we could execute modules by string rather than using a "" structure that would need to import/execute every module in-sequence.  This should make it so you just have to import one module and get all the dependent structure automagically.

Without importlib, every datasource would have to be imported like:

Instead now it can look like

A small change, but it should clean up overhead and allow for more sources to be loaded more easily.  Also, this does mean you could fork the repo and build your own table_config path without going crazy trying to path everything.

A Lot Of Work For What Exactly?

The point is to simplify access into the databases.  With a unified design there, we can very easily lay the groundwork for a Quandl-like REST API.  Also, with the query logic simplified/unified, writing direct apps to both fetch/process the data go from 100+ lines of SQL to 2-3 lines of connection.  

By abstracting a painful piece of the puzzle, this should make collaboration easier.  This also buys us the ability to use a local-only dummy sources for testing without production data, so collaborators can run in a "headless mode".  Though I doubt I will get much assistance on updating the Warehouse code, it's a price worth paying to solve some of the more tedious issues like new cron scripts or REST API design with less arduous SQL-injection risk/test.