Thursday, February 4, 2016

How I Data Science - Hunting For Trends

Long time friend Blake at has been pumping out some really exciting work around NPC-kill rates.  He recently poked me about taking that raw data forward to something with more teeth, and after futzing with it for about an hour, came out with some interesting data.

His inquiry reminded me of a common question I receive: How do I get into data science?  So, this blog is going to be a bit more long/technical than the recent fare.  We're going to walk step-by-step through the investigation process and I'll try to illustrate what I see as we go along.  The readouts were generated using JMP, only because it's faster to use than R.  This entire process can be done in R, and I can revisit with more specific R samples if its requested.

Let's take a look:

Getting Started

Raw Data

I like the Forecaster's Toolbox as a jumping off point.  We're looking for a few things when we start:

  • Basic visualizations: look for obvious trends
    • Simple time-series graph shows no obvious correlation
    • Simple scatter plots in case there's something obvious
  • Check scales: linear vs log vs sqrt
    • SUM_factionKills does not have much variance
    • log/sqrt price doesn't seem like a good idea
  • Skew data
    • +/- a few days to look for lead/lag 
  • Other useful indicators

This gives us some basic trends to start comparing.  Nothing is jumping out from the data at this point yet.

Scatter Plot

What we really want is a NPC-kills vs price correlation.  With that kind of relationship, we can basically automate investments off a single number.

At first crack, we're really not seeing anything.  The vertical spread, especially around the median factionKills value (~780k/day) is showing no viable trend to really predict with.  At best, we're seeing that variation/volatility is highest on median days, but that doesn't give us anything meaningful to work with in finding a trend to leverage.  Only prediction on which days might complete orders, not what price we can expect to get.

Looking at the skewed data (+1 to +5 days) there are some better clustering, but not better trending.  There's still a very obvious failure of the vertical line test which make it very hard to find a X->Y trend.  Also, some of the price outliers that were obvious in the first time-series graph are really showing to be problematic in this clustering view.

Back to the Drawing Board

As I have said in previous dev blogs, I really like if we can find a normal-shaped trend.  Even if we can find a correlation, it will be really hard to use effectively if it's not either linear OR normal.  Working outside those bounds gets difficult fast, so let's try to get back to the sorts of things we know well.

Thankfully, SUM_factionKills is reasonably normal.  But as we have discussed previously, prices really aren't.  But, deviation/volatility are normal-shaped trends.  This is starting to look like something we can at least statistically flag on, even if a linear relationship might be out of reach.

Now that we've clipped out the high-flier, and zoomed in on just the Machariel, things are looking a lot more useful.  Though the price/5d avg trends are essentially random, the deviation trend is looking far more linear.  This is extremely promising.

What I See

This preliminary result confirms a baseline assumption:
Higher ratting counts will lead to more NPC drops hitting the market and increased supply will drive down the price.
With the little bit of data, we can see a pretty strong correlation between deviation from the 5-day trend and total NPC kills in Angel space.  Now, that isn't to say we've "solved" the system yet, there are still a lot of troubling points:

  • The sample size is just barely big enough to work with.  
    • Don't like declaring trends without at least 60d of data to back them up
  • There are still some troubling fliers
    • Though the low and high ends of the graph are telling, there's some points around 800k-850k that make me slightly worried.
  •  Deviation/Volatility should be 0-centered.  
    • In a local period of decline.  Without positive swings, it's hard to confirm the "less ratting = higher prices" part of the equation
  • Flavor of the Month (FOTM)
    • Though the Machariel has been traditionally popular, it's easy to miss forest for trees with other indicators such as total sell volumes and other activity metrics
This is an extremely interesting first result out of the data at hand.  Though there are still plenty of points to be cautious about, this is enough confirmation to keep digging and collecting data.  Also, this being a derivative trend, I worry about leveraging it directly without a second signal to back it up.

Also, just to show the entire picture, we might need to include a fit-quality metric as a go/no-go boundary.  Where the Machariel/Dramiel are traditionally popular, the Cynabal isn't as strong.

Specifically troubling is the Dramiel graph which shows the reverse correlation we'd expect.  This could be a signal showing more about the demand driving the price of things more than strictly the supply.  Again, the best approach will probably be multi-factor, but this is a very interesting step toward something.  Paired with a market-side predictor, this could be a very useful second-source to validate against, or as a means to seed forecasts for items that aren't directly manufactured.

Also, I try very hard to test both positive and negative cases.  It's easy to accept when a model shows promise, and hard to accept where it might fail.  The second thing I always do in these kind of searches is try to find a case that breaks the tool, and understand why.  This is why I'm not particularly a fan of things like MACD, where it feels like 50/50 shot on whether the signal is true or not.  Even more so with candlestick reading.

Regardless, the NPC Kill rates are a very interesting trendline that I look forward to messing with more.  At the absolute least, there are still interesting things to be said about where players are spending their time, and there are still a lot of trends left to pick out of this data set.