Yeah, i'm not sure what Mike's agenda is here. He's supposed to be a numbers man yet it seems a little off to me and has done since his announcement just after Descenders launched. Is he trying to renegotiate with Valve or court the EGS Habidashers? I'm a little cynical at the moment regarding these things and it's always about money but i'd like to see a data-science person chime in on his methodology etc.
Not a data-science person per se, but without going into too much detail my grad focus involved quantitative data analysis, so plenty of looking at spreadsheets, datasets, regressions, charts and so forth. I've also been out of the game for a while, unfortunately. Knowing the limitations of my atrophying expertise, I'll try to be as modest as possible interpreting the slides.
Removing the top and bottom to get rid of outliers is ok. Especially if it's polluting the data as he says. It can be useful to see the full spread just to get an idea of in which revenue range the majority of games lie. Especially if you want contrast that against what a 'typical' game might come to expect. That kind of contextualization is nice, even when you have wild outliers. From there, you can narrow the scope of the data again to zoom in on a specific subset or issue. There were three instances whereby the data was modified: 1. removing all games with less than 10 reviews, 2. removing AAA and 3. removing top and bottom 5%. I don't know if that last step was necessary personally without seeing the data. I would've expected the first two limitations to be sufficient, but this could have its reasons. I've always removed outliers manually rather than indiscriminately removing them based on a fixed percentage. If they are strongly affecting the outcome of the data, I would assume it'd be easy to pick out the ones that aren't representative.
It would've been good to have visuals. Data analysis software can produce really informative visualizations. A basic box and whiskers plot would've done wonders. Often data doesn't translate well to text and the way some of these statements and conclusions are worded is open to interpretation.
Statements like "the average game on Steam" are hard to parse and open to misinterpretation. I guess with average, he means median, i.e. the literal middle of the pack (I assume this because the numbers for the "average game" line up with the median numbers). This is confusing because the average is the mean. Anyway, contrast the information about the median game against the means, and you see that the distribution of sales and revenue is highly unequally distributed. So the top half of the games did
a lot better than the bottom half. That's interesting in and of itself compared to year-over-year numbers.
One thing I didn't quite understand is how the median sells 1500 copies at $10 but somehow makes $16,000 in revenue. Typo? He does say that this is after all the cuts are taken out of the equation, but at that point I would expect that number to be much lower. At the very least there's that fat 30% cut that Valve takes, so how come it's that high? The 2018 numbers add up differently and would be closer to what I expect after taxes, distributor cut, etc. Then he says he's talking about "the average game on Steam [in 2018/2019]" but his sample is of one month, and then there's the bit about "in the first year on sale". I thought it was one month. Did he project that one month snapshot over the rest of the year? This slide is generally confusing to me.
When it comes to the part about game pricing. I don't really understand how that conclusion was reached. That relationship was not evident to me from the slides. Maybe there is more information on the purchase behavior and price relationship / (in)elasticity that the author has and is coloring this conclusion. So instead I'm guessing at how this conclusion was reached. I guess the implication is that because the median of copies sold in higher pricing brackets is as high or higher than in lower pricing brackets, it follows that price does not have a negative relationship with copies sold. I personally find that somewhat of a tenuous suggestion, given that there are many other variables that are associated with higher priced games, while there is no information on whether these factors were somehow accounted for.
In the end, all I'm seeing is means and median. Price and copies sold is a tricky measure with the popularity of GAAS, F2P, MTX. Were these types of games included in the results? Why/why not? There are probably more interesting/valuable ways to look at the data. 170 games is also a bit hard to draw solid conclusions out of about the entire market at large. I would rather emphasize that this is how this specific sample fared in this specific month compared to last year. But, again, at some point he referred to performance in a year? So I'm not quite sure where that's from. IIRC the author did hedge some of his conclusions in the twitter posts about these slides. But I think it was to be expected that people will run with these numbers and draw an even more simplified picture.
As an aside, it would've been interesting to see a plot of review positivity/negativity against copies sold. See if that tells us anything. Is the bottom half of 2018/2019 June releases generally negatively or positively reviewed? How do the two years compare? How strong is the relationship between review positivity/negativity and copies sold? The author already checked the presence of reviews after all. Could've been a nice extra data point. I guess it's tricky because more sales give you more reviews and review positivity is just a ratio of positive and negative reviews. So games with few sales don't have many reviews to work with. A bigger sample size could help with that. Could also make a variable that's some kind of factor of reviews versus copies. I'm sure smart people have found a way to operationalize this.
My main takeaway is that I wanted some combination of more numbers, more unprocessed numbers, more variables, more discussion, more responsible/clearer reporting and visualizations. Even if this isn't a dissertation.