Originally published at: https://esportsone.com/blog/building-strong-foundations-in-esports-data-science/
Data is the lifeblood of Esports One. We’re passionate about our vision to use data to power every second of esports gameplay, to do amazing things like improve engagement for esports viewers and fans, to help teams and players refine their gameplay and training, or to identify up-and-coming esports stars or promising streamers.
We’re excited about using groundbreaking approaches in big data, machine learning, and artificial intelligence to make these things happen, but before we can let that excitement carry us away on our grand adventure, we need to make sure we’re building on rock-solid foundations. That means we need to understand:
- Where does our data come from?
- Exactly what does each data point mean?
- What’s the most effective way to gather and store this data?
- Are our analyses and outputs based on sound statistical principles and game knowledge?
Let’s explore what strong data science foundations actually look like.
Gathering and preparing data is 80% of the work of data science, and when it comes to data ingestion, cleaning, storage, and distribution, strong engineering is obviously crucial. At Esports One, we have a great engineering team that works with a variety of tools to handle our data pipelines, takes advantage of newer technologies, and ensures everything we build can scale well and is efficient and flexible.
A few examples of the tools we use include:
There are a lot of great tools out there, but I strongly believe that the best tool is the one that gets the job done. The most important thing is to have a team of smart, adaptable engineers who can wield those tools effectively!
Beyond the technical considerations, solid data foundations begin with selecting the correct data to ingest. A default mindset for many data scientists is “collect everything!” Our basic urge is to gather every possible piece of available data, from every identifiable source. After all, you never know what might prove useful later…
The data-hoarding mentality is natural, and reasonable from a certain perspective, but it also becomes impractical and inefficient very quickly. If you identify what you actually need, and what you want to do with it, you can not only save time and money on the collection and storage of the data, but also improve performance once you begin manipulating it.
Selecting the right data means that your data science team deeply understands the data and can readily identify the value of each data point—or its lack of value! It’s dangerous to take data at “face value” and jump into analysis too quickly.
One great example in League of Legends (LoL for short) comes from its tracking of “crowd control” (CC), meaning effects that limit the actions. LoL’s official post-game stats report two data points for CC: one is called totalTimeCrowdControlDealt, and the other is timeCCingOthers. If you don’t know the differences between these two–which are substantial, but not well documented by Riot–you’re in danger of using them incorrectly, and your resulting analysis will be built on weak foundations. (If you’re interested in knowing the actual differences, shoot me a note on Twitter and I’ll be happy to explain in excruciating detail!)
Real data analysis only begins once we’re confident in the foundations we’ve built for our data collection and interpretation. Through sound data selection and clear understanding of each data point, we can begin preparing the complex models that got us so excited in the first place. These models require proper foundations of their own, starting with data selection and interpretation but extending into an understanding of the statistical or machine learning methods themselves.
Take a League of Legends win probability model as an example. We can select a statistical method (let’s say a random forest classifier), and then immediately begin inserting all kinds of variables that we think might be useful in predicting which team is going to win a game based on the current game state. Gold difference, towers, dragons, Baron Nashor… These all seem natural to include.
But if we don’t take the time to test, evaluate, and diagnose each component of our model, and apply our game expertise, we might miss the fact that Baron Nashor is highly collinear with gold and towers, and now we’ve generated a model that is not only subject to statistical criticism for its multicollinearity, but that also imperfectly represents the actual game.
In other words, even if the math appears to check out, and the outputs seem reasonable on the surface, we still have to lay proper foundations and ensure that our models are authentic, that our statistical assumptions are not being violated, and that our outputs are actually sound and reliable.
One Step at a Time
There’s obviously a lot to consider here; data science is a complex discipline, and each esport represents a complex landscape of data, interpretation, and analysis. That’s why it’s so necessary to work one step at a time, putting in the necessary work at each stage to make sure you won’t need to retrace your steps once you’ve gone further down the road, or—if you’ll forgive the mixed metaphor—get caught off-guard when a weak link in your chain suddenly breaks.
At Esports One, we’re focused on putting those steps together in the right order, and with the right due diligence, to keep us moving forward at a running pace!