Data science is often described as the overlap of domain knowledge, computer science, mathematics and statistics.
Domain knowledge is an area of data science discussed far less frequently than programming skills, algorithms, or statistics. In large part this is for good reason, as domain knowledge is often domain specific, and generalises to a lesser degree than programming skills or statistical knowledge.
Domain knowledge as a source of differentiation
In the world of data science domain knowledge is a common way of differentiating yourself; of course it is not the only way, if you are among the world’s top experts on deep neural networks, certainly this is a source of differentiation, but there are much easier ways to differentiate yourself. As my Physics professor once told me:
it’s easier to get half the marks in all of the questions, than all of the marks in half
I think this logic extends beyond exam grades, we all have a finite resources and finding a balance in terms of where we invest those resources for best effect is critical. It is what I call a fractal heuristic;, a rule of thumb that holds true at all scales, be it: health, wealth, and relationships, or the smaller scale of data science: math, programming, domain knowledge!
Investing resources in developing new skills and knowledge
Developing domain knowledge, like anything worth having, often requires a significant personal investment in terms of time, and potentially financial resources.
But sometimes the thought of investing time and effort to gain domain knowledge can feel limiting. When we see a selection of opportunities which demand certain skills, our temptation is to build skills and knowledge that would help us exploit as much of this demand as possible. Through this lens, focusing on one specific subset of opportunities within a particular domain feels limiting, however this ignores the supply side of the equation:
For instance suppose there are are 100 Java jobs, and 10 Python jobs, a demand side argument would argue to learn Java as there are more Java opportunities. But this completely ignores the supply side argument, if there are 200 Java applicants and 5 Python applicants, this changes the equation completely!
In an economic sense, and your time and resources are very much an economy, you need to balance supply and demand to identify good opportunities, and invest your time and effort to capitalise on those opportunities. Again this is a useful heuristic for making many decisions.
Viewed from this perspective domain specialisation is a tool to assist you in exploiting opportunities. And so specialisation can be a positive or negative, dependant upon whether it brings you closer to a good or a bad opportunity.
Is domain knowledge even necessary?
If you have enough data, and the right skills, is domain knowledge really even necessary? It’s an argument I’ve heard from a number of people. In the context of machine learning competitions, perhaps this is a fair point: In competition the question being asked of the data is preformed, the data is what it is, you can apply feature engineering and transformations, but ultimately in competition data scientists are working within a predefined boundary. As a colleague of mine told me these competitions are more; extreme machine learning; than data science, and I would be inclined to agree.
In the wild, a data scientist generally needs to form their own problems and hypothesis, they need to understand the wider world to ask the right questions of the data. However, there are schools of thought that given enough data you don’t even need to generate hypothesis up front.
The idea of being able to carry out “data science” without domain knowledge is certainly interesting, but ultimately it predicates on a fairly narrow idea of what a data scientist is or does. If a data scientist is someone who takes data and uses machine learning techniques to extract useful insights, then maybe domain knowledge is not necessary, in fact maybe a data scientist is unnecessary, why not just automate the whole process!
Personally, I see data science techniques as a tool, simply another technology to be used where appropriate, viewed this way the data scientist transitions to more of a problem solver. And a problem solver needs more than one tool at their disposal. The old saying that when your only tool is hammer, every problem is nail, is so true. If you can come at a problem with a variety of skills and techniques at your disposal, both hard and soft skills, ultimately you will achieve better results. Building domain knowledge and skills is an obvious way to achieve this.
The benefits of domain knowledge from a technical perspective
So from a “real world” technical perspective domain knowledge helps you to (among other things):
- understand the data, and the data generation process.
- define new business problems.
- engineer new features.
- identify which information is not, but should be collected.
Outside of machine learning competition there is also a greater need to understand the wider implications of the work particularly in terms legality, and ethics. This is a particular challenge in heavily regulated industries like insurance, finance, or health care.
The benefits of domain knowledge from a non-technical perspective
From a non-technical perspective domain knowledge is also important because it brings colour, texture, and context, to the work you do on a daily basis. A passion for the domain that you work in will keep you motivated, and focused; it will help you work through the difficult times.
Interaction with non-data scientists
Contrary to the hype the world does not yet, revolve around data science, and data scientists. As a data science professional if you want to work on innovative and interesting problems, you need to interact with a variety of stakeholders. Good domain knowledge is particularly beneficial when you find yourself interacting with non-data scientists.
In the world of insurance this could include subject matter experts such underwriters, engineers, brokers, clients, senior management etc. It allows you to frame your ideas in manner that the other person can easily understand and relate to. Basic domain knowledge is one of bridges between you and other experts in the business, and when it comes to discussing business problems it is essential to forming a shared understanding. Communication, and trust are key elements in getting buy in from other members of the business; you can develop a great machine learning model, but if people do not trust the model, do not trust you, or do not understand the technology, they will not buy in to your ideas!
Whilst good domain knowledge can work to your advantage, a lack of domain expertise can be incredibly dangerous. Let’s take a concrete example:
… suppose that you are a new data scientist at an insurance company, you are given some claims data to analyse, your boss wants you to devise a new model to predict claim frequency. You analyse the data, and find that the frequency of claims is falling over time, and so you advise your boss that all else being equal the company can afford to lower premiums as the claims frequency is falling. This seems logical, and your analysis from your perspective is technically sound.
But … claims have a lag time between occurrence and reporting. And so the claims data set you are working is effectively a snap shot in time, and so you are still waiting for the most recent policies to “mature”, actuaries refer to this as claims development, and to the “missing” claims as incurred but not reported IBNR. The value of known claims may also shift over time, some claims may deteriorate others may improve, this movement is referred to as IBNER incurred but not enough reserved/reported. These time effects mean that what may look like a downward trend in claim activity, could in fact be due to how the data is generated, all those ‘missing’ claims are still on their way! There are many many examples of traps like this, developing solid domain knowledge can assist you in navigating these gotchas, and provide a valuable sense check.
Certainly, you could write off these sort of gotchas as simply bad data science technique, not understanding sampling bias etc. but bear in mind that it’s a lot easy to identify these issues with hindsight!
Practical tips for developing domain knowledge
- Find a domain expert and ask questions
- Read widely
- Understand what other people do
- Develop your own communication skills
Domain knowledge is a highly valuable skill to develop: your domain knowledge, or lack of it will frame the opportunities you can effectively exploit. Domain expertise can be a source of inspiration, or exasperation, so choose your domain carefully. Domain knowledge can also help you to navigate domain specific gotchas and problems.