Leonardo Mazzone - 28 December 2023

Keeping up with data science without going mad

Continuous professional development is key to success in any career. This is especially true for knowledge workers, and can feel exceptionally burdensome for data scientists. Data science as a discipline is hard to define, and often those who try will resort to describing a Venn diagram with circles for programming skills, statistics, and domain expertise, identifying the data science practitioner as the superhuman whose abilities sit at the intersection of the three circles. The breadth of scope is compounded by the technical depth that each of these circles demands, as well as the remarkable velocity of theoretical and technological advancements in the field.

Hence, the goal of being satisfied with one’s upskilling in data science can sometimes feel elusive. This post is my attempt to gather my thoughts on the matter of what constitutes an effective and sustainable learning and development framework. It’s a personal manifesto, aligned with my interests, needs and capabilities. For example, I am a slow reader and thinker, but I don’t have dependants, and have a strong sense of discipline. As such, I know I can make slow but systematic progress. Some of these considerations won’t be relevant to everyone, but I hope that my partial success, obtained after several iterations of trial/error and introspection, will inspire others to break down the problem in a way that feels relevant to them and makes them feel, all in all, less overwhelmed.

Welcome to the first and only post I’ll ever write with disconcerting self-help vibes.

Managing investments

The cornerstone of my approach is being intentional with the areas in which I want to seriously invest. I chose them based on relevance to my current role as well as speculation about what will be most impactful in the medium term, and I took the time to define them as precisely as I could to keep myself honest. They are:

  1. Causal and Bayesian inference: one of the most exciting trends in data science is the realisation that the prediction-centred framework of machine learning is very limiting, and often inadequate for unlocking some core opportunities: reasoning about uncertainty; combining data-driven models and subject-matter expertise; enriching the understanding of the experts; support decision-making in complex settings; adjust model outputs to reason about the impact of our actions.
  2. Machine Learning Operations (MLOps): this covers anything from learning how to streamline model selection, reproducibility, optimisation of deployment infrastructure, ongoing monitoring of performance and downstream impact, as well as the interface between the model and products or organisational processes. In a nutshell, the abyss between whatever clever thing you’d make in a Jupyter notebook, and its ability to have a profound impact on the real world.
  3. Programming and software engineering: keeping up with the evolution of Python and its ecosystem of libraries to write more beautiful and effective code - and hopefully, less of it too. Discovering ways to improve efficiency, by interacting with more performant or scalable execution environments such as distributed systems or using alternative compilers / interpreters. Turning models and algorithms into resilient and maintainable software.
  4. Extraction of insights from unstructured data: applying Natural Language Processing (NLP) to extract information from text such that it can be queried, analysed, or used for downstream modelling tasks. This covers techniques such as named entity recognition, relation extraction, and the interaction between knowledge graphs and machine learning models.

For each area, I have a backlog of resources I want to look at. I keep the time I spend on each area balanced.

Optimising aggregators

How do I fill my learning backlog? I find that social media is too noisy and distracting. I’ve grown to really like newsletters, because they come to me, to my inbox, and my inbox is a serious place where work gets done (unlike any social network). However, if we are to maximise relevance, we need to cast a wide net, and thus the best strategy is not to select a handful of individual voices, but to instead look for the best aggregators. One of my favourites is the Data Science + AI newsletter from the Royal Statistical Society. It includes tutorials, new papers, industry applications and more, with snippets of text and graphics extracted to help you determine whether any of their abundant links are worth clicking. Other popular ones are The Batch by DeepLearning.AI, or Natan Benaich’s.

Communities of practice are another great aggregator of resources, with the added benefit of bidirectional exchanges and networking opportunities. There is a thriving community for Data Science across the UK Government. Its nascent MLOps interest group has been invaluable to me thanks to the quality of resources emerging from collective intelligence as well as rapid feedback from colleagues facing similar challenges. Speaking of communities, nothing like joining a data science book club has provided the level of extrinsic motivation that only a big dollop of peer pressure can induce.

Finding inner peace

L&D FOMO (Fear Of Missing Out) is my enemy. I have had to come to terms with the fact I can’t afford to spend meaningful time pursuing directions other than the 4 above, or I would be diluting my efforts excessively. If there are very significant developments in other areas, I will be learning about them by simply engaging with my colleagues and the rest of the community.

If the backlog is too full, I must resist the temptation of adding more to it. If I’m in that situation and a newsletter comes in, I need to bin it before reading the email subject. Because of this, it’s crucial to maintain the quality of the backlog very high by being extremely selective to the point it feels ruthless. Some things are easy to exclude: if an article is trying to exploit my FOMO, or worse, make me feel like an impostor (e.g. “10 things that will give you away as a data charlatan”), straight to the bin it goes. In some cases it’s not as clear-cut, and I have to make some educated guesses. Would I still want to read an article next year, once the hype subsides?

I focus on having a good system and trust that if I stick to it things will be alright. If it becomes clear it’s not working out, I need to improve the system, rather than pretending I can work harder.

Choosing the right lane

It’s not enough to choose areas to focus on. I also need to decide the scope for each. I have a small bias for breadth over depth of research for a few reasons:

Another trade-off to resolve is whether I should concentrate on well-established theory and techniques that I’m not familiar with, or I should try to keep up with the latest and greatest. There is an interesting reflection on this excellent overview of the current state of LLM engineering; the author polled their LinkedIn connections, and the result is a lot of heterogeneity in the approach of fellow data practitioners, ranging from “ignoring the hype to trying out all the tools” (cit.). It might not surprise you that I maintain a healthy level of skepticism towards the “cutting edge”. If I invest excessively and prematurely in novel tools or ideas, many of these investments might not pan out. It’s often the case that more recent approaches are difficult to contextualise without solid, and cheaper baselines established in previous work. However, I need to keep up to the extent I can form a mental framework to understand highly significant trends, and engage with the most important debates. Similarly to my approach to depth, I am much more prone to make riskier bets when they’re relevant to what I’m working on at the time.

Choosing the right time

This is trivial advice to give. It’s also daunting to stick to. The time and intellectual space needs to be created, and fiercely protected, to ensure I can follow through and attack the learning backlog.

The most helpful thing is being realistic about when is not the right time. If I’m commuting to the office, I won’t have the energy to do any L&D. During work hours, something more urgent will always come up. After work, my eyes will be tired and all I’ll want to do is go for a jog or play the piano. On a Sunday, I will want to maximise the time I spend sleeping in or enjoying the company of my partner, family and friends.

Hence, I take 30 minutes in the morning 4 days a week, after having had breakfast and before having started work, i.e. when my emotional and intellectual batteries are at full capacity. I try to align as much as possible what I do with what I find exciting on that particular day, striving to minimise resistance from my brain. I accept that not all days are going to be as propitious or successful, and I try to get the best of it.

Finding an excuse

It can be difficult at times to feel motivated enough to complete a piece of learning when it’s speculative, i.e. I struggle to visualise a circumstance in which I will be able to apply it in the short term. And, I can be too quick to feel like I’ve absorbed a new idea, only to discover a few months later I had only a superficial understanding, or that I’ve already forgotten whatever bit of intuition I had acquired. Thus, whenever possible, I will pair the consumption of some content with the production of an artifact, even better if it is a public-facing one because of the pressure from the real or perceived scrutiny, as well as the feeling of giving back to the community.

If I’ve read about a methodology, I will apply it to some data and share it in a Kaggle notebook. If I’m exploring a new library, I’ll have a look at the open issues on the repository and try to contribute a push request (even if it’s a tiny one). If I’ve attended a conference or read a paper, I will try to share my reflections on social media - or this blog :)