The problem-framing gap for new data scientists

“If I had an hour to solve a problem and my life depended on the solution, I would spend the first 55 minutes determining the proper question to ask … for once I know the proper question, I could solve the problem in less than five minutes.” —Albert Einstein

Learning from failure

One of the first projects I worked on as a data science consultant was building a customer churn model for a software company. I was relatively new to the field and had a technical understanding of how to build machine learning models in R and Python, but I had yet to translate that into anything tangible outside of examples from classes I had taken.

We met with the client and thought we had a good understanding of the problem, identifying customers that were likely to churn in the next month based on recent behavior. There was, however, a key question we didn’t ask: What was the typical customer contract like? What we learned later was that customers signed two- to three-year contracts. “Churning” every month wasn’t an important way to view the customer relationship.

Instead of asking, “Will this client churn in the next month?” we should have been asking, “Will this client renew or extend their contract?” That was the difference between a model being used and one that ended up on the scrap heap. Although it’s a bit embarrassing to describe this now, it was a vital learning moment for me about the importance of asking the right question in framing data science work.

What is problem framing?

The above is a good example of the importance of problem framing in data science work. (I use the term “data science” to broadly include machine learning, AI, data analysis, etc.)

Problem framing is setting up your business problem in a way that can be addressed with data. It’s the process of taking an abstract goal like “we want to know which customers are likely to churn” and translating it into what data will be used, how the data will look, and what modeling approaches might be applicable. It’s a combination of understanding the underlying mechanics of your problem and matching your problem to techniques and approaches that fit.

There’s a large gap between these academic programs and what’s expected of data scientists when they move into industry roles.

Problem framing is enacted in the way that you compile your data for analysis (what are your potential features? How are you defining your target? What is the appropriate level of granularity of your analysis?) and the techniques used to address your problem (many modeling approaches require specific data configurations).

Problem framing should be the first thing a data scientist does when working on a new project. The process of problem framing involves asking questions about the system you’re trying to model. It typically includes:

Defining the dependent variable

What are you trying to predict or model? What outcome are you trying to model against?
Customer churn example: How does the company define churn? Are they interested in downgrades vs. complete churn? What time window of no activity defines churn?

Defining the level of granularity of your analysis

What represents a single record in your data? At what level are you making predictions?
Customer churn example: Are you interested in users or customer accounts that could contain many users? Are you interested in evaluating individual teams using a product or entire accounts churning?

Assessing data availability

What do you think affects the dependent variable you’re trying to predict? How are you tracking these impacts? What data can help you model the system we are evaluating? Is the data accessible and available in a timely fashion? How can you securely and compliantly access this data? Can this data be responsibly used?
Customer churn example: Is there customer behavior data that you believe influences churn? Is there customer demographics data that you believe separates customer behavior? When is this data available and when? Can this customer data be used responsibly and protect customer privacy? Is this data at the right level of granularity?

Defining potential features

How can you separate the available data into things that represent unique attributes of the system you’re evaluating? What do you think has an impact on the thing you’re trying to predict? Can you create data columns to represent the impact of this data?
Customer churn example: What customer demographic data might make churn more or less likely? What customer behaviors are being captured that might make churn more or less likely? How are customers using your products or interacting with your company?

Defining potential modeling approaches

What approaches provide a reasonable framework for evaluating the system? What are the best ways to model the dependent variable? What does your data set need to look like for these approaches?
Customer churn example: What classification or regression modeling approaches might be appropriate? Are you predicting whether churn happens, time until churn, or a numeric value like the dollar impact of churn? Are you focused on accurate predictions of churn or the accurate ordering of customers likely to churn?The problem-framing gap for new data scientists

The problem-framing gap for new data scientists

Problem framing isn’t explicitly discussed in most data science university programs, bootcamps, or online data science classes, even though it’s a key part of being an effective data scientist.

Data science academic programs and bootcamps tend to focus on technical skills like classification or regression modeling approaches, only providing glimpses into real-world problems through pre-built data sets that are already formulated for machine learning. Real-world problems require the communication skills to work with both technical and nontechnical users to understand the system or process you’re trying to evaluate, and the creativity to map the problem to appropriate data and machine learning models.

There’s a large gap between these academic programs and what’s expected of data scientists when they move into industry roles, and this disconnect between learning about algorithms versus learning about application is a part of that.

Given these expectations, data scientists often need additional training. Data scientists used to collaboration in academic programs and bootcamps often find themselves embedded in functional teams with limited opportunity to work directly and exchange ideas with other data scientists. Companies that pair new data scientists with more experienced data scientists for collaboration and mentorship are able to more quickly build problem-framing skills for new data scientists. Deep collaboration—meaning more than just brown-bag lunches and occasional review sessions—is the best way to improve problem-framing output and more quickly benefit from the skills of new data scientists.

Deep collaboration—meaning more than just brown-bag lunches and occasional review sessions—is the best way to improve problem-framing output and more quickly benefit from the skills of new data scientists.

One recommendation for data scientists looking for more problem-framing experience is to get exposure to as many different types of data science problems as possible. Take to heart the real-life examples you’re exposed to in classes and training and try to internalize how these examples are being solved and what the possible data sources could have been.

What data transformations may have been needed to address the problem? What other data sources could have been added to make the model better? Are other industries or organizations experiencing a similar problem? Does this type of modeling approach apply to other problem sets?

Often, we see really interesting developments and results by transferring approaches from other industries to problems they weren’t designed for. We’ve seen computer vision approaches used to predict protein structure and survival analysis models (used heavily in healthcare) used to identify vehicles likely to break down. Exposure to many different approaches and understanding how they were used enable you to draw new connections.

In data science, we’re usually interested in how our models will generalize to new data. The exciting thing is problem approaches can generalize to new industries and problem sets in the same way.

Six additional tips for problem framing

Don’t be afraid to ask simple or “dumb” questions. These are necessary to get a good understand of the system you’re modeling.
Research the problem! There are typically blog posts, research papers, or instructional videos about what you’re trying to model. Use them and pull from existing knowledge bases.
Reach out to others you think can help and collaborate. Problem framing can be a great project step for data scientists to collaborate and learn from one another.
Consider the timing of your data. Know when it’s available and what will be known at the time of prediction when you’re creating your data set.
Simplify, simplify, simplify. We often see data scientists relying on overly complex models when a simpler option might be the better choice. Don’t assume complex is better. The simple solution might be the right one.
Know what information is important for your problem and define what success looks like before you move to problem framing. To use the customer churn example, do you want an accurate prediction of whether each customer will churn or do you want a list of the 100 customers most likely to churn so you can target them with some sort of promotion? This will inform the output you’re looking for from your model.

Problem framing is an absolutely vital step in data science projects that is sometimes overlooked. Often issues with problem framing only reveal themselves much later on in the data science process, and team collaboration can help prevent some common pitfalls. With the types above, you’ll be able to improve your problem-framing output and better enable data science impact.

‍