Data Science is growing.

It’s been called the “sexiest job of the 21st century”, and is attracting a flood of new entrants.

Recent reports indicate that there are 11,400 data scientists who have held 60,200 data-related roles. And the overall count has grown 200% over the last 4 years, across Internet, Education, Financial Services, and Marketing industries.

And yet amidst a field growing so fast, you can observe a bit of confused exuberance. It’s not uncommon for a company to hire a data scientist just after product launch, or after Series A. To some, data science has become the magic bullet for achieving scale or their next inflection point.

But what does a data scientist do? And does your company actually need one?

Data Scientists and Analysts

Social_Network_Analysis_Visualization.png

At its core, data science helps your company make decisions on product and operating metrics. It does this via data products and decision scienceimproving product performance, building prediction models, affinity maps, and cluster analysis.

But data science is just one tool. Business intelligence and analyst functions can also help with operating metrics, albeit with more basic toolsets of SQL and Excel. Whether you use one or the other depends on your company’s data infrastructure and event volume – and hiring data scientists too early can be like trying to crack a nut with sledgehammer. Except that this particular sledgehammer will feel understimulated, underappreciated, and probably end up quitting.

Not recognizing the distinction can lead to premature adoption of a data science team at high resource costs, hurting your business and limiting your data scientists. I’ve worked on several teams that started, grew, and eventually churned out their data science org – all for what seemed like good reasons, but nevertheless leading to unfortunate outcomes.

It stands to reason that not every company needs a data scientist.

Companies of a certain data maturity are best positioned to leverage data science teams, while others can fulfill their data needs with BI and Analyst functions. The criteria below outline scenarios where a data science team may not be necessary, and assist individual data scientists in identifying companies that may not have the right infrastructure to support them.

Low Event Volume

2000px-Comparison_standard_deviations.svg_.png

Not enough data is every data scientist’s nightmare.

Data scientists thrive when they are able to work with larger data sets and event volumes. But more importantly, the specialized skillsets that data scientists employ – linear regressions, bayesian modeling, etcetera, simply don’t work on smaller data sets.

Low event volumes can affect the statistical and explanatory power of your dataset. If you have only 100K users doing a couple actions a day, you likely have a lot statistical power but little explanatory power. By contrast if you have 1000 users doing 1000s of events per day, you have little statistical power but lots of explanatory power.

Small or sparse data sets can make drawing statistically meaningful conclusions via correlations and propensity scores impossible. And defining cohorts or segmenting your users only reduces the sample sets within smaller event volumes. Machine learning techniques used by data scientists simply wouldn’t be possible in such situations.

If your company has a low volume of events, traditional business intelligence and analyst roles may be more resource and cost-effective solutions to a majority of your analytics needs.

Not Enough Historical Data

calendar-924930_960_720.jpg

Data science at its core is about looking at the past to make predictions about the future.

One of the most common and recurring problems I’ve encountered in data science, however, is there not being enough historical data.

Time and again, I’d start a cohort analysis or look at aggregate counts of users. But to get true insight on metrics, you really need to put it into historical context. What was the metric in the past month, or past year? M/M or Y/Y trends give better context as to whether an observed behavior is an anomaly or part of a seasonal trend.

And with predictive modeling, you need historical data to build a training set for the features you want to assess. Without enough historical data, you can’t train a data set towards a specific signal in the future.

Even with large event volumes, it’s likely that you don’t have enough historical reference data to compare current data to. This could be due to your company simply being too new – it’s hard to do Y/Y growth rate analysis if you’re a seed-stage company. Or it could be that your company only recently instrumented your product with a client side analytics tool or event tagging.

Not having enough historical data makes it difficult for a data scientist to actually shine at their core strength – finding historical trends and making predictions for the future.

High Latency in Your Signal

Threshold_roc.wikipedia_edit.svg_.png

Even with enough historical data, there are circumstances where a company’s business doesn’t lend itself to predictive modeling.

One circumstance of this is high latency in signal.

Again, data science is about making predictions or probabilistic assessments of some future behavior based on past behavior. Different companies have different signal sets that they are trying to optimize for – signups, churn, sales, etc.

But depending on your company, those signals may have different levels of latency. For a gaming company, churn of active users can be detected within days, if not hours. But for a SaaS or B2B company churn is seen on the order of months if not years, due to the long-term nature of contracts.

As a consequence, building predictive models around behaviors where the gap between input to signal is on the order of years, makes it extremely difficult to do your analysis. The number of externalities that can come into play make it difficult for any one feature to have high predictive power, and makes building meaningful ROC curves sometimes impossible.

Of course extremely high volumes of data can compensate for high latency in signal. But such conditions can be tough to find or optimize for. If you are a data scientist, be cautious of businesses or industries where there is expected high latency in signal, as it will make your job quite difficult.

Low Signal to Noise Ratio

signal-to-noise.png

A circumstance where high event volumes may not compensate is when your signal to noise ratio is low.

Regardless of large event volumes, if the segment of data that actually carries the signal you are trying to optimize for is minuscule, then it will be difficult to do most of modeling.

For example, if you collect millions of events in user actions a day, but only a tiny percent actually ever performs an action core to your business, it’s unlikely you’re to find meaningful insights into what inputs drive adoption of your product.

When there is low signal to noise in your data sets, it probably necessitates larger discussions about your product, but it’s unlikely that a data science project could help in answering needed questions in such circumstances.

When You Need a Data Scientist

There are many reasons you should hire a data scientist.

A data scientist is a powerful asset in both data product and decision science – they are the aforementioned data sledgehammer. They can help your product with new recommendation engines, and assess affinity maps and other data to inform product direction. They can help guide your company’s operating metrics around KPIs essential to your business.

But leveraging a data science team appropriately requires a certain data maturity and infrastructure in place. You need some basic volume of events, and historical data for a data science team to provide meaningful insights on the future. Ideally your business operates on a model with low latency in signal and high signal to noise ratio.

Without these elements in place, you’ll have a sports car with no fuel. Ask yourself if more traditional roles like data analysts and business intelligence may suffice.

Sincerest thanks to Charles Pensig, data scientist at Optimizely and Jawbone, for his feedback on this essay.