A Data Science Approach to Maximizing Data Scientist Salary

I recently had the honor of speaking to the New York chapter of Women in Machine Learning and Data Science about the field of data science and about how we leverage statistical modeling at untapt to predict the success of job applications. In this post I detail the former, leaving coverage of our in-house algorithm development for a series of later entries.

In the year and a half that has passed since my previous post on data scientist skills and salaries, the occupation has become markedly more mainstream. The Google Trends utility, for example, indicates the term data science has twice as much interest today, and Glassdoor has recently declared it the career with the best work-life balance as well as the best job in the U.S. for 2016 outright. 


Top Salaries

At untapt, we surveyed proprietary and external sources to estimate that data scientists are indeed in demand, with salary ranges on par with software developers. The most recent O’Reilly data science salary survey suggests that this relationship to software engineer pay is attributable in part to an increasing overlap between the two in-demand careers.

Since the first O’Reilly survey in 2013, the best-paying technical expertise has solidified in favor of open-source, designed-for-scale, “cloud” computing-friendly approaches. These tools cluster near each other because they are often used together by data scientists, with key nodes within the network emerging as Python, Spark, PostgreSQL, JavaScript, MongoDB, and Hive. See the bottom-right corner of this cluster graphic: 



R’s Rapid Migration

The most dramatic change in the couple of years that has passed is the shift of R from clustering amongst the maximally-coveted open-source techniques toward lower-paying, often click-and-point commercial tools like Tableau, SPSS, SAS, and Visual Basic. As a scripting language, R can be ideally suited to ad hoc number crunching within a single machine’s RAM constraints, but it is less frequently applied within live, production applications or within the remote, multicore solutions increasingly required for terabyte-scale analytics.

Revolution Analytics, one of the largest commercial R firms, focused on improving this state of affairs by developing R to become a leading parallel computing tool. Its acquisition by Microsoft last year was intended by the Redmond giant to bolster its parallelized machine learning credentials. It may, however, have had the unintended consequence of consolidating cynical open-source hackers toward the Python community, which has concurrently developed a substantial proportion of the functionality previously unique to R. For the moment, in any event, both languages provide tremendous value to bilingual users, as detailed here.


Beyond tool choice, some of the additional key insights from the O’Reilly survey are:

  • data scientists in the finance and software industries command some of the highest salaries relative to those in other industries (chart above; go FinTech!)
  • larger companies tend to pay higher salaries (though small firms tend to feature compensation by other means, e.g., equity options; chart included below)
  • the more time spent on data extraction, transformation and loading each day correlates positively with pay, while time spent on data visualization correlates negatively


Using responses to their survey questions as predictor variables, authors of the report built a linear regression model that estimates the salary of data professionals. They were able to explain an impressive 43% of the variance in pay with their model. 

They affirmed that, holding all other variables constant, expertise with the open-source tool clusters (Spark, D3.js, Amazon Elastic MapReduce, and their highly-correlated buddies) are associated with higher pay. 

Indeed, familiarity with Apache Spark corresponds to a $12k salary bump while spending years of your life on a doctorate a mere $9k. Meanwhile, familiarity with Microsoft Visual Basic or related commercial software is associated with a $3k salary reduction. Below is the full predictor list selected by LASSO regression, with a y-intercept of $26,393. This technique avoids selecting multiple highly-correlated predictors, so the inclusion of, for example, D3 (+$7,894) also implies the inclusion of nearby technologies in the cluster diagram shown earlier (e.g., JavaScript, MongoDB). 

salaryPredictors1 salaryPredictors2

After adjusting for all of these work-relevant attributes, one data scientist trait retained by the predictive model stands out for its lack of reasonable justification: a nearly $3k drop associated with being female. Given that women code as well, perhaps even better, than men, this discrepancy appears to be attributable to bias, whether intentional or not.

At untapt, we believe that when data are leveraged to drive decision-making, bias can be reduced, akin to how data-focused equity crowdfunding site CircleUp has reduced gender bias in the consumer start-up space. For guidance on alleviating bias in the tech team-hiring process in particular, here’s an actionable list that we put together.

If you’re keen to dive more deeply into topics discussed in this post, I maintain a list of data scientist technologies, resources, and open data sources here.

Many thanks to Reshama Shaikh and Sinziana Eckner for hosting me. Thank you to Desi Domo for video production, and to Donald Whalen and Kate Brodock for editing the post.


Jon is the Chief Data Scientist at untapt. Previously he worked as Data Scientist at Omnicom in New York and as an algorithmic trader in Singapore. As a Wellcome Trust Scholar, Jon obtained a doctorate in neuroscience from Oxford University. He enjoys writing on scientific and technological advances, particularly in statistics and machine learning, with an emphasis on the societal impacts of these developments.