I recently had the honor of speaking to the New York chapter of Women in Machine Learning and Data Science about the field of data science and about how we leverage statistical modeling at untapt to predict the success of job applications. In this post I detail the former, leaving coverage of our in-house algorithm development for a series of later entries.
In the year and a half that has passed since my previous post on data scientist skills and salaries, the occupation has become markedly more mainstream. The Google Trends utility, for example, indicates the term data science has twice as much interest today, and Glassdoor has recently declared it the career with the best work-life balance as well as the best job in the U.S. for 2016 outright.
At untapt, we surveyed proprietary and external sources to estimate that data scientists are indeed in demand, with salary ranges on par with software developers. The most recent O’Reilly data science salary survey suggests that this relationship to software engineer pay is attributable in part to an increasing overlap between the two in-demand careers.
R’s Rapid Migration
The most dramatic change in the couple of years that has passed is the shift of R from clustering amongst the maximally-coveted open-source techniques toward lower-paying, often click-and-point commercial tools like Tableau, SPSS, SAS, and Visual Basic. As a scripting language, R can be ideally suited to ad hoc number crunching within a single machine’s RAM constraints, but it is less frequently applied within live, production applications or within the remote, multicore solutions increasingly required for terabyte-scale analytics.
Revolution Analytics, one of the largest commercial R firms, focused on improving this state of affairs by developing R to become a leading parallel computing tool. Its acquisition by Microsoft last year was intended by the Redmond giant to bolster its parallelized machine learning credentials. It may, however, have had the unintended consequence of consolidating cynical open-source hackers toward the Python community, which has concurrently developed a substantial proportion of the functionality previously unique to R. For the moment, in any event, both languages provide tremendous value to bilingual users, as detailed here.
Beyond tool choice, some of the additional key insights from the O’Reilly survey are:
- data scientists in the finance and software industries command some of the highest salaries relative to those in other industries (chart above; go FinTech!)
- larger companies tend to pay higher salaries (though small firms tend to feature compensation by other means, e.g., equity options; chart included below)
- the more time spent on data extraction, transformation and loading each day correlates positively with pay, while time spent on data visualization correlates negatively
Using responses to their survey questions as predictor variables, authors of the report built a linear regression model that estimates the salary of data professionals. They were able to explain an impressive 43% of the variance in pay with their model.
They affirmed that, holding all other variables constant, expertise with the open-source tool clusters (Spark, D3.js, Amazon Elastic MapReduce, and their highly-correlated buddies) are associated with higher pay.
After adjusting for all of these work-relevant attributes, one data scientist trait retained by the predictive model stands out for its lack of reasonable justification: a nearly $3k drop associated with being female. Given that women code as well, perhaps even better, than men, this discrepancy appears to be attributable to bias, whether intentional or not.
At untapt, we believe that when data are leveraged to drive decision-making, bias can be reduced, akin to how data-focused equity crowdfunding site CircleUp has reduced gender bias in the consumer start-up space. For guidance on alleviating bias in the tech team-hiring process in particular, here’s an actionable list that we put together.
If you’re keen to dive more deeply into topics discussed in this post, I maintain a list of data scientist technologies, resources, and open data sources here.