Analyzing the 2019 Data Science Industry

What is it like working in Data Science? What are the different languages used by data engineers and data scientists? More importantly, what is the pay difference? Do I need an advanced degree to break in?

As someone new to Data Science, my head buzzed with the above questions. And what better way to answer these questions by analyzing the raw data itself. If you would like to read the full notebook, see here.

  1. What is the breakdown of degree type for each profession?

What is the breakdown of degree type for each profession?

Degree Breakdown for Each Occupation

From the above plot, we can observe that PhDs tend to gravitate to the Statistician or Scientist roles. Bachelor degree holders are usually occupying the software engineer, data engineer or analyst roles. Masters degree holders appear to be evenly spread and are the majority of each occupation.

What are the most popular learning platforms among the different occupations?

Overall Learning Platform Breakdown
Learning Platform Breakdown for each Profession

For all the professions, Coursera comes as a clear winner.

For Statisticians, their 2nd source is from their own university courses. This is similar for Scientists. Perhaps there is a correlation between the skill sets needed for both jobs.

Udacity makes a significant appearance among Software Engineers. This seems to be the case as their courses are usually geared for Software Development. Otherwise, Udacity makes up a small percentage for each occupation, with the exception for Statisticians.

It is also interesting to note that for Managers and Analysts, their source of knowledge did not come from their university education. Perhaps this shows that for these roles, the respondents had came from other non-related disciplines.

Kaggle courses rank highly among all 5 occupations. There are 2 takeaways from this. Firstly, Kaggle could be seen as the “go-to” platform for all Data Science related competitions and datasets. Secondly, there could be survey bias as the survey was conducted on the Kaggle platform after all.

What are the popular softwares used among the different occupations?

Overall Softwares Breakdown
Common Softwares Breakdown for each Profession

From the above plots, we can observe that for Scientists and Statisticians, a large portion have to primarily use a Local Development Environment. The difference between Scientists and Statisticians are that Statisticians are more focused on the modelling whereas the Scientists have to also be exposed to other software, such as cloud computing.

Another interesting observation is that for Managers, the majority of them are using softwares (Basic statistical software, BI software, cloud etc) instead of development environments. The same can also be said for the Data Analyst roles. Hence, for a new programmer coming into ML/DS, these are the roles that they should aim for.

For the Data Engineers, Software Engineers, Scientists and Statisticians, a large portion of these roles involve heavy coding (high usage of Development Environments).

Another interesting point is that for Software Engineers, a high 23.2% are primarily using Basic Statistical Software. Perhaps these users are in a software engineer / manager hybrid role? Perhaps as a SCRUM master.

What programming languages are recommended for newcomers?

Recommended Languages Breakdown by Profession

Across all the occupations, the recommended languages in decreasing order of importance are: Python, SQL, R. Interesting to note that R has a high weightage for Statisticians. Perhaps this shows that a majority of the statistical libraries are still implemented in R and have not been ported to Python yet.

Another peculiar observation is that the Software Engineers also recommend the same 3 languages, despite Software Engineers using Java or C++ for their work (see below).

What are the current languages used for each occupation?

Language Breakdown for each Profession

Currently, Python is the most popular language except among Statisticians, where the choice is R.

For the other professions (Analyst, Software Engineers, Data Engineers, Scientist, Managers), their 2nd language is SQL. This show how important learning a database language. It makes sense as most data are stored in structured relational databases. Hence, database knowledge should be a critical skill that anyone in ML/DS should have.

Lastly, the Software Engineers have Java and JavaScript as their 3rd and 4th most used language. Interestingly, this corresponds to the top 5 languages used by Managers and Data Engineers. This makes sense as perhaps closer to production, they would need to interface more with their product APIs.

What are the salaries for each occupation?

Salary Box Plot (Excluding top outliers)
Analyst Median:  17499.5  with sample size:  1868
Software Eng Median: 17499.5 with sample size: 2108
Scientist Median: 34999.5 with sample size: 4613
Statistician Median: 22499.5 with sample size: 248
Manager Median: 44999.5 with sample size: 587
Data Eng Median: 27499.5 with sample size: 604
Salary-occupation data was missing for 9689 entries

Salaries for Analyst and Software Engineers are roughly the same, looking at the median. Interestingly, Data Engineers have a higher median than Statistician. Scientist makes the 2nd highest while Managers make the highest.

Of course, another potential insight that the reader can do would be to investigate the relationship between years of experience and the salary. No doubt, some of the managers were former engineers / scientists that had climbed the ladder.

Above the box-plots, the sample sizes for each category was noted. The reader should not that for Statistician, Manager and Data Engineer roles, the sample size is rather small, compared to Analyst, Software Engineers and Scientists.

Using a simple linear model, what variables are the most significant in predicting salary?

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .20, random_state=42) 

lm_model = LinearRegression(normalize=True)
lm_model.fit(X_train, y_train)

y_test_preds = lm_model.predict(X_test)
model_score = r2_score(y_test, y_test_preds)

After fitting a model, the significant factors are: Country, Education Level, Gender, Company Investment in Data Science Capabilities

Conclusion

To break into data science, one should use learn Python & SQL via Coursera. From the salary charts, most data science occupations pay relatively well. However, to earn the big $$$, one should move into managerial roles after harnessing enough technical knowledge.

A safer way for success is to go back to school to get a degree. Unlike software engineering, there are not many non-degree holders in this field.

So what are you waiting for? The journey may be tough, but the end will we worth it.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store