CLICK TO NAVIGATE BACK TO PORTFOLIO HOMEPAGE

Workplace attitudes towards mental health and their impact on employee productivity

A Tutorial and Introduction to the Data Science Pipeline by Gabrielle Baniqued

What is the Data Science Pipeline?

Our walkthrough of the Data Science Pipeline will consist of the following:

  • Data Collection
  • Data Management and Representation
  • Exploratory Data Analysis
  • Analysis, Hypothesis Testing, and Machine Learning
  • Insight and Discussion
  • Although this tutorial will follow this in order, something important to remember is that this 'pipeline' is not necessarily a step-by-step/one-and-done process. Often we will be required to circle back, repeat steps, rethink methods, and maybe even start from square one.

    Introduction - Why Track Attitudes towards Mental Illness?

    In 2012, the Center for Disease Control (CDC) released a report, Attitudes Toward Mental Illness, where they highlighted the value of tracking attitudes towards mental health. Some of their conclusions include:

  • Beliefs and attitudes about mental illness might predict if an individual discloses symptoms or seek treatment/support
  • Individuals are less likely to disclose mental illness/seek support if there is perceived stigma around it in their community
  • The probability of adults receiving mental health treatment increased when states provided more funding/made it more accessible
  • Statewide, positive attitudes towards receiving treatment for mental disorders was associated with higher per capita expenditure on state mental health agencies
  • On a state level, the CDC found lack of accessibility to mental health resources and negative stigma around the topic was correlated with less individuals seeking mental health treatment. On the other hand, states where attitudes towards mental health were positive (i.e., believing treatment is effective and important) and resources were easily accessible, also saw an increase in the number of individuals benefiting from treatments for mental disorders.

    The following project takes the reasoning from the aforementioned CDC study and applies it to a different environment.

    We will be exploring the following inquiry: do workplace attitudes about mental health influence employee productivity?

    1. Data Collection

    To answer our inquiry, we will be using OSMI (Open Sourcing Mental Illness, LTD) survey data from 2014 and 2016 regarding mental health in the tech workplace.

    Sometimes, data will have to be collected (ex. web scraping), parsed through, and organized before we can actually start using it. In this case, however, the data has been collected for us.

    The survey data is publicly available and can be downloaded as csv files from Kaggle:

  • OSMI Mental Health in Tech Survey (2014)
  • OSMI Mental Health in Tech Survey (2016)
  • Both of these surveys aimed to measure attitudes towards mental health and frequency of mental health disorders among tech workers.

    The code below will load the data into pandas Dataframes. For a majority of this project, we will be using pandas DataFrame methods. The documentation can be found here.

    Summary of the 2014 dataset: The 2014 survey dataset (survey_2014) consists of 1259 valid responses (rows) with 27 attributes (columns). The target audience for this survey was employees in the tech workplace.

    Summary of the 2016 dataset: The 2016 survey dataset (survey_2016) consists of 1433 valid responses (rows) with 63 attributes (columns). The target audience for this survey remained the same as the 2014 survey (employees in the tech workplace.)

    Here, you can see that the column names are different from the 2014 data; there are actually new columns in the 2016 data! In order to combine these two DataFrames, we are going to rename the 2016 columns, drop any columns non-existing in the 2014 survey data, then concatenate the two DataFrames.

    2. Data Management and Representation

    If we wanted to use all of our survey data, it would be valuable for us to clean all of it and discuss it. In this case, we are looking ahead and letting our later part of the project (machine learning, modeling, etc.) inform the data we should tidy.

    Preparing for later modeling - target and predictor values

    Now that our two surveys are combined into one DataFrame, we can decide on our predictors and target for our modeling later on.

    Out of all of the columns (in our modified DataFrame), two measures of mental wellness can be found in treatment and work_interfere. treatment measures if a respondent has ever received treatment for any mental disorders, implying that responses of 'Yes' imply a personal history of mental illness. work_interfere measures if a respondent's mental state/disorder interrupts their work. Our specific inquiry: the effect of mental health attitudes on productivity, is best measured by work_interfere. We will choose work_interfere as our target.

    Thinking ahead: Tidying Gender, Age, family_history

    Intuitively, I can identify some 'obvious' predictors: gender, age, family_history. These predictors have to do with the employee his/her/themself. We will explore if my intuition is reasonable in the Exploratory Data Analysis (EDA) phase.

    Gender: This question in the survey was free response, leaving us a lot of data to clean up. Gender is also increasingly being understood as non-binary, i.e., not just Male/Female. We will clean up, group the gender responses into three categories, and then assign each category a numerical value: Male: 0, Female: 1, and Nonbinary: 2, to use for Logistic Regression later on.

    Age: After getting rid of any invalid ages, we will keep age as a continuous variable (vs. making into a categorial variable).

    family_history: This question is answered by 'Yes', 'No', or 'I don't know'. 'I don't know' will be processed as a NaN value, and 'Yes' and 'No' will also be assigned numerical values, 0 for No, 1 for Yes.

    However, for my specific inquiry, I would also like to try to identify predictors on behalf of the workplace: workplace size, culture, attitudes towards mental health and determine their signifiance. In addition to predictors related to the employee, we will also focus on predictors regarding the workplace.

    Thinking ahead: Tidying no_employees and measuring workplace culture/attitude

    no_employees: The survey already provides us with grouped company sizes: 1-5, 6-25, 26-100, 100-500, 500-1000, More than 1000. We will stick with these groupings.

    Measuring workplace culture/attitude: However, 'workplace culture' or 'attitudes' are more difficult to measure based on the data we are given. Our plan is, for each response, to score the responses related to the workplace culture/attitudes questions (which are generally answered in a 'Yes, Maybe, No' fashion) and assign a workplace_score to each employee/row. For simplicity, we will weight each question equally.

    However, we can't just naively assign Yes as +1 and No as -1, since the meaning of the answer depends on the value of the question. So, we will determine the intention of the questions, classify them as such, and then score 'positive answers' by incrementing 1, 'negative answers' by decrementing 1, and 'netural answers' with a 0, unaffecting the score.

    A common way to handle categorical variables is to use sklearn's LabelEncoder to transform our variables into numerical values. The documentation can be found here. We use this in the code below.

    Below, we will translate the categorical value for no_employees into a numerical value. This will assist with our Logistic Regression later on.

    Handling work_interfere differences between 2014 and 2016

    One of the changes between the 2014 and 2016 surveys that presents a problem for our specific inquiry is that the 2016 survey asks for level of work interference due to mental health with or without effective treatment. The 2014 survey asks for work interference without the added conditions.

    The code below consolidates the two columns from 2016 into the one work_interfere column based on the following conditions:

  • If one of the two columns is a NaN value, the non-NaN value is used.
  • If both are NaN, no action is taken, since the work_interfere column will either have a value or NaN value in the cell by default.
  • If both are not NaN values, we use random.choice() to randomly choose between the two options. This is known as Hot Deck Imputation, which handles missing data by randomly replacing it with an observed response with a similar value. You can read about it more extensively here.
  • Afterwards, we drop the two 2016 work_interfere_with/notreat columns (we don't need them anymore!) All non-NaN values should be under the one work_interfere column now.

    After we've handled the differences between the 2014 and 2016 work_interfere responses, we can group them into a binary classification as described below:

  • No, Never, and Rarely -> Little to No Interference
  • Sometimes, Yes, Often -> Moderate to Significant Interference
  • This binary classification is done in preparing for our Logsistic Regression, to be completed later under Hypothesis Testing and Machine Learning.

    As we discussed earlier, we will be create our own scoring to determine a company's workplace culture and attitudes. We will call this workplace_score. The following questions/values will be used to determine the score.

  • benefits: Does your employer provide mental health benefits?
  • care_options: Do you know the options for mental health care your employer provides?
  • wellness_program: Has your employer ever discussed mental health as part of an employee wellness program?
  • seek_help: Does your employer provide resources to learn more about mental health issues and how to seek help?
  • anonymity: Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?
  • leave: How easy is it for you to take medical leave for a mental health condition?
  • mentalhealthconsequence: Do you think that discussing a mental health issue with your employer would have negative consequences?
  • physhealthconsequence: Do you think that discussing a physical health issue with your employer would have negative consequences?
  • coworkers: Would you be willing to discuss a mental health issue with your coworkers?
  • supervisor: Would you be willing to discuss a mental health issue with your direct supervisor(s)?
  • mentalvsphysical: Do you feel that your employer takes mental health as seriously as physical health?
  • obs_consequence: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?
  • We will intuitively classify the questions as a 'positive question' or a 'negative question'. A positive question is one where 'yes' will increase the score, and 'no' will decrease the score. A 'negative question' will work in the opposite way. Any other questions that don't fit in this binary will be classified and scored individually. Let's get familiar with what the scores mean:
  • A high workplace_score indicates a workplace where mental health resources are accessible and common knowledge, there are little to no consequences for being open about one's mental illness(es), and individuals are comfortable discussing the topic.
  • A low workplace_score indicates a workplace where mental health resources are not accessible and/or not well-communicated, there are observed consequences for being open about one's mental illness(es), and individuals are not comfortable discussing the topic.
  • The calculation of the scores will be reflected in the code below:

    3. Exploratory Data Analysis

    Inspecting our data

    Now that we've tidied up our data and organized it according to our inquiry, it will be helpful for us to understand the data we're using. The code below outputs distributions and countplots for us to make observations about our dataset and then revisit any hypotheses or thoughts we may have had in the earlier phases. The observations we make will inform how we will approach making our prediction model.

    Predictions

    Our inquiry seeks to determine connections between mental health attitudes/culture in the tech workplace (workplace_score) and employee productivity (levels of work interference due to personal mental health/work_interfere). I would expect a high workplace_score, indicating a workplace culture where mental wellness is prioritized and talked openly about without consequence, to be associated with less work interference (i.e., under the 'Little to No Interference' class).

    In order to evaluate the strength of the correlation between work_interference and workplace_score, I also want to evaluate how other predictors, (Age, Gender, family_history, and no_employees), interact with work_interference, and then compare.

    I predict that, out of the four predictors, family_history and Gender will be the most significant.

    The code below for the Exploratory Data Analysis (EDA) section will utilize seaborn and matplotlib to produce plots. The relevant documentation can be found here:

  • FacetGrid with seaborn
  • histplot with seaborn
  • countplot with seaborn
  • Plotting with matplotlib
  • Age

    The distribution for age has a right skew, with most tech employees falling between the ages of their mid-20s and late 30s.

    Family History of Mental Illness (family_history)

    The information below informs us that approximately 46% of respondents reported having a family history of mental illness, just below half.

    Gender

    Again, it is not surprising that a survey about the tech workplace, which is male-dominated, to be male-dominated. Approximately 77% of respondents identify as male, 21% identify as female, and less than 1% identify as nonbinary.

    Number of Employees (no_employees)

    Using the grouping determined by the survey, most respondents work in a company of sizes 6-25 or 26-100, and many respondents also work in a significantly larger company of more than 1000 employees.

    Exploring how our possible predictors interact with our target value

    Using FacetGrid from seaborn, we can see the distribution of certain variables (our predictors) based on another value (our target, work_interfere). Each of the two levels of interference displays its own distribution based on the predictor values.

    Distribution of Age by Levels of Work Interference

    Differences in sample size aside, the distributions both display a right skew. This appears to be more indicative of the ages alone, rather than any significant correlation.

    Distribution of Gender by Levels of Work Interference

    Here, we see that females (coded by 1) experience more work interference due to mental health issues than males (coded by 0). This may imply that Gender is a strong predictor value. The sample size for individuals identifying as non-binary is not large enough to make meaningful observations.

    Distribution of Number of Employees by Levels of Work Interference

    Just visually, I would not make any inferences about the distribution here. Often, naively plotting values may not be helpful in exploring data. One option is to reconsider how you represent your data (different ranges, making into a continuous variable, etc.).

    Distribution of Workplace Culture/Attitudes by Levels of Work Interference

    My hypothesis, that a positive workplace culture regarding mental health would reduce work interference is not supported by the plots below. Both of them demonstrate a concentration of moderate workplace_score values, with more employees reporting some sort of work interference (moderate to significant). This may imply that the workplace_score value is not as significant as I initially theorized.

    Counts of Family History of Mental Illness by Levels of Work Interference

    Respondents who indicate having a family history of mental illness also demonstrate having more incidences of work interference due to their own mental health. This may imply that family_history is a significant predictor.

    Our exploratory data analysis has given us some information and predictions going into our Hypothesis Testing and Machine Learning step. Now, we will create a model, interpret its outputs, and discuss its performance. It would be nice to come out on the other end of this step with 100% accuracy...but that won't happen. So, in the next step, we will also eventually discuss possible problem areas in our data collection, management, and analysis.

    Remember: The Data Science Pipeline isn't always straight-forward! More often than not, it will require that we repeat steps and fine-tune methods.

    4. Hypothesis Testing and Machine Learning

    Introduction to Logistic Regression

    As mentioned in earlier steps, we are going to use Logistic Regression to try to predict levels of interference for employees in the tech workplace.

    When to use Logistic Regression?

    We should use logistic regression when we want to measure the relationship between a categorical variable (in our case, work_interfere, which expresses two categories of work interference due to mental health), and one or more independent variables (Age, Gender, workplace_score, and family_history). For our specific inquiry and the source of our data (a survey), we primarily worked with categorial variables. Logistic Regression is a classification algorithm, making it a suitable choice for our dataset.

    Binary Logistic Regression

    Earlier in our code, we translated work_interfere into a variable with only two possible outcomes: Little to No Interference or Moderate to Significant Interference. There are other types of Logistic Regression that can be used to handle target variables having more than two classifications (multinomial, ordinal), but for our purposes, we will use binary logistic regression.

    The code below shows how to build a predictive model in Scikit-learn. The documentation can be found here.

    Let's put this in a DataFrame, just for clarity. Then we'll discuss what these coefficients indicate.

    Interpreting the coefficients

  • Gender is the lowest coefficient we have, as it refers to the 'Male' classification, i.e., male respondents are less likely to report work interference due to their mental health.
  • Positive numbers indicate a positive correlation. Based on this, we can say that respondents not reporting a history of mental illness in the family are less likely to experience work interference.
  • workplace_score, which is most relevant to our inquiry, has a negative sign, indicating that it has a negative correlation with having moderate to significant work interference. However, this value's significance can be debated.
  • This doesn't quite align with the earlier predictions. As mentioned earlier, we can't depend on our model to perform with 100% accuracy. Our next step will be to evaluate how our predictive model performed.

    Evaluating the performance of our model with a Confusion Matrix

    A confusion matrix, also known as an error matrix, can further our understanding by visualizing the number of correct and incorrect predictions for our classifications.

    We can go even further and create a heatmap from our cnf_matrix, as seen below. This does not change our prediction/error counts, it simply provides a visual where, the more concentrated/darker a color becomes, it indicates a higher count for that classification. Our heatmap shows a significant concentration of incorrectly predicted values (specifically, Type 1 Errors, which are generally thought of as the worse of the two), which should prompt us to reevaluate our model's performance.

    We will use seaborn to create our heatmap. The documentation can be found here.

    Interpreting the classification_report

  • precision: Identifies how many are correctly classified among ths specified class.
  • recall: Indicates the amount of some class that can be found among the whole dataset class amount.
  • f1-score: the harmonic mean between precision and recall.
  • support: count of occurences for each classification
  • What does this mean?: The logistic regression created a model based on our training and test sets. recall shows us the product of our fitted model: with 33% of respondents having little to no work interference due to mental health, and 82% of respondents reporting some sort of work interference due to mental health. precision proceeds to tell us how accurate this is, compared to our testing set (actual results).

    Our model has an overall accuracy of 0.62, which...isn't great.

    So, what do we do now? One option is to improve our model's accuracy by going back and handling our data differently: this may include normalizing/standardizing values, looking for class imbalances in our dataset, or rework how we classify our predictors and targets. Important: When doing this, it's important to optimize accuracy on your training set, not your testing set, to avoid overfitting.

    However, for our purposes, we won't be going back and making these changes; instead, we are going to discuss the likely causes of our lack of accuracy and other methods we could use to fine-tune our model.

    5. Insight and Discussion

    Walking back through the Data Science Pipeline

    Now that we've done one pass through the data science pipeline, our next step is going to be walking through it again, instead this time to review our methods, identify things we could have done differently, and discuss potential problem areas.

    1. Data Collection

  • Handling work_interfere: Instead of manipulating the 2016 dataset to fit with the 2014 dataset, which was less informative, it may have increased accuracy to use multiple-imputation or other imputation methods.
  • Choice of datasets: The chosen datasets may have not been best-suited to answer the inquiry about productivity. Although work_interfere describes productivity in the sense of 'least interference due to mental health', it may have been a stretch and potentially answered better with another dataset. Additionally, 2014 and 2016 could be considered outdated, especially when it comes to mental wellness attitudes, which have significantly evolved and shifted over the past few years.
  • 2. Data Management and Representation and 3. Exploratory Data Analysis

  • Handling of workplace_score: Fine-tuning the way this score was calculated by differently weighting questions is something I would be interested in seeing the effects of. The way I chose to calculate it, giving positive marks and negative marks based on an employee's perception of their workplace, possibly communicated arbitrary information, especially when compared to stronger predictors, such as family_history and Gender.
  • Lack of normalization/standardization of predictor variables: Earlier in the code, we naively translated categorical variables into numerical ones for gender and family_history. The other predictor variables, age and workplace_score are continuous variables. It is possible certain variables are dominating the model. Normalizing and differently handling how the data is represented may result in a better balanced training set.
  • 4. Discussing Predictive Model Performance (Hypothesis Testing, Machine Learning)

  • Our target variable lacked balance: In order to demonstrate a binary logstic regression, we manipulated our target variable, work_interfere, to only produce two results. 'Little to No Interference' consisted of survey responses of 'No', 'Never', and 'Rarely', and 'Moderate to Significant Interference' consisted of 'Sometimes', 'Yes', 'Often'. However, the more moderate category of 'Sometimes' had the most responses, and likely produced an imbalance in favor of the 'Moderate to Significant Interference'. A less statistics-related observation... I likely let this slide because I wanted to see a strong and obvious correlation between my predictor values and mental illness's impact on work_interfere. It's important to also check your personal biases!
  • General handling of missing values/NaN and sample size: Outside of work_interfere, we did not do any meaningful imputation in dealing with our missing values. This resulted in a significant reduction of our dataset. Similar to a lot of the discussions above, more fine-tuning could have possibly resulted in different model behavior.
  • Final Thoughts

    So...what about the initial inquiry? Do workplace attitudes about mental health influence productivity?

    Based on the model we have: they do not significantly reduce an individual's likelihood to have less work interference. However, the problem areas we discussed have indicated numerous ways this inquiry can be re-explored, whether with a whole new dataset, different treatment of our variables, or a different ML model.

    As you can see, walking through the data science pipeline is not necessarily straightfoward. Though the way we collect, manage, and analyze the data is important, there is great value in the discussion aspect of the data science pipeline: understanding our methods, figuring out how we can improve, and trying the same steps again with new perspectives.