Analyzing Smart Device Usage: Bellabeat Case Study

Analyzing Smart Device Usage: Bellabeat Case Study

What is the Bellabeat case study?

This is the Bellabeat case study project from the Google Data Analytics Specialization, and the goal of this case study is to demonstrate the process of Ask, Prepare, Process, Analyze, Share, Act. These 8 processes are very important for doing data analysis, they ensure we get consistent, repeatable processes each time.

I talked about my experiences and opinion of the Google Data Analytics Specialization.

In this project, I am doing the Bellabeat case study by exploring the trends of smart device usage, and come up with new insights for the marketing team. The usage of smart devices have been growing in popularity in recent years, and many firms wants to better capture the smart device market. Through the Bellabeat case study, we hope to any trends regarding the smart device usages.

Bellabeat Case Study: Scenario

Pen and paper for writing down plans.

Suppose that I was a junior data analyst working at Bellabeat, and I was tasked with analyzing smart device fitness data to gain insight into new growth opportunities. Bellabeat is a successful small company, and Urška Sršen, co-founder and Chief Creative Officer of Bellabeat understands the insight from analyzing smart device data could reveal new opportunities for the company.

Questions this Case Study Wants to Answer

  • What are some trends in smart device usage?
  • How could these trends apply to Bellabeat customers?
  • How could these trends help influence Bellabeat marketing strategy?

Understanding Our Data for the Bellabeat Case Study

Image about data.

Gathering Necessary Data

In order to analyze smart device usage, the data ideally needs to track user activities throughout their day. Some physical factors includes: how many steps a user has walked, how far user has walked, how many calories used from the physical activities, how fast were the users traveling. Other can be biological factor such as heart rate, weight, and BMI.

The Fitbit Fitness Tracker Data fits all the criteria for the type of data we need. Thirty Fitbit users given consent to submit their personal smart device tracking data, including minute-level output for physical activities, heart rate, and sleep monitoring. These data are especially useful since they can help us look at how people with different level of physical activities can affect their calories burned and heart rate.

The data includes measurement such as steps taken, activities intensity, calories burned, METS (Metabolic Equivalents) — Amount of energy used compared to sitting or at rest. There were other measurement included with the data such as sleep, weight, and BMI. For the sleep measurement, I am unsure what it is actually measuring.

Limitation of the dataset

Header row of the dataset.
Header row of the measurement for sleep

The header of the csv just says value, which seems to have value range from 1-3 in the column. Perhaps Fitbit has a feature to calculate the sleep quality, so maybe this is what the value meant on the csv header. I believe on the safer side, it is better to exclude this because a misunderstanding of the data can lead to wrong analysis. Which ultimately, leads to the wrong conclusion.

The weight and BMI data was a nice to have kind of data, but unfortunately only few people input their weight and BMI for the duration of the survey. There is not enough sample to use this data, so for this analysis, it will not be used.

One of the things I am concerned about this data is that, whether or not this data includes enough diversity of users for us to extrapolate the trend. Since the survey is mostly anonymous, so other than the measurement from Fitbit, we do not know other factors such as age, diet, etc. Reason this is a problem its because the 30 people can be selected from a similar group. Let assume that 60-70% of the 30 people participates in high physical activities, because the sample is skewed toward one group of people, the results of the analysis could be biased.

The data is available on Kaggle.

Data Processing/Cleaning

Putting on gloves before cleaning.

Since the data is separated into different CSV files, using a DBMS (Database Management System) can be beneficial. SQL (Structured Query Language) allows the user to create, and modify the databases. Benefits of SQL is that it can query multiple tables in the databases, insert or modify existing data, create new table, etc.

Another benefit of SQL is the speed it can query data. When the dataset is small, with only a few hundred thousand rows of data, then something like Excel will do just fine. However, where SQL excels is when the dataset is large, millions or even hundreds of millions of rows, it can query data much faster than Excel.

In this case study, I will use the MYSQL. Each CSV files can represented as a table in the database, so instead of having three files, the data can be managed under a single database.

First, create a database named fitbit

Next, we create the tables needed to store our data. In this case, I am using 4 datasets, so 4 tables will be created to store these data.

Creating Tables in MySQL
  • id: Corresponds to the person doing the survey
  • date_time: This is a string containing the date
  • steps/mets/intensity/calories: The target measurement

Here the CSV files are loaded into MYSQL using LOAD DATA LOCAL INFILE.

The “Path” to the CSV is based on absolute path. This path would be in the form of C:/Users/<username>/”Path to File”

Loading data from CSV into MySQL

This is a snapshot of the steps table.

Snapshot of one of the dataset

Everything looks fine except the date format needs to be changed to 24 hour format instead. To convert the date_time column from 12-hours format into 24-hours format, we can use the str_to_date function in MYSQL.

-- Update the date_time column for each of the tables
UPDATE steps
SET date_time = str_to_date(date_time, '%m/%d/%Y %h:%i:%s %p')
WHERE Id = Id;

UPDATE mets
SET date_time = str_to_date(date_time, '%m/%d/%Y %h:%i:%s %p')
WHERE Id = Id;

UPDATE intensity
SET date_time = str_to_date(date_time, '%m/%d/%Y %h:%i:%s %p')
WHERE Id = Id;

UPDATE calories
SET date_time = str_to_date(date_time, '%m/%d/%Y %h:%i:%s %p')
WHERE Id = Id;

Before

After

Now we change the data type of the date_time column from VARCHAR to datetime.

-- Change the date_time column from VARCHAR to datetime datatype
ALTER TABLE steps modify column date_time datetime;

ALTER TABLE mets modify column date_time datetime;

ALTER TABLE intensity modify column date_time datetime;

ALTER TABLE calories modify column date_time datetime;

Since the measure is in different tables, we can use LEFT JOIN to merge the tables together.

-- Merging the tables into one
SELECT steps.*, mets.mets, intensity.intensity, calories.calories
FROM steps
LEFT JOIN mets ON steps.id=mets.id 
	AND steps.date_time=mets.date_time
LEFT JOIN intensity ON steps.id=intensity.id 
	AND steps.date_time=intensity.date_time
LEFT JOIN calories ON steps.id=calories.id 
	AND steps.date_time=calories.date_time;

This is the resulting table, and export this result to an CSV file.

Currently, our data is measured every minute, and which can be difficult to see any trends. By aggregating data by 30 minutes, 1 hour, and daily, the data can show a fuller picture. Since 1-minute data can have a significant difference between each interval. Let says a person suddenly runs 100 steps in one minute, then the next minute the number of steps goes to 0. Therefore, by aggregating our data, we can see smoother change, but we lose the sensitivity that 1 minutes of data can show.

-- Aggravate data into hourly by id and time
SELECT ANY_VALUE(id) AS id, 
	DATE_FORMAT(ANY_VALUE(date_time), '%Y-%m-%d %H:%00:00') AS time_stamp, 
	SUM(steps) AS hourly_steps, 
    AVG(mets) AS avg_hourly_mets, 
    AVG(intensity) AS avg_hourly_intensity, 
	SUM(calories) AS hourly_calories_burned
FROM minute_merged
GROUP BY id, ROUND(unix_timestamp(time_stamp)/(60*60));


-- Aggrevate the data into daily by id and time
SELECT ANY_VALUE(id) AS id, DATE_FORMAT(ANY_VALUE(date_time), '%Y-%m-%d') AS time_stamp, 
	SUM(steps) AS daily_steps, 
    AVG(mets) AS avg_daily_mets, 
    AVG(intensity) AS avg_daily_intensity, 
	SUM(calories) AS daily_calories_burned
FROM minute_merged
GROUP BY id, ROUND(unix_timestamp(time_stamp)/(24*60*60));

idtime_stamphourly_stepsavg_hourly_metsavg_hourly_intensityhourly_calories_burned
15039603662016-04-12 00:00:0037317.23330.333381.32409763336182
15039603662016-04-12 01:00:0016012.83330.133360.56049823760986
15039603662016-04-12 02:00:0015112.46670.116758.83019828796387
15039603662016-04-12 03:00:00010.00000.000047.189998626708984
15039603662016-04-12 04:00:00010.06670.000047.50459861755371
15039603662016-04-12 05:00:00010.06670.000047.50459861755371
15039603662016-04-12 06:00:00010.06670.000047.50459861755371
15039603662016-04-12 07:00:00010.03330.000047.34729862213135
15039603662016-04-12 08:00:0025014.43330.216768.1108980178833
15039603662016-04-12 09:00:00186429.85000.5000140.86214590072632
Sample data from the aggravated hourly data.

Data Analysis of the data for the Bellabeat Case Study

Looking at result of the data and analyzing it.

Now that we have data in a minute, hourly and daily format for our Bellabeat case study, it is time to start the analysis process. In the Google Data Analytics course, they used the R programming language to do the analysis. However, I will use Python since getting the result is the same, but a slightly different method.

This is the Planned structure of the python project folder

Project Folder

  \ data
    \\ data needed for analysis

  \ visualizations
    \\ graphs

  \ linear regression
    \\ regression models

  \ misc
    \\helper functions

  \ analysis.py
  \ data_visualization.py
  • data folder: contains the 3 aggravated data CSVs
  • visualizations folder: contains all the graphs needed for showing trends
  • linear regression folder: contains the regression model
  • misc: includes any miscellaneous helper functions
  • analysis.py: includes the Regression model
  • data_visualization.py: plots all the data needed

First lets import the data from CSV into Python. For this, I will use the Pandas library to import the CSV to a data frame.

# Loading the required libraries
import pandas as pd
import numpy as np

# Import the data from csv to dataframe
minute_data = pd.read_csv('data/minute_measurement_merged.csv')
hourly_data = pd.read_csv('data/hourly_measurement_merged.csv')
daily_data = pd.read_csv('data/daily_measurement_merged.csv')

Now that the data is loaded into Python, we can generate some summary statistics to see what the data is trying to tell us. Since the first column of the data is the id, it is not that useful for summary statistic. Thus, it can be excluded from the summary statistic.

# Generate summary statistic of our data
minute_data.iloc[:, 1:].describe()
hourly_data.iloc[:, 1:].describe()
daily_data.iloc[:, 1:].describe()

Summary statistic for minute measurement data.

              steps          mets     intensity      calories
count  1.325580e+06  1.325580e+06  1.325580e+06  1.325580e+06
mean   5.336192e+00  1.469001e+01  2.005937e-01  1.623130e+00
std    1.812830e+01  1.205541e+01  5.190227e-01  1.410447e+00
min    0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
25%    0.000000e+00  1.000000e+01  0.000000e+00  9.357000e-01
50%    0.000000e+00  1.000000e+01  0.000000e+00  1.217600e+00
75%    0.000000e+00  1.100000e+01  0.000000e+00  1.432700e+00
max    2.200000e+02  1.570000e+02  3.000000e+00  1.974990e+01

Note: This is summary statistic that includes ALL 33 of the survey participants, but it does not necessarily mean every person has the same statistic. What I mean is that on average from the whole sample, they walk or run 5.33 steps/minute. Does that imply everyone walks or runs an average of 5.33 steps/per min? Not necessarily, because some could walk more than 5.33 steps/min and some can walk less than 5.33 steps/min. The average of all participants averages 5.33 steps/min. 

Let take a look at the data at individual levels, first we take a look at the data on minute measurement.

minute_data.groupby('id', as_index=False).agg({'steps': 'mean', 
                                               'mets': 'mean', 
                                               'intensity': 'mean', 
                                               'calories': 'mean'})
            id      steps       mets  intensity  calories
0   1503960366   8.706323  16.668550   0.269503  1.308458
1   1624580081   4.025136  12.522011   0.133990  1.040579
2   1644430081   5.130108  14.107674   0.175330  1.982551
3   1844505072   1.822663  11.877975   0.083698  1.111422
4   1927972279   0.643116  10.648279   0.030956  1.524636
5   2022484408   7.960249  16.955374   0.283628  1.759459
6   2026352035   3.896467  13.936843   0.180208  1.080245
7   2320127002   3.311451  13.162993   0.145714  1.210995
8   2347167796   6.897625  15.720491   0.242029  1.478669
9   2873212765   5.301857  15.285077   0.251698  1.338597
10  3372868164   4.839583  14.573023   0.256321  1.363015
11  3977333714   7.860249  15.451461   0.253807  1.085774
12  4020332650   1.557354  12.198224   0.072632  1.677744
13  4057192912   2.898674  12.053030   0.081629  1.486862
14  4319703577   4.821915  14.082896   0.188513  1.423378
15  4388161847   7.250907  16.375669   0.238526  2.135878
16  4445114986   3.373010  13.387290   0.163302  1.535782
17  4558609924   5.387432  15.350226   0.240829  1.426141
18  4702921684   6.054697  14.958869   0.215527  2.095438
19  5553957443   6.093744  14.507443   0.214064  1.327167
20  5577150313   5.850259  18.742255   0.331591  2.373143
21  6117666160   4.688712  14.730025   0.209015  1.530892
22  6290855005   4.102206  13.192632   0.176667  1.887752
23  6775888955   1.786093  11.841831   0.072896  1.513978
24  6962181067   6.913046  15.590255   0.247928  1.399325
25  7007744171   8.164420  16.953910   0.293012  1.833057
26  7086361926   6.537131  15.938904   0.226057  1.803806
27  8053475328  10.373311  17.020317   0.299161  2.070390
28  8253242879   4.750116  13.226721   0.151779  1.311297
29  8378563200   6.121487  16.076762   0.247999  2.414050
30  8583815059   4.090065  13.150534   0.152182  1.844362
31  8792009665   1.333085  12.026265   0.073909  1.410320
32  8877689391  11.243551  19.667984   0.318256  2.399803

Here we can see that not everyone has the same average for the number of steps taken, mets, intensity, and calories burned. We can also see here that some of the participants also have more sedentary lifestyles. Some other participants are way more active compared to the sample mean. This can potentially mean we have a fairly well-distributed sample of users, the users that are very physically active and users that are less physically active.

Summary statistic for hourly measurement data

        hourly_steps  avg_hourly_mets  avg_hourly_intensity  hourly_calories_burned
count  22093.000000     22093.000000          22093.000000             22093.000000
mean     320.171502        14.690011              0.200595                97.387771
std      690.467256         8.342430              0.352260                60.700980
min        0.000000        10.000000              0.000000                42.162001
25%        0.000000        10.000000              0.000000                63.172731
50%       40.000000        11.216700              0.050000                82.523997
75%      357.000000        16.250000              0.266700                108.088200
max    10554.000000       129.300000              3.000000                948.493082

Summary statistic for daily measurement data

        daily_steps  avg_daily_mets  avg_daily_intensity  daily_calories_burned
count    934.000000      934.000000           934.000000             934.000000 
mean    7573.392934       14.663957             0.199417            2303.627430 
std     5100.939182        2.903262             0.115348             693.567613 
min        0.000000       10.000000             0.000000              64.872000 
25%     3709.000000       12.710225             0.122375            1828.246126 
50%     7362.000000       14.696200             0.210400            2124.219920 
75%    10692.250000       16.406750             0.277100            2781.955238 
max    36019.000000       25.775700             0.627800            4552.352890 

Finding Relationships Between Variables

Now that we have some insight into what the data looks like and its distribution, we can take a look at the different effects that the variables have on each other. Suppose we are interested in how the number of steps taken, mets, and intensity can affect the number of calories consumed. Since the response variable, in this case, is a continuous variable, we can run a simple linear regression to explore their relationships.

# Import the stats model library for linear regression
import statsmodels
import statsmodels.api as sm

# Create a function that fits the linear regression model
def lm_model(response_variable, explanatory_variable):
    """
    Takes in data from data frame and fit a linear regression model
    :param explanatory_variable: All the explanatory variables that is trying to explain the result
    :param response_variable: The response data that is the result from the explanatory data
    :return: Linear regression model
    """

    explanatory_variable = sm.add_constant(explanatory_variable)

    # Fit the linear regression model
    lm = sm.OLS(response_variable, explanatory_variable).fit()

    return lm


# Create a response and explanatory variable
response = hourly_data['hourly_calories_burned']
explanatory = hourly_data.iloc[:, [2, 3, 4]]

# Create a linear regression model using grouped hourly data
lm_hourly = lm_model(response, explanatory)
lm_hourly.summary()

Let’s take a look at the result from the regression model.

>>> lm_hourly.summary()

<class 'statsmodels.iolib.summary.Summary'>

                              OLS Regression Results                              
==================================================================================
Dep. Variable:     hourly_calories_burned   R-squared:                       0.875
Model:                                OLS   Adj. R-squared:                  0.875
Method:                     Least Squares   F-statistic:                 5.158e+04
Date:                    Tue, 27 Dec 2022   Prob (F-statistic):               0.00
Time:                            12:21:01   Log-Likelihood:                -99084.
No. Observations:                   22093   AIC:                         1.982e+05
Df Residuals:                       22089   BIC:                         1.982e+05
Df Model:                               3                                         
Covariance Type:                nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                  -27.4842      0.860    -31.976      0.000     -29.169     -25.799
hourly_steps            -0.0122      0.000    -24.590      0.000      -0.013      -0.011
avg_hourly_mets          9.2980      0.083    111.533      0.000       9.135       9.461
avg_hourly_intensity   -38.9427      1.893    -20.572      0.000     -42.653     -35.232
==============================================================================

Independent Variables (Explanatory Variables): Hourly steps taken, Average hourly METS, Average hourly intensity
Dependent Variables (Response Variables): Calories consumed

The Adjusted R-squared given in the summary suggests that our model can explain 87.5% of the data, and this suggests that the model fits the data fairly well. Under the P> | t | column, it shows that most of the independent variables are below 0.05 (5%) significance level. What this means is that the predictors have a meaningful impact on the number of calories consumed. Although, there are many other significance levels, but I believe for this case study 5% significance level is sufficient.

Now let’s look at the coefficient that our model is suggesting and the amount of effect it has on the number of calories consumed. For the number of steps taken every hour, the amount of calories consumed decreases. However, this does not make sense since we know that we consume energy to walk or run, so when we walk or run we consume less energy? this does not seem right. We are expecting a positive correlation between steps and calories consumed, but our model is suggesting a negative correlation.

Uh oh, there is something wrong with our model. Let’s take a look at the correlation between our variables to see what is going on.

Heatmap showing high correlation among variables.
Correlation matrix plot between all the variables in the hourly data

As it turns out, we had a slight problem is that all our variables have high correlations with each other. This suggests that our linear regression model might be affected by multicollinearity. Multicollinearity happens when independent variables, in our case the number of steps taken per hour, hourly METS, and hourly intensity are highly correlated with each other.

We can check if there is actually multicollinearity in our model with VIF (Variance Inflation Factor). This measures how much of the variance of a variable is increased compared to no multicollinearity.

from statsmodels.stats.outliers_influence import variance_inflation_factor

def vif(lm):
    """
    Takes in a linear regression model, and outputs the Variance Inflated Factor
    """

    # Saves the linear regression variables in the model
    lm_var = lm.model.exog

    # For each linear regression variables in the model, calculates their VIF values
    variance_factor = [variance_inflation_factor(lm_var, i) for i in range(lm_var.shape[1])]

    return variance_factor


# Check the VIF values from the regression model for hourly data
vif_lm_hourly = vif(lm_hourly)
vif_lm_hourly
vif_lm_hourly
[35.455352022352976, 5.6262326176546145, 23.211170850113135, 21.339431987542408]

Generally, what the number means is

  • VIF = 1No Correlation
  • VIF = 5Moderately Correlation
  • VIF >= 10Serious Correlation

From the result of vir_lm_hourly, the first number is the VIF of the constant, so we can just ignore that. the second number is the VIF of the hourly_steps, and is more or less moderately correlated. avg_hourly_mets, avg_hourly_intensity have VIF of 23.21 and 21.34 respectively. From the list above, if the VIF >= 10, then there are serious correlations between variables.

To solve this problem, we can do something simple like removing one of the variables with high VIF. Let’s try and remove the avg_hourly_mets since it has the highest VIF.

# Excludes the avg_hourly_mets in the explanatory variables
explanatory = hourly_data.iloc[:, [2, 4]]

# Create a linear regression model using grouped hourly data
lm_hourly = lm_model(response, explanatory)
lm_hourly.summary()
>>> lm_hourly.summary()
<class 'statsmodels.iolib.summary.Summary'>

                              OLS Regression Results                              
==================================================================================
Dep. Variable:     hourly_calories_burned   R-squared:                       0.805
Model:                                OLS   Adj. R-squared:                  0.805
Method:                     Least Squares   F-statistic:                 4.552e+04
Date:                    Tue, 27 Dec 2022   Prob (F-statistic):               0.00
Time:                            19:30:29   Log-Likelihood:            -1.0402e+05
No. Observations:                   22093   AIC:                         2.080e+05
Df Residuals:                       22090   BIC:                         2.081e+05
Df Model:                               2                                         
Covariance Type:                nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                   66.5594      0.209    319.165      0.000      66.151      66.968
hourly_steps             0.0052      0.001      8.809      0.000       0.004       0.006
avg_hourly_intensity   145.4090      1.154    126.037      0.000     143.148     147.670
==============================================================================

It seems like the new model without the avg_hourly_mets variable made the adjusted R-square decrease by 0.07. However, the coefficient of the variables now makes more sense. When a person travels more it should consume more energy, and an increase in the intensity of activity also increases calories consumed. While its a method to handle multicollinearity, dropping a variable can remove potentially useful information.

Another way we can handle multicollinearity is to use ridge regression. It is similar to linear regression but with regularization.

def lm_regularized(response_variable, explanatory_variable, l1, alpha):
    """
    Linear regression model with regularization
    """

    explanatory_variable = sm.add_constant(explanatory_variable)

    # Create linear regression model
    lm = sm.OLS(response_variable, explanatory_variable)
    lm_norm = lm.fit()

    # Fit the regression model with regularization
    lm_reg = lm.fit_regularized(L1_wt=l1, alpha=alpha, start_params=lm_norm.params)
    lm_reg = sm.regression.linear_model.OLSResults(lm, lm_reg.params, lm.normalized_cov_params)

    return lm_reg


# Ridge regression
lm_reg = lm_regularized(response, explanatory, l1=0, alpha=5)
lm_reg.summary()

>>> lm_reg.summary()
<class 'statsmodels.iolib.summary.Summary'>

                              OLS Regression Results                              
==================================================================================
Dep. Variable:     hourly_calories_burned   R-squared:                       0.866
Model:                                OLS   Adj. R-squared:                  0.866
Method:                     Least Squares   F-statistic:                 4.760e+04
Date:                    Tue, 27 Dec 2022   Prob (F-statistic):               0.00
Time:                            21:57:47   Log-Likelihood:                -99855.
No. Observations:                   22093   AIC:                         1.997e+05
Df Residuals:                       22089   BIC:                         1.998e+05
Df Model:                               3                                         
Covariance Type:                nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                    0.2806      0.890      0.315      0.753      -1.464       2.025
hourly_steps             0.0018      0.001      3.432      0.001       0.001       0.003
avg_hourly_mets          6.4752      0.086     75.007      0.000       6.306       6.644
avg_hourly_intensity     0.0981      1.960      0.050      0.960      -3.744       3.940
==============================================================================

Here we can see that the adjusted R-squared is better than the model that is dropping one of the highly correlated variables. We see a positive correlation between hourly_steps and avg_hourly_intensity. However, avg_hourly_intensity is now statistically insignificant and we are uncertain of the effect it can have on the hourly_calories_burned.

What does this result tell us? it suggests a few things. First, on an hourly basis, every step the participant takes seems to increase consume 0.0018 calories. Second, on an hourly basis, for every unit increase of METS causes the participant consumes 6.4752 calories. Finally, on an hourly basis, every unit increase in activity intensity, causes the participant to consume 0.0981 calories. However, since this effect is not statistically significant, we are uncertain of the exact effect it has on calorie consumption.

Insight of the Data Through Visualization

In this section, we will take a look at our data through data visualization. We have seen the numbers, but what do they mean?.

Here we have the plot of the aggregate data for total number of steps taken during the survey.

Total Number of Steps Taken for the Duration of the Survey

Total number of steps for each participant of the survey.
Aggregate data of total steps taken for all participants during the survey

Average Calories Consumed Per Step

Here we can take a look at how much calories is consumed per step for every user.

Average calories required per step for each participant of the survey.
Average calories consumed per step

Average METS Per Minute

We can see from the graph that some users is more physically active than others. Let’s take a look at some of the visualization for using minute interval data.

Average METS for all participants per minute

Average Physical Intensity Per Minute

Average physical intensity for each participant of survey
Average intensity for all participants per minute

Average Calories Consumed Per Minute

Average calories for all participants per minute

Minimum Calories Consumed at Rest

There is an interesting situation here where some participants have higher average METS and activity intensity levels compared to other participants. However, their calories consumption is lower than participants with lower average activities intensities. What is going on here?

As it turns out, some of the participants have a higher calories consumption at rest.

Minimum calories consumed per minute at rest by person

Fitbit Usage

How come some Fitbit users have significantly less number of steps taken?

Total Fitbit time use over 30 days

Over the duration of 30 days, user 9, 11, 14, 29 used less Fitbit compared to other Fitbit users. This suggests that our data is not balance because not all users have similar amount of use time. We have to be careful when interpreting some of these results because this is does not show a full picture.

Percentage of No Activity over 30 Days

Now we have some idea of our users, let’s check the amount of time each users have an activity intensity of 0 and METS of 0. An activity intensity of 0 and METS of 0 indicates no physical activity.

Percentage of No Activity of Every User over 30 days

We can see only a few participant is very active with an average of 50% time spent being sedentary. Most of the users are around 60%-70% of no activity, and few users that is above 80% of the time being in no activity.

Data Visualization using Power BI

Other than using Python to produce these visualizations, I also used Power BI to create some new visualizations. Let’s take a look at the relationship between the number of steps taken and the number of calories burned grouped by time of day using a stacked column chart.

Average Steps Taken and Calories burned between 12 AM to 6 AM

From the two images above, it seems like a few users like to do some sort of morning activities around 5 AM and 6 AM. The number of steps can calories consumed is much greater than other users. It is also interesting to note that there are still many people awake from around the time of 12 AM to 1 AM. One user, in particular, seems to awake around 4 AM quite a bit.

Average Steps Taken and Calories burned between 7 AM to 10 AM

The reason that I choose this time range is that this is usually the morning rush. People working day shifts usually start to commute to work around this time. By choosing this time range, we can see how people commute to work. As we can see from the figure above, people have been more physically active between 7 AM and 8 AM. However, there are a few users that do not seem to have taken many steps. There could be a few reasons why this is the case, it could be that they are traveling by car, working from home, retired, or their work does not start in the morning. It can also mean that their job does not require a lot of walking.

Average Steps Taken and Calories burned between 11 AM to 5 PM

Those who are not very physically active during the day seems to be not very physically active in the afternoon either.

Average Steps Taken and Calories burned between 6 PM to 11 PM

As expected, most users seem to move around less in the evening, but there are a few users that seem to move a lot more in the evening. One of the reasons could be that they are working the afternoon shift for their job.

Putting Everything Together

Here are some of the things we have discovered during Bella Beat case study:

  • Many participants are living sedentary lifestyles
  • Some participants consume more calories while resting
  • Some participants consume more energy while walking/running
  • Positive correlation between physical activity and calories consumed

What Are Some Trends in Smart Device Usage?

Some trends from the list above include many people with smart devices still living sedentary lifestyles. Over 30 days, most of the participants have 60-70% of the time sitting/resting. Some even have higher inactivity rates, and only a few participants are being more physically active. Another thing is that some participants require a higher amount of calories for every step taken. One of the factors that can cause this is the person has a higher weight.

How Could These Trends Apply to Bellabeat Customers?

From the survey data, most people live a sedentary lifestyle, then we can remind customers to walk around a bit if they have been sitting for x amount of time. Hopefully, this can introduce some form of physical activity when the customer is not able to do so through other means. The regression model showed a positive correlation between physical activity and calorie consumption. This means that an increase in physical activity should increase calorie consumption. Letting the customers understand these correlations can help them make better healthier informed decisions.

How Could These Trends Help Influence Bellabeat Marketing Strategy?

If we know many people might be living a sedentary lifestyle, then awareness campaigns, and combining with Bellabeat App to help more people be aware of their lifestyles. Many might know they live a sedentary lifestyle, but many do not know the quantifiable amount. If we show people that 60-70% of the time they are sitting/resting, then perhaps they will decide to make a change.

Remarks

This marks the end of the Bella Beat case study, but there is a few things I wanted to address.

Here are some of the things I’ve noticed:

  • There are rows where calories consumed is 0 when the person is at rest
  • There are rows where high calories consumption but little to no physical activity recorded
  • Not every participant use Fitbit for same or similar amount of time
  • Not knowing how METS and Intensity is calculated or how it is being recorded
  • Many incomplete data, or data without clear indication of what they are measuring

Full code available here: https://github.com/thecodingmango/case_study/tree/main/Google_Data_Analytics