What is the Bellabeat case study?
This is the Bellabeat case study project from the Google Data Analytics Specialization, and the goal of this case study is to demonstrate the process of Ask, Prepare, Process, Analyze, Share, Act. These 8 processes are very important for doing data analysis, they ensure we get consistent, repeatable processes each time.
I talked about my experiences and opinion of the Google Data Analytics Specialization.
In this project, I am doing the Bellabeat case study by exploring the trends of smart device usage, and come up with new insights for the marketing team. The usage of smart devices have been growing in popularity in recent years, and many firms wants to better capture the smart device market. Through the Bellabeat case study, we hope to any trends regarding the smart device usages.
Bellabeat Case Study: Scenario
Suppose that I was a junior data analyst working at Bellabeat, and I was tasked with analyzing smart device fitness data to gain insight into new growth opportunities. Bellabeat is a successful small company, and Urška Sršen, co-founder and Chief Creative Officer of Bellabeat understands the insight from analyzing smart device data could reveal new opportunities for the company.
Questions this Case Study Wants to Answer
- What are some trends in smart device usage?
- How could these trends apply to Bellabeat customers?
- How could these trends help influence Bellabeat marketing strategy?
Understanding Our Data for the Bellabeat Case Study
Gathering Necessary Data
In order to analyze smart device usage, the data ideally needs to track user activities throughout their day. Some physical factors includes: how many steps a user has walked, how far user has walked, how many calories used from the physical activities, how fast were the users traveling. Other can be biological factor such as heart rate, weight, and BMI.
The Fitbit Fitness Tracker Data fits all the criteria for the type of data we need. Thirty Fitbit users given consent to submit their personal smart device tracking data, including minute-level output for physical activities, heart rate, and sleep monitoring. These data are especially useful since they can help us look at how people with different level of physical activities can affect their calories burned and heart rate.
The data includes measurement such as steps taken, activities intensity, calories burned, METS (Metabolic Equivalents) — Amount of energy used compared to sitting or at rest. There were other measurement included with the data such as sleep, weight, and BMI. For the sleep measurement, I am unsure what it is actually measuring.
Limitation of the dataset
The header of the csv just says value, which seems to have value range from 1-3 in the column. Perhaps Fitbit has a feature to calculate the sleep quality, so maybe this is what the value meant on the csv header. I believe on the safer side, it is better to exclude this because a misunderstanding of the data can lead to wrong analysis. Which ultimately, leads to the wrong conclusion.
The weight and BMI data was a nice to have kind of data, but unfortunately only few people input their weight and BMI for the duration of the survey. There is not enough sample to use this data, so for this analysis, it will not be used.
One of the things I am concerned about this data is that, whether or not this data includes enough diversity of users for us to extrapolate the trend. Since the survey is mostly anonymous, so other than the measurement from Fitbit, we do not know other factors such as age, diet, etc. Reason this is a problem its because the 30 people can be selected from a similar group. Let assume that 60-70% of the 30 people participates in high physical activities, because the sample is skewed toward one group of people, the results of the analysis could be biased.
The data is available on Kaggle.
Data Processing/Cleaning
Since the data is separated into different CSV files, using a DBMS (Database Management System) can be beneficial. SQL (Structured Query Language) allows the user to create, and modify the databases. Benefits of SQL is that it can query multiple tables in the databases, insert or modify existing data, create new table, etc.
Another benefit of SQL is the speed it can query data. When the dataset is small, with only a few hundred thousand rows of data, then something like Excel will do just fine. However, where SQL excels is when the dataset is large, millions or even hundreds of millions of rows, it can query data much faster than Excel.
In this case study, I will use the MYSQL. Each CSV files can represented as a table in the database, so instead of having three files, the data can be managed under a single database.
First, create a database named fitbit
Next, we create the tables needed to store our data. In this case, I am using 4 datasets, so 4 tables will be created to store these data.
- id: Corresponds to the person doing the survey
- date_time: This is a string containing the date
- steps/mets/intensity/calories: The target measurement
Here the CSV files are loaded into MYSQL using LOAD DATA LOCAL INFILE.
The “Path” to the CSV is based on absolute path. This path would be in the form of C:/Users/<username>/”Path to File”
This is a snapshot of the steps table.
Everything looks fine except the date format needs to be changed to 24 hour format instead. To convert the date_time column from 12-hours format into 24-hours format, we can use the str_to_date function in MYSQL.
-- Update the date_time column for each of the tables
UPDATE steps
SET date_time = str_to_date(date_time, '%m/%d/%Y %h:%i:%s %p')
WHERE Id = Id;
UPDATE mets
SET date_time = str_to_date(date_time, '%m/%d/%Y %h:%i:%s %p')
WHERE Id = Id;
UPDATE intensity
SET date_time = str_to_date(date_time, '%m/%d/%Y %h:%i:%s %p')
WHERE Id = Id;
UPDATE calories
SET date_time = str_to_date(date_time, '%m/%d/%Y %h:%i:%s %p')
WHERE Id = Id;
Before
After
Now we change the data type of the date_time column from VARCHAR to datetime.
-- Change the date_time column from VARCHAR to datetime datatype
ALTER TABLE steps modify column date_time datetime;
ALTER TABLE mets modify column date_time datetime;
ALTER TABLE intensity modify column date_time datetime;
ALTER TABLE calories modify column date_time datetime;
Since the measure is in different tables, we can use LEFT JOIN to merge the tables together.
-- Merging the tables into one
SELECT steps.*, mets.mets, intensity.intensity, calories.calories
FROM steps
LEFT JOIN mets ON steps.id=mets.id
AND steps.date_time=mets.date_time
LEFT JOIN intensity ON steps.id=intensity.id
AND steps.date_time=intensity.date_time
LEFT JOIN calories ON steps.id=calories.id
AND steps.date_time=calories.date_time;
This is the resulting table, and export this result to an CSV file.
Currently, our data is measured every minute, and which can be difficult to see any trends. By aggregating data by 30 minutes, 1 hour, and daily, the data can show a fuller picture. Since 1-minute data can have a significant difference between each interval. Let says a person suddenly runs 100 steps in one minute, then the next minute the number of steps goes to 0. Therefore, by aggregating our data, we can see smoother change, but we lose the sensitivity that 1 minutes of data can show.
-- Aggravate data into hourly by id and time
SELECT ANY_VALUE(id) AS id,
DATE_FORMAT(ANY_VALUE(date_time), '%Y-%m-%d %H:%00:00') AS time_stamp,
SUM(steps) AS hourly_steps,
AVG(mets) AS avg_hourly_mets,
AVG(intensity) AS avg_hourly_intensity,
SUM(calories) AS hourly_calories_burned
FROM minute_merged
GROUP BY id, ROUND(unix_timestamp(time_stamp)/(60*60));
-- Aggrevate the data into daily by id and time
SELECT ANY_VALUE(id) AS id, DATE_FORMAT(ANY_VALUE(date_time), '%Y-%m-%d') AS time_stamp,
SUM(steps) AS daily_steps,
AVG(mets) AS avg_daily_mets,
AVG(intensity) AS avg_daily_intensity,
SUM(calories) AS daily_calories_burned
FROM minute_merged
GROUP BY id, ROUND(unix_timestamp(time_stamp)/(24*60*60));
id | time_stamp | hourly_steps | avg_hourly_mets | avg_hourly_intensity | hourly_calories_burned |
---|---|---|---|---|---|
1503960366 | 2016-04-12 00:00:00 | 373 | 17.2333 | 0.3333 | 81.32409763336182 |
1503960366 | 2016-04-12 01:00:00 | 160 | 12.8333 | 0.1333 | 60.56049823760986 |
1503960366 | 2016-04-12 02:00:00 | 151 | 12.4667 | 0.1167 | 58.83019828796387 |
1503960366 | 2016-04-12 03:00:00 | 0 | 10.0000 | 0.0000 | 47.189998626708984 |
1503960366 | 2016-04-12 04:00:00 | 0 | 10.0667 | 0.0000 | 47.50459861755371 |
1503960366 | 2016-04-12 05:00:00 | 0 | 10.0667 | 0.0000 | 47.50459861755371 |
1503960366 | 2016-04-12 06:00:00 | 0 | 10.0667 | 0.0000 | 47.50459861755371 |
1503960366 | 2016-04-12 07:00:00 | 0 | 10.0333 | 0.0000 | 47.34729862213135 |
1503960366 | 2016-04-12 08:00:00 | 250 | 14.4333 | 0.2167 | 68.1108980178833 |
1503960366 | 2016-04-12 09:00:00 | 1864 | 29.8500 | 0.5000 | 140.86214590072632 |
Data Analysis of the data for the Bellabeat Case Study
Now that we have data in a minute, hourly and daily format for our Bellabeat case study, it is time to start the analysis process. In the Google Data Analytics course, they used the R programming language to do the analysis. However, I will use Python since getting the result is the same, but a slightly different method.
This is the Planned structure of the python project folder
Project Folder
\ data
\\ data needed for analysis
\ visualizations
\\ graphs
\ linear regression
\\ regression models
\ misc
\\helper functions
\ analysis.py
\ data_visualization.py
- data folder: contains the 3 aggravated data CSVs
- visualizations folder: contains all the graphs needed for showing trends
- linear regression folder: contains the regression model
- misc: includes any miscellaneous helper functions
- analysis.py: includes the Regression model
- data_visualization.py: plots all the data needed
First lets import the data from CSV into Python. For this, I will use the Pandas library to import the CSV to a data frame.
# Loading the required libraries
import pandas as pd
import numpy as np
# Import the data from csv to dataframe
minute_data = pd.read_csv('data/minute_measurement_merged.csv')
hourly_data = pd.read_csv('data/hourly_measurement_merged.csv')
daily_data = pd.read_csv('data/daily_measurement_merged.csv')
Now that the data is loaded into Python, we can generate some summary statistics to see what the data is trying to tell us. Since the first column of the data is the id, it is not that useful for summary statistic. Thus, it can be excluded from the summary statistic.
# Generate summary statistic of our data
minute_data.iloc[:, 1:].describe()
hourly_data.iloc[:, 1:].describe()
daily_data.iloc[:, 1:].describe()
Summary statistic for minute measurement data.
steps mets intensity calories
count 1.325580e+06 1.325580e+06 1.325580e+06 1.325580e+06
mean 5.336192e+00 1.469001e+01 2.005937e-01 1.623130e+00
std 1.812830e+01 1.205541e+01 5.190227e-01 1.410447e+00
min 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 0.000000e+00 1.000000e+01 0.000000e+00 9.357000e-01
50% 0.000000e+00 1.000000e+01 0.000000e+00 1.217600e+00
75% 0.000000e+00 1.100000e+01 0.000000e+00 1.432700e+00
max 2.200000e+02 1.570000e+02 3.000000e+00 1.974990e+01
Note: This is summary statistic that includes ALL 33 of the survey participants, but it does not necessarily mean every person has the same statistic. What I mean is that on average from the whole sample, they walk or run 5.33 steps/minute. Does that imply everyone walks or runs an average of 5.33 steps/per min? Not necessarily, because some could walk more than 5.33 steps/min and some can walk less than 5.33 steps/min. The average of all participants averages 5.33 steps/min.
Let take a look at the data at individual levels, first we take a look at the data on minute measurement.
minute_data.groupby('id', as_index=False).agg({'steps': 'mean',
'mets': 'mean',
'intensity': 'mean',
'calories': 'mean'})
id steps mets intensity calories
0 1503960366 8.706323 16.668550 0.269503 1.308458
1 1624580081 4.025136 12.522011 0.133990 1.040579
2 1644430081 5.130108 14.107674 0.175330 1.982551
3 1844505072 1.822663 11.877975 0.083698 1.111422
4 1927972279 0.643116 10.648279 0.030956 1.524636
5 2022484408 7.960249 16.955374 0.283628 1.759459
6 2026352035 3.896467 13.936843 0.180208 1.080245
7 2320127002 3.311451 13.162993 0.145714 1.210995
8 2347167796 6.897625 15.720491 0.242029 1.478669
9 2873212765 5.301857 15.285077 0.251698 1.338597
10 3372868164 4.839583 14.573023 0.256321 1.363015
11 3977333714 7.860249 15.451461 0.253807 1.085774
12 4020332650 1.557354 12.198224 0.072632 1.677744
13 4057192912 2.898674 12.053030 0.081629 1.486862
14 4319703577 4.821915 14.082896 0.188513 1.423378
15 4388161847 7.250907 16.375669 0.238526 2.135878
16 4445114986 3.373010 13.387290 0.163302 1.535782
17 4558609924 5.387432 15.350226 0.240829 1.426141
18 4702921684 6.054697 14.958869 0.215527 2.095438
19 5553957443 6.093744 14.507443 0.214064 1.327167
20 5577150313 5.850259 18.742255 0.331591 2.373143
21 6117666160 4.688712 14.730025 0.209015 1.530892
22 6290855005 4.102206 13.192632 0.176667 1.887752
23 6775888955 1.786093 11.841831 0.072896 1.513978
24 6962181067 6.913046 15.590255 0.247928 1.399325
25 7007744171 8.164420 16.953910 0.293012 1.833057
26 7086361926 6.537131 15.938904 0.226057 1.803806
27 8053475328 10.373311 17.020317 0.299161 2.070390
28 8253242879 4.750116 13.226721 0.151779 1.311297
29 8378563200 6.121487 16.076762 0.247999 2.414050
30 8583815059 4.090065 13.150534 0.152182 1.844362
31 8792009665 1.333085 12.026265 0.073909 1.410320
32 8877689391 11.243551 19.667984 0.318256 2.399803
Here we can see that not everyone has the same average for the number of steps taken, mets, intensity, and calories burned. We can also see here that some of the participants also have more sedentary lifestyles. Some other participants are way more active compared to the sample mean. This can potentially mean we have a fairly well-distributed sample of users, the users that are very physically active and users that are less physically active.
Summary statistic for hourly measurement data
hourly_steps avg_hourly_mets avg_hourly_intensity hourly_calories_burned
count 22093.000000 22093.000000 22093.000000 22093.000000
mean 320.171502 14.690011 0.200595 97.387771
std 690.467256 8.342430 0.352260 60.700980
min 0.000000 10.000000 0.000000 42.162001
25% 0.000000 10.000000 0.000000 63.172731
50% 40.000000 11.216700 0.050000 82.523997
75% 357.000000 16.250000 0.266700 108.088200
max 10554.000000 129.300000 3.000000 948.493082
Summary statistic for daily measurement data
daily_steps avg_daily_mets avg_daily_intensity daily_calories_burned
count 934.000000 934.000000 934.000000 934.000000
mean 7573.392934 14.663957 0.199417 2303.627430
std 5100.939182 2.903262 0.115348 693.567613
min 0.000000 10.000000 0.000000 64.872000
25% 3709.000000 12.710225 0.122375 1828.246126
50% 7362.000000 14.696200 0.210400 2124.219920
75% 10692.250000 16.406750 0.277100 2781.955238
max 36019.000000 25.775700 0.627800 4552.352890
Finding Relationships Between Variables
Now that we have some insight into what the data looks like and its distribution, we can take a look at the different effects that the variables have on each other. Suppose we are interested in how the number of steps taken, mets, and intensity can affect the number of calories consumed. Since the response variable, in this case, is a continuous variable, we can run a simple linear regression to explore their relationships.
# Import the stats model library for linear regression
import statsmodels
import statsmodels.api as sm
# Create a function that fits the linear regression model
def lm_model(response_variable, explanatory_variable):
"""
Takes in data from data frame and fit a linear regression model
:param explanatory_variable: All the explanatory variables that is trying to explain the result
:param response_variable: The response data that is the result from the explanatory data
:return: Linear regression model
"""
explanatory_variable = sm.add_constant(explanatory_variable)
# Fit the linear regression model
lm = sm.OLS(response_variable, explanatory_variable).fit()
return lm
# Create a response and explanatory variable
response = hourly_data['hourly_calories_burned']
explanatory = hourly_data.iloc[:, [2, 3, 4]]
# Create a linear regression model using grouped hourly data
lm_hourly = lm_model(response, explanatory)
lm_hourly.summary()
Let’s take a look at the result from the regression model.
>>> lm_hourly.summary()
<class 'statsmodels.iolib.summary.Summary'>
OLS Regression Results
==================================================================================
Dep. Variable: hourly_calories_burned R-squared: 0.875
Model: OLS Adj. R-squared: 0.875
Method: Least Squares F-statistic: 5.158e+04
Date: Tue, 27 Dec 2022 Prob (F-statistic): 0.00
Time: 12:21:01 Log-Likelihood: -99084.
No. Observations: 22093 AIC: 1.982e+05
Df Residuals: 22089 BIC: 1.982e+05
Df Model: 3
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const -27.4842 0.860 -31.976 0.000 -29.169 -25.799
hourly_steps -0.0122 0.000 -24.590 0.000 -0.013 -0.011
avg_hourly_mets 9.2980 0.083 111.533 0.000 9.135 9.461
avg_hourly_intensity -38.9427 1.893 -20.572 0.000 -42.653 -35.232
==============================================================================
Independent Variables (Explanatory Variables): Hourly steps taken, Average hourly METS, Average hourly intensity
Dependent Variables (Response Variables): Calories consumed
The Adjusted R-squared given in the summary suggests that our model can explain 87.5% of the data, and this suggests that the model fits the data fairly well. Under the P> | t | column, it shows that most of the independent variables are below 0.05 (5%) significance level. What this means is that the predictors have a meaningful impact on the number of calories consumed. Although, there are many other significance levels, but I believe for this case study 5% significance level is sufficient.
Now let’s look at the coefficient that our model is suggesting and the amount of effect it has on the number of calories consumed. For the number of steps taken every hour, the amount of calories consumed decreases. However, this does not make sense since we know that we consume energy to walk or run, so when we walk or run we consume less energy? this does not seem right. We are expecting a positive correlation between steps and calories consumed, but our model is suggesting a negative correlation.
Uh oh, there is something wrong with our model. Let’s take a look at the correlation between our variables to see what is going on.
As it turns out, we had a slight problem is that all our variables have high correlations with each other. This suggests that our linear regression model might be affected by multicollinearity. Multicollinearity happens when independent variables, in our case the number of steps taken per hour, hourly METS, and hourly intensity are highly correlated with each other.
We can check if there is actually multicollinearity in our model with VIF (Variance Inflation Factor). This measures how much of the variance of a variable is increased compared to no multicollinearity.
from statsmodels.stats.outliers_influence import variance_inflation_factor
def vif(lm):
"""
Takes in a linear regression model, and outputs the Variance Inflated Factor
"""
# Saves the linear regression variables in the model
lm_var = lm.model.exog
# For each linear regression variables in the model, calculates their VIF values
variance_factor = [variance_inflation_factor(lm_var, i) for i in range(lm_var.shape[1])]
return variance_factor
# Check the VIF values from the regression model for hourly data
vif_lm_hourly = vif(lm_hourly)
vif_lm_hourly
vif_lm_hourly
[35.455352022352976, 5.6262326176546145, 23.211170850113135, 21.339431987542408]
Generally, what the number means is
- VIF = 1 — No Correlation
- VIF = 5 — Moderately Correlation
- VIF >= 10 — Serious Correlation
From the result of vir_lm_hourly, the first number is the VIF of the constant, so we can just ignore that. the second number is the VIF of the hourly_steps, and is more or less moderately correlated. avg_hourly_mets, avg_hourly_intensity have VIF of 23.21 and 21.34 respectively. From the list above, if the VIF >= 10, then there are serious correlations between variables.
To solve this problem, we can do something simple like removing one of the variables with high VIF. Let’s try and remove the avg_hourly_mets since it has the highest VIF.
# Excludes the avg_hourly_mets in the explanatory variables
explanatory = hourly_data.iloc[:, [2, 4]]
# Create a linear regression model using grouped hourly data
lm_hourly = lm_model(response, explanatory)
lm_hourly.summary()
>>> lm_hourly.summary()
<class 'statsmodels.iolib.summary.Summary'>
OLS Regression Results
==================================================================================
Dep. Variable: hourly_calories_burned R-squared: 0.805
Model: OLS Adj. R-squared: 0.805
Method: Least Squares F-statistic: 4.552e+04
Date: Tue, 27 Dec 2022 Prob (F-statistic): 0.00
Time: 19:30:29 Log-Likelihood: -1.0402e+05
No. Observations: 22093 AIC: 2.080e+05
Df Residuals: 22090 BIC: 2.081e+05
Df Model: 2
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const 66.5594 0.209 319.165 0.000 66.151 66.968
hourly_steps 0.0052 0.001 8.809 0.000 0.004 0.006
avg_hourly_intensity 145.4090 1.154 126.037 0.000 143.148 147.670
==============================================================================
It seems like the new model without the avg_hourly_mets variable made the adjusted R-square decrease by 0.07. However, the coefficient of the variables now makes more sense. When a person travels more it should consume more energy, and an increase in the intensity of activity also increases calories consumed. While its a method to handle multicollinearity, dropping a variable can remove potentially useful information.
Another way we can handle multicollinearity is to use ridge regression. It is similar to linear regression but with regularization.
def lm_regularized(response_variable, explanatory_variable, l1, alpha):
"""
Linear regression model with regularization
"""
explanatory_variable = sm.add_constant(explanatory_variable)
# Create linear regression model
lm = sm.OLS(response_variable, explanatory_variable)
lm_norm = lm.fit()
# Fit the regression model with regularization
lm_reg = lm.fit_regularized(L1_wt=l1, alpha=alpha, start_params=lm_norm.params)
lm_reg = sm.regression.linear_model.OLSResults(lm, lm_reg.params, lm.normalized_cov_params)
return lm_reg
# Ridge regression
lm_reg = lm_regularized(response, explanatory, l1=0, alpha=5)
lm_reg.summary()
>>> lm_reg.summary()
<class 'statsmodels.iolib.summary.Summary'>
OLS Regression Results
==================================================================================
Dep. Variable: hourly_calories_burned R-squared: 0.866
Model: OLS Adj. R-squared: 0.866
Method: Least Squares F-statistic: 4.760e+04
Date: Tue, 27 Dec 2022 Prob (F-statistic): 0.00
Time: 21:57:47 Log-Likelihood: -99855.
No. Observations: 22093 AIC: 1.997e+05
Df Residuals: 22089 BIC: 1.998e+05
Df Model: 3
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const 0.2806 0.890 0.315 0.753 -1.464 2.025
hourly_steps 0.0018 0.001 3.432 0.001 0.001 0.003
avg_hourly_mets 6.4752 0.086 75.007 0.000 6.306 6.644
avg_hourly_intensity 0.0981 1.960 0.050 0.960 -3.744 3.940
==============================================================================
Here we can see that the adjusted R-squared is better than the model that is dropping one of the highly correlated variables. We see a positive correlation between hourly_steps and avg_hourly_intensity. However, avg_hourly_intensity is now statistically insignificant and we are uncertain of the effect it can have on the hourly_calories_burned.
What does this result tell us? it suggests a few things. First, on an hourly basis, every step the participant takes seems to increase consume 0.0018 calories. Second, on an hourly basis, for every unit increase of METS causes the participant consumes 6.4752 calories. Finally, on an hourly basis, every unit increase in activity intensity, causes the participant to consume 0.0981 calories. However, since this effect is not statistically significant, we are uncertain of the exact effect it has on calorie consumption.
Insight of the Data Through Visualization
In this section, we will take a look at our data through data visualization. We have seen the numbers, but what do they mean?.
Here we have the plot of the aggregate data for total number of steps taken during the survey.
Total Number of Steps Taken for the Duration of the Survey
Average Calories Consumed Per Step
Here we can take a look at how much calories is consumed per step for every user.
Average METS Per Minute
We can see from the graph that some users is more physically active than others. Let’s take a look at some of the visualization for using minute interval data.
Average Physical Intensity Per Minute
Average Calories Consumed Per Minute
Minimum Calories Consumed at Rest
There is an interesting situation here where some participants have higher average METS and activity intensity levels compared to other participants. However, their calories consumption is lower than participants with lower average activities intensities. What is going on here?
As it turns out, some of the participants have a higher calories consumption at rest.
Fitbit Usage
How come some Fitbit users have significantly less number of steps taken?
Over the duration of 30 days, user 9, 11, 14, 29 used less Fitbit compared to other Fitbit users. This suggests that our data is not balance because not all users have similar amount of use time. We have to be careful when interpreting some of these results because this is does not show a full picture.
Percentage of No Activity over 30 Days
Now we have some idea of our users, let’s check the amount of time each users have an activity intensity of 0 and METS of 0. An activity intensity of 0 and METS of 0 indicates no physical activity.
We can see only a few participant is very active with an average of 50% time spent being sedentary. Most of the users are around 60%-70% of no activity, and few users that is above 80% of the time being in no activity.
Data Visualization using Power BI
Other than using Python to produce these visualizations, I also used Power BI to create some new visualizations. Let’s take a look at the relationship between the number of steps taken and the number of calories burned grouped by time of day using a stacked column chart.
Average Steps Taken and Calories burned between 12 AM to 6 AM
From the two images above, it seems like a few users like to do some sort of morning activities around 5 AM and 6 AM. The number of steps can calories consumed is much greater than other users. It is also interesting to note that there are still many people awake from around the time of 12 AM to 1 AM. One user, in particular, seems to awake around 4 AM quite a bit.
Average Steps Taken and Calories burned between 7 AM to 10 AM
The reason that I choose this time range is that this is usually the morning rush. People working day shifts usually start to commute to work around this time. By choosing this time range, we can see how people commute to work. As we can see from the figure above, people have been more physically active between 7 AM and 8 AM. However, there are a few users that do not seem to have taken many steps. There could be a few reasons why this is the case, it could be that they are traveling by car, working from home, retired, or their work does not start in the morning. It can also mean that their job does not require a lot of walking.
Average Steps Taken and Calories burned between 11 AM to 5 PM
Those who are not very physically active during the day seems to be not very physically active in the afternoon either.
Average Steps Taken and Calories burned between 6 PM to 11 PM
As expected, most users seem to move around less in the evening, but there are a few users that seem to move a lot more in the evening. One of the reasons could be that they are working the afternoon shift for their job.
Putting Everything Together
Here are some of the things we have discovered during Bella Beat case study:
- Many participants are living sedentary lifestyles
- Some participants consume more calories while resting
- Some participants consume more energy while walking/running
- Positive correlation between physical activity and calories consumed
What Are Some Trends in Smart Device Usage?
Some trends from the list above include many people with smart devices still living sedentary lifestyles. Over 30 days, most of the participants have 60-70% of the time sitting/resting. Some even have higher inactivity rates, and only a few participants are being more physically active. Another thing is that some participants require a higher amount of calories for every step taken. One of the factors that can cause this is the person has a higher weight.
How Could These Trends Apply to Bellabeat Customers?
From the survey data, most people live a sedentary lifestyle, then we can remind customers to walk around a bit if they have been sitting for x amount of time. Hopefully, this can introduce some form of physical activity when the customer is not able to do so through other means. The regression model showed a positive correlation between physical activity and calorie consumption. This means that an increase in physical activity should increase calorie consumption. Letting the customers understand these correlations can help them make better healthier informed decisions.
How Could These Trends Help Influence Bellabeat Marketing Strategy?
If we know many people might be living a sedentary lifestyle, then awareness campaigns, and combining with Bellabeat App to help more people be aware of their lifestyles. Many might know they live a sedentary lifestyle, but many do not know the quantifiable amount. If we show people that 60-70% of the time they are sitting/resting, then perhaps they will decide to make a change.
Remarks
This marks the end of the Bella Beat case study, but there is a few things I wanted to address.
Here are some of the things I’ve noticed:
- There are rows where calories consumed is 0 when the person is at rest
- There are rows where high calories consumption but little to no physical activity recorded
- Not every participant use Fitbit for same or similar amount of time
- Not knowing how METS and Intensity is calculated or how it is being recorded
- Many incomplete data, or data without clear indication of what they are measuring
Full code available here: https://github.com/thecodingmango/case_study/tree/main/Google_Data_Analytics