This is one of the project reports in my portfolio about data analysis using R.
The aim for the project is to find the best market to invest the advertisement for the e-leaning products.
The skills are covering data extraction, data cleaning, data visualization, data analysis and strategy proposal.
In this project, we’ll aim to find the two best markets to advertise our product in — we’re working for an e-learning company that offers courses on programming. Most of our courses are on web and mobile development, but we also cover many other domains, like data science, game development, etc.
To avoid spending money on organizing a survey, we’ll first try to make use of existing data to determine whether we can reach any reliable result.
One good candidate for our purpose is freeCodeCamp’s 2017 New Coder Survey. freeCodeCamp is a free e-learning platform that offers courses on web development. Because they run a popular Medium publication (over 400,000 followers), their survey attracted new coders with varying interests (not only web development), which is ideal for the purpose of our analysis.
The survey data is publicly available in this GitHub
repository. Below, we’ll do a quick exploration of the
2017-fCC-New-Coders-Survey-Data.csv
file stored in the
clean-data
folder of the repository we just mentioned.
We’ll read in the file using the direct link here.
library(readr)
fcc <- read_csv("2017-fCC-New-Coders-Survey-Data.csv")
## Rows: 18175 Columns: 136
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (27): BootcampName, CityPopulation, CodeEventOther, CommuteTime, Count...
## dbl (105): Age, AttendedBootcamp, BootcampFinish, BootcampLoanYesNo, Bootca...
## dttm (4): Part1EndTime, Part1StartTime, Part2EndTime, Part2StartTime
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(fcc)
## [1] 18175 136
head(fcc, 5)
## # A tibble: 5 × 136
## Age Attend…¹ Bootc…² Bootc…³ Bootc…⁴ Bootc…⁵ Child…⁶ CityP…⁷ CodeE…⁸ CodeE…⁹
## <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 27 0 NA NA <NA> NA NA more t… NA NA
## 2 34 0 NA NA <NA> NA NA less t… NA NA
## 3 21 0 NA NA <NA> NA NA more t… NA NA
## 4 26 0 NA NA <NA> NA NA betwee… NA NA
## 5 20 0 NA NA <NA> NA NA betwee… NA NA
## # … with 126 more variables: CodeEventFCC <dbl>, CodeEventGameJam <dbl>,
## # CodeEventGirlDev <dbl>, CodeEventHackathons <dbl>, CodeEventMeetup <dbl>,
## # CodeEventNodeSchool <dbl>, CodeEventNone <dbl>, CodeEventOther <chr>,
## # CodeEventRailsBridge <dbl>, CodeEventRailsGirls <dbl>,
## # CodeEventStartUpWknd <dbl>, CodeEventWkdBootcamps <dbl>,
## # CodeEventWomenCode <dbl>, CodeEventWorkshops <dbl>, CommuteTime <chr>,
## # CountryCitizen <chr>, CountryLive <chr>, EmploymentField <chr>, …
As we mentioned in the introduction, most of our courses are on web and mobile development, but we also cover many other domains, like data science, game development, etc. For the purpose of our analysis, we want to answer questions about a population of new coders that are interested in the subjects we teach. We’d like to know:
So we first need to clarify whether the data set has the right
categories of people for our purpose. The JobRoleInterest
column describes for every participant the role(s) they’d be interested
in working in. If a participant is interested in working in a certain
domain, it means that they’re also interested in learning about that
domain. So let’s take a look at the frequency distribution table of this
column 1
and determine whether the data we have is relevant.
#split-and-combine workflow
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
fcc %>%
group_by(JobRoleInterest) %>%
summarise(freq = n()*100/nrow(fcc)) %>%
arrange(desc(freq))
## # A tibble: 3,212 × 2
## JobRoleInterest freq
## <chr> <dbl>
## 1 <NA> 61.5
## 2 Full-Stack Web Developer 4.53
## 3 Front-End Web Developer 2.48
## 4 Data Scientist 0.836
## 5 Back-End Web Developer 0.781
## 6 Mobile Developer 0.644
## 7 Game Developer 0.627
## 8 Information Security 0.506
## 9 Full-Stack Web Developer, Front-End Web Developer 0.352
## 10 Front-End Web Developer, Full-Stack Web Developer 0.308
## # … with 3,202 more rows
The information in the table above is quite granular, but from a quick scan it looks like:
It’s also interesting to note that many respondents are interested in more than one subject. It’d be useful to get a better picture of how many people are interested in a single subject and how many have mixed interests. Consequently, in the next code block, we’ll:
JobRoleInterest
column to find
the number of options for each participant.
# Split each string in the 'JobRoleInterest' column
splitted_interests <- fcc %>%
select(JobRoleInterest) %>%
tidyr::drop_na() %>%
#Tidyverse actually makes by default operation over columns, rowwise changes this behavior.
rowwise %>%
mutate(opts = length(stringr::str_split(JobRoleInterest, ",")[[1]]))
# alternative implementation
# mutate(opts = unlist( map(JobRoleInterest, function(x) length(str_split(x,',')[[1]]))))
# then ungroup() is not needed later.
# Frequency table for the var describing the number of options
n_of_options <- splitted_interests %>%
ungroup() %>% #this is needeed because we used the rowwise() function before
group_by(opts) %>%
summarize(freq = n()*100/nrow(splitted_interests))
n_of_options
## # A tibble: 13 × 2
## opts freq
## <int> <dbl>
## 1 1 31.7
## 2 2 10.9
## 3 3 15.9
## 4 4 15.2
## 5 5 12.0
## 6 6 6.72
## 7 7 3.86
## 8 8 1.76
## 9 9 0.987
## 10 10 0.472
## 11 11 0.186
## 12 12 0.300
## 13 13 0.0286
It turns out that only 31.65% of the participants have a clear idea about what programming niche they’d like to work in, while the vast majority of students have mixed interests. But given that we offer courses on various subjects, the fact that new coders have mixed interest might be actually good for us.
The focus of our courses is on web and mobile development, so let’s find out how many respondents chose at least one of these two options.
# Frequency table (we can also use split-and-combine) str() will sum the result.
web_or_mobile <- stringr::str_detect(fcc$JobRoleInterest, "Web Developer|Mobile Developer")
freq_table <- table(web_or_mobile)
freq_table <- freq_table * 100 / sum(freq_table)
freq_table
## web_or_mobile
## FALSE TRUE
## 13.75858 86.24142
# Graph for the frequency table above
df <- tibble::tibble(x = c("Other Subject","Web or Mobile Developpement"),
y = freq_table)
library(ggplot2)
ggplot(data = df, aes(x = x, y = y, fill = x)) +
geom_histogram(stat = "identity")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Don't know how to automatically pick scale for object of type table. Defaulting to continuous.
It turns out that most people in this survey (roughly 86%) are interested in either web or mobile development. These figures offer us a strong reason to consider this sample representative for our population of interest. We want to advertise our courses to people interested in all sorts of programming niches but mostly web and mobile development.
Now we need to figure out what are the best markets to invest money in for advertising our courses. We’d like to know:
One indicator of a good market is the number of potential customers — the more potential customers in a market, the better. If our ads manage to convince 10% of the 5,000 potential customers in market A to buy our product, then this is better than convincing 100% of the 30 potential customers in market B.
Let’s begin with finding out where these new coders are located, and what are the densities (how many new coders there are) for each location. This should be a good start for finding out the best two markets to run our ads campaign in.
The data set provides information about the location of each participant at a country level. We can think of each country as an individual market, so we can frame our goal as finding the two best countries to advertise in.
We can start by examining the frequency distribution table of the
CountryLive
variable, which describes what country each
participant lives in (not their origin country). We’ll only consider
those participants who answered what role(s) they’re interested in, to
make sure we work with a representative sample.
# Isolate the participants that answered what role they'd be interested in
fcc_good <- fcc %>%
tidyr::drop_na(JobRoleInterest)
# Frequency tables with absolute and relative frequencies
# Display the frequency tables in a more readable format
fcc_good %>%
group_by(CountryLive) %>%
summarise(`Absolute frequency` = n(),
`Percentage` = n() * 100 / nrow(fcc_good) ) %>%
arrange(desc(Percentage))
## # A tibble: 138 × 3
## CountryLive `Absolute frequency` Percentage
## <chr> <int> <dbl>
## 1 United States of America 3125 44.7
## 2 India 528 7.55
## 3 United Kingdom 315 4.51
## 4 Canada 260 3.72
## 5 <NA> 154 2.20
## 6 Poland 131 1.87
## 7 Brazil 129 1.84
## 8 Germany 125 1.79
## 9 Australia 112 1.60
## 10 Russia 102 1.46
## # … with 128 more rows
4.69% of our potential customers are located in the US, and this definitely seems like the most interesting market. India has the second customer density, but it’s just 7.55%, which is not too far from the United Kingdom (4.50%) or Canada (3.71%).
This is useful information, but we need to go more in depth than this and figure out how much money people are actually willing to spend on learning. Advertising in high-density markets where most people are only willing to learn for free is extremely unlikely to be profitable for us.
The MoneyForLearning
column describes in American
dollars the amount of money spent by participants from the moment they
started coding until the moment they completed the survey. Our company
sells subscriptions at a price of $59 per month, and for this reason
we’re interested in finding out how much money each student spends per
month.
We’ll narrow down our analysis to only four countries: the US, India, the United Kingdom, and Canada. We do this for two reasons:
Let’s start with creating a new column that describes the amount of
money a student has spent per month so far. To do that, we’ll need to
divide the MoneyForLearning
column to the
MonthsProgramming
column. The problem is that some students
answered that they have been learning to code for 0 months (it might be
that they have just started). To avoid dividing by 0, we’ll replace 0
with 1 in the MonthsProgramming
column.
# Replace 0s with 1s to avoid division by 0
fcc_good <- fcc_good %>%
mutate(MonthsProgramming = replace(MonthsProgramming, MonthsProgramming == 0, 1) )
# New column for the amount of money each student spends each month
fcc_good <- fcc_good %>%
mutate(money_per_month = MoneyForLearning/MonthsProgramming)
fcc_good %>%
summarise(na_count = sum(is.na(money_per_month)) ) %>%
pull(na_count)
## [1] 675
Let’s keep only the rows that don’t have NA values for the
money_per_month
column.
# Keep only the rows with non-NAs in the `money_per_month` column
fcc_good <- fcc_good %>% tidyr::drop_na(money_per_month)
We want to group the data by country, and then measure the average
amount of money that students spend per month in each country. First,
let’s remove the rows having NA
values for the
CountryLive
column, and check out if we still have enough
data for the four countries that interest us.
# Remove the rows with NA values in 'CountryLive'
fcc_good <- fcc_good %>% tidyr::drop_na(CountryLive)
# Frequency table to check if we still have enough data
fcc_good %>% group_by(CountryLive) %>%
summarise(freq = n() ) %>%
arrange(desc(freq)) %>%
head()
## # A tibble: 6 × 2
## CountryLive freq
## <chr> <int>
## 1 United States of America 2933
## 2 India 463
## 3 United Kingdom 279
## 4 Canada 240
## 5 Poland 122
## 6 Germany 114
This should be enough, so let’s compute the average value spent per month in each country by a student. We’ll compute the average using the mean.
# Mean sum of money spent by students each month
countries_mean <- fcc_good %>%
filter(CountryLive == 'United States of America' | CountryLive == 'India' | CountryLive == 'United Kingdom'|CountryLive == 'Canada') %>%
group_by(CountryLive) %>%
summarize(mean = mean(money_per_month)) %>%
arrange(desc(mean))
countries_mean
## # A tibble: 4 × 2
## CountryLive mean
## <chr> <dbl>
## 1 United States of America 228.
## 2 India 135.
## 3 Canada 114.
## 4 United Kingdom 45.5
The results for the United Kingdom and Canada are a bit surprising relative to the values we see for India. If we considered a few socio-economical metrics (like GDP per capita), we’d intuitively expect people in the UK and Canada to spend more on learning than people in India.
It might be that we don’t have have enough representative data for the United Kingdom and Canada, or we have some outliers (maybe coming from wrong survey answers) making the mean too large for India, or too low for the UK and Canada. Or it might be that the results are correct.
Let’s use box plots to visualize the distribution of the
money_per_month
variable for each country.
# Isolate only the countries of interest
only_4 <- fcc_good %>%
filter(CountryLive == 'United States of America' | CountryLive == 'India' | CountryLive == 'United Kingdom'|CountryLive == 'Canada')
# Since maybe, we will remove elements from the database,
# we add an index column containing the number of each row.
# Hence, we will have a match with the original database in case of some indexes.
only_4 <- only_4 %>%
mutate(index = row_number())
# Box plots to visualize distributions
ggplot( data = only_4, aes(x = CountryLive, y = money_per_month)) +
geom_boxplot() +
ggtitle("Money Spent Per Month Per Country\n(Distributions)") +
xlab("Country") +
ylab("Money per month (US dollars)") +
theme_bw()
It’s hard to see on the plot above if there’s anything wrong with the data for the United Kingdom, India, or Canada, but we can see immediately that there’s something really off for the US: two persons spend each month $50,000 or more for learning. This is not impossible, but it seems extremely unlikely, so we’ll remove every value that goes over $20,000 per month.
# Isolate only those participants who spend less than 10,000 per month
fcc_good <- fcc_good %>%
filter(money_per_month < 20000)
Now let’s recompute the mean values and plot the box plots again.
# Mean sum of money spent by students each month
countries_mean = fcc_good %>%
filter(CountryLive == 'United States of America' | CountryLive == 'India' | CountryLive == 'United Kingdom'|CountryLive == 'Canada') %>%
group_by(CountryLive) %>%
summarize(mean = mean(money_per_month)) %>%
arrange(desc(mean))
countries_mean
## # A tibble: 4 × 2
## CountryLive mean
## <chr> <dbl>
## 1 United States of America 184.
## 2 India 135.
## 3 Canada 114.
## 4 United Kingdom 45.5
# Isolate only the countries of interest
only_4 <- fcc_good %>%
filter(CountryLive == 'United States of America' | CountryLive == 'India' | CountryLive == 'United Kingdom'|CountryLive == 'Canada') %>%
mutate(index = row_number())
# Box plots to visualize distributions
ggplot( data = only_4, aes(x = CountryLive, y = money_per_month)) +
geom_boxplot() +
ggtitle("Money Spent Per Month Per Country\n(Distributions)") +
xlab("Country") +
ylab("Money per month (US dollars)") +
theme_bw()
We can see a few extreme outliers for India (values over $2,500 per month), but it’s unclear whether this is good data or not. Maybe these persons attended several bootcamps, which tend to be very expensive. Let’s examine these two data points to see if we can find anything relevant.
# Inspect the extreme outliers for India
india_outliers <- only_4 %>%
filter(CountryLive == 'India' &
money_per_month >= 2500)
india_outliers
## # A tibble: 6 × 138
## Age Attend…¹ Bootc…² Bootc…³ Bootc…⁴ Bootc…⁵ Child…⁶ CityP…⁷ CodeE…⁸ CodeE…⁹
## <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 24 0 NA NA <NA> NA NA betwee… NA NA
## 2 20 0 NA NA <NA> NA NA more t… NA NA
## 3 28 0 NA NA <NA> NA NA betwee… 1 NA
## 4 22 0 NA NA <NA> NA NA more t… NA NA
## 5 19 0 NA NA <NA> NA NA more t… NA NA
## 6 27 0 NA NA <NA> NA NA more t… NA NA
## # … with 128 more variables: CodeEventFCC <dbl>, CodeEventGameJam <dbl>,
## # CodeEventGirlDev <dbl>, CodeEventHackathons <dbl>, CodeEventMeetup <dbl>,
## # CodeEventNodeSchool <dbl>, CodeEventNone <dbl>, CodeEventOther <chr>,
## # CodeEventRailsBridge <dbl>, CodeEventRailsGirls <dbl>,
## # CodeEventStartUpWknd <dbl>, CodeEventWkdBootcamps <dbl>,
## # CodeEventWomenCode <dbl>, CodeEventWorkshops <dbl>, CommuteTime <chr>,
## # CountryCitizen <chr>, CountryLive <chr>, EmploymentField <chr>, …
It seems that neither participant attended a bootcamp. Overall, it’s really hard to figure out from the data whether these persons really spent that much money with learning. The actual question of the survey was “Aside from university tuition, about how much money have you spent on learning to code so far (in US dollars)?”, so they might have misunderstood and thought university tuition is included. It seems safer to remove these six rows.
# Remove the outliers for India
only_4 <- only_4 %>%
filter(!(index %in% india_outliers$index))
Looking back at the box plot above, we can also see more extreme outliers for the US (values over $6,000 per month). Let’s examine these participants in more detail.
# Examine the extreme outliers for the US
us_outliers = only_4 %>%
filter(CountryLive == 'United States of America' &
money_per_month >= 6000)
us_outliers
## # A tibble: 11 × 138
## Age Atten…¹ Bootc…² Bootc…³ Bootc…⁴ Bootc…⁵ Child…⁶ CityP…⁷ CodeE…⁸ CodeE…⁹
## <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 26 1 0 0 The Co… 1 NA more t… 1 NA
## 2 32 1 0 0 The Ir… 1 NA betwee… NA NA
## 3 34 1 1 0 We Can… 1 NA more t… NA NA
## 4 31 0 NA NA <NA> NA NA betwee… NA NA
## 5 46 1 1 1 Sabio.… 0 NA betwee… NA NA
## 6 32 0 NA NA <NA> NA NA more t… 1 NA
## 7 26 1 0 1 Codeup 0 NA more t… NA NA
## 8 33 1 0 1 Grand … 1 NA betwee… NA NA
## 9 29 0 NA NA <NA> NA 2 more t… NA NA
## 10 27 0 NA NA <NA> NA 1 more t… NA NA
## 11 50 0 NA NA <NA> NA 2 less t… NA NA
## # … with 128 more variables: CodeEventFCC <dbl>, CodeEventGameJam <dbl>,
## # CodeEventGirlDev <dbl>, CodeEventHackathons <dbl>, CodeEventMeetup <dbl>,
## # CodeEventNodeSchool <dbl>, CodeEventNone <dbl>, CodeEventOther <chr>,
## # CodeEventRailsBridge <dbl>, CodeEventRailsGirls <dbl>,
## # CodeEventStartUpWknd <dbl>, CodeEventWkdBootcamps <dbl>,
## # CodeEventWomenCode <dbl>, CodeEventWorkshops <dbl>, CommuteTime <chr>,
## # CountryCitizen <chr>, CountryLive <chr>, EmploymentField <chr>, …
only_4 <- only_4 %>%
filter(!(index %in% us_outliers$index))
Out of these 11 extreme outliers, six people attended bootcamps, which justify the large sums of money spent on learning. For the other five, it’s hard to figure out from the data where they could have spent that much money on learning. Consequently, we’ll remove those rows where participants reported thaT they spend $6,000 each month, but they have never attended a bootcamp.
Also, the data shows that eight respondents had been programming for no more than three months when they completed the survey. They most likely paid a large sum of money for a bootcamp that was going to last for several months, so the amount of money spent per month is unrealistic and should be significantly lower (because they probably didn’t spend anything for the next couple of months after the survey). As a consequence, we’ll remove every these eight outliers.
In the next code block, we’ll remove respondents that:
# Remove the respondents who didn't attenD a bootcamp
no_bootcamp <- only_4 %>%
filter(CountryLive == 'United States of America' &
money_per_month >= 6000 &
AttendedBootcamp == 0)
only_4_ <- only_4 %>%
filter(!(index %in% no_bootcamp$index))
# Remove the respondents that had been programming for less than 3 months
less_than_3_months <- only_4 %>%
filter(CountryLive == 'United States of America' &
money_per_month >= 6000 &
MonthsProgramming <= 3)
only_4 <- only_4 %>%
filter(!(index %in% less_than_3_months$index))
Looking again at the last box plot above, we can also see an extreme outlier for Canada — a person who spends roughly $5,000 per month. Let’s examine this person in more depth.
# Examine the extreme outliers for Canada
canada_outliers = only_4 %>%
filter(CountryLive == 'Canada' &
money_per_month >= 4500 &
MonthsProgramming <= 3)
canada_outliers
## # A tibble: 1 × 138
## Age Attend…¹ Bootc…² Bootc…³ Bootc…⁴ Bootc…⁵ Child…⁶ CityP…⁷ CodeE…⁸ CodeE…⁹
## <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 24 1 0 0 Bloc.io 1 NA more t… 1 NA
## # … with 128 more variables: CodeEventFCC <dbl>, CodeEventGameJam <dbl>,
## # CodeEventGirlDev <dbl>, CodeEventHackathons <dbl>, CodeEventMeetup <dbl>,
## # CodeEventNodeSchool <dbl>, CodeEventNone <dbl>, CodeEventOther <chr>,
## # CodeEventRailsBridge <dbl>, CodeEventRailsGirls <dbl>,
## # CodeEventStartUpWknd <dbl>, CodeEventWkdBootcamps <dbl>,
## # CodeEventWomenCode <dbl>, CodeEventWorkshops <dbl>, CommuteTime <chr>,
## # CountryCitizen <chr>, CountryLive <chr>, EmploymentField <chr>, …
Here, the situation is similar to some of the US respondents — this participant had been programming for no more than two months when he completed the survey. He seems to have paid a large sum of money in the beginning to enroll in a bootcamp, and then he probably didn’t spend anything for the next couple of months after the survey. We’ll take the same approach here as for the US and remove this outlier.
# Remove the extreme outliers for Canada
only_4 <- only_4 %>%
filter(!(index %in% canada_outliers$index))
Let’s recompute the mean values and generate the final box plots.
# Mean sum of money spent by students each month
countries_mean <- only_4 %>%
group_by(CountryLive) %>%
summarize(mean = mean(money_per_month)) %>%
arrange(desc(mean))
countries_mean
## # A tibble: 4 × 2
## CountryLive mean
## <chr> <dbl>
## 1 United States of America 143.
## 2 Canada 93.1
## 3 India 65.8
## 4 United Kingdom 45.5
# Box plots to visualize distributions
ggplot( data = only_4, aes(x = CountryLive, y = money_per_month)) +
geom_boxplot() +
ggtitle("Money Spent Per Month Per Country\n(Distributions)") +
xlab("Country") +
ylab("Money per month (US dollars)") +
theme_bw()
# Choosing the Two Best Markets
Obviously, one country we should advertise in is the US. Lots of new coders live there and they are willing to pay a good amount of money each month (roughly $143).
We sell subscriptions at a price of $59 per month, and Canada seems to be the best second choice because people there are willing to pay roughly $93 per month, compared to India ($66) and the United Kingdom ($45).
The data suggests strongly that we shouldn’t advertise in the UK, but let’s take a second look at India before deciding to choose Canada as our second best choice:
# Frequency table for the 'CountryLive' column
only_4 %>% group_by(CountryLive) %>%
summarise(freq = n() * 100 / nrow(only_4) ) %>%
arrange(desc(freq)) %>%
head()
## # A tibble: 4 × 2
## CountryLive freq
## <chr> <dbl>
## 1 United States of America 75.0
## 2 India 11.7
## 3 United Kingdom 7.16
## 4 Canada 6.14
# Frequency table to check if we still have enough data
only_4 %>% group_by(CountryLive) %>%
summarise(freq = n() ) %>%
arrange(desc(freq)) %>%
head()
## # A tibble: 4 × 2
## CountryLive freq
## <chr> <int>
## 1 United States of America 2920
## 2 India 457
## 3 United Kingdom 279
## 4 Canada 239
So it’s not crystal clear what to choose between Canada and India. Although it seems more tempting to choose Canada, there are good chances that India might actually be a better choice because of the large number of potential customers.
At this point, it seems that we have several options:
At this point, it’s probably best to send our analysis to the marketing team and let them use their domain knowledge to decide. They might want to do some extra surveys in India and Canada and then get back to us for analyzing the new survey data.
In this project, we analyzed survey data from new coders to find the best two markets to advertise in. The only solid conclusion we reached is that the US would be a good market to advertise in.
For the second best market, it wasn’t clear-cut what to choose between India and Canada. We decided to send the results to the marketing team so they can use their domain knowledge to take the best decision.
We can use the Split-and-Combine workflow.↩︎
We can use the drop_na()
function.↩︎
We can use the stringr::str_split()
function.↩︎