Parallelization in R: Testing your PC

Sourav Bikash
4 min readOct 29, 2021

View the gains your multi-core processor PC is giving you

By installing R (R version 4.1.1 (2021–08–10)) in your windows PC you can try to view the parallelization gains in your PC/Laptop using some simple calculations that involve intensive CPU activity.

Intel i7: 8-core vs single-core

Here, I have tried to use a location data set involving most of the cities in the world to calculate the average inter-city distance from a sample. We will see that this calculation leads to a combinatorial explosion and hence puts your CPU to work.

I am using a Lenovo laptop with the Intel 11th Gen i7 (8 virtual cores). This will give me a comparison between a single-core processing time vs an 8-core. As a first step, we need to check the number of cores in our system:

> library(parallel)
> detectCores() #get the number of cores in your system
[1] 8

I have picked up the cities data set from Kaggle: https://www.kaggle.com/max-mind/world-cities-database. Lets get the data and read it in:

> set.seed(1)
> all.cities.locs <- read.csv(“~/DASCA_WORK/Global_Data/worldcitiespop.csv”,
stringsAsFactors=FALSE)
> head(all.cities.locs)
Country City AccentCity Region Population Latitude Longitude
1 ad aixas Aixàs 06 NA 42.48333 1.466667
2 ad aixirivali Aixirivali 06 NA 42.46667 1.500000
3 ad aixirivall Aixirivall 06 NA 42.46667 1.500000
4 ad aixirvall Aixirvall 06 NA 42.46667 1.500000
5 ad aixovall Aixovall 06 NA 42.46667 1.483333
6 ad andorra Andorra 07 NA 42.50000 1.516667
>

We need to find the distance between 2 latitudes and longitudes — an implementation of the haversine function should suffice.

#degree to radians
to.radians <- function(degrees){
degrees * pi / 180
}
#the haversine function to calculate the distance between 2 geo-locations
haversine <- function(lat1, long1, lat2, long2, unit=”km”){
radius <- 6378 # radius of Earth in kilometers
delta.phi <- to.radians(lat2 — lat1)
delta.lambda <- to.radians(long2 — long1)
phi1 <- to.radians(lat1)
phi2 <- to.radians(lat2)
term1 <- sin(delta.phi/2) ^ 2
term2 <- cos(phi1) * cos(phi2) * sin(delta.lambda/2) ^ 2
the.terms <- term1 + term2
delta.sigma <- 2 * atan2(sqrt(the.terms), sqrt(1-the.terms))
distance <- radius * delta.sigma
if(unit==”km”) return(distance)
if(unit==”miles”) return(0.621371*distance)
}

Now we need to set up our functions for the simulation. We are going to calculate the pair-wise distances between the cities in a sample and calculate their average. The complexity rises as the number of pairs will be quite large as the sample size increases!

# — — — — — — — — — — — — -single_core — — — — — — — — — — — -
single.core <- function(cities.locs){
running.sum <- 0
for(i in 1:(nrow(cities.locs)-1)){
for(j in (i+1):nrow(cities.locs)){
# i is the row of the first lat/long pair
# j is the row of the second lat/long pair
this.dist <- haversine(cities.locs[i, 6],#row 6 has the latitude
cities.locs[i, 7],#row 7 has the longitude
cities.locs[j, 6],
cities.locs[j, 7])
running.sum <- running.sum + this.dist
}
}
# Now we have to divide by the number of
# distances we took. This is given by
return(running.sum /
((nrow(cities.locs)*(nrow(cities.locs)-1))/2))
}
system.time(ave.dist <- single.core(cities.locs))
print(ave.dist)
#----------------------Multi-core-------------
clusterExport(detectCores(), c("haversine", "to.radians"))
multi.core <- function(cities.locs){
all.combs <- combn(1:nrow(cities.locs), 2)
numcombs <- ncol(all.combs)
results <- parLapply(c1, 1:numcombs, function(x){
lat1 <- cities.locs[all.combs[1, x], 6]
long1 <- cities.locs[all.combs[1, x], 7]
lat2 <- cities.locs[all.combs[2, x], 6]
long2 <- cities.locs[all.combs[2, x], 7]
return(haversine(lat1, long1, lat2, long2))
})
return(sum(unlist(results)) / numcombs)
}
system.time(ave.dist <- multi.core(cities.locs))
print(ave.dist)

With these functions we are good to set up our simulation. The data set has around 3MM+ entries and using a random sample should give us a quick result. I have decided to take 10 samples ranging from 1000 to 10,000 for each of the 2 experiments. I will be storing the results in a data frame in R that we can use to analyze and plot later. Below is a loop that makes 10 iterations each for single core and multi-core tests and saves the system time.

# — — — — — — — — — — — -multi core test — — — — — — — — — — — # declaring an empty data frame with 3 columns and null entries
df = data.frame(matrix(
vector(), 0, 3, dimnames=list(c(), c(“Sample_Size”,”Cores”,”Time”))),
stringsAsFactors=F)
i = 8 # set to the number of cores in the systemfor(j in seq(1000,10000,1000)){
clusterExport(makeCluster(i), c(“haversine”, “to.radians”)) #number of cores to be assigned
# choose a random sample of cities
smp.size <- j #assign sample size
random.sample <- sample((1:nrow(all.cities.locs)), smp.size)
cities.locs <- all.cities.locs[random.sample, ]
row.names(cities.locs) <- NULL
#get the simulation results
df<-rbind(df,c(j, i, system.time(ave.dist <- multi.core(cities.locs))[3]))
}
colnames(df)<-c(“Sample_Size”,”Cores”,”Time”)write.csv(df,
“~/DASCA_WORK/Global_Data/runtime_test.csv”,
row.names = TRUE)
# — — ---------- single core test — — — — — — — — — i=1for(j in seq(1000,10000,1000)){
# choose a random sample of cities
smp.size <- j #assign sample size
random.sample <- sample((1:nrow(all.cities.locs)), smp.size)
cities.locs <- all.cities.locs[random.sample, ]
row.names(cities.locs) <- NULL
#get the simulation results
df<-rbind(df,c(j, i, system.time(ave.dist <- single.core(cities.locs))[3]))
}
colnames(df)<-c(“Sample_Size”,”Cores”,”Time”)write.csv(df,
“~/DASCA_WORK/Global_Data/runtime_test_single_core.csv”,
row.names = TRUE)

These simulations will take some time to run — so do not wait for them to finish and keep you laptop/PC connected to a power source. The results looked some what like this on my system: (the tables entries are in mins)

Table: Simulation results
  1. The full code/script can be found here: https://github.com/souravoo7/Marketing_Analytics
  2. The source of the data: https://www.kaggle.com/max-mind/world-cities-database

--

--