Parallelization in R: Testing your PC

4 min readOct 29, 2021

View the gains your multi-core processor PC is giving you

By installing R (R version 4.1.1 (2021–08–10)) in your windows PC you can try to view the parallelization gains in your PC/Laptop using some simple calculations that involve intensive CPU activity.

Here, I have tried to use a location data set involving most of the cities in the world to calculate the average inter-city distance from a sample. We will see that this calculation leads to a combinatorial explosion and hence puts your CPU to work.

I am using a Lenovo laptop with the Intel 11th Gen i7 (8 virtual cores). This will give me a comparison between a single-core processing time vs an 8-core. As a first step, we need to check the number of cores in our system:

> library(parallel)
> detectCores() #get the number of cores in your system
[1] 8

I have picked up the cities data set from Kaggle: https://www.kaggle.com/max-mind/world-cities-database. Lets get the data and read it in:

> set.seed(1)
> all.cities.locs <- read.csv(“~/DASCA_WORK/Global_Data/worldcitiespop.csv”,
 stringsAsFactors=FALSE)
> head(all.cities.locs)
 Country City       AccentCity Region Population Latitude Longitude
 1 ad    aixas      AixÃ s     06     NA         42.48333 1.466667
 2 ad    aixirivali Aixirivali 06     NA         42.46667 1.500000
 3 ad    aixirivall Aixirivall 06     NA         42.46667 1.500000
 4 ad    aixirvall  Aixirvall  06     NA         42.46667 1.500000
 5 ad    aixovall   Aixovall   06     NA         42.46667 1.483333
 6 ad    andorra    Andorra    07     NA         42.50000 1.516667
 >

We need to find the distance between 2 latitudes and longitudes — an implementation of the haversine function should suffice.

#degree to radians
to.radians <- function(degrees){
 degrees * pi / 180
}
#the haversine function to calculate the distance between 2 geo-locationshaversine <- function(lat1, long1, lat2, long2, unit=”km”){
 radius <- 6378 # radius of Earth in kilometers
 delta.phi <- to.radians(lat2 — lat1)
 delta.lambda <- to.radians(long2 — long1)
 phi1 <- to.radians(lat1)
 phi2 <- to.radians(lat2)
 term1 <- sin(delta.phi/2) ^ 2
 term2 <- cos(phi1) * cos(phi2) * sin(delta.lambda/2) ^ 2
 the.terms <- term1 + term2
 delta.sigma <- 2 * atan2(sqrt(the.terms), sqrt(1-the.terms))
 distance <- radius * delta.sigma
 if(unit==”km”) return(distance)
 if(unit==”miles”) return(0.621371*distance)
}

Now we need to set up our functions for the simulation. We are going to calculate the pair-wise distances between the cities in a sample and calculate their average. The complexity rises as the number of pairs will be quite large as the sample size increases!

# — — — — — — — — — — — — -single_core — — — — — — — — — — — -
single.core <- function(cities.locs){
 running.sum <- 0
 for(i in 1:(nrow(cities.locs)-1)){
 for(j in (i+1):nrow(cities.locs)){
 # i is the row of the first lat/long pair
 # j is the row of the second lat/long pair
 this.dist <- haversine(cities.locs[i, 6],#row 6 has the latitude
                        cities.locs[i, 7],#row 7 has the longitude
                        cities.locs[j, 6],
                        cities.locs[j, 7])
 running.sum <- running.sum + this.dist
 }
 }
 # Now we have to divide by the number of
 # distances we took. This is given by
 return(running.sum /
 ((nrow(cities.locs)*(nrow(cities.locs)-1))/2))
}system.time(ave.dist <- single.core(cities.locs))
print(ave.dist)#----------------------Multi-core-------------
clusterExport(detectCores(), c("haversine", "to.radians"))multi.core <- function(cities.locs){
  all.combs <- combn(1:nrow(cities.locs), 2)
  numcombs <- ncol(all.combs)
  results <- parLapply(c1, 1:numcombs, function(x){
    lat1  <- cities.locs[all.combs[1, x], 6]
    long1 <- cities.locs[all.combs[1, x], 7]
    lat2  <- cities.locs[all.combs[2, x], 6]
    long2 <- cities.locs[all.combs[2, x], 7]
    return(haversine(lat1, long1, lat2, long2))
  })
  return(sum(unlist(results)) / numcombs)
}system.time(ave.dist <- multi.core(cities.locs))
print(ave.dist)

With these functions we are good to set up our simulation. The data set has around 3MM+ entries and using a random sample should give us a quick result. I have decided to take 10 samples ranging from 1000 to 10,000 for each of the 2 experiments. I will be storing the results in a data frame in R that we can use to analyze and plot later. Below is a loop that makes 10 iterations each for single core and multi-core tests and saves the system time.

# — — — — — — — — — — — -multi core test — — — — — — — — — — — # declaring an empty data frame with 3 columns and null entries
df = data.frame(matrix(
 vector(), 0, 3, dimnames=list(c(), c(“Sample_Size”,”Cores”,”Time”))),
 stringsAsFactors=F)i = 8 # set to the number of cores in the systemfor(j in seq(1000,10000,1000)){
 clusterExport(makeCluster(i), c(“haversine”, “to.radians”)) #number of cores to be assigned
 # choose a random sample of cities
 smp.size <- j #assign sample size
 random.sample <- sample((1:nrow(all.cities.locs)), smp.size)
 cities.locs <- all.cities.locs[random.sample, ]
 row.names(cities.locs) <- NULL
 #get the simulation results
 df<-rbind(df,c(j, i, system.time(ave.dist <- multi.core(cities.locs))[3]))
 }colnames(df)<-c(“Sample_Size”,”Cores”,”Time”)write.csv(df, 
 “~/DASCA_WORK/Global_Data/runtime_test.csv”,
 row.names = TRUE)# — — ---------- single core test — — — — — — — — — i=1for(j in seq(1000,10000,1000)){
 # choose a random sample of cities
 smp.size <- j #assign sample size
 random.sample <- sample((1:nrow(all.cities.locs)), smp.size)
 cities.locs <- all.cities.locs[random.sample, ]
 row.names(cities.locs) <- NULL
 #get the simulation results
 df<-rbind(df,c(j, i, system.time(ave.dist <- single.core(cities.locs))[3]))
 }colnames(df)<-c(“Sample_Size”,”Cores”,”Time”)write.csv(df, 
 “~/DASCA_WORK/Global_Data/runtime_test_single_core.csv”,
 row.names = TRUE)

These simulations will take some time to run — so do not wait for them to finish and keep you laptop/PC connected to a power source. The results looked some what like this on my system: (the tables entries are in mins)

The full code/script can be found here: https://github.com/souravoo7/Marketing_Analytics
The source of the data: https://www.kaggle.com/max-mind/world-cities-database

Parallelization in R: Testing your PC

Written by Sourav Bikash