Count pairs whose value is equal to given sum using Spark

Aarfah Ahmad
3 min readMay 25, 2021

Hola!

This is in continuation to my previous article — All About Big Data and Hadoop.

Here, I will try to discuss a problem and how we can use Spark for it.

The title might seem straightforward but the solution we will discuss is not apparently. It is one of the most favorite question asked in interviews.

Problem Statement

You are given a list of numbers and a sum, you need to find the pairs whose addition will match the value of the given sum.

Consider the list of numbers as x — 1,4,2,5 and sum as 6

So, the pairs which could give you the required result would be 2 — 4,2 and 1,5

The basic logic would be to loop around the array and keep checking the sum.

But that will result in huge time complexity as you will need 2 loops. You can refer to this link to know more about the approach.

Also, this is alright when you have small dataset but when it comes to having big dataset, it’s not feasible to use loops in Spark.

The most promising approach is instead of performing sum, find those whose subtraction could give you the desired result.

So, find ‘Sum-x’, if that is present in the x then you get the required pairs, also considering that you take the distinct pairs. 1,2 or 2,1 should be considered as one pair.

In simple terms, create a column as ‘Sum-x’ then look if Sum-x is available in x or not, if yes then return the pair (x, Sum-x) as desired pairs.

Here, I have taken the dataset in CSV format having around 72k records.

Find the below PySpark code to implement the logic discussed above.

%python
from pyspark.sql.types import *
from pyspark.sql.functions import row_number,lit,col
from pyspark.sql import functions as F
import pandas as pd
from pyspark.sql.window import Window
input_df = spark.read.csv("/FileStore/tables/numbers.csv").toDF("number_")
sum = 98123
final=input_df.withColumn("number",input_df["number_"].cast(DoubleType()))
final=final.withColumn("Sub",sum-col("number"))
list_num = final.rdd.map(lambda x: x.number).distinct().collect()
final=final.filter(col('Sub').isin(list_num))
#to get distinct pairs -> a,b and b,a => a,b
pandas_df = final.toPandas()
#using hash function to create a column with hash values which will generate values to help us get the distinct pair
pandas_df['val']= pandas_df.apply(lambda x: hash(frozenset([x["number"],x["Sub"]])),axis=1)
mySchema = StructType([StructField("number_", StringType(), True) \
,StructField("number", DoubleType(), True)\
,StructField("Sub", DoubleType(), True) \
,StructField("val", LongType(), True)])

df = spark.createDataFrame(pandas_df,schema=mySchema)
w = Window().orderBy(lit('number'))
df = df.withColumn("row_num", row_number().over(
Window.partitionBy("val").orderBy(col("number").desc())))
df = df.filter(df.row_num==1)
df = df[['number_','number', 'Sub']]
print(df.count())

Screenshots of the output

Hope this was helpful!

Thanks for reading! Please do share the article, if you liked it. Any comments or suggestions are welcome!

--

--

Aarfah Ahmad

Data Engineer | AWS | GCP | Snowflake | ETL | Warehousing