Newest 'rdd' Questions - Stack Overflow

0 votes

1 answer

103 views

PySpark CONTEXT_ONLY_VALID_ON_DRIVER error when calling class method inside RDD transformation (AWS Glue)

I am running a Glue job where I call a framework Python file stored in S3. I download and import the framework like this: os.system(f'aws s3 cp s3://s3_bucket/Common/ABC/XML_TO_PARQUE_FRAMEWORK.py ./ -...

KReEd

reputation score 378

asked Jun 10 at 15:35

0 votes

0 answers

64 views

BroadcastExchange breaks RDD's lineage

I'm writing this post so hopefully someone can verify my findings and add some missing pieces. After calling cleanShuffleDependencies on an RDD after a long series of transformations, I noticed that ...

Dzeri96

reputation score 461

asked May 19 at 9:48

-1 votes

2 answers

85 views

How to Join two RDDs in pyspark with nested tuples

I need to join two RDDs as part of my programming assignment. The problem is that the first RDD is nested, while the other is flat. I tried different things, but nothing seemed to work. Is there any ...

Ahmed Sohail Aslam PhDCS 2025

reputation score 1

asked Dec 1, 2025 at 3:46

1 vote

1 answer

130 views

How to properly recalculate Spark DataFrame statistics after checkpoint?

Here is minimal example using default data in DataBricks (Spark 3.4): import org.apache.spark.sql.functions.col import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types._ sc....

Igor Railean

reputation score 11

asked May 15, 2025 at 20:43

0 votes

3 answers

184 views

Pyspark mapPartition evaluates the function more times than expected

I'm working with PySpark to process large amounts of data. However, I noticed that the function called by mapPartitions is executed one more time than expected. For instance, in the following code ...

sebenitezg

reputation score 1

asked Dec 31, 2024 at 11:19

0 votes

1 answer

39 views

Find common data among two RDD in spark execution

I have RDD1 col1 col2 A x123 B y123 C z123 RDD2 col1 A C I want to run intersection of two RDDs and find common elements i.e. item that are in RDD2 what is the data of ...

Sachin Shrivastava

reputation score 1

asked Oct 8, 2024 at 17:47

0 votes

1 answer

5k views

RDD is not implemented error on pyspark.sql.connect.dataframe.Dataframe

I have a dataframe on databricks on which I would like to use the RDD api on. The type of the dataframe is pyspark.sql.connect.dataframe.Dataframe after reading from the catalog. I found out that this ...

imawful

reputation score 135

asked Sep 25, 2024 at 8:16

0 votes

1 answer

72 views

unpacking nested tuples after Spark RDD join

The resources for this are scarce and I'm not sure that there's a solution to this issue. Suppose you have 3 simple RDD's. Or more specifically 3 PairRDD's. val rdd1: RDD[(Int, Int)] = sc.parallelize(...

Nizar

reputation score 763

asked Sep 12, 2024 at 12:19

0 votes

0 answers

168 views

While in Jupyter notebook, while using pyspark, get Py4JJavaError when using simple .count

While using the following code: import pyspark from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql import SparkSession from pyspark.sql.types import Row from datetime ...

aemilius89

reputation score 17

asked Aug 30, 2024 at 20:18

-1 votes

1 answer

405 views

pySpark RDD whitelisted Class issues

I was used below code before enabled unity catalog cluster in azure databricks notebook but after changed shared users enabled cluster. i could not able to use below logic, how should we achieve ...

Developer Rajinikanth

reputation score 382

asked Aug 27, 2024 at 11:53

1 vote

1 answer

163 views

avg() over a whole dataframe causing different output

I see that dataframe.agg(avg(Col) works fine, but when i calculate avg() over a window over whole column(not using any partition), i see different results based on which column i use with orderBy. ...

anurag86

reputation score 1707

asked Jul 26, 2024 at 7:39

3 votes

1 answer

106 views

Casting RDD to a different type (from float64 to double)

I have a code like below, which uses pyspark. test_truth_value = RDD. test_predictor_rdd = RDD. valuesAndPred = test_truth_value.zip(lasso_model.predict(test_predictor_rdd)).map(lambda x: ((x[0]), (x[...

Inkyu Kim

reputation score 175

asked Jul 3, 2024 at 12:09

1 vote

1 answer

71 views

Saving and Loading RDD (pyspark) to pickle file is changing order of SparseVectors

I trained tf-idf on a pre-tokenized (unigram tokenizer) dataset that I converted from list[list(token1, token2, token3, ...)] to an RDD using pyspark's HashingTF and IDF implementations. I tried to ...

Caden

reputation score 65

asked Jun 25, 2024 at 19:15

1 vote

1 answer

689 views

Why is my PySpark row_number column messed up when applying a schema?

I want to apply a schema to specific non-technical columns of a Spark DataFrame. Beforehand, I add an artificial ID using Window and row_number so that I can later join some other technical columns to ...

stats_guy

reputation score 717

asked Jun 24, 2024 at 12:35

0 votes

0 answers

47 views

PySpark with RDD - How to calculate and compare averages?

I need to solve a problem where a company wants to offer k different users free use (a kind of coupon) of their application for two months. The goal is to identify users who are likely to churn (leave ...

Yoel Ha

reputation score 1

asked Jun 14, 2024 at 11:38

Collectives™ on Stack Overflow

PySpark CONTEXT_ONLY_VALID_ON_DRIVER error when calling class method inside RDD transformation (AWS Glue)

BroadcastExchange breaks RDD's lineage

How to Join two RDDs in pyspark with nested tuples

How to properly recalculate Spark DataFrame statistics after checkpoint?

Pyspark mapPartition evaluates the function more times than expected

Find common data among two RDD in spark execution

RDD is not implemented error on pyspark.sql.connect.dataframe.Dataframe

unpacking nested tuples after Spark RDD join

While in Jupyter notebook, while using pyspark, get Py4JJavaError when using simple .count

pySpark RDD whitelisted Class issues

avg() over a whole dataframe causing different output

Casting RDD to a different type (from float64 to double)

Saving and Loading RDD (pyspark) to pickle file is changing order of SparseVectors

Why is my PySpark row_number column messed up when applying a schema?

PySpark with RDD - How to calculate and compare averages?

Hot Network Questions