Skip to main content
The 2026 Annual Developer Survey is live— take the Survey today!
Filter by
Sorted by
Tagged with
Filter by Employee ID
0 votes
1 answer
103 views

I am running a Glue job where I call a framework Python file stored in S3. I download and import the framework like this: os.system(f'aws s3 cp s3://s3_bucket/Common/ABC/XML_TO_PARQUE_FRAMEWORK.py ./ -...
0 votes
0 answers
64 views

I'm writing this post so hopefully someone can verify my findings and add some missing pieces. After calling cleanShuffleDependencies on an RDD after a long series of transformations, I noticed that ...
-1 votes
2 answers
85 views

I need to join two RDDs as part of my programming assignment. The problem is that the first RDD is nested, while the other is flat. I tried different things, but nothing seemed to work. Is there any ...
1 vote
1 answer
130 views

Here is minimal example using default data in DataBricks (Spark 3.4): import org.apache.spark.sql.functions.col import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types._ sc....
0 votes
3 answers
184 views

I'm working with PySpark to process large amounts of data. However, I noticed that the function called by mapPartitions is executed one more time than expected. For instance, in the following code ...
0 votes
1 answer
39 views

I have RDD1 col1 col2 A x123 B y123 C z123 RDD2 col1 A C I want to run intersection of two RDDs and find common elements i.e. item that are in RDD2 what is the data of ...
0 votes
1 answer
5k views

I have a dataframe on databricks on which I would like to use the RDD api on. The type of the dataframe is pyspark.sql.connect.dataframe.Dataframe after reading from the catalog. I found out that this ...
0 votes
1 answer
72 views

The resources for this are scarce and I'm not sure that there's a solution to this issue. Suppose you have 3 simple RDD's. Or more specifically 3 PairRDD's. val rdd1: RDD[(Int, Int)] = sc.parallelize(...
0 votes
0 answers
168 views

While using the following code: import pyspark from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql import SparkSession from pyspark.sql.types import Row from datetime ...
-1 votes
1 answer
405 views

I was used below code before enabled unity catalog cluster in azure databricks notebook but after changed shared users enabled cluster. i could not able to use below logic, how should we achieve ...
1 vote
1 answer
163 views

I see that dataframe.agg(avg(Col) works fine, but when i calculate avg() over a window over whole column(not using any partition), i see different results based on which column i use with orderBy. ...
3 votes
1 answer
106 views

I have a code like below, which uses pyspark. test_truth_value = RDD. test_predictor_rdd = RDD. valuesAndPred = test_truth_value.zip(lasso_model.predict(test_predictor_rdd)).map(lambda x: ((x[0]), (x[...
1 vote
1 answer
71 views

I trained tf-idf on a pre-tokenized (unigram tokenizer) dataset that I converted from list[list(token1, token2, token3, ...)] to an RDD using pyspark's HashingTF and IDF implementations. I tried to ...
1 vote
1 answer
689 views

I want to apply a schema to specific non-technical columns of a Spark DataFrame. Beforehand, I add an artificial ID using Window and row_number so that I can later join some other technical columns to ...
0 votes
0 answers
47 views

I need to solve a problem where a company wants to offer k different users free use (a kind of coupon) of their application for two months. The goal is to identify users who are likely to churn (leave ...

15 30 50 per page
1
2 3 4 5
271