Newest 'apache-spark-sql' Questions

-5 votes

0 answers

47 views

How do I format currency fields without comma (2 Dec) ,date columns to ‘MM/DD/YYYY’, for example for ‘2026-10-01’ I want it to display as 10/02/2026,’ [closed]

If my output is pipe delimited text file, does it the type of the fields matter or everything is written as text including numeric / currency fields and dates, for example if I want to format my date ...

Hanna Ambaye

reputation score 1

asked 2 days ago

2 votes

1 answer

74 views

How can a nested JSON field cause AMBIGUOUS_REFERENCE_TO_FIELDS in Spark?

I'm consuming two Binance streams: a trade stream and a kline (candlestick) stream. These are the schemas I'm using in my Spark job: ====================================================================...

Vikas Malakar

reputation score 21

asked Jun 25 at 13:24

-1 votes

1 answer

101 views

SQL Query for Unpivot compatible with SQLGLOT parser

I have imported the data from the attached Excel file. The dataset currently has the following structure: ISO, Name, 1993, 1994, 1995, …, 2023 Each year is represented as a separate column, and new ...

Gourav Joshi

reputation score 109

asked May 1 at 9:18

Best practices

0 votes

4 replies

107 views

Databricks run SQL query in python without spark

I am trying to run a big Databricks query with a lot of CTEs, etc. but I do not really want to run it in spark. Some parts of the query that work on the normal SQL warehouse do not work on spark. I am ...

max

reputation score 105

asked Mar 16 at 15:01

0 votes

0 answers

98 views

Can't SELECT anything in a AWS Glue Data Catalog view due to invalid view text: <REDACTED VIEW TEXT>

i created a glue view through a glue job like this: CREATE OR REPLACE PROTECTED MULTI DIALECT VIEW risk_models_output.vw_behavior_special_limit_score SECURITY DEFINER AS [query ...

Paloma Raissa

reputation score 1

asked Jan 14 at 20:37

0 votes

1 answer

106 views

Spark job fails with UnsafeExternalSorter OOM when using groupBy + collect_list + sort – how to optimize?

How to replace groupBy + collect_list + array_sort with a more memory-efficient approach in Spark SQL? I have a Spark (Java) batch job that processes large telecom event data The job is failing with `...

Thịnh Nguyễn

reputation score 1

asked Jan 13 at 9:16

0 votes

1 answer

75 views

Why 2 tables bucketed by col1 and joined by (col1, col2) are shuffled?

// Enable all bucketing optimizations spark.conf.set("spark.sql.requireAllClusterKeysForDistribution", "false") spark.conf.set("spark.sql.sources.bucketing.enabled&...

user2417458

reputation score 31

asked Dec 25, 2025 at 12:03

1 vote

0 answers

60 views

How to optimize special array_intersect in hive sql executed by spark engine?

buckets is a column of type array<string>. The logic is similar to array_intersect, except only the prefix of each string in buckets (before the first -) is compared. How can I optimize the ...

Dong Ye

reputation score 11

asked Nov 22, 2025 at 17:27

Best practices

0 votes

5 replies

130 views

Pushing down filters in RDBMS with Java Spark

I have been working as a Data Engineer and got this issue. I came across a use case where I have a view(lets name it as inputView) which is created by reading data from some source. Now somewhere ...

Parth Sarthi Roy

reputation score 1

asked Nov 14, 2025 at 6:13

Advice

0 votes

6 replies

190 views

Pyspark SQL: How to do GROUP BY with specific WHERE condition

So I am doing some SQL aggregation transformations of a dataset and there is a certain condition that I would like to do, but not sure how. Here is a basic code block: le_test = spark.sql(""&...

BeaverFever

reputation score 21

asked Nov 2, 2025 at 6:39

0 votes

0 answers

115 views

How to Check if a Query Touches Data Files or just Uses Manifests and Metadata in Iceberg

I created a table as follows: CREATE TABLE IF NOT EXISTS raw_data.civ ( date timestamp, marketplace_id int, ... some more columns ) USING ICEBERG PARTITIONED BY ( marketplace_id, ...

shiva

reputation score 2801

asked Oct 25, 2025 at 15:11

2 votes

1 answer

251 views

How to collect multiple metrics with observe in PySpark without triggering multiple actions

I have a PySpark job that reads data from table a, performs some transformations and filters, and then writes the result to table b. Here’s a simplified version of the code: import pyspark.sql....

עומר אמזלג

reputation score 31

asked Oct 22, 2025 at 15:17

0 votes

1 answer

262 views

Unexpected Write Behavior when using MERGE INTO/INSERT INTO Iceberg Spark Queries

I am observing different write behaviors when executing queries on EMR Notebook (correct behavior) vs when using spark-submit to submit a spark application to EMR Cluster (incorrect behavior). When I ...

shiva

reputation score 2801

asked Oct 21, 2025 at 20:58

0 votes

0 answers

65 views

Spark: VSAM File read issue with special character

We have a scenario to read a VSAM file directly along with a copy book to understand the column lengths, we were using COBRIX library as part of spark read. However, we could the same is not properly ...

Rocky1989

reputation score 409

asked Oct 15, 2025 at 7:06

0 votes

0 answers

64 views

How to link Spark event log stages to PySpark code or query?

I'm analyzing Spark event logs and have already retrieved the SparkListenerStageSubmitted and SparkListenerTaskEnd events to collect metrics such as spill, skew ratio, memory, and CPU usage. However, ...

Carol C

reputation score 1

asked Oct 9, 2025 at 19:40

Collectives™ on Stack Overflow

How do I format currency fields without comma (2 Dec) ,date columns to ‘MM/DD/YYYY’, for example for ‘2026-10-01’ I want it to display as 10/02/2026,’ [closed]

How can a nested JSON field cause AMBIGUOUS_REFERENCE_TO_FIELDS in Spark?

SQL Query for Unpivot compatible with SQLGLOT parser

Databricks run SQL query in python without spark

Can't SELECT anything in a AWS Glue Data Catalog view due to invalid view text: <REDACTED VIEW TEXT>

Spark job fails with UnsafeExternalSorter OOM when using groupBy + collect_list + sort – how to optimize?

Why 2 tables bucketed by col1 and joined by (col1, col2) are shuffled?

How to optimize special array_intersect in hive sql executed by spark engine?

Pushing down filters in RDBMS with Java Spark

Pyspark SQL: How to do GROUP BY with specific WHERE condition

How to Check if a Query Touches Data Files or just Uses Manifests and Metadata in Iceberg

How to collect multiple metrics with observe in PySpark without triggering multiple actions

Unexpected Write Behavior when using MERGE INTO/INSERT INTO Iceberg Spark Queries

Spark: VSAM File read issue with special character

How to link Spark event log stages to PySpark code or query?

Hot Network Questions