1. 21 8月, 2021 1 次提交
  2. 10 5月, 2021 1 次提交
  3. 08 5月, 2021 1 次提交
    • L
      [SPARK-35288][SQL] StaticInvoke should find the method without exact argument classes match · 7733510d
      Liang-Chi Hsieh 提交于
      ### What changes were proposed in this pull request?
      
      This patch proposes to make StaticInvoke able to find method with given method name even the parameter types do not exactly match to argument classes.
      
      ### Why are the changes needed?
      
      Unlike `Invoke`, `StaticInvoke` only tries to get the method with exact argument classes. If the calling method's parameter types are not exactly matched with the argument classes, `StaticInvoke` cannot find the method.
      
      `StaticInvoke` should be able to find the method under the cases too.
      
      ### Does this PR introduce _any_ user-facing change?
      
      Yes. `StaticInvoke` can find a method even the argument classes are not exactly matched.
      
      ### How was this patch tested?
      
      Unit test.
      
      Closes #32413 from viirya/static-invoke.
      Authored-by: NLiang-Chi Hsieh <viirya@gmail.com>
      Signed-off-by: NLiang-Chi Hsieh <viirya@gmail.com>
      (cherry picked from commit 33fbf564)
      Signed-off-by: NLiang-Chi Hsieh <viirya@gmail.com>
      7733510d
  4. 02 5月, 2021 2 次提交
  5. 28 4月, 2021 1 次提交
  6. 27 4月, 2021 2 次提交
  7. 25 4月, 2021 1 次提交
  8. 23 4月, 2021 1 次提交
  9. 21 4月, 2021 1 次提交
  10. 20 4月, 2021 1 次提交
    • A
      [SPARK-35080][SQL] Only allow a subset of correlated equality predicates when... · c438f5fa
      allisonwang-db 提交于
      [SPARK-35080][SQL] Only allow a subset of correlated equality predicates when a subquery is aggregated
      
      This PR updated the `foundNonEqualCorrelatedPred` logic for correlated subqueries in `CheckAnalysis` to only allow correlated equality predicates that guarantee one-to-one mapping between inner and outer attributes, instead of all equality predicates.
      
      To fix correctness bugs. Before this fix Spark can give wrong results for certain correlated subqueries that pass CheckAnalysis:
      Example 1:
      ```sql
      create or replace view t1(c) as values ('a'), ('b')
      create or replace view t2(c) as values ('ab'), ('abc'), ('bc')
      
      select c, (select count(*) from t2 where t1.c = substring(t2.c, 1, 1)) from t1
      ```
      Correct results: [(a, 2), (b, 1)]
      Spark results:
      ```
      +---+-----------------+
      |c  |scalarsubquery(c)|
      +---+-----------------+
      |a  |1                |
      |a  |1                |
      |b  |1                |
      +---+-----------------+
      ```
      Example 2:
      ```sql
      create or replace view t1(a, b) as values (0, 6), (1, 5), (2, 4), (3, 3);
      create or replace view t2(c) as values (6);
      
      select c, (select count(*) from t1 where a + b = c) from t2;
      ```
      Correct results: [(6, 4)]
      Spark results:
      ```
      +---+-----------------+
      |c  |scalarsubquery(c)|
      +---+-----------------+
      |6  |1                |
      |6  |1                |
      |6  |1                |
      |6  |1                |
      +---+-----------------+
      ```
      Yes. Users will not be able to run queries that contain unsupported correlated equality predicates.
      
      Added unit tests.
      
      Closes #32179 from allisonwang-db/spark-35080-subquery-bug.
      Lead-authored-by: Nallisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
      Co-authored-by: NWenchen Fan <cloud0fan@gmail.com>
      Signed-off-by: NWenchen Fan <wenchen@databricks.com>
      (cherry picked from commit bad4b6f0)
      Signed-off-by: NWenchen Fan <wenchen@databricks.com>
      c438f5fa
  11. 15 4月, 2021 1 次提交
  12. 13 4月, 2021 1 次提交
  13. 12 4月, 2021 1 次提交
  14. 10 4月, 2021 1 次提交
    • L
      [SPARK-34963][SQL][2.4] Fix nested column pruning for extracting... · ae5568e9
      Liang-Chi Hsieh 提交于
      [SPARK-34963][SQL][2.4] Fix nested column pruning for extracting case-insensitive struct field from array of struct
      
      ### What changes were proposed in this pull request?
      
      This patch proposes a fix of nested column pruning for extracting case-insensitive struct field from array of struct.
      
      This is the backport of #32059 to branch-2.4.
      
      ### Why are the changes needed?
      
      Under case-insensitive mode, nested column pruning rule cannot correctly push down extractor of a struct field of an array of struct, e.g.,
      
      ```scala
      val query = spark.table("contacts").select("friends.First", "friends.MiDDle")
      ```
      
      Error stack:
      ```
      [info]   java.lang.IllegalArgumentException: Field "First" does not exist.
      [info] Available fields:
      [info]   at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274)
      [info]   at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274)
      [info]   at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
      [info]   at scala.collection.AbstractMap.getOrElse(Map.scala:59)
      [info]   at org.apache.spark.sql.types.StructType.apply(StructType.scala:273)
      [info]   at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:44)
      [info]   at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:41)
      ```
      
      ### Does this PR introduce _any_ user-facing change?
      
      No
      
      ### How was this patch tested?
      
      Unit test
      
      Closes #32112 from viirya/fix-array-nested-pruning-2.4.
      Authored-by: NLiang-Chi Hsieh <viirya@gmail.com>
      Signed-off-by: NLiang-Chi Hsieh <viirya@gmail.com>
      ae5568e9
  15. 09 4月, 2021 1 次提交
    • L
      [SPARK-34994][BUILD][2.4] Fix git error when pushing the tag after release script succeeds · b4d9d4a8
      Liang-Chi Hsieh 提交于
      ### What changes were proposed in this pull request?
      
      This patch proposes to fix an error when running release-script in 2.4 branch.
      
      ### Why are the changes needed?
      
      When I ran release script for cutting 2.4.8 RC1, either in dry-run or normal run at the last step "push the tag after success", I encounter the following error:
      
      ```
      fatal: Not a git repository (or any parent up to mount parent ....)
      Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
      ```
      
      ### Does this PR introduce _any_ user-facing change?
      
      No, dev only.
      
      ### How was this patch tested?
      
      Manual test.
      
      Closes #32100 from viirya/fix-release-script-2.
      Authored-by: NLiang-Chi Hsieh <viirya@gmail.com>
      Signed-off-by: NLiang-Chi Hsieh <viirya@gmail.com>
      b4d9d4a8
  16. 08 4月, 2021 1 次提交
  17. 07 4月, 2021 2 次提交
  18. 04 4月, 2021 1 次提交
    • L
      [SPARK-34939][CORE][2.4] Throw fetch failure exception when unable to... · 30436b54
      Liang-Chi Hsieh 提交于
      [SPARK-34939][CORE][2.4] Throw fetch failure exception when unable to deserialize broadcasted map statuses
      
      ### What changes were proposed in this pull request?
      
      This patch catches `IOException`, which is possibly thrown due to unable to deserialize map statuses (e.g., broadcasted value is destroyed), when deserilizing map statuses. Once `IOException` is caught, `MetadataFetchFailedException` is thrown to let Spark handle it.
      
      This is a backport of #32033 to branch-2.4.
      
      ### Why are the changes needed?
      
      One customer encountered application error. From the log, it is caused by accessing non-existing broadcasted value. The broadcasted value is map statuses. E.g.,
      
      ```
      [info]   Cause: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_0_piece0 of broadcast_0
      [info]   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1410)
      [info]   at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:226)
      [info]   at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:103)
      [info]   at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
      [info]   at org.apache.spark.MapOutputTracker$.$anonfun$deserializeMapStatuses$3(MapOutputTracker.scala:967)
      [info]   at org.apache.spark.internal.Logging.logInfo(Logging.scala:57)
      [info]   at org.apache.spark.internal.Logging.logInfo$(Logging.scala:56)
      [info]   at org.apache.spark.MapOutputTracker$.logInfo(MapOutputTracker.scala:887)
      [info]   at org.apache.spark.MapOutputTracker$.deserializeMapStatuses(MapOutputTracker.scala:967)
      ```
      
      There is a race-condition. After map statuses are broadcasted and the executors obtain serialized broadcasted map statuses. If any fetch failure happens after, Spark scheduler invalidates cached map statuses and destroy broadcasted value of the map statuses. Then any executor trying to deserialize serialized broadcasted map statuses and access broadcasted value, `IOException` will be thrown. Currently we don't catch it in `MapOutputTrackerWorker` and above exception will fail the application.
      
      Normally we should throw a fetch failure exception for such case. Spark scheduler will handle this.
      
      ### Does this PR introduce _any_ user-facing change?
      
      No
      
      ### How was this patch tested?
      
      Unit test.
      
      Closes #32045 from viirya/fix-broadcast.
      Authored-by: NLiang-Chi Hsieh <viirya@gmail.com>
      Signed-off-by: NDongjoon Hyun <dhyun@apple.com>
      30436b54
  19. 02 4月, 2021 1 次提交
    • K
      [SPARK-24931][INFRA][2.4] Fix the GA failure related to R linter · 04485fe3
      Kousuke Saruta 提交于
      ### What changes were proposed in this pull request?
      
      This PR backports the change of #32028 .
      
      This PR fixes the GA failure related to R linter which happens on some PRs (e.g. #32023, #32025).
      The reason seems `Rscript -e "devtools::install_github('jimhester/lintrv2.0.0')"` fails to download `lintrv2.0.0`.
      I don't know why but I confirmed we can download `v2.0.1`.
      
      ### Why are the changes needed?
      
      To keep GA healthy.
      
      ### Does this PR introduce _any_ user-facing change?
      
      No.
      
      ### How was this patch tested?
      
      GA itself.
      
      Closes #32029 from sarutak/backport-SPARK-24931.
      Authored-by: NKousuke Saruta <sarutak@oss.nttdata.com>
      Signed-off-by: NSean Owen <srowen@gmail.com>
      04485fe3
  20. 01 4月, 2021 1 次提交
  21. 31 3月, 2021 1 次提交
    • T
      [SPARK-34909][SQL] Fix conversion of negative to unsigned in conv() · f2ddbab4
      Tim Armstrong 提交于
      Use `java.lang.Long.divideUnsigned()` to do integer division in `NumberConverter` to avoid a bug in `unsignedLongDiv` that produced invalid results.
      
      The previous results are incorrect, the result of the below query should be 45012021522523134134555
      ```
      scala> spark.sql("select conv('-10', 11, 7)").show(20, 150)
      +-----------------------+
      |       conv(-10, 11, 7)|
      +-----------------------+
      |4501202152252313413456|
      +-----------------------+
      scala> spark.sql("select hex(conv('-10', 11, 7))").show(20, 150)
      +----------------------------------------------+
      |                         hex(conv(-10, 11, 7))|
      +----------------------------------------------+
      |3435303132303231353232353233313334313334353600|
      +----------------------------------------------+
      ```
      
      `conv()` will produce different results because the bug is fixed.
      
      Added a simple unit test.
      
      Closes #32006 from timarmstrong/conv-unsigned.
      Authored-by: NTim Armstrong <tim.armstrong@databricks.com>
      Signed-off-by: NWenchen Fan <wenchen@databricks.com>
      (cherry picked from commit 13b255fe)
      Signed-off-by: NWenchen Fan <wenchen@databricks.com>
      f2ddbab4
  22. 29 3月, 2021 4 次提交
    • T
      [SPARK-34876][SQL][2.4] Fill defaultResult of non-nullable aggregates · 38238d5e
      Tanel Kiis 提交于
      ### What changes were proposed in this pull request?
      
      Filled the `defaultResult` field on non-nullable aggregates
      
      ### Why are the changes needed?
      
      The `defaultResult` defaults to `None` and in some situations (like correlated scalar subqueries) it is used for the value of the aggregation.
      
      The UT result before the fix:
      ```
      -- !query
      SELECT t1a,
         (SELECT count(t2d) FROM t2 WHERE t2a = t1a) count_t2,
         (SELECT approx_count_distinct(t2d) FROM t2 WHERE t2a = t1a) approx_count_distinct_t2,
         (SELECT collect_list(t2d) FROM t2 WHERE t2a = t1a) collect_list_t2,
         (SELECT collect_set(t2d) FROM t2 WHERE t2a = t1a) collect_set_t2,
          (SELECT hex(count_min_sketch(t2d, 0.5d, 0.5d, 1)) FROM t2 WHERE t2a = t1a) collect_set_t2
      FROM t1
      -- !query schema
      struct<t1a:string,count_t2:bigint,approx_count_distinct_t2:bigint,collect_list_t2:array<bigint>,collect_set_t2:array<bigint>,collect_set_t2:string>
      -- !query output
      val1a	0	NULL	NULL	NULL	NULL
      val1a	0	NULL	NULL	NULL	NULL
      val1a	0	NULL	NULL	NULL	NULL
      val1a	0	NULL	NULL	NULL	NULL
      val1b	6	3	[19,119,319,19,19,19]	[19,119,319]	0000000100000000000000060000000100000004000000005D8D6AB90000000000000000000000000000000400000000000000010000000000000001
      val1c	2	2	[219,19]	[219,19]	0000000100000000000000020000000100000004000000005D8D6AB90000000000000000000000000000000100000000000000000000000000000001
      val1d	0	NULL	NULL	NULL	NULL
      val1d	0	NULL	NULL	NULL	NULL
      val1d	0	NULL	NULL	NULL	NULL
      val1e	1	1	[19]	[19]	0000000100000000000000010000000100000004000000005D8D6AB90000000000000000000000000000000100000000000000000000000000000000
      val1e	1	1	[19]	[19]	0000000100000000000000010000000100000004000000005D8D6AB90000000000000000000000000000000100000000000000000000000000000000
      val1e	1	1	[19]	[19]	0000000100000000000000010000000100000004000000005D8D6AB90000000000000000000000000000000100000000000000000000000000000000
      ```
      
      ### Does this PR introduce _any_ user-facing change?
      
      Bugfix
      
      ### How was this patch tested?
      
      UT
      
      Closes #31991 from tanelk/SPARK-34876_non_nullable_agg_subquery_2.4.
      Authored-by: NTanel Kiis <tanel.kiis@gmail.com>
      Signed-off-by: NHyukjinKwon <gurwls223@apache.org>
      38238d5e
    • L
      [SPARK-34855][CORE] Avoid local lazy variable in SparkContext.getCallSite · 3e65ba93
      Liang-Chi Hsieh 提交于
      ### What changes were proposed in this pull request?
      
      `SparkContext.getCallSite` uses local lazy variable. In Scala 2.11, local lazy val requires synchronization so for large number of job submissions in the same context, it will be a bottleneck. This only for branch-2.4 as we drop Scala 2.11 support at SPARK-26132.
      
      ### Why are the changes needed?
      
      To avoid possible bottleneck for large number of job submissions in the same context.
      
      ### Does this PR introduce _any_ user-facing change?
      
      No
      
      ### How was this patch tested?
      
      Existing tests.
      
      Closes #31988 from viirya/SPARK-34855.
      Authored-by: NLiang-Chi Hsieh <viirya@gmail.com>
      Signed-off-by: NHyukjinKwon <gurwls223@apache.org>
      3e65ba93
    • H
      Revert "[SPARK-34876][SQL] Fill defaultResult of non-nullable aggregates" · 102b7239
      HyukjinKwon 提交于
      This reverts commit b83ab633.
      102b7239
    • T
      [SPARK-34876][SQL] Fill defaultResult of non-nullable aggregates · b83ab633
      Tanel Kiis 提交于
      ### What changes were proposed in this pull request?
      
      Filled the `defaultResult` field on non-nullable aggregates
      
      ### Why are the changes needed?
      
      The `defaultResult` defaults to `None` and in some situations (like correlated scalar subqueries) it is used for the value of the aggregation.
      
      The UT result before the fix:
      ```
      -- !query
      SELECT t1a,
         (SELECT count(t2d) FROM t2 WHERE t2a = t1a) count_t2,
         (SELECT count_if(t2d > 0) FROM t2 WHERE t2a = t1a) count_if_t2,
         (SELECT approx_count_distinct(t2d) FROM t2 WHERE t2a = t1a) approx_count_distinct_t2,
         (SELECT collect_list(t2d) FROM t2 WHERE t2a = t1a) collect_list_t2,
         (SELECT collect_set(t2d) FROM t2 WHERE t2a = t1a) collect_set_t2,
          (SELECT hex(count_min_sketch(t2d, 0.5d, 0.5d, 1)) FROM t2 WHERE t2a = t1a) collect_set_t2
      FROM t1
      -- !query schema
      struct<t1a:string,count_t2:bigint,count_if_t2:bigint,approx_count_distinct_t2:bigint,collect_list_t2:array<bigint>,collect_set_t2:array<bigint>,collect_set_t2:string>
      -- !query output
      val1a	0	0	NULL	NULL	NULL	NULL
      val1a	0	0	NULL	NULL	NULL	NULL
      val1a	0	0	NULL	NULL	NULL	NULL
      val1a	0	0	NULL	NULL	NULL	NULL
      val1b	6	6	3	[19,119,319,19,19,19]	[19,119,319]	0000000100000000000000060000000100000004000000005D8D6AB90000000000000000000000000000000400000000000000010000000000000001
      val1c	2	2	2	[219,19]	[219,19]	0000000100000000000000020000000100000004000000005D8D6AB90000000000000000000000000000000100000000000000000000000000000001
      val1d	0	0	NULL	NULL	NULL	NULL
      val1d	0	0	NULL	NULL	NULL	NULL
      val1d	0	0	NULL	NULL	NULL	NULL
      val1e	1	1	1	[19]	[19]	0000000100000000000000010000000100000004000000005D8D6AB90000000000000000000000000000000100000000000000000000000000000000
      val1e	1	1	1	[19]	[19]	0000000100000000000000010000000100000004000000005D8D6AB90000000000000000000000000000000100000000000000000000000000000000
      val1e	1	1	1	[19]	[19]	0000000100000000000000010000000100000004000000005D8D6AB90000000000000000000000000000000100000000000000000000000000000000
      ```
      
      ### Does this PR introduce _any_ user-facing change?
      
      Bugfix
      
      ### How was this patch tested?
      
      UT
      
      Closes #31973 from tanelk/SPARK-34876_non_nullable_agg_subquery.
      Authored-by: NTanel Kiis <tanel.kiis@gmail.com>
      Signed-off-by: NHyukjinKwon <gurwls223@apache.org>
      (cherry picked from commit 4b9e94c4)
      Signed-off-by: NHyukjinKwon <gurwls223@apache.org>
      b83ab633
  23. 26 3月, 2021 2 次提交
    • H
      [SPARK-34874][INFRA] Recover test reports for failed GA builds · 8062ab0a
      HyukjinKwon 提交于
      ### What changes were proposed in this pull request?
      
      https://github.com/dawidd6/action-download-artifact/commit/621becc6d7c440318382ce6f4cb776f27dd3fef3#r48726074
      there was a behaviour change in the download artifact plugin and it disabled the test reporting in failed builds.
      
      This PR recovers it by explicitly setting the conclusion from the workflow runs to search for the artifacts to download.
      
      ### Why are the changes needed?
      
      In order to properly report the failed test cases.
      
      ### Does this PR introduce _any_ user-facing change?
      
      No, it's dev only.
      
      ### How was this patch tested?
      
      Manually tested at https://github.com/HyukjinKwon/spark/pull/30
      
      Before:
      
      ![Screen Shot 2021-03-26 at 10 54 48 AM](https://user-images.githubusercontent.com/6477701/112566110-b7951d80-8e21-11eb-8fad-f637db9314d5.png)
      
      After:
      
      ![Screen Shot 2021-03-26 at 5 04 01 PM](https://user-images.githubusercontent.com/6477701/112606215-7588cd80-8e5b-11eb-8fdd-3afebd629f4f.png)
      
      Closes #31970 from HyukjinKwon/SPARK-34874.
      Authored-by: NHyukjinKwon <gurwls223@apache.org>
      Signed-off-by: NHyukjinKwon <gurwls223@apache.org>
      (cherry picked from commit c8233f1b)
      Signed-off-by: NHyukjinKwon <gurwls223@apache.org>
      8062ab0a
    • T
      [SPARK-34607][SQL][2.4] Add `Utils.isMemberClass` to fix a malformed class name error on jdk8u · 615dbe1c
      Takeshi Yamamuro 提交于
      ### What changes were proposed in this pull request?
      
      This PR intends to fix a bug of `objects.NewInstance` if a user runs Spark on jdk8u and a given `cls` in `NewInstance` is a deeply-nested inner class, e.g.,.
      ```
        object OuterLevelWithVeryVeryVeryLongClassName1 {
          object OuterLevelWithVeryVeryVeryLongClassName2 {
            object OuterLevelWithVeryVeryVeryLongClassName3 {
              object OuterLevelWithVeryVeryVeryLongClassName4 {
                object OuterLevelWithVeryVeryVeryLongClassName5 {
                  object OuterLevelWithVeryVeryVeryLongClassName6 {
                    object OuterLevelWithVeryVeryVeryLongClassName7 {
                      object OuterLevelWithVeryVeryVeryLongClassName8 {
                        object OuterLevelWithVeryVeryVeryLongClassName9 {
                          object OuterLevelWithVeryVeryVeryLongClassName10 {
                            object OuterLevelWithVeryVeryVeryLongClassName11 {
                              object OuterLevelWithVeryVeryVeryLongClassName12 {
                                object OuterLevelWithVeryVeryVeryLongClassName13 {
                                  object OuterLevelWithVeryVeryVeryLongClassName14 {
                                    object OuterLevelWithVeryVeryVeryLongClassName15 {
                                      object OuterLevelWithVeryVeryVeryLongClassName16 {
                                        object OuterLevelWithVeryVeryVeryLongClassName17 {
                                          object OuterLevelWithVeryVeryVeryLongClassName18 {
                                            object OuterLevelWithVeryVeryVeryLongClassName19 {
                                              object OuterLevelWithVeryVeryVeryLongClassName20 {
                                                case class MalformedNameExample2(x: Int)
                                              }}}}}}}}}}}}}}}}}}}}
      ```
      
      The root cause that Kris (rednaxelafx) investigated is as follows (Kudos to Kris);
      
      The reason why the test case above is so convoluted is in the way Scala generates the class name for nested classes. In general, Scala generates a class name for a nested class by inserting the dollar-sign ( `$` ) in between each level of class nesting. The problem is that this format can concatenate into a very long string that goes beyond certain limits, so Scala will change the class name format beyond certain length threshold.
      
      For the example above, we can see that the first two levels of class nesting have class names that look like this:
      ```
      org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassName1$
      org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassName1$OuterLevelWithVeryVeryVeryLongClassName2$
      ```
      If we leave out the fact that Scala uses a dollar-sign ( `$` ) suffix for the class name of the companion object, `OuterLevelWithVeryVeryVeryLongClassName1`'s full name is a prefix (substring) of `OuterLevelWithVeryVeryVeryLongClassName2`.
      
      But if we keep going deeper into the levels of nesting, you'll find names that look like:
      ```
      org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassNam$$$$2a1321b953c615695d7442b2adb1$$$$ryVeryLongClassName8$OuterLevelWithVeryVeryVeryLongClassName9$OuterLevelWithVeryVeryVeryLongClassName10$
      org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassNam$$$$2a1321b953c615695d7442b2adb1$$$$ryVeryLongClassName8$OuterLevelWithVeryVeryVeryLongClassName9$OuterLevelWithVeryVeryVeryLongClassName10$OuterLevelWithVeryVeryVeryLongClassName11$
      org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassNam$$$$85f068777e7ecf112afcbe997d461b$$$$VeryLongClassName11$OuterLevelWithVeryVeryVeryLongClassName12$
      org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassNam$$$$85f068777e7ecf112afcbe997d461b$$$$VeryLongClassName11$OuterLevelWithVeryVeryVeryLongClassName12$OuterLevelWithVeryVeryVeryLongClassName13$
      org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassNam$$$$85f068777e7ecf112afcbe997d461b$$$$VeryLongClassName11$OuterLevelWithVeryVeryVeryLongClassName12$OuterLevelWithVeryVeryVeryLongClassName13$OuterLevelWithVeryVeryVeryLongClassName14$
      org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassNam$$$$5f7ad51804cb1be53938ea804699fa$$$$VeryLongClassName14$OuterLevelWithVeryVeryVeryLongClassName15$
      org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassNam$$$$5f7ad51804cb1be53938ea804699fa$$$$VeryLongClassName14$OuterLevelWithVeryVeryVeryLongClassName15$OuterLevelWithVeryVeryVeryLongClassName16$
      org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassNam$$$$5f7ad51804cb1be53938ea804699fa$$$$VeryLongClassName14$OuterLevelWithVeryVeryVeryLongClassName15$OuterLevelWithVeryVeryVeryLongClassName16$OuterLevelWithVeryVeryVeryLongClassName17$
      org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassNam$$$$69b54f16b1965a31e88968df1a58d8$$$$VeryLongClassName17$OuterLevelWithVeryVeryVeryLongClassName18$
      org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassNam$$$$69b54f16b1965a31e88968df1a58d8$$$$VeryLongClassName17$OuterLevelWithVeryVeryVeryLongClassName18$OuterLevelWithVeryVeryVeryLongClassName19$
      org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassNam$$$$69b54f16b1965a31e88968df1a58d8$$$$VeryLongClassName17$OuterLevelWithVeryVeryVeryLongClassName18$OuterLevelWithVeryVeryVeryLongClassName19$OuterLevelWithVeryVeryVeryLongClassName20$
      ```
      with a hash code in the middle and various levels of nesting omitted.
      
      The `java.lang.Class.isMemberClass` method is implemented in JDK8u as:
      http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/tip/src/share/classes/java/lang/Class.java#l1425
      ```
          /**
           * Returns {code true} if and only if the underlying class
           * is a member class.
           *
           * return {code true} if and only if this class is a member class.
           * since 1.5
           */
          public boolean isMemberClass() {
              return getSimpleBinaryName() != null && !isLocalOrAnonymousClass();
          }
      
          /**
           * Returns the "simple binary name" of the underlying class, i.e.,
           * the binary name without the leading enclosing class name.
           * Returns {code null} if the underlying class is a top level
           * class.
           */
          private String getSimpleBinaryName() {
              Class<?> enclosingClass = getEnclosingClass();
              if (enclosingClass == null) // top level class
                  return null;
              // Otherwise, strip the enclosing class' name
              try {
                  return getName().substring(enclosingClass.getName().length());
              } catch (IndexOutOfBoundsException ex) {
                  throw new InternalError("Malformed class name", ex);
              }
          }
      ```
      and the problematic code is `getName().substring(enclosingClass.getName().length())` -- if a class's enclosing class's full name is *longer* than the nested class's full name, this logic would end up going out of bounds.
      
      The bug has been fixed in JDK9 by https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8057919 , but still exists in the latest JDK8u release. So from the Spark side we'd need to do something to avoid hitting this problem.
      
      This is the backport of #31733.
      
      ### Why are the changes needed?
      
      Bugfix on jdk8u.
      
      ### Does this PR introduce _any_ user-facing change?
      
      No.
      
      ### How was this patch tested?
      
      Added tests.
      
      Closes #31747 from maropu/SPARK34607-BRANCH2.4.
      Authored-by: NTakeshi Yamamuro <yamamuro@apache.org>
      Signed-off-by: NLiang-Chi Hsieh <viirya@gmail.com>
      615dbe1c
  24. 25 3月, 2021 1 次提交
    • T
      [SPARK-34596][SQL][2.4] Use Utils.getSimpleName to avoid hitting Malformed... · 6ee1c08a
      Takeshi Yamamuro 提交于
      [SPARK-34596][SQL][2.4] Use Utils.getSimpleName to avoid hitting Malformed class name in NewInstance.doGenCode
      
      ### What changes were proposed in this pull request?
      
      Use `Utils.getSimpleName` to avoid hitting `Malformed class name` error in `NewInstance.doGenCode`.
      
      NOTE: branch-2.4 does not have the interpreted implementation of `SafeProjection`, so it does not fall back into the interpreted mode if the compilation fails. Therefore, the test in this PR just checks that the compilation error happens instead of checking that the interpreted mode works well.
      
      This is the backport PR of #31709 and the credit should be rednaxelafx .
      
      ### Why are the changes needed?
      
      On older JDK versions (e.g. JDK8u), nested Scala classes may trigger `java.lang.Class.getSimpleName` to throw an `java.lang.InternalError: Malformed class name` error.
      In this particular case, creating an `ExpressionEncoder` on such a nested Scala class would create a `NewInstance` expression under the hood, which will trigger the problem during codegen.
      
      Similar to https://github.com/apache/spark/pull/29050, we should use  Spark's `Utils.getSimpleName` utility function in place of `Class.getSimpleName` to avoid hitting the issue.
      
      There are two other occurrences of `java.lang.Class.getSimpleName` in the same file, but they're safe because they're only guaranteed to be only used on Java classes, which don't have this problem, e.g.:
      ```scala
          // Make a copy of the data if it's unsafe-backed
          def makeCopyIfInstanceOf(clazz: Class[_ <: Any], value: String) =
            s"$value instanceof ${clazz.getSimpleName}? ${value}.copy() : $value"
          val genFunctionValue: String = lambdaFunction.dataType match {
            case StructType(_) => makeCopyIfInstanceOf(classOf[UnsafeRow], genFunction.value)
            case ArrayType(_, _) => makeCopyIfInstanceOf(classOf[UnsafeArrayData], genFunction.value)
            case MapType(_, _, _) => makeCopyIfInstanceOf(classOf[UnsafeMapData], genFunction.value)
            case _ => genFunction.value
          }
      ```
      The Unsafe-* family of types are all Java types, so they're okay.
      
      ### Does this PR introduce _any_ user-facing change?
      
      Fixes a bug that throws an error when using `ExpressionEncoder` on some nested Scala types, otherwise no changes.
      
      ### How was this patch tested?
      
      Added a test case to `org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite`. It'll fail on JDK8u before the fix, and pass after the fix.
      
      Closes #31888 from maropu/SPARK-34596-BRANCH2.4.
      Lead-authored-by: NTakeshi Yamamuro <yamamuro@apache.org>
      Co-authored-by: NKris Mok <kris.mok@databricks.com>
      Signed-off-by: NLiang-Chi Hsieh <viirya@gmail.com>
      6ee1c08a
  25. 23 3月, 2021 2 次提交
    • L
      [MINOR][DOCS] Updating the link for Azure Data Lake Gen 2 in docs · e756130b
      Lena 提交于
      Current link for `Azure Blob Storage and Azure Datalake Gen 2` leads to AWS information. Replacing the link to point to the right page.
      
      For users to access to the correct link.
      
      Yes, it fixes the link correctly.
      
      N/A
      
      Closes #31938 from lenadroid/patch-1.
      Authored-by: NLena <alehall@microsoft.com>
      Signed-off-by: NMax Gekk <max.gekk@gmail.com>
      (cherry picked from commit d32bb4e5)
      Signed-off-by: NMax Gekk <max.gekk@gmail.com>
      e756130b
    • P
      [SPARK-34726][SQL][2.4] Fix collectToPython timeouts · 5685d845
      Peter Toth 提交于
      ### What changes were proposed in this pull request?
      
      One of our customers frequently encounters `"serve-DataFrame" java.net.SocketTimeoutException: Accept timed` errors in PySpark because `DataSet.collectToPython()` in Spark 2.4 does the following:
      1. Collects the results
      2. Opens up a socket server that is then listening to the connection from Python side
      3. Runs the event listeners as part of `withAction` on the same thread as SPARK-25680 is not available in Spark 2.4
      4. Returns the address of the socket server to Python
      5. The Python side connects to the socket server and fetches the data
      
      As the customer has a custom, long running event listener the time between 2. and 5. is frequently longer than the default connection timeout and increasing the connect timeout is not a good solution as we don't know how long running the listeners can take.
      
      ### Why are the changes needed?
      
      This PR simply moves the socket server creation (2.) after running the listeners (3.). I think this approach has has a minor side effect that errors in socket server creation are not reported as `onFailure` events, but currently errors happening during opening the connection from Python side or data transfer from JVM to Python are also not reported as events so IMO this is not a big change.
      
      ### Does this PR introduce _any_ user-facing change?
      No.
      
      ### How was this patch tested?
      Added new UT + manual test.
      
      Closes #31818 from peter-toth/SPARK-34726-fix-collectToPython-timeouts-2.4.
      Lead-authored-by: NPeter Toth <ptoth@cloudera.com>
      Co-authored-by: NPeter Toth <peter.toth@gmail.com>
      Signed-off-by: NLiang-Chi Hsieh <viirya@gmail.com>
      5685d845
  26. 22 3月, 2021 2 次提交
    • W
      [SPARK-34719][SQL][2.4] Correctly resolve the view query with duplicated column names · ce58e057
      Wenchen Fan 提交于
      backport https://github.com/apache/spark/pull/31811 to 2.4
      
      For permanent views (and the new SQL temp view in Spark 3.1), we store the view SQL text and re-parse/analyze the view SQL text when reading the view. In the case of `SELECT * FROM ...`, we want to avoid view schema change (e.g. the referenced table changes its schema) and will record the view query output column names when creating the view, so that when reading the view we can add a `SELECT recorded_column_names FROM ...` to retain the original view query schema.
      
      In Spark 3.1 and before, the final SELECT is added after the analysis phase: https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala#L67
      
      If the view query has duplicated output column names, we always pick the first column when reading a view. A simple repro:
      ```
      scala> sql("create view c(x, y) as select 1 a, 2 a")
      res0: org.apache.spark.sql.DataFrame = []
      
      scala> sql("select * from c").show
      +---+---+
      |  x|  y|
      +---+---+
      |  1|  1|
      +---+---+
      ```
      
      In the master branch, we will fail at the view reading time due to https://github.com/apache/spark/commit/b891862fb6b740b103d5a09530626ee4e0e8f6e3 , which adds the final SELECT during analysis, so that the query fails with `Reference 'a' is ambiguous`
      
      This PR proposes to resolve the view query output column names from the matching attributes by ordinal.
      
      For example,  `create view c(x, y) as select 1 a, 2 a`, the view query output column names are `[a, a]`. When we reading the view, there are 2 matching attributes (e.g.`[a#1, a#2]`) and we can simply match them by ordinal.
      
      A negative example is
      ```
      create table t(a int)
      create view v as select *, 1 as col from t
      replace table t(a int, col int)
      ```
      When reading the view, the view query output column names are `[a, col]`, and there are two matching attributes of `col`, and we should fail the query. See the tests for details.
      
      bug fix
      
      yes
      
      new test
      
      Closes #31894 from cloud-fan/backport.
      Authored-by: NWenchen Fan <wenchen@databricks.com>
      Signed-off-by: NTakeshi Yamamuro <yamamuro@apache.org>
      ce58e057
    • D
      [SPARK-34811][CORE] Redact fs.s3a.access.key like secret and token · 29b981b3
      Dongjoon Hyun 提交于
      ### What changes were proposed in this pull request?
      
      Like we redact secrets and tokens, this PR aims to redact access key.
      
      ### Why are the changes needed?
      
      Access key is also worth to hide.
      
      ### Does this PR introduce _any_ user-facing change?
      
      This will hide this information from SparkUI (`Spark Properties` and `Hadoop Properties` and logs).
      
      ### How was this patch tested?
      
      Pass the newly updated UT.
      
      Closes #31912 from dongjoon-hyun/SPARK-34811.
      Authored-by: NDongjoon Hyun <dhyun@apple.com>
      Signed-off-by: NDongjoon Hyun <dhyun@apple.com>
      (cherry picked from commit 3c32b54a)
      Signed-off-by: NDongjoon Hyun <dhyun@apple.com>
      29b981b3
  27. 21 3月, 2021 1 次提交
    • V
      [SPARK-26625] Add oauthToken to spark.redaction.regex · 7879a0ca
      Vinoo Ganesh 提交于
      ## What changes were proposed in this pull request?
      
      The regex (spark.redaction.regex) that is used to decide which config properties or environment settings are sensitive should also include oauthToken to match  spark.kubernetes.authenticate.submission.oauthToken
      
      ## How was this patch tested?
      
      Simple regex addition - happy to add a test if needed.
      
      Author: Vinoo Ganesh <vganesh@palantir.com>
      
      Closes #23555 from vinooganesh/vinooganesh/SPARK-26625.
      
      (cherry picked from commit 01301d09)
      Signed-off-by: NDongjoon Hyun <dhyun@apple.com>
      7879a0ca
  28. 20 3月, 2021 2 次提交
    • L
      [SPARK-34776][SQL][3.0][2.4] Window class should override producedAttributes · 59e4ae41
      Liang-Chi Hsieh 提交于
      ### What changes were proposed in this pull request?
      
      This patch proposes to override `producedAttributes` of  `Window` class.
      
      ### Why are the changes needed?
      
      This is a backport of #31897 to branch-3.0/2.4. Unlike original PR, nested column pruning does not allow pushing through `Window` in branch-3.0/2.4 yet. But `Window` doesn't override `producedAttributes`. It's wrong and could cause potential issue. So backport `Window` related change.
      
      ### Does this PR introduce _any_ user-facing change?
      
      No
      
      ### How was this patch tested?
      
      Existing tests.
      
      Closes #31904 from viirya/SPARK-34776-3.0.
      Authored-by: NLiang-Chi Hsieh <viirya@gmail.com>
      Signed-off-by: NTakeshi Yamamuro <yamamuro@apache.org>
      (cherry picked from commit 828cf76b)
      Signed-off-by: NTakeshi Yamamuro <yamamuro@apache.org>
      59e4ae41
    • Y
      [SPARK-34774][BUILD][2.4] Ensure change-scala-version.sh update scala.version... · c5d81cbe
      yangjie01 提交于
      [SPARK-34774][BUILD][2.4] Ensure change-scala-version.sh update scala.version in parent POM correctly
      
      ### What changes were proposed in this pull request?
      After SPARK-34507,  execute` change-scala-version.sh` script will update `scala.version` in parent pom, but if we execute the following commands in order:
      
      ```
      dev/change-scala-version.sh 2.12
      dev/change-scala-version.sh 2.11
      git status
      ```
      
      there will generate git diff as follow:
      
      ```
      diff --git a/pom.xml b/pom.xml
      index f4a50dc5c1..89fd7d88af 100644
      --- a/pom.xml
      +++ b/pom.xml
       -155,7 +155,7
           <commons.math3.version>3.4.1</commons.math3.version>
      
           <commons.collections.version>3.2.2</commons.collections.version>
      -    <scala.version>2.11.12</scala.version>
      +    <scala.version>2.12.10</scala.version>
           <scala.binary.version>2.11</scala.binary.version>
           <codehaus.jackson.version>1.9.13</codehaus.jackson.version>
           <fasterxml.jackson.version>2.6.7</fasterxml.jackson.version>
      ```
      
      seem 'scala.version' property was not update correctly.
      
      So this pr add an extra 'scala.version' to scala-2.11 profile to ensure change-scala-version.sh can update the public `scala.version` property correctly.
      
      ### Why are the changes needed?
      Bug fix.
      
      ### Does this PR introduce _any_ user-facing change?
      No
      
      ### How was this patch tested?
      **Manual test**
      
      Execute the following commands in order:
      
      ```
      dev/change-scala-version.sh 2.12
      dev/change-scala-version.sh 2.11
      git status
      ```
      
      **Before**
      
      ```
      diff --git a/pom.xml b/pom.xml
      index f4a50dc5c1..89fd7d88af 100644
      --- a/pom.xml
      +++ b/pom.xml
       -155,7 +155,7
           <commons.math3.version>3.4.1</commons.math3.version>
      
           <commons.collections.version>3.2.2</commons.collections.version>
      -    <scala.version>2.11.12</scala.version>
      +    <scala.version>2.12.10</scala.version>
           <scala.binary.version>2.11</scala.binary.version>
           <codehaus.jackson.version>1.9.13</codehaus.jackson.version>
           <fasterxml.jackson.version>2.6.7</fasterxml.jackson.version>
      ```
      
      **After**
      
      No git diff.
      
      Closes #31893 from LuciferYang/SPARK-34774-24.
      Authored-by: Nyangjie01 <yangjie01@baidu.com>
      Signed-off-by: NSean Owen <srowen@gmail.com>
      c5d81cbe
  29. 16 3月, 2021 1 次提交
  30. 15 3月, 2021 1 次提交
    • D
      [SPARK-34743][SQL][TESTS] ExpressionEncoderSuite should use deepEquals when we... · 7b7a8fe3
      Dongjoon Hyun 提交于
      [SPARK-34743][SQL][TESTS] ExpressionEncoderSuite should use deepEquals when we expect `array of array`
      
      ### What changes were proposed in this pull request?
      
      This PR aims to make `ExpressionEncoderSuite` to use `deepEquals` instead of `equals` when `input` is `array of array`.
      
      This comparison code itself was added by SPARK-11727 at Apache Spark 1.6.0.
      
      ### Why are the changes needed?
      
      Currently, the interpreted mode fails for `array of array` because the following line is used.
      ```
      Arrays.equals(b1.asInstanceOf[Array[AnyRef]], b2.asInstanceOf[Array[AnyRef]])
      ```
      
      ### Does this PR introduce _any_ user-facing change?
      
      No. This is a test-only PR.
      
      ### How was this patch tested?
      
      Pass the existing CIs.
      
      Closes #31837 from dongjoon-hyun/SPARK-34743.
      Authored-by: NDongjoon Hyun <dhyun@apple.com>
      Signed-off-by: NDongjoon Hyun <dhyun@apple.com>
      (cherry picked from commit 363a7f07)
      Signed-off-by: NDongjoon Hyun <dhyun@apple.com>
      7b7a8fe3