• H
    Merge with PostgreSQL version 12 (up to a point between beta2 and beta3). · 19cd1cf4
    Heikki Linnakangas 提交于
    This is the point where PostgreSQL REL_12_STABLE was branched off the
    master branch and v13 development started.
    
    See PostgreSQL release notes for v10, v11 and v12 for information on
    the uptream changes included in this merge:
    
    https://www.postgresql.org/docs/release/10.0/
    https://www.postgresql.org/docs/release/11.0/
    https://www.postgresql.org/docs/release/12.0/
    
    The two most notable upstream features that had a big impact on GPDB
    code are:
    
    1. Partitioning
    2. Table AM API.
    
    The old GPDB partitioning support was completely ripped out and
    replaced with the upstream code, and new glue code was written to keep
    the old syntax working on top of the new implementation. Similarly,
    the AO and AOCO table code was refactored to be table access methods,
    working under the new Table AM API.
    
    In addition to those big-ticket items, there are many, many smaller
    changes, detailed in the sections below. This isn't a comprehensive
    list of all upstream changes, I have only noted items that had a
    special impact on GPDB, because they work somehow differently from
    upstream, or they affected existing GPDB code or tests somehow. Also,
    there are many little things marked with GPDB_12_MERGE_FIXME comments
    in the code that will need to be addressed after the merge.
    
    Partitioning
    ============
    
    All the GPDB-specific code to deal with partitioning was ripped out
    and replaced with the new PostgreSQL partitioning code. Compatibility
    for the old GPDB partitioning syntax was reimplemented on top of the
    upstream code, in the parse analysis phase.
    
    Partitioning features that were removed or substantially changed
    ----------------------------------------------------------------
    
    - Multi-column list partitioning is no longer supported. As a
      work-around, you can create a composite type and use that as the
      partitioning key, like this:
    
        create type foo_partkey as (b int, c int);
        create table foo (a int, b int, c int)
          partition by list ((row(b, c)::foo_partkey));
    
      Partition pruning might not work in all the cases that it used to with
      that work-around though.
    
    - PostgreSQL range partitions don't support START EXCLUSIVE or END
      INCLUSIVE.  The start boundary is always inclusive and end
      exclusive. To support the old syntax, we convert EXCLUSIVE to
      INCLUSIVE by doing "+ 1" on the boundary value. That only works for
      datatypes that have a suitable "+" operator, like 'integer' and
      'timestamp', but not 'float' or 'text'.
    
    - Addressing a partition by RANK, in "ALTER TABLE ALTER/DROP PARTITION
      FOR (RANK (1))", is no longer supported.
    
    - The gp_enable_exchange_default_partition GUC was removed. With
      upstream partition code exchanging (essentially attaching new)
      default partition comes with no risk of wrong data. Attaching a
      default partition code validates that the data does not violate the
      partition constraints, so blocking the exchange of default partition
      under GUC is no longer required. It's same as exchanging any other
      partition.
    
    - The old GPDB implementation had a concept of "partition name" that
      was separate from "relation name". The new upstream implementation
      does not. To keep old ALTER TABLE DROP PARTITION etc. commands that
      used partition name working as much as possible, they now search for
      a partition with a rel name that's constructed using the same rules
      that the legacy CREATE TABLE syntax comaptibility layer uses. So it
      mostly works, assuming you consistently use the legacy syntax. It is
      recommended to convert new applications to the new upstream syntax,
      however, to avoid ambiguity. The rules for forming relation names
      for partitions using the legacy CREATE TABLE syntax are mostly the
      same as in earlier GPDB versions, but not necessarily exactly the
      same in all cases.
    
    - The old partitioning-related catalog tables, pg_partition and
      pg_partition_rule are gone, as well as the pg_partition_def()
      function. Partitioning information is now stored in the upstream
      pg_partitioned_table catalog, and in pg_class in relispartition and
      relpartbound fields. PostgreSQL also has three helper functions:
      pg_partition_ancestors(rel)), pg_partition_root(rel) and
      pg_partition_tree(rel)
    
    - Partition boundaries are no longer represented as CHECK
      constraints. Partition constraints is a completely separate concept
      now.
    
    - Partitioned tables have no access method. They used to be heap or AO
      or AOCO, even though they contained no data, but not anymore.
    
    - A "partition template" is created if you use the legacy CREATE
      TABLE with SUBPARTITION TEMPLATE syntax. The partition template
      is used if you add a partition with ALTER TABLE ADD PARTITION, but
      not with the upstream CREATE TABLE PARTITION OF syntax. The templates
      are stored in the new gp_partition_template catalog (previously, in
      pg_partition_rule). The representation stored in catalog is the "raw"
      parser output, before resolving type names etc. Partition templates are
      not currently dumped by pg_dump, so they will be lost on dump & restore
      or pg_upgrade.
    
    Other notable partioning changes
    --------------------------------
    
    - pg_dump always uses the upstream syntax now. That fixes a lot of the
      issues we've had with dump & restore of complex partitioning
      cases. See github issues #2647, #2648, #3598, #5003, #6455. (I
      haven't verified that all of those issues are really fixed by this,
      but they should be).
    
    - In PostgreSQL, GRANT/REVOKE on a partitioned table only affects the
      partitioned table itself, not the partitions. Permission on the
      partitioned table is enough to query the partitioned table and all
      its children, but it makes a difference if you query the partition
      directly, or if the partition is detached. We kept the old GPDB
      behavior where GRANT/REVOKE recurses on the child partitions.
      (There is no ONLY syntax on GRANT/REVOKE that could be used to
      control it.)
    
    - Similarly, when you create a new partition with CREATE TABLE
      PARTITION OF, the permissions of the new parent table are copied to
      the new partition.
    
    - Updating the partitioning key is now allowed, and the row is moved
      to the correct partition. Used to throw an error.
    
    - Unlogged tables are now allowed with regular tables in partition
      hierarchy.  Temporary and permanent tables still cannot be mixed.
    
    - Inserting into an intermediate partition is now allowed. The row
      will be routed to the correct partition, like in PostgreSQL. It used
      to be forbidden in GPDB.
    
    - Unlike PostgreSQL, we allow COPY TO on a partitioned table. That
      worked in GPDB before, so we needed to keep it working. It was
      re-implemented by automatically transforming the "COPY <table> TO
      ..." command to "COPY (SELECT * FROM <table>) TO ...". The hint in
      the upstream error message suggests doing that, but we now do it
      automatically. There is already code to do that transformation, for
      COPY TO when Row-Level Security is enabled, so this was a one line
      change.
    
    - REINDEX TABLE on a partitioned table recurses to the partitions, and
      reindexes all indexes on all partitions. This is different from
      upstream, where REINDEX TABLE on a partitioned emits a warning and
      does nothing. But it is the same behavior as on previous versions of
      GPDB.
    
    - REINDEX INDEX on a partitioned index is a no-op, as a partitioned
      table doesn't contain any data. In PostgreSQL, it throws an error,
      but we have allowed it historically in GPDB.
    
    - There are behavioral changes in how triggers are fired with
      partitioned tables; we now follow the upstream rules. Triggers on a
      partitioned table are fired when you insert/delete/update it. If you
      update a row from one partition to another, delete+insert triggers
      are fired. The 'triggers_gp' regression test captures the different
      combinations, take a look at the changes to the expected output.
      (Statement triggers are still not supported in GDPB)
    
    - PARTITION BY is now accepted on both sides of WITH clause in CREATE
      TABLE. The old GDPB syntax had it after them, the upstream syntax
      had it before it, so allow both.
    
    - Lots of error messages related to partitining are now different,
      since we rely on upstream code.
    
    - Removed a lot of NOTICEs when partitions are created or dropped.
    
    Partition Selectors
    -------------------
    
    GPDB-specific Partition Selector nodes are still created, for
    partition pruning on joins, but the planner code to create them was
    largely rewritten, to leverage the upstream partition pruning code as
    much as possible:
    
    - In executor, Partition Selectors now use the upstream partition
      pruning infrastructure to compute which partitions can contain
      tuples that pass the join qual. The partitions are recorded in a
      Bitmapset, like the upstream partition pruning code does, and the
      Bitmapset is passed from the PartitionSelector to the Append node
      using a special executor Param.  The old hash table that contained
      the partition OIDs is gone, as is the PartSelected expression. The
      Append node now directly skips the pruned partitions
    
    - In planner, use the upstream code to construct the
      PartitionPruneInfo steps to do the pruning. Thanks to this, join
      pruning now also works for quals with <, > operators, and for tables
      partitioned on expressions rather than plain columns.
    
    - In planner, refactor where the decision to do join pruning is
      made. It's now decided in create_append_plan(), where run-time
      pruning is decided, too. To pass the partition pruning info to the
      Partition Selector above the Append, maintain a stack of joins that
      are candidates for Partition Selectors as we recurse in
      create_plan().
    
    ORCA only supports very simple static partition pruning currently, and
    falls back for most cases involving partitioned tables. FIXME comments
    are added to the code and/or tests where that happens now, and work is
    scheduled to re-integrate more partitioning support later.
    
    Append-optimized tables and the new Table AM API
    ================================================
    
    PostgreSQL v12 introduced a new API for table access methods that
    allows replacing the usual heap storage with something else. The AO
    and AOCO table code was refactored to sit behind the new API. There
    are now two extra table AMs in GPDB in addition to the 'heap':
    'ao_row' and 'ao_column'. The old syntax, "WITH (appendonly=true)", is
    now mapped to "CREATE TABLE ... USING ao_row", and similarly for
    column-orinted AOCO tables.
    
    The 'appendonly' and 'orientation' options are no longer accepted in
    the 'gp_default_storage_options' GUC. Its interactions with
    'default_table_access_method' and options given in the DDL command
    became too unwieldy, so it was deemed better to stop supporting it.
    You can use the upstream 'default_table_access_method' GUC instead.
    
    There are some other small differences in what table options are
    accepted in what contexts, where the old behavior was too difficult to
    maintain exactly. For example, if you specify 'compresslevel' and
    'compresstype' in the CREATE TABLE WITH clause, but some of the
    partitions use heap storage, you now get an error. It used to be
    accepted, but the options only affected the AO partitions.
    
    Table inheritance is now allowed for AOCO tables. It was previously
    disallowed for no discernible reason, and it seems to just work now.
    
    The table AM API doesn't cover all the places where we had had to
    modify upstream code previously. There are still some
    RelationIsAppendOptimized() calls sprinkled in the backend code, but
    much fewer than before. One big difference is that updates to AO
    tables need a call to appendonly_dml_init() or aoco_dml_init() to be
    called before modifying the table, and appendonly_dml_finish() or
    aoco_dml_finish() afterwards.
    
    The access method is now represented in the catalogs like in upstream,
    in the pg_class.relam field. The old GPDB-specific pg_class.relstorage
    column has been removed.
    
    Planner changes
    ===============
    
    - Big changes to partition planning and pruning, as we merged upstream
      partititioning code. Most GPDB partitioning code has been replaced with
      corresponding upstream. See section "Partitioning" for details.
    
    - PostgreSQL v12 planner learned to inline side-effect free CTEs. That
      is, if you do "WITH foo (SELECT ...) SELECT * FROM foo, ...", the
      'foo' subquery is inlined into the main query and planned as
      one. Previously, the CTE was always planned separately, and
      materialized in full. GPDB has always supported inlining CTEs, so
      this isn't new to GPDB, but we adopted the code changes related to
      that as much as possible. In PostgreSQL, the feature can be
      controlled by specifying "WITH foo AS MATERIALIZED (...)"  or NOT
      MATERIALIZED, and that now works in GPDB too.
    
      In GPDB, the inlining could previously be controlled with the
      "gp_cte_sharing" GUC. It still works, but if MATERIALIZED or NOT
      MATERIALIZED was specified in the query it takes precedence. Like in
      previous releases, Even if gp_cte_sharing is set to 'on', the
      subquery is inlined if there is only one reference to it in the
      query.
    
      When gp_cte_sharing = 'off' (the default), GPDB inlines the
      subqueries even if it's not strictly legal to do so. See
      https://github.com/greenplum-db/gpdb/issues/1349. This merge doesn't
      change that.
    
    - Partitionwise joins and aggregates were introduced in PostgreSQL
      v11. They work in GPDB too. The main benefit of partitionwise
      aggregation is that if the partition hierarchy contains a foreign
      table, the aggregation can be pushed down to the foreign server.
    
    - GROUPING SETS can now be implemented by hashing, or a mix of hashing
      and sorting. It shows up as a MixedAggregate node in EXPLAIN. However,
      the GPDB-specific multi-stage aggregate planning code
      (in cdbgroupingpaths.c) has not been updated to take advantage of that,
      so you only get such plans with one-stage aggregates, when the grouping
      keys happen to coincide with the table's distribution keys. This is a
      TODO.
    
    Other changes
    =============
    
    - The default logging directory, 'pg_log', was renamed to just 'log'
      in PostgreSQL v10. That might affect tools that would try to look
      into the logs. (The 'pg_xlog' and 'pg_clog' directories were also
      renamed to 'pg_wal' and 'pg_xact', respectively, but external tools
      shouldn't be peeking into those directories directly)
    
    - JIT compilation, introduced in PostgreSQL v11, works. We haven't
      done anything extra for JIT compilation of GPDB-specific code, but
      everything that evaluates expressions benefits from the upstream JIT
      support, also in GPDB-specific node types.
    
    - Hash indexes are enabled now. We had disabled them completely in
      GPDB, because they were not WAL-logged. Since PostgreSQL v10 they
      are, so we can enable them.
    
    - "MK sort", in tuplesort_mk.c, is gone. The "gp_enable_mk_sort" GUC
      is gone too. We always use the upstream sort code now. There have
      been performance improvements to it in the upstream releases, but
      this is still a performance regression in cases where the MK-sort
      was particularly good. That is, if you have multiple ORDER BY
      columns, with lots of duplicates. We might still reintroduce MK sort
      later in some form, as a separate PR. But that work should be made
      on pgsql-hackers for PostgreSQL first.
    
    - The GPDB-specific "Hybrid Hash Agg" code is gone. Its purpose was to
      support spilling large hash aggregates to disk, instead of running
      out of memory. That functionality has been replaced by backporting
      the functionally similar hash agg spilling feature from PostgreSQL
      v13. That consisted of a number of upstream commits:
    
        a10701c0e3 Allow simplehash to use already-calculated hash values.
        81080df32b Add transparent block-level memory accounting
        87b6b57191 Fix edge case leading to agg transitions skipping ExecAggTransReparent() calls.
        f76e8f8507 Refactor hash_agg_entry_size().
        4ac4bd07fa Introduce TupleHashTableHash() and LookupTupleHashEntryHash().
        5dddc85d78 Change signature of TupleHashTableHash().
        d238458f6d Minor refactor of nodeAgg.c.
        a9e9e9bc8e Fixup for nodeAgg.c refactor.
        b27eec5ea6 Extend ExecBuildAggTrans() to support a NULL pointer check.
        0a390a8e8b Introduce LogicalTapeSetExtend().
        5d9fd3ae3d Disk-based Hash Aggregation.
        5471822147 Fixes for Disk-based Hash Aggregation.
        8b6151eae7 Avoid allocating unnecessary zero-sized array.
        04fced3688 Fix costing for disk-based hash aggregation.
        e4ee460db3 Include chunk overhead in hash table entry size estimate.
        255871265e Create memory context for HashAgg with a reasonable maxBlockSize.
        02f3d5670a Fix HashAgg regression from choosing too many initial buckets.
        f12207ca4e Logical Tape Set: use min heap for freelist.
    
      As a result, the nodeAgg.c code is more similar to the v13 version
      than v12. TODO: backpatch any followup commits from REL_13_STABLE
      before release.
    
    - Parallelism is still disabled in GPDB. That includes the new
      parallel CREATE INDEX code.
    
    - Fix behavior with UPDATE WHERE CURRENT OF. As you can see in the
      changes to expected output of the 'portals_updatable' test, some of
      the test results changed. As far as I can see, the old behaviour
      memorized in the expected output was incorrect. PostgreSQL has
      always returned the result we're now seeing on the merge branch,
      which is different from the result you get on master.  The test had
      not been modified since 2012, when the test was added along with
      support for WHERE CURRENT OF. I could not find any explanation in
      git history or in test comments. I believe it was simply a bug when
      WHERE CURRENT OF support was added, and no one noticed that the test
      result was wrong.
    
    - The GPDB implementation of "RANGE offset PRECEDING/FOLLOWING" in
      window functions was replaced with the upstream implementation that
      was introduced in version 11. The upstream implementation is more
      strict about the datatypes than the code in GPDB master. However,
      looking at old JIRA ticket MPP-5246 that was mentioned in one of the
      tests, it had a test case like this attached to it:
    
        explain select cn, count(*) over (order by dt range between '2 day'::interval preceding and 2 preceding) from mpp5246_sale;
        ERROR:  can't coerce leading frame bound to type of trailing frame bound
        HINT:  specify the leading and trailing frame bounds as the same type
    
      So apparently we were strict about the types back in 2009, and you
      got that error up to 5X_STABLE, but it got relaxed in 6X_STABLE. And
      now it becomes more strict with the upstream code again. I think
      that's fine, although we probably shouldn't have relaxed it in
      6X_STABLE. Too late to change that.
    
    - Remove the undocumented 'gp_indexcheck_vacuum' GUC. A new contrib
      module, contrib/amcheck, was added in PostgreSQL v10, with similar
      functionality.
    
    - Logical replication is still disabled in GPDB.
    
    - The default of 'hot_standby' was changed to 'on' in upstream, but
      hot standby is still not supported GPDB, and the default in GPDB
      remains 'off'. It is now possible to turn it 'on' manually in
      mirrors, though. That allows tools like pg_basebackup to work, and
      it allows running the upstream replication tests in
      src/test/recovery. However, there's no user-visible, supported, new
      functionality here.
    
    - pg_basebackup got a feature to verify checksums, but that has been
      disabled in the wrapper python scripts. So checksums are not
      verified by the management utilities that use pg_basebackup under
      the hood (gpexpand, gpinitstandby, gpconfigurenewsegment). Likewise,
      tests that use pg_basebackup were modified with the
      --no-verify-checkums flag. The reason for this is that the checksum
      verification is broken for AO/AOCO tables, which don't have page
      checksum like other relations do, and pg_basebackup cannot
      distinguish them. We ought to fix this somehow, but for now it's a
      FIXME.
    
    - pg_ctl -w now waits for distributed transaction recovery.
    
      When starting master in distributed mode, in GPDB pg_ctl -w should
      wait for distributed transaction recovery to complete. As only after
      that connections are allowed.
    
      Adding new state "dtmready" to be recorded in PID file to mark
      distributed transaction recovery completion. When dtx background
      worker finishes distributed transaction recovery, postmaster writes
      this state to file. pg_ctl waits for this "dtmready" state instead
      of "ready" state is starting master in distributed mode which is
      conveyed using "-E" postmaster options.
    
      This change should meet the previous expectations, that after
      successful completion of gpstart, distributed transaction can be
      made.
    
    - If a client tries to connect while DTM recovery in-progress, you now
      get an error. Client connections used to wait for dtx recovery
      process to complete distributed transaction recovery. Depending on
      situation this waiting could be infinite. To avoid hung connections
      though in certain situation where dtx recovery can't complete the
      distributed transactions, better to error out and let client retry
      in this situation similar to local crash recovery situation.
      (Discussion:
      https://groups.google.com/a/greenplum.org/d/msg/gpdb-dev/dgU9il8l24A/NTAz2cl5BQAJ)
    
    Source code and test changes
    ============================
    
    - PostgreSQL v11 replaced the way the initial catalog contents are
      stored in the source code. Instead of the "DATA(insert ..." lines in
      the .h files, the initial contents are stored in JSON format, in
      separate .dat files. The JSON format is much more convenient to
      edit, and makes it more convenient to carry diffs against them, so
      we don't need the GPDB-specific perl scripts to provide defaults for
      GPDB-added catalog columns anymore, nor the "catullus.pl" script
      that was used to generate pg_proc.h contents for GPDB-added
      functions. That's all gone, and all GPDB-specific changes are made
      to the .dat files directly.
    
    - PostgreSQL v12 removed support for OIDs in user tables, and changed
      all catalog tables with OIDs so that the OID is a regular column. A
      lot of GPDB specific code and tests to deal with OIDs was removed
      with that.
    
    - A new Kerberos test suite was added to PostgreSQL v11. We had one in
      GPDB already, under src/test/gpkerberos. It's not clear how much
      overlap there is between the two, we could possibly get rid of the
      GPDB test suite now, but for now, keep them both.
    
    - Support for passing around MemTuples in TupleTableSlots was
      removed. MemTuple is still the native format of AO (row) tables, but
      is deformed and stored in a virtual tuple slot as soon as it's read
      from disk. MemTuple is no longer used as the on-wire format in
      Motions either, nor as the on-disk format for temp files when
      sorting; we now use upstream MinimalTuple for those.
    
    - HeapTuple->t_tableOid has been resurrected. We had removed it in
      previous GPDB versions for performance reasons, but I don't believe
      that the overhead is really noticeable, at least not in recent
      PostgreSQL versions.
    
    - The pg_rewind regression tests were previously implemented in GPDB
      as a shell script. It was done that way when we backported the
      upstream tests but didn't have the TAP test framework yet. Now we
      do, so replace the shell implementation with the upstream TAP tests.
    
    - The way ALTER TABLE commands are dispatched was refactored. Some
      ALTER TABLE subcommands need to dispatch some extra information from
      QD to QEs.  Previously, the ATExec* functions modified the
      sub-commands in the QD, and the modified subcommands were dispatched
      to the QEs. The problem with this approach is that the QEs try to
      run the ATPrep* phase on the already- prepped commands, so we had to
      hack the ATPrep* functions to cope with that. That was causing merge
      conflicts, which could be fixed, but I was not happy with the diff
      footprint of that approach. Instead, dispatch the AlteredTableInfo
      structs that were constructed in the QD's prep phase. That way, the
      execute phase can modifiy the sub-commands and those modifications
      are dispatched.
    
    - Upstream isolation tests (src/test/isolation) are now run in utility
      mode. Instead of disabling or heavily modifying isolation upstream
      test better to run them in utility mode (kind of single node
      postgres) and get the code coverage. Still doesn't test GPDB
      clustered behavior. Need to enhance the framework for it to
      understand locking across segments and also work with global
      deadlock detector.
    
    - SRFs in target lists. We adoped the new upstream node type,
      ProjectSet, to deal with SRFs in target lists. As a performance
      optimization, there were some GPDB changes in the planner earlier,
      to always use a Result node if there were SRFs in the target list,
      and the SRF-in-targetlist support was removed from other executor
      nodes. That was pretty much the same changes that were made in
      upstream, except that upstream uses ProjectSet instead of Result,
      and the behavior was changed to be more sane when you had multiple
      SRFs. Replace all the GPDB changes with upstream. Due to the
      behavior changes with multiple SRFs, some INSERT statements in GPDB
      tests had to modified so that they generate the same test data as
      before.
    Co-authored-by: NAbhijit Subramanya <asubramanya@pivotal.io>
    Co-authored-by: NAdam Lee <adlee@vmware.com>
    Co-authored-by: NAlexandra Wang <walexandra@vmware.com>
    Co-authored-by: NAndrey Borodin <amborodin@acm.org>
    Co-authored-by: NAshwin Agrawal <aashwin@vmware.com>
    Co-authored-by: NAsim Praveen <pasim@vmware.com>
    Co-authored-by: NChris Hajas <chajas@vmware.com>
    Co-authored-by: NDaniel Gustafsson <dgustafsson@pivotal.io>
    Co-authored-by: NDavid Kimura <dkimura@pivotal.io>
    Co-authored-by: NGeorgios Kokolatos <gkokolatos@pivotal.io>
    Co-authored-by: NHeikki Linnakangas <hlinnakangas@pivotal.io>
    Co-authored-by: NHubert Zhang <hzhang@pivotal.io>
    Co-authored-by: NJesse Zhang <sbjesse@gmail.com>
    Co-authored-by: NNing Yu <nyu@pivotal.io>
    Co-authored-by: NPengzhou Tang <ptang@pivotal.io>
    Co-authored-by: NSoumyadeep Chakraborty <sochakraborty@pivotal.io>
    Co-authored-by: NWeinan Wang <wewang@pivotal.io>
    Co-authored-by: NZhenghua Lyu <kainwen@gmail.com>
    19cd1cf4