README.windows.md · master · Greenplum / Gpdb

Merge with PostgreSQL version 12 (up to a point between beta2 and beta3). · 19cd1cf4

由 Heikki Linnakangas 提交于 9月 22, 2020

This is the point where PostgreSQL REL_12_STABLE was branched off the
master branch and v13 development started.

See PostgreSQL release notes for v10, v11 and v12 for information on
the uptream changes included in this merge:

https://www.postgresql.org/docs/release/10.0/
https://www.postgresql.org/docs/release/11.0/
https://www.postgresql.org/docs/release/12.0/

The two most notable upstream features that had a big impact on GPDB
code are:

1. Partitioning
2. Table AM API.

The old GPDB partitioning support was completely ripped out and
replaced with the upstream code, and new glue code was written to keep
the old syntax working on top of the new implementation. Similarly,
the AO and AOCO table code was refactored to be table access methods,
working under the new Table AM API.

In addition to those big-ticket items, there are many, many smaller
changes, detailed in the sections below. This isn't a comprehensive
list of all upstream changes, I have only noted items that had a
special impact on GPDB, because they work somehow differently from
upstream, or they affected existing GPDB code or tests somehow. Also,
there are many little things marked with GPDB_12_MERGE_FIXME comments
in the code that will need to be addressed after the merge.

Partitioning
============

All the GPDB-specific code to deal with partitioning was ripped out
and replaced with the new PostgreSQL partitioning code. Compatibility
for the old GPDB partitioning syntax was reimplemented on top of the
upstream code, in the parse analysis phase.

Partitioning features that were removed or substantially changed
----------------------------------------------------------------

- Multi-column list partitioning is no longer supported. As a
work-around, you can create a composite type and use that as the
partitioning key, like this:

create type foo_partkey as (b int, c int);
create table foo (a int, b int, c int)
partition by list ((row(b, c)::foo_partkey));

Partition pruning might not work in all the cases that it used to with
that work-around though.

- PostgreSQL range partitions don't support START EXCLUSIVE or END
INCLUSIVE. The start boundary is always inclusive and end
exclusive. To support the old syntax, we convert EXCLUSIVE to
INCLUSIVE by doing "+ 1" on the boundary value. That only works for
datatypes that have a suitable "+" operator, like 'integer' and
'timestamp', but not 'float' or 'text'.

- Addressing a partition by RANK, in "ALTER TABLE ALTER/DROP PARTITION
FOR (RANK (1))", is no longer supported.

- The gp_enable_exchange_default_partition GUC was removed. With
upstream partition code exchanging (essentially attaching new)
default partition comes with no risk of wrong data. Attaching a
default partition code validates that the data does not violate the
partition constraints, so blocking the exchange of default partition
under GUC is no longer required. It's same as exchanging any other
partition.

- The old GPDB implementation had a concept of "partition name" that
was separate from "relation name". The new upstream implementation
does not. To keep old ALTER TABLE DROP PARTITION etc. commands that
used partition name working as much as possible, they now search for
a partition with a rel name that's constructed using the same rules
that the legacy CREATE TABLE syntax comaptibility layer uses. So it
mostly works, assuming you consistently use the legacy syntax. It is
recommended to convert new applications to the new upstream syntax,
however, to avoid ambiguity. The rules for forming relation names
for partitions using the legacy CREATE TABLE syntax are mostly the
same as in earlier GPDB versions, but not necessarily exactly the
same in all cases.

- The old partitioning-related catalog tables, pg_partition and
pg_partition_rule are gone, as well as the pg_partition_def()
function. Partitioning information is now stored in the upstream
pg_partitioned_table catalog, and in pg_class in relispartition and
relpartbound fields. PostgreSQL also has three helper functions:
pg_partition_ancestors(rel)), pg_partition_root(rel) and
pg_partition_tree(rel)

- Partition boundaries are no longer represented as CHECK
constraints. Partition constraints is a completely separate concept
now.

- Partitioned tables have no access method. They used to be heap or AO
or AOCO, even though they contained no data, but not anymore.

- A "partition template" is created if you use the legacy CREATE
TABLE with SUBPARTITION TEMPLATE syntax. The partition template
is used if you add a partition with ALTER TABLE ADD PARTITION, but
not with the upstream CREATE TABLE PARTITION OF syntax. The templates
are stored in the new gp_partition_template catalog (previously, in
pg_partition_rule). The representation stored in catalog is the "raw"
parser output, before resolving type names etc. Partition templates are
not currently dumped by pg_dump, so they will be lost on dump & restore
or pg_upgrade.

Other notable partioning changes
--------------------------------

- pg_dump always uses the upstream syntax now. That fixes a lot of the
issues we've had with dump & restore of complex partitioning
cases. See github issues #2647, #2648, #3598, #5003, #6455. (I
haven't verified that all of those issues are really fixed by this,
but they should be).

- In PostgreSQL, GRANT/REVOKE on a partitioned table only affects the
partitioned table itself, not the partitions. Permission on the
partitioned table is enough to query the partitioned table and all
its children, but it makes a difference if you query the partition
directly, or if the partition is detached. We kept the old GPDB
behavior where GRANT/REVOKE recurses on the child partitions.
(There is no ONLY syntax on GRANT/REVOKE that could be used to
control it.)

- Similarly, when you create a new partition with CREATE TABLE
PARTITION OF, the permissions of the new parent table are copied to
the new partition.

- Updating the partitioning key is now allowed, and the row is moved
to the correct partition. Used to throw an error.

- Unlogged tables are now allowed with regular tables in partition
hierarchy. Temporary and permanent tables still cannot be mixed.

- Inserting into an intermediate partition is now allowed. The row
will be routed to the correct partition, like in PostgreSQL. It used
to be forbidden in GPDB.

- Unlike PostgreSQL, we allow COPY TO on a partitioned table. That
worked in GPDB before, so we needed to keep it working. It was
re-implemented by automatically transforming the "COPY <table> TO
..." command to "COPY (SELECT * FROM <table>) TO ...". The hint in
the upstream error message suggests doing that, but we now do it
automatically. There is already code to do that transformation, for
COPY TO when Row-Level Security is enabled, so this was a one line
change.

- REINDEX TABLE on a partitioned table recurses to the partitions, and
reindexes all indexes on all partitions. This is different from
upstream, where REINDEX TABLE on a partitioned emits a warning and
does nothing. But it is the same behavior as on previous versions of
GPDB.

- REINDEX INDEX on a partitioned index is a no-op, as a partitioned
table doesn't contain any data. In PostgreSQL, it throws an error,
but we have allowed it historically in GPDB.

- There are behavioral changes in how triggers are fired with
partitioned tables; we now follow the upstream rules. Triggers on a
partitioned table are fired when you insert/delete/update it. If you
update a row from one partition to another, delete+insert triggers
are fired. The 'triggers_gp' regression test captures the different
combinations, take a look at the changes to the expected output.
(Statement triggers are still not supported in GDPB)

- PARTITION BY is now accepted on both sides of WITH clause in CREATE
TABLE. The old GDPB syntax had it after them, the upstream syntax
had it before it, so allow both.

- Lots of error messages related to partitining are now different,
since we rely on upstream code.

- Removed a lot of NOTICEs when partitions are created or dropped.

Partition Selectors
-------------------

GPDB-specific Partition Selector nodes are still created, for
partition pruning on joins, but the planner code to create them was
largely rewritten, to leverage the upstream partition pruning code as
much as possible:

- In executor, Partition Selectors now use the upstream partition
pruning infrastructure to compute which partitions can contain
tuples that pass the join qual. The partitions are recorded in a
Bitmapset, like the upstream partition pruning code does, and the
Bitmapset is passed from the PartitionSelector to the Append node
using a special executor Param. The old hash table that contained
the partition OIDs is gone, as is the PartSelected expression. The
Append node now directly skips the pruned partitions

- In planner, use the upstream code to construct the
PartitionPruneInfo steps to do the pruning. Thanks to this, join
pruning now also works for quals with <, > operators, and for tables
partitioned on expressions rather than plain columns.

- In planner, refactor where the decision to do join pruning is
made. It's now decided in create_append_plan(), where run-time
pruning is decided, too. To pass the partition pruning info to the
Partition Selector above the Append, maintain a stack of joins that
are candidates for Partition Selectors as we recurse in
create_plan().

ORCA only supports very simple static partition pruning currently, and
falls back for most cases involving partitioned tables. FIXME comments
are added to the code and/or tests where that happens now, and work is
scheduled to re-integrate more partitioning support later.

Append-optimized tables and the new Table AM API
================================================

PostgreSQL v12 introduced a new API for table access methods that
allows replacing the usual heap storage with something else. The AO
and AOCO table code was refactored to sit behind the new API. There
are now two extra table AMs in GPDB in addition to the 'heap':
'ao_row' and 'ao_column'. The old syntax, "WITH (appendonly=true)", is
now mapped to "CREATE TABLE ... USING ao_row", and similarly for
column-orinted AOCO tables.

The 'appendonly' and 'orientation' options are no longer accepted in
the 'gp_default_storage_options' GUC. Its interactions with
'default_table_access_method' and options given in the DDL command
became too unwieldy, so it was deemed better to stop supporting it.
You can use the upstream 'default_table_access_method' GUC instead.

There are some other small differences in what table options are
accepted in what contexts, where the old behavior was too difficult to
maintain exactly. For example, if you specify 'compresslevel' and
'compresstype' in the CREATE TABLE WITH clause, but some of the
partitions use heap storage, you now get an error. It used to be
accepted, but the options only affected the AO partitions.

Table inheritance is now allowed for AOCO tables. It was previously
disallowed for no discernible reason, and it seems to just work now.

The table AM API doesn't cover all the places where we had had to
modify upstream code previously. There are still some
RelationIsAppendOptimized() calls sprinkled in the backend code, but
much fewer than before. One big difference is that updates to AO
tables need a call to appendonly_dml_init() or aoco_dml_init() to be
called before modifying the table, and appendonly_dml_finish() or
aoco_dml_finish() afterwards.

The access method is now represented in the catalogs like in upstream,
in the pg_class.relam field. The old GPDB-specific pg_class.relstorage
column has been removed.

Planner changes
===============

- Big changes to partition planning and pruning, as we merged upstream
partititioning code. Most GPDB partitioning code has been replaced with
corresponding upstream. See section "Partitioning" for details.

- PostgreSQL v12 planner learned to inline side-effect free CTEs. That
is, if you do "WITH foo (SELECT ...) SELECT * FROM foo, ...", the
'foo' subquery is inlined into the main query and planned as
one. Previously, the CTE was always planned separately, and
materialized in full. GPDB has always supported inlining CTEs, so
this isn't new to GPDB, but we adopted the code changes related to
that as much as possible. In PostgreSQL, the feature can be
controlled by specifying "WITH foo AS MATERIALIZED (...)" or NOT
MATERIALIZED, and that now works in GPDB too.

In GPDB, the inlining could previously be controlled with the
"gp_cte_sharing" GUC. It still works, but if MATERIALIZED or NOT
MATERIALIZED was specified in the query it takes precedence. Like in
previous releases, Even if gp_cte_sharing is set to 'on', the
subquery is inlined if there is only one reference to it in the
query.

When gp_cte_sharing = 'off' (the default), GPDB inlines the
subqueries even if it's not strictly legal to do so. See
https://github.com/greenplum-db/gpdb/issues/1349. This merge doesn't
change that.

- Partitionwise joins and aggregates were introduced in PostgreSQL
v11. They work in GPDB too. The main benefit of partitionwise
aggregation is that if the partition hierarchy contains a foreign
table, the aggregation can be pushed down to the foreign server.

- GROUPING SETS can now be implemented by hashing, or a mix of hashing
and sorting. It shows up as a MixedAggregate node in EXPLAIN. However,
the GPDB-specific multi-stage aggregate planning code
(in cdbgroupingpaths.c) has not been updated to take advantage of that,
so you only get such plans with one-stage aggregates, when the grouping
keys happen to coincide with the table's distribution keys. This is a
TODO.

Other changes
=============

- The default logging directory, 'pg_log', was renamed to just 'log'
in PostgreSQL v10. That might affect tools that would try to look
into the logs. (The 'pg_xlog' and 'pg_clog' directories were also
renamed to 'pg_wal' and 'pg_xact', respectively, but external tools
shouldn't be peeking into those directories directly)

- JIT compilation, introduced in PostgreSQL v11, works. We haven't
done anything extra for JIT compilation of GPDB-specific code, but
everything that evaluates expressions benefits from the upstream JIT
support, also in GPDB-specific node types.

- Hash indexes are enabled now. We had disabled them completely in
GPDB, because they were not WAL-logged. Since PostgreSQL v10 they
are, so we can enable them.

- "MK sort", in tuplesort_mk.c, is gone. The "gp_enable_mk_sort" GUC
is gone too. We always use the upstream sort code now. There have
been performance improvements to it in the upstream releases, but
this is still a performance regression in cases where the MK-sort
was particularly good. That is, if you have multiple ORDER BY
columns, with lots of duplicates. We might still reintroduce MK sort
later in some form, as a separate PR. But that work should be made
on pgsql-hackers for PostgreSQL first.

- The GPDB-specific "Hybrid Hash Agg" code is gone. Its purpose was to
support spilling large hash aggregates to disk, instead of running
out of memory. That functionality has been replaced by backporting
the functionally similar hash agg spilling feature from PostgreSQL
v13. That consisted of a number of upstream commits:

a10701c0e3 Allow simplehash to use already-calculated hash values.
81080df32b Add transparent block-level memory accounting
87b6b57191 Fix edge case leading to agg transitions skipping ExecAggTransReparent() calls.
f76e8f8507 Refactor hash_agg_entry_size().
4ac4bd07fa Introduce TupleHashTableHash() and LookupTupleHashEntryHash().
5dddc85d78 Change signature of TupleHashTableHash().
d238458f6d Minor refactor of nodeAgg.c.
a9e9e9bc8e Fixup for nodeAgg.c refactor.
b27eec5ea6 Extend ExecBuildAggTrans() to support a NULL pointer check.
0a390a8e8b Introduce LogicalTapeSetExtend().
5d9fd3ae3d Disk-based Hash Aggregation.
5471822147 Fixes for Disk-based Hash Aggregation.
8b6151eae7 Avoid allocating unnecessary zero-sized array.
04fced3688 Fix costing for disk-based hash aggregation.
e4ee460db3 Include chunk overhead in hash table entry size estimate.
255871265e Create memory context for HashAgg with a reasonable maxBlockSize.
02f3d5670a Fix HashAgg regression from choosing too many initial buckets.
f12207ca4e Logical Tape Set: use min heap for freelist.

As a result, the nodeAgg.c code is more similar to the v13 version
than v12. TODO: backpatch any followup commits from REL_13_STABLE
before release.

- Parallelism is still disabled in GPDB. That includes the new
parallel CREATE INDEX code.

- Fix behavior with UPDATE WHERE CURRENT OF. As you can see in the
changes to expected output of the 'portals_updatable' test, some of
the test results changed. As far as I can see, the old behaviour
memorized in the expected output was incorrect. PostgreSQL has
always returned the result we're now seeing on the merge branch,
which is different from the result you get on master. The test had
not been modified since 2012, when the test was added along with
support for WHERE CURRENT OF. I could not find any explanation in
git history or in test comments. I believe it was simply a bug when
WHERE CURRENT OF support was added, and no one noticed that the test
result was wrong.

- The GPDB implementation of "RANGE offset PRECEDING/FOLLOWING" in
window functions was replaced with the upstream implementation that
was introduced in version 11. The upstream implementation is more
strict about the datatypes than the code in GPDB master. However,
looking at old JIRA ticket MPP-5246 that was mentioned in one of the
tests, it had a test case like this attached to it:

explain select cn, count(*) over (order by dt range between '2 day'::interval preceding and 2 preceding) from mpp5246_sale;
ERROR: can't coerce leading frame bound to type of trailing frame bound
HINT: specify the leading and trailing frame bounds as the same type

So apparently we were strict about the types back in 2009, and you
got that error up to 5X_STABLE, but it got relaxed in 6X_STABLE. And
now it becomes more strict with the upstream code again. I think
that's fine, although we probably shouldn't have relaxed it in
6X_STABLE. Too late to change that.

- Remove the undocumented 'gp_indexcheck_vacuum' GUC. A new contrib
module, contrib/amcheck, was added in PostgreSQL v10, with similar
functionality.

- Logical replication is still disabled in GPDB.

- The default of 'hot_standby' was changed to 'on' in upstream, but
hot standby is still not supported GPDB, and the default in GPDB
remains 'off'. It is now possible to turn it 'on' manually in
mirrors, though. That allows tools like pg_basebackup to work, and
it allows running the upstream replication tests in
src/test/recovery. However, there's no user-visible, supported, new
functionality here.

- pg_basebackup got a feature to verify checksums, but that has been
disabled in the wrapper python scripts. So checksums are not
verified by the management utilities that use pg_basebackup under
the hood (gpexpand, gpinitstandby, gpconfigurenewsegment). Likewise,
tests that use pg_basebackup were modified with the
--no-verify-checkums flag. The reason for this is that the checksum
verification is broken for AO/AOCO tables, which don't have page
checksum like other relations do, and pg_basebackup cannot
distinguish them. We ought to fix this somehow, but for now it's a
FIXME.

- pg_ctl -w now waits for distributed transaction recovery.

When starting master in distributed mode, in GPDB pg_ctl -w should
wait for distributed transaction recovery to complete. As only after
that connections are allowed.

Adding new state "dtmready" to be recorded in PID file to mark
distributed transaction recovery completion. When dtx background
worker finishes distributed transaction recovery, postmaster writes
this state to file. pg_ctl waits for this "dtmready" state instead
of "ready" state is starting master in distributed mode which is
conveyed using "-E" postmaster options.

This change should meet the previous expectations, that after
successful completion of gpstart, distributed transaction can be
made.

- If a client tries to connect while DTM recovery in-progress, you now
get an error. Client connections used to wait for dtx recovery
process to complete distributed transaction recovery. Depending on
situation this waiting could be infinite. To avoid hung connections
though in certain situation where dtx recovery can't complete the
distributed transactions, better to error out and let client retry
in this situation similar to local crash recovery situation.
(Discussion:
https://groups.google.com/a/greenplum.org/d/msg/gpdb-dev/dgU9il8l24A/NTAz2cl5BQAJ)

Source code and test changes
============================

- PostgreSQL v11 replaced the way the initial catalog contents are
stored in the source code. Instead of the "DATA(insert ..." lines in
the .h files, the initial contents are stored in JSON format, in
separate .dat files. The JSON format is much more convenient to
edit, and makes it more convenient to carry diffs against them, so
we don't need the GPDB-specific perl scripts to provide defaults for
GPDB-added catalog columns anymore, nor the "catullus.pl" script
that was used to generate pg_proc.h contents for GPDB-added
functions. That's all gone, and all GPDB-specific changes are made
to the .dat files directly.

- PostgreSQL v12 removed support for OIDs in user tables, and changed
all catalog tables with OIDs so that the OID is a regular column. A
lot of GPDB specific code and tests to deal with OIDs was removed
with that.

- A new Kerberos test suite was added to PostgreSQL v11. We had one in
GPDB already, under src/test/gpkerberos. It's not clear how much
overlap there is between the two, we could possibly get rid of the
GPDB test suite now, but for now, keep them both.

- Support for passing around MemTuples in TupleTableSlots was
removed. MemTuple is still the native format of AO (row) tables, but
is deformed and stored in a virtual tuple slot as soon as it's read
from disk. MemTuple is no longer used as the on-wire format in
Motions either, nor as the on-disk format for temp files when
sorting; we now use upstream MinimalTuple for those.

- HeapTuple->t_tableOid has been resurrected. We had removed it in
previous GPDB versions for performance reasons, but I don't believe
that the overhead is really noticeable, at least not in recent
PostgreSQL versions.

- The pg_rewind regression tests were previously implemented in GPDB
as a shell script. It was done that way when we backported the
upstream tests but didn't have the TAP test framework yet. Now we
do, so replace the shell implementation with the upstream TAP tests.

- The way ALTER TABLE commands are dispatched was refactored. Some
ALTER TABLE subcommands need to dispatch some extra information from
QD to QEs. Previously, the ATExec* functions modified the
sub-commands in the QD, and the modified subcommands were dispatched
to the QEs. The problem with this approach is that the QEs try to
run the ATPrep* phase on the already- prepped commands, so we had to
hack the ATPrep* functions to cope with that. That was causing merge
conflicts, which could be fixed, but I was not happy with the diff
footprint of that approach. Instead, dispatch the AlteredTableInfo
structs that were constructed in the QD's prep phase. That way, the
execute phase can modifiy the sub-commands and those modifications
are dispatched.

- Upstream isolation tests (src/test/isolation) are now run in utility
mode. Instead of disabling or heavily modifying isolation upstream
test better to run them in utility mode (kind of single node
postgres) and get the code coverage. Still doesn't test GPDB
clustered behavior. Need to enhance the framework for it to
understand locking across segments and also work with global
deadlock detector.

- SRFs in target lists. We adoped the new upstream node type,
ProjectSet, to deal with SRFs in target lists. As a performance
optimization, there were some GPDB changes in the planner earlier,
to always use a Result node if there were SRFs in the target list,
and the SRF-in-targetlist support was removed from other executor
nodes. That was pretty much the same changes that were made in
upstream, except that upstream uses ProjectSet instead of Result,
and the behavior was changed to be more sane when you had multiple
SRFs. Replace all the GPDB changes with upstream. Due to the
behavior changes with multiple SRFs, some INSERT statements in GPDB
tests had to modified so that they generate the same test data as
before.
Co-authored-by: NAbhijit Subramanya <asubramanya@pivotal.io>
Co-authored-by: NAdam Lee <adlee@vmware.com>
Co-authored-by: NAlexandra Wang <walexandra@vmware.com>
Co-authored-by: NAndrey Borodin <amborodin@acm.org>
Co-authored-by: NAshwin Agrawal <aashwin@vmware.com>
Co-authored-by: NAsim Praveen <pasim@vmware.com>
Co-authored-by: NChris Hajas <chajas@vmware.com>
Co-authored-by: NDaniel Gustafsson <dgustafsson@pivotal.io>
Co-authored-by: NDavid Kimura <dkimura@pivotal.io>
Co-authored-by: NGeorgios Kokolatos <gkokolatos@pivotal.io>
Co-authored-by: NHeikki Linnakangas <hlinnakangas@pivotal.io>
Co-authored-by: NHubert Zhang <hzhang@pivotal.io>
Co-authored-by: NJesse Zhang <sbjesse@gmail.com>
Co-authored-by: NNing Yu <nyu@pivotal.io>
Co-authored-by: NPengzhou Tang <ptang@pivotal.io>
Co-authored-by: NSoumyadeep Chakraborty <sochakraborty@pivotal.io>
Co-authored-by: NWeinan Wang <wewang@pivotal.io>
Co-authored-by: NZhenghua Lyu <kainwen@gmail.com>

19cd1cf4

README.windows.md 4.3 KB

编辑 Web IDE

Greenplum / Gpdb

Replace README.windows.md