Use postgres database for pg_rewind cleanly shutdown execution to avoid potential pg_rewind hang.

During testing, I encountered an incremental gprecoverseg hang issue. Incremental gprecoverseg is based on pg_rewind. pg_rewind launches a single mode postgres process and quits after crash recovery if the postgres instance was not cleanly shut down - this is used to ensure that the postgres is in a consistent state before doing incremental recovery. I found that the single mode postgres hangs with the below stack. \#1 0x00000000008cf2d6 in PGSemaphoreLock (sema=0x7f238274a4b0, interruptOK=1 '\001') at pg_sema.c:422 \#2 0x00000000009614ed in ProcSleep (locallock=0x2c783c0, lockMethodTable=0xddb140 <default_lockmethod>) at proc.c:1347 \#3 0x000000000095a0c1 in WaitOnLock (locallock=0x2c783c0, owner=0x2cbf950) at lock.c:1853 \#4 0x0000000000958e3a in LockAcquireExtended (locktag=0x7ffde826aa60, lockmode=3, sessionLock=0 '\000', dontWait=0 '\000', reportMemoryError=1 '\001', locallockp=0x0) at lock.c:1155 \#5 0x0000000000957e64 in LockAcquire (locktag=0x7ffde826aa60, lockmode=3, sessionLock=0 '\000', dontWait=0 '\000') at lock.c:700 \#6 0x000000000095728c in LockSharedObject (classid=1262, objid=1, objsubid=0, lockmode=3) at lmgr.c:939 \#7 0x0000000000b0152b in InitPostgres (in_dbname=0x2c769f0 "template1", dboid=0, username=0x2c59340 "gpadmin", out_dbname=0x0) at postinit.c:1019 \#8 0x000000000097b970 in PostgresMain (argc=5, argv=0x2c51990, dbname=0x2c769f0 "template1", username=0x2c59340 "gpadmin") at postgres.c:4820 \#9 0x00000000007dc432 in main (argc=5, argv=0x2c51990) at main.c:241 It tries to hold the lock for template1 on pg_database with lockmode 3 but it conflicts with the lock with lockmode 5 which was held by a recovered dtx transaction in startup RecoverPreparedTransactions(). Typically the dtx transaction comes from "create database" (by default the template database is template1). Fixing this by using the postgres database for single mode postgres execution. The postgres database is commonly used in many background worker backends like dtx recovery, gdd and ftsprobe. With this change, we do not need to worry about "create database" with template postgres, etc since they won't succeed, thus avoid the lock conflict. We may be able to fix this in InitPostgres() by bypassing the locking code in single mode but the current fix seems to be safer. Note InitPostgres() locks/unlocks some other catalog tables also but almost all of them are using lock mode 1 (except mode 3 pg_resqueuecapability per debugging output). It seems that it is not usual in real scenario to have a dtx transaction that locks catalog with mode 8 which conflicts with mode 1. If we encounter this later we need to think out a better (might not be trivial) solution for this. For now let's fix the issue we encountered at first. Note in this patch the code fixes in buildMirrorSegments.py and twophase.c are not related to this patch. They do not seem to be strict bugs but we'd better fix them to avoid potential issues in the future. Reviewed-by: N Ashwin Agrawal <aashwin@vmware.com> Reviewed-by: N Asim R P <pasim@vmware.com>

Use postgres database for pg_rewind cleanly shutdown execution to avoid potential pg_rewind hang.
During testing, I encountered an incremental gprecoverseg hang issue. Incremental gprecoverseg is based on pg_rewind. pg_rewind launches a single mode postgres process and quits after crash recovery if the postgres instance was not cleanly shut down - this is used to ensure that the postgres is in a consistent state before doing incremental recovery. I found that the single mode postgres hangs with the below stack. \#1 0x00000000008cf2d6 in PGSemaphoreLock (sema=0x7f238274a4b0, interruptOK=1 '\001') at pg_sema.c:422 \#2 0x00000000009614ed in ProcSleep (locallock=0x2c783c0, lockMethodTable=0xddb140 <default_lockmethod>) at proc.c:1347 \#3 0x000000000095a0c1 in WaitOnLock (locallock=0x2c783c0, owner=0x2cbf950) at lock.c:1853 \#4 0x0000000000958e3a in LockAcquireExtended (locktag=0x7ffde826aa60, lockmode=3, sessionLock=0 '\000', dontWait=0 '\000', reportMemoryError=1 '\001', locallockp=0x0) at lock.c:1155 \#5 0x0000000000957e64 in LockAcquire (locktag=0x7ffde826aa60, lockmode=3, sessionLock=0 '\000', dontWait=0 '\000') at lock.c:700 \#6 0x000000000095728c in LockSharedObject (classid=1262, objid=1, objsubid=0, lockmode=3) at lmgr.c:939 \#7 0x0000000000b0152b in InitPostgres (in_dbname=0x2c769f0 "template1", dboid=0, username=0x2c59340 "gpadmin", out_dbname=0x0) at postinit.c:1019 \#8 0x000000000097b970 in PostgresMain (argc=5, argv=0x2c51990, dbname=0x2c769f0 "template1", username=0x2c59340 "gpadmin") at postgres.c:4820 \#9 0x00000000007dc432 in main (argc=5, argv=0x2c51990) at main.c:241 It tries to hold the lock for template1 on pg_database with lockmode 3 but it conflicts with the lock with lockmode 5 which was held by a recovered dtx transaction in startup RecoverPreparedTransactions(). Typically the dtx transaction comes from "create database" (by default the template database is template1). Fixing this by using the postgres database for single mode postgres execution. The postgres database is commonly used in many background worker backends like dtx recovery, gdd and ftsprobe. With this change, we do not need to worry about "create database" with template postgres, etc since they won't succeed, thus avoid the lock conflict. We may be able to fix this in InitPostgres() by bypassing the locking code in single mode but the current fix seems to be safer. Note InitPostgres() locks/unlocks some other catalog tables also but almost all of them are using lock mode 1 (except mode 3 pg_resqueuecapability per debugging output). It seems that it is not usual in real scenario to have a dtx transaction that locks catalog with mode 8 which conflicts with mode 1. If we encounter this later we need to think out a better (might not be trivial) solution for this. For now let's fix the issue we encountered at first. Note in this patch the code fixes in buildMirrorSegments.py and twophase.c are not related to this patch. They do not seem to be strict bugs but we'd better fix them to avoid potential issues in the future. Reviewed-by: N Ashwin Agrawal <aashwin@vmware.com> Reviewed-by: N Asim R P <pasim@vmware.com>
288908f3 · Paul Guo · f6c59503 · 288908f3 · 288908f3 · 288908f3
7 changed file
--- a/gpMgmt/bin/gppylib/operations/buildMirrorSegments.py
+++ b/gpMgmt/bin/gppylib/operations/buildMirrorSegments.py
@@ -363,6 +363,7 @@ class GpMirrorListToBuild:
                                 dbname='template1')
            conn = dbconn.connect(dburl, utility=True)
            dbconn.execSQL(conn, "CHECKPOINT")
+            conn.close()

            # If the postmaster.pid still exists and another process
            # is actively using that pid, pg_rewind will fail when it

--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -432,6 +432,7 @@ MarkAsPreparing(TransactionId xid,
 	proc->backendId = InvalidBackendId;
 	proc->databaseId = databaseid;
 	proc->roleId = owner;
+	proc->mppSessionId = gp_session_id;
 	proc->lwWaiting = false;
 	proc->lwWaitMode = 0;
 	proc->waitLock = NULL;

--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -811,8 +811,28 @@ ensureCleanShutdown(const char *argv0)
 		return;

 	/* finally run postgres single-user mode */
-	snprintf(cmd, MAXCMDLEN, "\"%s\" --single -D \"%s\" template1 < %s",
-			 exec_path, datadir_target, DEVNULL);
+	/*
+	 * gpdb: use postgres instead of template1, else the below postgres
+	 * instance might hang in the below scenario:
+	 *
+	 * 1. There was a prepared but not finished "create database " dtx
+	 *    transaction which was recovered during crash recovery in the startup
+	 *    process and thus it holds the lock of database template1 since
+	 *    by default template1 is the template for database creation.
+	 *
+	 * 2. Single mode postgres process will execute the below code in
+	 * InitPostgres() after finishing crash recovery (i.e. calling
+	 * startupXLOG()) and then hang due to lock conflict.
+	 *
+	 *    LockSharedObject(DatabaseRelationId, ...);
+	 *
+	 * DB_FOR_COMMON_ACCESS is used in fts probe, dtx recovery, gdd so it's
+	 * hard to have the above kind of dtx transaction on DB_FOR_COMMON_ACCESS
+	 * since the commands (e.g. create database with template
+	 * DB_FOR_COMMON_ACCESS) would fail.
+	 */
+	snprintf(cmd, MAXCMDLEN, "\"%s\" --single -D \"%s\" %s < %s",
+			 exec_path, datadir_target, DB_FOR_COMMON_ACCESS, DEVNULL);

 	if (system(cmd) != 0)
 		pg_fatal("postgres single-user mode of target instance failed for command: %s\n", cmd);

--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -364,14 +364,14 @@ extern int	trace_recovery_messages;
 extern int	trace_recovery(int trace_level);

 /*
- * database which is used by startup, gdd, fts, etc for catalog access.
+ * database which is used by dtx recovery, gdd, fts, etc for catalog access.
 * We are not using template1 since it seems that users would like to recreate
 * the template1 database for customization sometimes. That means template1
- * could be dropped and then recreated and thus that will break
- * startup, gdd, fts, etc. Also template1 is the default template for the
- * 'create database' command. Using template1 will make that command fail:
- * "ERROR:  source database "template1" is being accessed by other users"
- * "DETAIL:  There are 2 other sessions using the database."
+ * could be dropped and then recreated and thus that will break dtx recovery,
+ * gdd, fts, etc. Also template1 is the default template for the 'create
+ * database' command. Using template1 will make that command fail: "ERROR:
+ * source database "template1" is being accessed by other users" "DETAIL:
+ * There are 2 other sessions using the database."
 */
 #define DB_FOR_COMMON_ACCESS	"postgres"


--- a/src/test/isolation2/expected/prepared_xact_deadlock_pg_rewind.out
+++ b/src/test/isolation2/expected/prepared_xact_deadlock_pg_rewind.out
+-- Test a recovered (in startup) prepared transaction does not block
+-- pg_rewind due to lock conflict of database template1 when it runs the single
+-- mode instance to ensure clean shutdown on the target postgres instance.
+include: helpers/server_helpers.sql;
+CREATE
+
+-- set GUCs to speed-up the test
+1: alter system set gp_fts_probe_retries to 2;
+ALTER
+1: alter system set gp_fts_probe_timeout to 5;
+ALTER
+1: select pg_reload_conf();
+ pg_reload_conf 
+----------------
+ t              
+(1 row)
+
+1: select gp_inject_fault('after_xlog_xact_prepare_flushed', 'suspend', dbid) from gp_segment_configuration where role='p' and content = 0;
+ gp_inject_fault 
+-----------------
+ Success:        
+(1 row)
+2&: create database db_orphan_prepare;  <waiting ...>
+1: select gp_wait_until_triggered_fault('after_xlog_xact_prepare_flushed', 1, dbid) from gp_segment_configuration where role='p' and content = 0;
+ gp_wait_until_triggered_fault 
+-------------------------------
+ Success:                      
+(1 row)
+
+-- immediate shutdown the primary and then promote the mirror.
+1: select pg_ctl((select datadir from gp_segment_configuration c where c.role='p' and c.content=0), 'stop');
+ pg_ctl 
+--------
+ OK     
+(1 row)
+1: select gp_request_fts_probe_scan();
+ gp_request_fts_probe_scan 
+---------------------------
+ t                         
+(1 row)
+1: select content, preferred_role, role, status, mode from gp_segment_configuration where content = 0;
+ content | preferred_role | role | status | mode 
+---------+----------------+------+--------+------
+ 0       | p              | m    | d      | n    
+ 0       | m              | p    | u      | n    
+(2 rows)
+
+-- wait until promote is finished.
+0U: select 1;
+ ?column? 
+----------
+ 1        
+(1 row)
+0Uq: ... <quitting>
+2<:  <... completed>
+ERROR:  Error on receive from seg0 10.152.8.141:7002 pid=20285: server closed the connection unexpectedly
+	This probably means the server terminated abnormally
+	before or while processing the request.
+
+-- restore the cluster. Previously there is a bug the incremental recovery
+-- hangs in pg_rewind due to lock conflict. pg_rewinds runs a single-mode
+-- postgres to ensure clean shutdown of the postgres. That will recover the
+-- unhandled prepared transactions into memory which will hold locks. For
+-- example, "create database" will hold the lock of template1 on pg_database
+-- with mode 5, but that conflicts with the mode 3 lock which is needed during
+-- postgres starting in InitPostgres() and thus pg_rewind hangs forever.
+!\retcode gprecoverseg -a;
+(exited with code 0)
+!\retcode gprecoverseg -ar;
+(exited with code 0)
+
+-- reset fts GUCs.
+3: alter system reset gp_fts_probe_retries;
+ALTER
+3: alter system reset gp_fts_probe_timeout;
+ALTER
+3: select pg_reload_conf();
+ pg_reload_conf 
+----------------
+ t              
+(1 row)
--- a/src/test/isolation2/isolation2_schedule
+++ b/src/test/isolation2/isolation2_schedule
@@ -4,6 +4,7 @@ test: lockmodes
 # Put test prepare_limit near to test lockmodes since both of them reboot the
 # cluster during testing. Usually the 2nd reboot should be faster.
 test: prepare_limit
+test: prepared_xact_deadlock_pg_rewind
 test: ao_partition_lock
 test: dml_on_root_locks_all_parts


--- a/src/test/isolation2/sql/prepared_xact_deadlock_pg_rewind.sql
+++ b/src/test/isolation2/sql/prepared_xact_deadlock_pg_rewind.sql
+-- Test a recovered (in startup) prepared transaction does not block
+-- pg_rewind due to lock conflict of database template1 when it runs the single
+-- mode instance to ensure clean shutdown on the target postgres instance.
+include: helpers/server_helpers.sql;
+
+-- set GUCs to speed-up the test
+1: alter system set gp_fts_probe_retries to 2;
+1: alter system set gp_fts_probe_timeout to 5;
+1: select pg_reload_conf();
+
+1: select gp_inject_fault('after_xlog_xact_prepare_flushed', 'suspend', dbid) from gp_segment_configuration where role='p' and content = 0;
+2&: create database db_orphan_prepare;
+1: select gp_wait_until_triggered_fault('after_xlog_xact_prepare_flushed', 1, dbid) from gp_segment_configuration where role='p' and content = 0;
+
+-- immediate shutdown the primary and then promote the mirror.
+1: select pg_ctl((select datadir from gp_segment_configuration c where c.role='p' and c.content=0), 'stop');
+1: select gp_request_fts_probe_scan();
+1: select content, preferred_role, role, status, mode from gp_segment_configuration where content = 0;
+
+-- wait until promote is finished.
+0U: select 1;
+0Uq:
+2<:
+
+-- restore the cluster. Previously there is a bug the incremental recovery
+-- hangs in pg_rewind due to lock conflict. pg_rewinds runs a single-mode
+-- postgres to ensure clean shutdown of the postgres. That will recover the
+-- unhandled prepared transactions into memory which will hold locks. For
+-- example, "create database" will hold the lock of template1 on pg_database
+-- with mode 5, but that conflicts with the mode 3 lock which is needed during
+-- postgres starting in InitPostgres() and thus pg_rewind hangs forever.
+!\retcode gprecoverseg -a;
+!\retcode gprecoverseg -ar;
+
+-- reset fts GUCs.
+3: alter system reset gp_fts_probe_retries;
+3: alter system reset gp_fts_probe_timeout;
+3: select pg_reload_conf();