Fix a hung issue caused by gp_interconnect_id disorder
This issue is exposed when doing an experiment to remove the special "eval_stable_functions" handling in evaluate_function(), qp_functions_in_* test cases will get stuck sometimes and it turns out to be a gp_interconnect_id disorder issue. Under UDPIFC interconnect, gp_interconnect_id is used to distinguish the executions of MPP-fied plan in the same session and in the receiver side, packets with smaller gp_interconnect_id is treated as 'past' packets, receiver will stop the sender to send the packets. The RCA of the hung is: 1. QD call InitSliceTable() to advance the gp_interconnect_id and store it in slice table. 2. In CdbDispatchPlan->exec_make_plan_constant(), QD find some stable function need to be simplified to const, then it executes this function first. 3. The function contains the SQL, QD init another slice table and advance the gp_interconnect_id again, QD dispatch the new plan and execute it. 4. After the function is simplified to const, QD continues to dispatch the previous plan, however, the gp_interconnect_id for it becomes the older one. When a packet comes, if the receiver hasn't set up the interconnect yet, the packet will be handled by handleMismatch() and it will be treated as `past` packets and the senders will be stopped earlier by the receiver. Later the receiver finish the setup of interconnect, it cannot get any packets from senders and get stuck. To resolve this, we advance the gp_interconnect_id when a plan is really dispatched, the plan is dispatched sequentially, so the later dispatched plan will have a higher gp_interconnect_id. Also limit the usage of gp_interconnect_id in rx thread of UDPIFC, we prefer to use sliceTable->ic_instance_id in main thread. Reviewed-by: NHeikki Linnakangas <hlinnakangas@pivotal.io> Reviewed-by: NAsim R P <apraveen@pivotal.io> Reviewed-by: NHubert Zhang <hzhang@pivotal.io>
Showing
想要评论请 注册 或 登录