Ticket #683 (closed defect: fixed)

Opened 3 weeks ago

Last modified 10 days ago

Cannot restart message, but restart OK.

Reported by: stefan Owned by: alexander
Priority: major Milestone: Linux
Component: startstop Version: 7.10
Keywords: Cc:

Description

20190131_UTC_03:02:12 startstop: successfully restarted <wave_serverV> 20190131_UTC_03:02:12 startstop: Cannot restart pid=8167; it is not my child!

Need to remove false "cannot restart message". This is latest svn as of today. CentOS 7.4

Change History

comment:1 Changed 2 weeks ago by paulf

We probably need a way for the Restart Thread to have a way to tell the main thread that a restart for a given pid is in progress. Then this message can be "restart in progress for pid X" instead of "cannot restart because the pid X isn't found in my table anymore...."

Remember, the reason we have a RestartThread? for a module is because in the past if a restart took a long time, it locked up the main thread from servicing status requests and other tasks...

comment:2 Changed 2 weeks ago by paulf

This is probably easiest to reproduce using the wave_serverV which takes some time to exit when a restart command is issued.....

One would need to issue a repeated restart command for this PID from the cmd line....

comment:3 Changed 2 weeks ago by alexander

  • Owner changed from somebody to alexander
  • Status changed from new to assigned

comment:4 Changed 2 weeks ago by alexander

This is an error with copystatus, not startstop:

copystatus copies messages of type 'TYPE_STOP' and 'TYPE_RESTART', in the main listener loop of startstop we loop over all rings and check for messages.

a 'restart' message will be present in two rings, the target ring of the module and the status ring being copied to by copy status

I recommend that we DON'T have copystatus copy messages of these types as it will only result in one message succeeding and the other failing in quick succession

comment:5 Changed 2 weeks ago by paulf

I just realized I replied to email and forgot to record the issue here. Here is the same comments as my email to ewdev:

lex/Larry,

The reason TYPE_RESTART and TYPE_STOP are in copystatus is I think for the stopmodule EW module's purpose.

They are there for statmgr so that it can not issue a restart if a TYPE_STOP has been sent...

We need to fix the behavior in startstop so it can ignore dups....not sure how we can ID that though.....

Paul

comment:6 Changed 2 weeks ago by paulf

Okay, I say we dump the warning message.

We should not add new baggage to detect duplicate messages as that is unnecessary complications for startstop.

We also should introduce a new state to the status message and child structure called "restarting" to indicate that the Restart is in progress for a given PID and also we should double check all of the mutexes surrounding the structure that holds the PID's for each process controlled by startstop are solid...so that we don't have 2 threads beating on the same PID at once.

comment:7 Changed 12 days ago by alexander

  • Status changed from assigned to closed
  • Resolution set to fixed

Revision 7781 fixes this

comment:8 Changed 10 days ago by alexander

  • Status changed from closed to reopened
  • Resolution fixed deleted

Need re-work. Statmgr *does* need to be aware of "restart" messages, but "copystatus" creates the duplicate messages we wish to avoid.

Duplicates will be avoided by modifying "restart" to copy to a status ring directly (keeping statmgr in the loop) provided that a status ring exists, or else default to the current behavior of sending the restart message to the first available ring.

comment:9 Changed 10 days ago by alexander

  • Status changed from reopened to closed
  • Resolution set to fixed

NOW this ticket is done (revision 7786)

Restart now looks up statmgr's ring, if it doesn't find it we revert to using the first ring listed in startstop (as was the case before).

The following test passes:

1) stopmodule <pid>

stops the module referenced by pid, notifies statmgr to not monitor the stopped module

2) restart <pid>

restarts module referenced by pid (which will be referenced by new_pid), tells statmgr to resume monitoring of module. Also: no duplicate "restart" messages as copystatus no longer copies restart messages

3) kill <new_pid>

simulates module 'death'. Statmgr is monitoring the module again and will attempt to restart it once it has failed to receive a heartbeat for the configured length of time again, no duplicate "restart" messages will be issued

Note: See TracTickets for help on using tickets.