Ticket #298 (new defect)

Opened 8 years ago

Last modified 6 years ago

Startstop Not Responding

Reported by: stefan Owned by: somebody
Priority: major Milestone:
Component: startstop_service Version:
Keywords: Cc:

Description

Thought the Windows startstop was multi-threaded now, and 'status' would always return a status even if startstop was busy. But I just got the following when trying to do a 'reconfigure'. (I hadn't added anything new to starstop_nt.d, I only wanted to restart statmgr from the command-line, and reconfigure should be a way to easily do that.)

In my example, I do reconfigure: I get a status, but then not the second reconfigured status. Then I try the status command; it fails. Then I try another reconfigure and it works, as does status again now.

c:\earthworm\run_NF\params>reconfigure
using default config file startstop_nt.d

NOTE: IF (and only if) this command fails on the next line,'tport_attach'
  failed, usually because Earthworm is not running, or your user doesn't
 have permissions to access the Earthworm System.

tport_attach succeded!
****   Initial status: ****

                    EARTHWORM SYSTEM STATUS

        Start time (UTC):       Wed Jul 24 19:42:22 2013
        Current time (UTC):     Wed Jul 24 20:37:53 2013
        Ring  1 name/key/size:  WAVE_RING / 1000 / 3096 kb
        Ring  2 name/key/size:  PICK_RING / 1005 / 1024 kb
        Ring  3 name/key/size:  HYPO_RING / 1015 / 1024 kb
        Ring  4 name/key/size:  FILTERPICK_RING / 1037 / 1024 kb
        Ring  5 name/key/size:  PICK_TA_RING / 1043 / 1024 kb
        Ring  6 name/key/size:  HYPO_TA_RING / 1041 / 1024 kb
        Ring  7 name/key/size:  FILTERPICK_TA_RING / 1042 / 1024 kb
        Startstop's Config File:    startstop_nt.d
        Startstop's Params Dir:     (null)
        Startstop's Bin Dir:        (null)
        Startstop's Priority Class: Normal
        Startstop Version:          v7.7 2012-05-20

         Process  Process               Class/
          Name      Id     Status      Priority      Console  Argument
         -------  -------  ------      --------      -------  --------
         statmgr     668   Alive     Normal/Normal    New    statmgr.d
       heli_ewII    3436   Alive     Normal/Normal    Minim  heli_ewII.d
      copystatus    3556   Alive     Normal/Normal    NoNew  FILTERPICK_RING >
      copystatus    2136   Alive     Normal/Normal    NoNew  WAVE_RING <PO_RING
      copystatus     352   Alive     Normal/Normal    NoNew  PICK_RING <PO_RING
    wave_serverV    1032   Alive     Normal/Normal    Minim  wave_serverV.d
     carlstatrig    2728   Alive     Normal/Normal    NoNew  carlstatrig.d
     carlsubtrig    2860   Alive     Normal/Normal    NoNew  carlsubtrig.d
       trig2disk    3804   Alive     Normal/Normal    NoNew  trig2disk.d
      import_ack    2116   Alive     Normal/Normal    New    import_ack.d
         pick_ew    2608   Alive     Normal/Normal    New    pick_ew.d
        pkfilter    3292   Alive     Normal/Normal    New    pkfilter.d
       binder_ew    2056   Alive     Normal/Normal    New    binder_ew.d
          eqproc    2560   Alive     Normal/Normal    NoNew  eqproc.d
       arc2cubic    1996   Alive     Normal/Normal    NoNew  arc2cubic.d
         ew2file    1380   Alive     Normal/Normal    NoNew  ew2file.d
         ew2file    1908   Alive     Normal/Normal    NoNew  ew2fileMAG.d
  export_generic    1532   Alive     Normal/Normal    NoNew  export_hypos.d
            gmew    3140   Alive     Normal/Normal    NoNew  gmew.d
  export_generic    2408   Alive     Normal/Normal    NoNew  export_winston.d
        localmag    2492   Alive     Normal/Normal    NoNew  localmag.d
            java     712   Alive     Normal/Normal    New    <on.in.ew.ImportEW
            java    2124   Alive     Normal/Normal    New    <inston.server.WWS

****   Sent reconfigure directive; refreshed status will appear below... ****

Earthworm may be hung; no response from startstop in 10 seconds.

c:\earthworm\run_NF\params>status
using default config file startstop_nt.d
NOTE: If next line reads "ERROR: tport_attach...", Earthworm is not running.
      Sent request for status; waiting for response...

Earthworm may be hung; no response from startstop in 10 seconds.

c:\earthworm\run_NF\params>reconfigure
using default config file startstop_nt.d

NOTE: IF (and only if) this command fails on the next line,'tport_attach'
  failed, usually because Earthworm is not running, or your user doesn't
 have permissions to access the Earthworm System.

tport_attach succeded!
****   Initial status: ****

                    EARTHWORM SYSTEM STATUS

        Start time (UTC):       Wed Jul 24 19:42:22 2013
        Current time (UTC):     Wed Jul 24 20:39:27 2013
        Ring  1 name/key/size:  WAVE_RING / 1000 / 3096 kb
        Ring  2 name/key/size:  PICK_RING / 1005 / 1024 kb
        Ring  3 name/key/size:  HYPO_RING / 1015 / 1024 kb
        Ring  4 name/key/size:  FILTERPICK_RING / 1037 / 1024 kb
        Ring  5 name/key/size:  PICK_TA_RING / 1043 / 1024 kb
        Ring  6 name/key/size:  HYPO_TA_RING / 1041 / 1024 kb
        Ring  7 name/key/size:  FILTERPICK_TA_RING / 1042 / 1024 kb
        Startstop's Config File:    startstop_nt.d
        Startstop's Params Dir:     (null)
        Startstop's Bin Dir:        (null)
        Startstop's Priority Class: Normal
        Startstop Version:          v7.7 2012-05-20

         Process  Process               Class/
          Name      Id     Status      Priority      Console  Argument
         -------  -------  ------      --------      -------  --------
         statmgr    3944   Alive     Normal/Normal    New    statmgr.d
       heli_ewII    3436   Alive     Normal/Normal    Minim  heli_ewII.d
      copystatus    3556   Alive     Normal/Normal    NoNew  FILTERPICK_RING >
      copystatus    2136   Alive     Normal/Normal    NoNew  WAVE_RING <PO_RING
      copystatus     352   Alive     Normal/Normal    NoNew  PICK_RING <PO_RING
    wave_serverV    1032   Alive     Normal/Normal    Minim  wave_serverV.d
     carlstatrig    2728   Alive     Normal/Normal    NoNew  carlstatrig.d
     carlsubtrig    2860   Alive     Normal/Normal    NoNew  carlsubtrig.d
       trig2disk    3804   Alive     Normal/Normal    NoNew  trig2disk.d
      import_ack    2116   Alive     Normal/Normal    New    import_ack.d
         pick_ew    2608   Alive     Normal/Normal    New    pick_ew.d
        pkfilter    3292   Alive     Normal/Normal    New    pkfilter.d
       binder_ew    2056   Alive     Normal/Normal    New    binder_ew.d
          eqproc    2560   Alive     Normal/Normal    NoNew  eqproc.d
       arc2cubic    1996   Alive     Normal/Normal    NoNew  arc2cubic.d
         ew2file    1380   Alive     Normal/Normal    NoNew  ew2file.d
         ew2file    1908   Alive     Normal/Normal    NoNew  ew2fileMAG.d
  export_generic    1532   Alive     Normal/Normal    NoNew  export_hypos.d
            gmew    3140   Alive     Normal/Normal    NoNew  gmew.d
  export_generic    2408   Alive     Normal/Normal    NoNew  export_winston.d
        localmag    2492   Alive     Normal/Normal    NoNew  localmag.d
            java     712   Alive     Normal/Normal    New    <on.in.ew.ImportEW
            java    2124   Alive     Normal/Normal    New    <inston.server.WWS

****   Sent reconfigure directive; refreshed status will appear below... ****

                    EARTHWORM SYSTEM STATUS

        Start time (UTC):       Wed Jul 24 19:42:22 2013
        Current time (UTC):     Wed Jul 24 20:39:33 2013
        Ring  1 name/key/size:  WAVE_RING / 1000 / 3096 kb
        Ring  2 name/key/size:  PICK_RING / 1005 / 1024 kb
        Ring  3 name/key/size:  HYPO_RING / 1015 / 1024 kb
        Ring  4 name/key/size:  FILTERPICK_RING / 1037 / 1024 kb
        Ring  5 name/key/size:  PICK_TA_RING / 1043 / 1024 kb
        Ring  6 name/key/size:  HYPO_TA_RING / 1041 / 1024 kb
        Ring  7 name/key/size:  FILTERPICK_TA_RING / 1042 / 1024 kb
        Startstop's Config File:    startstop_nt.d
        Startstop's Params Dir:     (null)
        Startstop's Bin Dir:        (null)
        Startstop's Priority Class: Normal
        Startstop Version:          v7.7 2012-05-20

         Process  Process               Class/
          Name      Id     Status      Priority      Console  Argument
         -------  -------  ------      --------      -------  --------
         statmgr    1524   Alive     Normal/Normal    New    statmgr.d
       heli_ewII    3436   Alive     Normal/Normal    Minim  heli_ewII.d
      copystatus    3556   Alive     Normal/Normal    NoNew  FILTERPICK_RING >
      copystatus    2136   Alive     Normal/Normal    NoNew  WAVE_RING <PO_RING
      copystatus     352   Alive     Normal/Normal    NoNew  PICK_RING <PO_RING
    wave_serverV    1032   Alive     Normal/Normal    Minim  wave_serverV.d
     carlstatrig    2728   Alive     Normal/Normal    NoNew  carlstatrig.d
     carlsubtrig    2860   Alive     Normal/Normal    NoNew  carlsubtrig.d
       trig2disk    3804   Alive     Normal/Normal    NoNew  trig2disk.d
      import_ack    2116   Alive     Normal/Normal    New    import_ack.d
         pick_ew    2608   Alive     Normal/Normal    New    pick_ew.d
        pkfilter    3292   Alive     Normal/Normal    New    pkfilter.d
       binder_ew    2056   Alive     Normal/Normal    New    binder_ew.d
          eqproc    2560   Alive     Normal/Normal    NoNew  eqproc.d
       arc2cubic    1996   Alive     Normal/Normal    NoNew  arc2cubic.d
         ew2file    1380   Alive     Normal/Normal    NoNew  ew2file.d
         ew2file    1908   Alive     Normal/Normal    NoNew  ew2fileMAG.d
  export_generic    1532   Alive     Normal/Normal    NoNew  export_hypos.d
            gmew    3140   Alive     Normal/Normal    NoNew  gmew.d
  export_generic    2408   Alive     Normal/Normal    NoNew  export_winston.d
        localmag    2492   Alive     Normal/Normal    NoNew  localmag.d
            java     712   Alive     Normal/Normal    New    <on.in.ew.ImportEW
            java    2124   Alive     Normal/Normal    New    <inston.server.WWS

...

Change History

comment:1 Changed 8 years ago by paulf

I would argue that despite startstop being multithreaded if the host on which earthworm is running is a single core CPU, then there could be delays still if there is a lot going on on the box and startstop takes longer to push out a TYPE_STATUS message than status is willing to wait. status is configured to wait just 10 seconds.

Now why reconfigure got a status message back and the status module did not is a real puzzler. Did status work after the reconfigure command was run? what was the load on the system at this time as viewed from Windows Task Manager. I have not seen startstop hang on any UNIX system in recent memory, but they may not be as loaded as Windows seems to get. We really need to be able to consistently debug this one to say there is a problem. Can we close this one for now as I don't know how to even begin to debug this based on reconfigure working subsequently (note reconfigure uses the exact same code as does status....)????

comment:2 Changed 8 years ago by saurel@…

Hello,

Reading Paul's answer makes me remind that I actually had this problem on my Linux test computer, which is in fact a virtual computer. I experienced the startstop not responding a couple of time while all the modules were running, and if I remember, the system load was not so heavy at the time. I let the system ran for a couple of days before testing "status" again, and in the meantime, it worked normally, picks and location were produced and so on.

Since this happened on a test computer, I didn't care too much and just rebooted the computer or killed all the modules since even "pau" was not responding. I didn't tried "reconfigure".

The virtual computer has one CPU, runs CentOS-5.4 i686 and I use the latest sources (or from a couple of month ago) from the SVN.

I hope this will help you. Regards.

Jean-Marie SAUREL.

comment:3 Changed 8 years ago by paulf

Thanks Jean-Marie! The more info the better, so please report all that you see so that we can fix it and make it better.

Was this the 7.6 version of startstop or a recent one from SVN repo? Please check if you have a moment.

So status never came back and pau failed to terminate everything. It seems like the thread that responded to request messages died somehow.

comment:4 Changed 8 years ago by saurel@…

Paul, I think it's 7.6 version of startstop. I just patched the binder with the one on the SVN (with the P-S-ratio command added by you).

Here what it says when asking for status.

      Sent request for status; waiting for response...

                    EARTHWORM SYSTEM STATUS

        Hostname-OS:            ewtrait.ovmp.martinique.univ-ag.fr - Linux 2.6.18-164.el5xen
        Start time (UTC):       Thu Aug  1 12:28:58 2013
        Current time (UTC):     Thu Aug  8 13:20:38 2013
        Disk space avail:       3252660 kb
        Ring  1 name/key/size:  PROD_WAVES / 1000 / 16384 kb
        Ring  2 name/key/size:  TRIGGER_RING / 1001 / 16384 kb
        Ring  3 name/key/size:  RSSAM_RING / 1002 / 16384 kb
        Ring  4 name/key/size:  PICK_WAVES / 1003 / 16384 kb
        Ring  5 name/key/size:  PICK_RING / 1004 / 16384 kb
        Ring  6 name/key/size:  HYPO_RING / 1007 / 16384 kb
        Ring  7 name/key/size:  RAW_PICKS / 1008 / 16384 kb
        Ring  8 name/key/size:  FP_PICKS / 1010 / 16384 kb
        Ring  9 name/key/size:  HYPO_OUT_RING / 1009 / 16384 kb
        Ring 10 name/key/size:  RSAM_RING / 1013 / 16384 kb
        Ring 11 name/key/size:  FP_PICKS_FILT / 1014 / 16384 kb
        Ring 12 name/key/size:  ALARM_RING / 1015 / 16384 kb
        Startstop's Log Dir:    /home/ew/run_prod/logs/
        Startstop's Params Dir: /home/ew/run_prod/params/
        Startstop's Bin Dir:    /home/ew/v7.6/bin
        Startstop Version:      v7.6 2012-11-10

Is there anywhere else in the source where I could have better info on the version ?

comment:5 Changed 8 years ago by paulf

Upon looking through the startstop_unix_generic.c code I see that if there is no response to status, then this handled in the main thread (RunEarthworm() function) that does the interrogation of the rings and calling of functions to act upon the STOP/RECONFIG/REQSTATUS/RESTART messages. If this thread should die, then startstop should be dead. Jean-Marie, did you notice if the startstop process was still running and if so, were there any messages in the startstop log that indicated any issues? It looks like we need to try and attach to a startstop process that exhibits this behavior with gdb and see what we can find out. The main thread is pretty straight forward code.

Jean-Marie: the version is echoed back in the startstop version or in status. In the SVN version for 7.7 we have a new file in the earthworm/include dir called startstop_version.h that will be used by both UNIX/Windows to signify the version number of the running startstop.

The one thing I noticed is that the REQSTATUS is handled sequentially and could gum up the entire works. That is when a TYPE_REQSTATUS comes in, SendStatus(int iring) is called with the index of the ring to respond to the status request. SendStatus() is just a function and is not multithreaded (maybe it should be). All that this function does is call EncodeStatus() to build the status message followed by tport_putmsg() to send it to the ring indicated. If this hangs up in any way, then startstop's main thread will go into a coma waiting for SendStatus to return......

So looking at EncodeStatus() I see that there is a lot going on, some of which could hang things up if there are problems. It would be good to be able to find one of these cases and attach to the process using gdb and see just where things are hung up.

comment:6 Changed 8 years ago by saurel@…

Ok, I don't see startstop_version.h so I assume I'm running 7.6 version.

I don't remember if startstop was still running or not. By looking into the startstop log file, I can see plenty of "cannot restart pid=xxx; it's not my child" during a couple of hours before the crash. Maybe this has overloaded startstop.

Maybe I should say that on that test computer, we are trying with Claudio Satriano to debug the pick_FP. And one thing we found was that at the startup of earthworm when using pick_FP, this module, among others, is quite CPU hungry, which seems to lead to some processes responding too slowly, which means they are restarted, which means they are again CPU hungry at their start, which means other modules will stop responding and so on for a couple of hours. If I remember, one of the startstop crash occurs during this "stabilization" phase, but I'm pretty sure another one occured unexpectedly during normal earthworm behaviour (ie, not startup).

Unfortunately, I cannot tell you more about this, since I didn't kept traces of that crash, so I wouldn't even know on which day I should search for clue in the log files.

comment:7 Changed 8 years ago by stefan

I would argue that despite startstop being multithreaded if the host on which earthworm is running is a single core CPU, then there could be delays still if there is a lot going on on the box and startstop takes longer to push out a TYPE_STATUS message than status is willing to wait. status is configured to wait just 10 seconds.

This was a multi-core box in my case

Now why reconfigure got a status message back and the status module did not is a real puzzler. Did status work after the reconfigure command was run?

Yes. There was a problem with startstop briefly, and then it cleared up and everything acted like normal after that. So anything like status would have worked fine after the outage.

what was the load on the system at this time as

viewed from Windows Task Manager.

My recollection was that the box wasn't particularly busy, but I didn't take a task manager reading at the time.

I have not seen startstop hang on

any UNIX system in recent memory, but they may not be as loaded as Windows seems to get. We really need to be able to consistently debug this one to say there is a problem. Can we close this one for now as I don't know how to even begin to debug this based on reconfigure working subsequently (note reconfigure uses the exact same code as does status....)????

We definitely shouldn't close this but if we can't reproduce it it's obviously very hard to fix. So it may stay in the trac queue for a while before we figure out a way to address this.

Basically in my case: everything was fine. Something briefly wedged startstop (even multi-threaded) then it got unwedged and everything was perfectly fine again.

comment:8 Changed 8 years ago by paulf

I was thinking about the option to make SendStatus() a threaded function, but this would present a problem for the case of the reconfigure command which issues two TYPE_REQSTATUS requests to startstop, one before and one after reconfiguration to show what the result was. If the first thread call to SendStatus() takes longer to finish than the second one, you would get out of order results.

That said, you wouldn't have the main startstop thread that handles requests hang up spectacularly like was observed....

But then again, I am putting a fix before we truly know what the cause is.

comment:9 Changed 8 years ago by stefan

I reproduced this just now on the NIOSH .141 Windows XP machine. Reconfigure restarts statmgr. CPU usage low, RAM usage low. Reconfigure did the same thing; showed the first status, and then failed to show the second. I repeatedly hit 'status' and got no connection. THEN statmgr's new shell window popped up, and only THEN I could get a response back from the status command in the startstopconsole.

comment:10 Changed 8 years ago by saurel@…

Hello,

Again I have had this problem this week-end. The system is still in this weird state.

Here are all the modules running.

[ew@ewtrait0 logs]$ ps -ef | grep ew
ew        5749  5923  0 Oct18 ?        00:00:00 export_generic export_bindFWIpickFP.d
ew        5778  5923  0 Oct18 ?        00:00:00 hypAssoc hypAssoc.d
ew        5788  5923  0 Oct18 ?        00:00:00 export_generic export_hyp2piccard.d
ew        5791  5923  0 Oct18 ?        00:00:00 export_generic export_bindMQpickFP.d
ew        5923     1  0 Oct08 ?        00:00:00 startstop
ew        5924  5923  0 Oct08 ?        00:00:00 statmgr statmgr.d
ew        5929  5923  0 Oct08 ?        00:00:25 coda_aav coda_aav.d
ew        5930  5923  0 Oct08 ?        00:00:32 coda_dur coda_dur.d
ew        5931  5923  0 Oct08 ?        00:00:03 eqassemble eqassemble_FWI_FP.d
ew        5932  5931  0 Oct08 ?        00:00:00 eqbuf eqbuf_FWI_FP.d
ew        5933  5923  0 Oct08 ?        00:00:00 eqassemble eqassemble_MQ_FP.d
ew        5934  5932  0 Oct08 ?        00:00:00 eqcoda eqcoda_FWI_FP.d
ew        5935  5934  0 Oct08 ?        00:00:00 hyp71_mgr hyp71_FWI_FP.d
ew        5937  5933  0 Oct08 ?        00:00:00 eqbuf eqbuf_MQ_FP.d
ew        5938  5923  0 Oct08 ?        00:00:01 ew2rsam ew2rsam.d
ew        5939  5923  0 Oct08 ?        00:00:00 ew_rsamalarm ew_rsamalarm.d
ew        5940  5923  0 Oct08 ?        00:00:00 ew2file rsam_alarm2file.d
ew        5941  5923  0 Oct08 ?        00:00:00 export_generic export_hyp2WO.d
ew        5942  5923  0 Oct08 ?        00:00:00 export_generic export_hyp2MC.d
ew        5943  5937  0 Oct08 ?        00:00:00 eqcoda eqcoda_MQ_FP.d
ew        5953  5923  0 Oct08 ?        00:00:00 scnl2scn scnl2scn_sam.d
ew        5954  5923  0 Oct08 ?        00:00:00 scn2scnl scn2scnl.d
ew        5955  5923  0 Oct08 ?        00:00:14 slink2ew slink_srv.d
ew        5956  5923  0 Oct08 ?        00:00:00 trig2arc trig2arc.d
ew        5957  5923  0 Oct08 ?        00:01:01 wave_serverV wave_serverV_BB.d
ew        5962  5943  0 Oct08 ?        00:00:50 hyp71_mgr hyp71_MQ_FP.d
ew        5965  5923  0 Oct08 ?        00:00:02 wave_serverV wave_serverV_SP.d
ew        6112  5923  0 Oct08 ?        00:00:01 carlstatrig carlstatrig.d
ew        6113  5923  0 Oct08 ?        00:00:00 carlsubtrig carlsubtrig.d
ew        6497  5923  0 Oct18 ?        00:00:03 binder_ew binder_MQ_FP.d
ew        6515  5923  0 Oct18 ?        00:00:03 pick_FP pick_FP.d

Status request is hung.

[ew@ewtrait0 ~]$ status
using default config file startstop_unix.d
NOTE: If next line reads "ERROR: tport_attach...", Earthworm is not running.

You will remark that some process data back from the 18th, while most of the process data back from the 8th. I think the system is down since the 18th at the time where we had quite an important earthquake (origin time 04:02:15) that seems to have overloaded the system. Here are the last lines of startstop log file. {{{20131018_UTC_04:43:56 startstop: Cannot restart pid=6453; it is not my child!

20131018_UTC_04:43:57 startstop: Cannot restart pid=6454; it is not my child!

20131018_UTC_04:44:01 startstop: Cannot restart pid=6476; it is not my child!

20131018_UTC_04:44:02 startstop: Cannot restart pid=6453; it is not my child!

20131018_UTC_04:44:03 startstop: Cannot restart pid=6454; it is not my child! }}}

And the last lines of statmgr log file.

UTC_Fri Oct 18 04:43:58 2013  ewtrait0.ovmp.martinique.univ-ag.fr/binder_MQ SKI.HHZ.TR.00 - Not in station list.

UTC_Fri Oct 18 04:43:58 2013  ewtrait0.ovmp.martinique.univ-ag.fr/binder_MQ MAGL.HHN.WI.00 - Not in station list.

UTC_Fri Oct 18 04:43:58 2013  ewtrait0.ovmp.martinique.univ-ag.fr/binder_MQ SEUS.HHN.NA.-- - Not in station list.

UTC_Fri Oct 18 04:43:58 2013  ewtrait0.ovmp.martinique.univ-ag.fr/binder_MQ SEUS.HHN.NA.-- - Not in station list.

UTC_Fri Oct 18 04:43:58 2013  ewtrait0.ovmp.martinique.univ-ag.fr/binder_MQ SEUS.HHE.NA.-- - Not in station list.

UTC_Fri Oct 18 04:43:58 2013  ewtrait0.ovmp.martinique.univ-ag.fr/binder_MQ SEUS.HHZ.NA.-- - Not in station list.

UTC_Fri Oct 18 04:43:58 2013  ewtrait0.ovmp.martinique.univ-ag.fr/binder_MQ SEUS.HHZ.NA.-- - Not in station list.

UTC_Fri Oct 18 04:45:32 2013  ewtrait0.ovmp.martinique.univ-ag.fr/binder_FWI (ag.fr) module dead
20131018_UTC_04:45:32 Statmgr: sent restart request for binder_FWI pid 6496
UTC_Fri Oct 18 04:43:58 2013  ewtrait0.ovmp.martinique.univ-ag.fr/binder_MQ LPM.HHE.MQ.00 - Not in station list.

UTC_Fri Oct 18 04:43:58 2013  ewtrait0.ovmp.martinique.univ-ag.fr/binder_FWI SLBI.HHE.WI.00 - Not in station list.

The system is a CentOS 5.4 64bits with EarthWorm? v7.6.

Note that the same EarthWorm? system (v7.6) is running on a 32bits CentOS 5.4 and didn't get any trouble during the quake and still runs without problems.

If you need some extra infos, just ask, I keep the system hung for a couple of hour yet.

Jean-Marie.

comment:11 Changed 6 years ago by et

I took a look at this issue, but was not able to reproduce the problem. (I tried setting 'statmgrDelay' to a larger value like 10 seconds, but this does not seem to cause similar symptoms.) If anyone can describe or post configuration files for a reproducable test case, please do.

--ET

comment:12 Changed 6 years ago by stefan

Hm, I think I've seen behavior like the following on EZW, but can't confirm it happens with Earthworm. Try this: start startstop from a cmd prompt in Windows. Then run 'status' in another cmd prompt. We assume it works properly. Then do a "mark" of text in the startstop cmd prompt window, and leave the mark selecting something. Go back to the other window and try and type 'status'. Does it return? Or does it hang because the "mark" is freezing up the interactive console of startstop?

comment:13 Changed 6 years ago by et

Stefan, I tried the test of leaving marked text in the 'startstop' console (on Windows 7). It looks like, while text is being marked, calls that attempt to do console output are blocked. The 'startstop' module doesn't tend to do console output unless the user enters a command, so it doesn't get blocked. The 'status' command via a different console always seems to work OK.

When a module has its console output blocked it still shows as 'alive' on the status screen. If 'statmgr' is monitoring the module it tends to stop seeing heartbeats and tries to restart the module.

Issuing a 'reconfigure' will sort of do what is described in this ticket -- the first status works, but the second never arrives. However, unlike what is described in this ticket, the system never recovers unless the text-marking is finished. After the text-marking is finished the system seems to recover fine and run OK.

--ET

comment:14 Changed 6 years ago by stefan

It's a Windows-specific problem, however, I think it is worth it to fix it (perhaps some threading solution). If someone accidentally marked a window, the rest of the system still works properly. (this has happened to me on more than one occasion...)

Until we get a reproducible case in UNIX, I don't think we can begin to address that.

comment:15 Changed 6 years ago by et

Well, I guess we'd have to queue all outgoing 'stdout' and 'stderr' data and use a separate thread to send it. Is it worth that much effort?

--ET

comment:16 Changed 6 years ago by stefan

An example of a Windows user getting caught by this:

On Mon, Jun 22, 2015 at 10:57 AM, James P Davis <jpdavis009@…> wrote:

Hi all,

We've been dealing with this issue for some time with our WinXP and Win7 Earthworm boxes and maybe one of you knows what's going on.

The problem is that, at random, startstop on Windows will go to sleep. When this happens, all of EW is stuck in a pause-state. Nothing happens and there are no errors until Ew is "poked" by pressing 'enter' on startstop's console window. When poked like this, hundreds of missed-message and clock errors will spew in console until EW eventually catches up. Once Ew catches up, all is groovy and normal operation resumes.

Like I said, this happens at random times and, that we know of at CERI, there's no certain way to trigger or reproduce the problem. We're using EW 7.7 and autostart_nt_ew.bat to launch EW at startup. The issue affects both WinXP and Win7. It happens often enough that we've given it a name - "windoze".

Any ideas?

---

And Lynn's reply indicating that she has taken some effort to work around this issue. So I think it may be worth the trouble to fix Windows startstop in 7.9 Earthworm

---

I am guessing that if you look at your startstop console window's "properties", you will find one or both of the 'edit options' checked.

If this is the case, and someone accidentally clicks inside the startstop window and highlights some text, Windoze will pause the running process waiting for more human input. Typing 'enter' in the window unhighlights stuff and lets the process resume.

To avoid this hung window 'feature', we make sure that both edit options are *unchecked* for all of our Earthworm modules that have console windows on Windoze, like so:

Properties dialog box

Options Tab

Edit Options

Quick Edit Mode Insert Mode

I think on XP we had to run Earthworm manually (not as a service) the first time to set up the properties, then choose "save for future windows with the same title" would make the property stick when it's run as a service.

Property dialog boxes service-level consoles are not visible to a 'normal' user.

Note: See TracTickets for help on using tickets.