Ticket #726 (closed defect: fixed)

Opened 4 weeks ago

Last modified 4 weeks ago

reconfigure command did not restart statmgr

Reported by: paulf Owned by: somebody
Priority: major Milestone: OSX
Component: reconfigure Version:
Keywords: Cc:

Description

This was during EW class on a Mac OS X system (10.15) and a reconfigure did not seem to restart the statmgr process. This was with V7.10.

A quick look at the code shows that it tries to do this in the logic....but we need to debug this because we clearly saw a case where it did not.

Matteo can confirm this as well!! He saw the same thing.

Ilya also pointed out that if you don't reset nRing properly, then reconfigure doesn't warn you that it didn't work....but this will come and bite you later when you start EW from scratch.

Change History

comment:1 Changed 4 weeks ago by quintiliani

Paul,

I checked both issues (statmgr and nRing) on Linux running Earthworm binaries based on Subversion revision r8155

The behavior is the same.

I attach some log messages.

        Hostname-OS:            dcc793b9b262 - Linux 4.19.76-linuxkit
        Start time (UTC):       Wed Jun 10 16:30:50 2020
        Current time (UTC):     Wed Jun 10 16:31:11 2020
        Disk space avail:       44658764 kb
        Ring  1 name/key/size:  WAVE_RING / 1000 / 8192 kb
        Ring  2 name/key/size:  IN_RING / 1037 / 8192 kb
        Ring  3 name/key/size:  DUP_WAVE_RING / 1041 / 8192 kb
        Ring  4 name/key/size:  STATUS_RING / 1040 / 512 kb
        Startstop's Log Dir:    /opt/ew_env/log/
        Startstop's Params Dir: /opt/ew_env/params/
        Startstop's Bin Dir:    /opt/earthworm/bin
        Startstop Version:      v7.10 2019-08-13 (64 bit)

         Process  Process               Class/    CPU
          Name      Id      Status     Priority   Used    Argument
         -------  -------   ------     --------   ----    --------
       startstop      97    Alive         ??/ 0 00:00:00  -
      copystatus      98    Alive         ??/ 0 00:00:00  WAVE_RING STATUS_RING
      copystatus      99    Alive         ??/ 0 00:00:00  IN_RING STATUS_RING
      copystatus     100    Alive         ??/ 0 00:00:00  DUP_WAVE_RING <US_RING
         statmgr     101    Alive         ??/ 0 00:00:00  statmgr.d
  export_generic     102    Alive         ??/ 0 00:00:00  export_generic.d
  export_generic     103    Alive         ??/ 0 00:00:00  export_generic_wws.d
        slink2ew     104    Alive         ??/ 0 00:00:00  slink2ew.d
  import_generic     105    Dead                          import_generic.d

   Press return to print Earthworm status, or
   type restart nnn where nnn is proc id or name to restart, or
   type quit<cr> to stop Earthworm.

configure

   Press return to print Earthworm status, or
   type restart nnn where nnn is proc id or name to restart, or
   type quit<cr> to stop Earthworm.

20200610_UTC_16:31:12 startstop: Reconfigure: Re-reading command file <startstop_unix.d>
20200610_UTC_16:31:12 startstop: Adding child, oldNChild=8, nChild=9, statMgrLoc=-1
startstop: requested OTHER priority (15) not between 0 and 0: reset to 0
20200610_UTC_16:31:12 startstop: process <ringdup_generic> <130> started.
ringdup_generic ringdup_generic ringdup_generic.d
20200610_UTC_16:31:12 ringdup_generic(MOD_RINGDUP_GENERIC): Read command file <ringdup_generic.d>
 Version 0.0.9 2016-09-01
UTC_Wed Jun 10 16:31:14 2020  dcc793b9b262/statmgr msg from unknown module  (statmgr doesn't have a .desc file for this one) inst:73 mod:115 typ:3
UTC_Wed Jun 10 16:31:43 2020  dcc793b9b262/statmgr msg from unknown module  (statmgr doesn't have a .desc file for this one) inst:73 mod:115 typ:3
UTC_Wed Jun 10 16:32:14 2020  dcc793b9b262/statmgr msg from unknown module  (statmgr doesn't have a .desc file for this one) inst:73 mod:115 typ:3
UTC_Wed Jun 10 16:32:46 2020  dcc793b9b262/statmgr msg from unknown module  (statmgr doesn't have a .desc file for this one) inst:73 mod:115 typ:3
                   EARTHWORM-64 SYSTEM STATUS

        Hostname-OS:            dcc793b9b262 - Linux 4.19.76-linuxkit
        Start time (UTC):       Wed Jun 10 16:30:50 2020
        Current time (UTC):     Wed Jun 10 16:32:52 2020
        Disk space avail:       44647300 kb
        Ring  1 name/key/size:  WAVE_RING / 1000 / 8192 kb
        Ring  2 name/key/size:  IN_RING / 1037 / 8192 kb
        Ring  3 name/key/size:  DUP_WAVE_RING / 1041 / 8192 kb
        Ring  4 name/key/size:  STATUS_RING / 1040 / 512 kb
        Startstop's Log Dir:    /opt/ew_env/log/
        Startstop's Params Dir: /opt/ew_env/params/
        Startstop's Bin Dir:    /opt/earthworm/bin
        Startstop Version:      v7.10 2019-08-13 (64 bit)

         Process  Process               Class/    CPU
          Name      Id      Status     Priority   Used    Argument
         -------  -------   ------     --------   ----    --------
       startstop      97    Alive         ??/ 0 00:00:00  -
      copystatus      98    Alive         ??/ 0 00:00:00  WAVE_RING STATUS_RING
      copystatus      99    Alive         ??/ 0 00:00:00  IN_RING STATUS_RING
      copystatus     100    Alive         ??/ 0 00:00:00  DUP_WAVE_RING <US_RING
         statmgr     101    Alive         ??/ 0 00:00:00  statmgr.d
  export_generic     102    Alive         ??/ 0 00:00:00  export_generic.d
  export_generic     103    Alive         ??/ 0 00:00:00  export_generic_wws.d
        slink2ew     104    Alive         ??/ 0 00:00:00  slink2ew.d
  import_generic     105    Dead                          import_generic.d
 ringdup_generic     130    Alive         ??/ 0 00:00:00  ringdup_generic.d
Last edited 4 weeks ago by quintiliani (previous) (diff)

comment:2 Changed 4 weeks ago by quintiliani

I also did some other tests on Earthworm Linux r8155 about nRing and reconfigure.

I arbitrarily changed the value of the nRing parameter several times to startstop_unix.d and then launched reconfigure.

The value of nRing is completely ignored by reconfigure.

comment:3 Changed 4 weeks ago by quintiliani

The problem is that metaring.statmgr_location is set only for each new module is added when Earthworm starts.

The check at http://earthworm.isti.com/trac/earthworm/browser/trunk/src/libsrc/util/startstop_unix_generic.c#L796

evaluate only the nChild-th process (the new one)

if (strcmp( child[nChild].processName, "statmgr") == 0)

Setting instead the value of metaring.statmgr_location by the following quick-and-dirty-test-code, statmgr is properly restarted.

dcc793b9b262:/opt/earthworm/src/libsrc [ew:isti_course_1] $ svn di
Index: util/startstop_unix_generic.c
===================================================================
--- util/startstop_unix_generic.c       (revision 8155)
+++ util/startstop_unix_generic.c       (working copy)
@@ -426,6 +426,18 @@
                         metaring.ringKey[j] );
                     metaring.nRing ++;
                 }
+
+               int j_on_Child = 0;
+               for ( j_on_Child = 0; j_on_Child < nChild; j_on_Child++ ) {
+                       // logit( "et" , "%s\n",  child[j_on_Child].processName);
+                       if (strcmp( child[j_on_Child].processName, "statmgr") == 0)
+                       {
+                               /* Store statmgr's location in the child array so that we can start it first */
+                               metaring.statmgr_location = j_on_Child;
+                       }
+               }
+
+
                 if (nChild > oldNChild ) {
                     logit( "et" ,
                         "startstop: Adding child, oldNChild=%d, nChild=%d, statMgrLoc=%d\n",
@@ -436,6 +448,9 @@
                              oldNChild, nChild, metaring.statmgr_location );
                 }
                 SpawnChildren();
+
+
+
                 if (metaring.statmgr_location != -1) {
                     logit( "et" , "startstop: Final reconfigure step: Restart statmgr\n" );
                     sprintf (msg, "%d", child[metaring.statmgr_location].pid);

Here the output after reconfigure:

                   EARTHWORM-64 SYSTEM STATUS

        Hostname-OS:            dcc793b9b262 - Linux 4.19.76-linuxkit
        Start time (UTC):       Wed Jun 10 19:22:06 2020
        Current time (UTC):     Wed Jun 10 19:22:46 2020
        Disk space avail:       43660460 kb
        Ring  1 name/key/size:  WAVE_RING / 1000 / 8192 kb
        Ring  2 name/key/size:  IN_RING / 1037 / 8192 kb
        Ring  3 name/key/size:  DUP_WAVE_RING / 1041 / 8192 kb
        Ring  4 name/key/size:  STATUS_RING / 1040 / 512 kb
        Startstop's Log Dir:    /opt/ew_env/log/
        Startstop's Params Dir: /opt/ew_env/params/
        Startstop's Bin Dir:    /opt/earthworm/bin
        Startstop Version:      v7.10 2019-08-13 (64 bit)

         Process  Process               Class/    CPU
          Name      Id      Status     Priority   Used    Argument
         -------  -------   ------     --------   ----    --------
       startstop    2096    Alive         ??/ 0 00:00:00  -
      copystatus    2097    Alive         ??/ 0 00:00:00  WAVE_RING STATUS_RING
      copystatus    2098    Alive         ??/ 0 00:00:00  IN_RING STATUS_RING
      copystatus    2099    Alive         ??/ 0 00:00:00  DUP_WAVE_RING <US_RING
         statmgr    2100    Alive         ??/ 0 00:00:00  statmgr.d
  export_generic    2101    Alive         ??/ 0 00:00:00  export_generic.d
  export_generic    2102    Alive         ??/ 0 00:00:00  export_generic_wws.d
        slink2ew    2103    Alive         ??/ 0 00:00:00  slink2ew.d
  import_generic    2104    Dead                          import_generic.d

   Press return to print Earthworm status, or
   type restart nnn where nnn is proc id or name to restart, or
   type quit<cr> to stop Earthworm.

reconfigure

   Press return to print Earthworm status, or
   type restart nnn where nnn is proc id or name to restart, or
   type quit<cr> to stop Earthworm.

20200610_UTC_19:22:51 startstop: Reconfigure: Re-reading command file <startstop_unix.d>
20200610_UTC_19:22:51 startstop: Adding child, oldNChild=8, nChild=9, statMgrLoc=3
startstop: requested OTHER priority (15) not between 0 and 0: reset to 0
20200610_UTC_19:22:51 startstop: process <ringdup_generic> <2153> started.
ringdup_generic ringdup_generic ringdup_generic.d
20200610_UTC_19:22:51 startstop: Final reconfigure step: Restart statmgr
20200610_UTC_19:22:51 ringdup_generic(MOD_RINGDUP_GENERIC): Read command file <ringdup_generic.d>
 Version 0.0.9 2016-09-01
UTC_Wed Jun 10 19:22:52 2020  dcc793b9b262/statmgr Program stopping. Page sent.
startstop: requested OTHER priority (15) not between 0 and 0: reset to 0
20200610_UTC_19:22:52 startstop: successfully restarted <statmgr> <pid: 2156>
statmgr statmgr statmgr.d
20200610_UTC_19:22:52 LoadTableEnvVariable(): nfiles 2
UTC_Wed Jun 10 19:22:53 2020  dcc793b9b262/statmgr Program starting. Page sent.

And then the next output status:

                    EARTHWORM-64 SYSTEM STATUS

        Hostname-OS:            dcc793b9b262 - Linux 4.19.76-linuxkit
        Start time (UTC):       Wed Jun 10 19:22:06 2020
        Current time (UTC):     Wed Jun 10 19:23:12 2020
        Disk space avail:       43660296 kb
        Ring  1 name/key/size:  WAVE_RING / 1000 / 8192 kb
        Ring  2 name/key/size:  IN_RING / 1037 / 8192 kb
        Ring  3 name/key/size:  DUP_WAVE_RING / 1041 / 8192 kb
        Ring  4 name/key/size:  STATUS_RING / 1040 / 512 kb
        Startstop's Log Dir:    /opt/ew_env/log/
        Startstop's Params Dir: /opt/ew_env/params/
        Startstop's Bin Dir:    /opt/earthworm/bin
        Startstop Version:      v7.10 2019-08-13 (64 bit)

         Process  Process               Class/    CPU
          Name      Id      Status     Priority   Used    Argument
         -------  -------   ------     --------   ----    --------
       startstop    2096    Alive         ??/ 0 00:00:00  -
      copystatus    2097    Alive         ??/ 0 00:00:00  WAVE_RING STATUS_RING
      copystatus    2098    Alive         ??/ 0 00:00:00  IN_RING STATUS_RING
      copystatus    2099    Alive         ??/ 0 00:00:00  DUP_WAVE_RING <US_RING
         statmgr    2156    Alive         ??/ 0 00:00:00  statmgr.d
  export_generic    2101    Alive         ??/ 0 00:00:00  export_generic.d
  export_generic    2102    Alive         ??/ 0 00:00:00  export_generic_wws.d
        slink2ew    2103    Alive         ??/ 0 00:00:00  slink2ew.d
  import_generic    2104    Dead                          import_generic.d
 ringdup_generic    2153    Alive         ??/ 0 00:00:00  ringdup_generic.d

   Press return to print Earthworm status, or
   type restart nnn where nnn is proc id or name to restart, or
   type quit<cr> to stop Earthworm.

If that's okay for you, tomorrow I'll better arrange the code and submit it to the EW repository.

comment:4 Changed 4 weeks ago by paulf

Thanks Matteo. You beat me to the fix. Please check-in your version and change the date on the Startstop Version (done in the include directory).

comment:5 Changed 4 weeks ago by quintiliani

Paul, I will only be able to do it tomorrow and considering that you have a better overall vision than mine, perhaps better than you proceed.

comment:6 Changed 4 weeks ago by paulf

  • Status changed from new to closed
  • Resolution set to fixed

Fixed in r8158, thanks Matteo.

Note: See TracTickets for help on using tickets.