Ticket #564 (closed defect: fixed)

Opened 2 years ago

Last modified 4 weeks ago

archman problems on windows

Reported by: paulf Owned by:
Priority: major Milestone: Windows
Component: archman Version: 8.0
Keywords: Cc:

Description

Elmar Rothert of Q-con reports:

Hi Paul, I just discovered a potential bug in the EW archman module: I observed that sometimes (on Windows) archman is not able to open the temporary file which is generated by “tmpnam”. The module then does not abort but gets stuck in an endless loop and has to be killed and restarted manually. Inserting “exit (1);” behind lines 1182 and 1184 in the .c code will fix this problem. When archman aborts, statmgr should be able to restart the module automatically. However, statmgr can not restart archman because of an invalid PID. It seems that archman does not send the correct heartbeat message into the ring. Changing line 874 sprintf (buffer, "%ld %d\n", bin_time, config->pid); -> sprintf (buffer, "%ld %ld\n", (long) bin_time, (long) config->pid); will fix this. I hope this is helpful for improvements of the module in the next version. Best regards, Elmar

Change History

comment:1 Changed 2 years ago by scott

Random thoughts:

  • This is not a Windows-specific issue; the exact same behavior (but following a different code path for the temp file) is present for non-Windows platforms.
  • Is there a "safer" directory for the module to try and create a temporary file? (This is only an issue in Windows.)
  • With the heartbeat fix, couldn't we end up with a tug of war between archman killing itself & statmgr resurrecting it?
  • Since most of the config file used by waveman2disk comes from archman's config file, might not waveman2disk be modified so it can understand archman's config file?

comment:2 Changed 2 years ago by baker

The proposed fix makes assumptions about the match between the format items and the data. C99 (7.23.1.3) defines the time_t type as a size_t. That means it is unsigned, not signed. There is also a format modifier for size_t data, the z modifier (7.19.6.1.7). I believe the proper format item for a time_t is %zu. I am not sure whether a cast of a time_t is required at all.

pid_t's are more problematic. They are not language constructs, but operating system constructs. The POSIX standard recommends treating pid_t's as "magic cookies" (B.2.6), with no particular meaning ascribed to the number, other than the sign bit, which is used to flag certain operations. The only specification it makes for pid_t is that is must be defined in <sys/types.h>. I would try a %zd format item and hope that no cast is required for a pid_t either.

In the past I have done Earthworm builds using compiler flags to catch format item mismatches. I remember having to make some (throw-away) source changes, though, to test all the format items used in calls to logit. There's lots of them! This is definitely one area where you would want help from the compiler to catch such errors.

comment:3 Changed 4 weeks ago by paulf

  • Status changed from new to closed
  • Resolution set to fixed

In r7975, I made the program declare a "Fatal Error" if the temp name cannot fails or the file cannot be opened.

I also cast pid_t to (long) and used the %ld which we know works on Windows and elsewhere.

Note: See TracTickets for help on using tickets.