Ticket #484 (new enhancement)

Opened 5 years ago

Last modified 5 years ago

statmgr's detection time for module heartbeats lost can be increased automagically

Reported by: paulf Owned by: somebody
Priority: major Milestone: All Platforms
Component: statmgr Version: 7.8
Keywords: Cc:

Description

Many times users use the same check interval as they do heartbeat, not knowing that setting them identical is not optimal and there are cases where if you check every N seconds for a heartbeat, the heartbeat may come at N+1 resulting in statmgr sending a restart request unnecessarily for a module.

My new idea is simply to add a default number of seconds, say 5, to whatever the check interval is...to buffer against this most common of Earthworm mistakes. It would greatly reduce headaches and issues many see.

If I don't hear any objections, I will likely make this change during the next EW class next week.

Attachments

crotwell_statmgr.patch (7.5 KB) - added by philip 5 years ago.
patch to allow statmgr to read corresponding .d files looking for heartbeatint
ValidateHeartbeatToTimeout.sh (3.3 KB) - added by baker 5 years ago.
Shell script to validate a module's heartbeat rate

Change History

comment:1 Changed 5 years ago by baker

Paul,

I'm not so sure it isn't better to have failures show up so people figure out what to fix. I don't like the idea of software not doing what I told it to do, especially if it does it silently. I am definitely sympathetic to the dilemma. But, What if 5 seconds really isn't always good enough? Don't we recommend statmgr timeout = 2X module heartbeat rate? With the current situation, isn't it much more likely that timeouts will actually happen due to the misconfiguration you describe? That in and of itself is educational. If the 5 seconds of padding makes the system work most of the time, what will happen when that was not quite enough? That might be an even harder problem to diagnose. And, the people you are talking about doing this for are probably not going to understand the true cause of the failure if they haven't learned how to properly set up the heartbeats in the first place.

Is there an alternative? What if instead statmgr looked to see what the module heartbeat interval is and, say, doubled that for its timeout? That's essentially what I did when I implemented K2 SDS Block Mode 2 data streaming in k2ew. I looked at the value the K2 had for the SDS Timeout count (the same role played by the statmgr timeout) and I set the rate that k2ew sent its CONTINUE command packets (the same role played by the module heartbeat interval) to half that.

It seems to me the real problem is that the heartbeat interval/timeout should really only be specified in one place -- they are intimately related. I think the module heartbeat interval is the right place, and statmgr should then figure out from that what a reasonable timout should be.

What about doing that?

Larry Baker

comment:2 Changed 5 years ago by philip

Might be worth noting ticket number 3, (and yes I am sad it isn't ticket number 1 or 2) http://earthworm.isti.com/trac/earthworm/ticket/3

Also, I think it would be good to have a default heartbeat, maybe 30 seconds. In the vast majority of the cases the actual heartbeat doesn't matter that much, unless it is over a long haul high latency connection, and users would probably be best served with a 30 second heartbeat that was standard and not specified in the config. The problem comes in the case of someone that doesn't understand picking bad values, so the philosophy should be, I think, that users shouldn't have to configure things that a) don't matter and b) are easy to screw up. Only in the case of something special would it be specified, and then the user would be expect to know what they were doing.

$0.02, you millage may vary!

comment:3 Changed 5 years ago by baker

Phil,

Precisely! You were quite prescient.

Perhaps this ticket should be categorized as a duplicate of ticket #3, which might also deserve upgrading from wishlist to enhancement -- and even scheduled for a fix (after 5 years!).

Larry Baker

comment:4 Changed 5 years ago by baker

Is the problem at the moment because the timeout in a module's .desc file is too short for the heartbeat in the same module's .d file? Concurring with Phil's comment

users shouldn't have to configure things

, shouldn't Earthworm at least deliver .d and .desc files that follow recommended practice? I.e., shouldn't the short-term fix be to reconcile the default heartbeat rates with the corresponding timeouts?

Larry Baker

comment:5 Changed 5 years ago by philip

I hate to do this, but http://earthworm.isti.com/trac/earthworm/ticket/2

Another idea would be for statmgr to put out warnings if the heartbeats it is receiving are > 95% of the time interval it expects. Not that people look at the logs, but at least there is a chance of finding it before it blows up.

Philip

comment:6 Changed 5 years ago by paulf

Thanks for the feedback Philip and Larry:

There are a lot of .d/.desc files to go through, but we can make a stab at putting in optimal default settings for all of them for the 7.8 build; that is if the USGS (our funding source) makes this a priority....I think it should be, as it would cut down on common errors.

And Yes, agreed, pulling the information from one place would be fine, but that is just not the design that was laid out by the original developers and I don't want to rock the boat too much since there are a lot of stable systems out there..BACKWARD compat is a huge issue..and making that backward compatible and "easily done" is not so simple since each developer used different config settings for Heartbeat interval specification....if either of you have time to go through every module and find which syntax was used in each program that would be great, but until the funding source powers say that is a priority over say converting to 64 bit, then we cannot do it under that umbrella. It's not as simple as getting the HB info from the module .d file, IMHO, there are many sides to this issue that need to be thought out.

My suggestion was to provide a safety net against operator error and it would be simple to do in one place and do it quickly...that was my reasoning.

comment:7 Changed 5 years ago by baker

I'm not sure what has been concluded about the proposed change to statmgr. I think software should do what it is told to do. When we get to the point that statmgr is not told what to use for timeout, but has to figure out the best choice -- hopefully, following Philip's suggestions -- then it should calculate the timeout if one is not provided.

As far as the other issues Paul raises, I see no particular benefit in pursuing a 64-bit Earthworm at this time; a 32-bit address space is plenty. And, I see no virtue in preserving a poor choice made by the original developers just for the sake of backward compatibility. Everyone had to bite the bullet when TRACEBUFs became TRACEBUF2s. I don't think there are many sides to this issue. I do think that the work will have to be carefully done, which could be time consuming. For example, I think there are modules with .d file parsers that insist on the order of the options. Most, I think, don't play well when there are options they don't recognize. Paul's right about the possibly different ways the heartbeat is specified. Those would have to be rationalized to use some consistent choice. Both .d and .desc files specify the module ID, not always using the same name for the parameter. That should also be rationalized to use consistent terminology. Rationalizing common use cases to be consistent is a good idea.

Philip made good suggestions five years ago that would have prevented the confusion Paul's students are running into now. I think his ideas are worth implementing, whether USGS needs them or not. This is, after all, a community project, not a USGS project.

comment:8 Changed 5 years ago by philip

The snarky jerk in me wants to make a snide remark, but I will refrain (mostly).

Attached is a patch that does mostly what I think we want. The change does:

1) trys to read the corresponding .d file to the .desc and extract the sending heartbeat, using one of 5, yes 5, variations on "HeartBeatInt?". My look at the .d files seems to say that this covers the vast majority of modules. An exercise for the reader to deal with the few leftovers.

2) If the .desc file does not have a tsec in it, and the .d does then it uses the .d value *3/2. I think a 50% buffer seems reasonable, but that number can be changed.

3) If the .desc does not have tsec and the .d does not have one of the 5 heartbeatints, then it sets it to a default value of 42 *3/2. The 42, being the answer to life, the universe and everything, seems a good default, and so is now in the main .h file. Would be worth an effort to slowly change each module so that if heartbeatint is not in the .d, then the module uses the DEFAULT_HEARTBEAT default value, but that is for another day.

4) If the .desc file does have tsec and the .d file does have one of the 5 (did I mention 5 freaking variations on heartbeatint?), then statmgr checks to see that the receiving interval is strictly greater than the sending. If not, it gives a warning and then adds 5 to the sending value as its expected value.

5) All of this is logged in statmgr's log file.

Here is an example log file. In this case,

pick_ew.desc does not have tsec in .desc or heartbeatint in .d tankplayer has values in both, but are both 30, leading to the issue here in bug 484 binder_ew has value in .d and are valid, ie sending < receiving so all ok heli_ewII does not have tsec in .desc, but does have it in .d

I think these 4 cover most cases, and I think statmgr now does the right thing. Here is the output:


-------------------------------------------------------
statmgr: startup at UTC_20140601_14:52:45
This program is using the non-MT-Safe version of logit.
-------------------------------------------------------
20140601_UTC_14:52:45 Heartbeat setting for statmgr.desc: send: -2 receive: 0
20140601_UTC_14:52:45 Heartbeat setting for startstop.desc: send: -2 receive: 60
20140601_UTC_14:52:45 Heartbeat tsec not in <pick_ew.desc> and can't find in <pick_ew.d>, using default 42 *3/2 = 63
20140601_UTC_14:52:45 Heartbeat setting for pick_ew.desc: send: -1 receive: 63
WARNING: Heartbeat in .d file <tankplayer.d> (30) is >= heartbeat (30) in .desc file <tankplayer.desc>, this will cause statmgr to restart unnecessarily. Setting receive tsec to 30+5
20140601_UTC_14:52:45 Heartbeat setting for tankplayer.desc: send: 30 receive: 35
20140601_UTC_14:52:45 Heartbeat setting for binder_ew.desc: send: 30 receive: 120
20140601_UTC_14:52:45 Heartbeat tsec not in <heli_ewII.desc> but found in <heli_ewII.d>, using 10 *3/2 = 15
20140601_UTC_14:52:45 Heartbeat setting for heli_ewII.desc: send: 10 receive: 15
statmgr: Read command file <statmgr.d>
20140601_UTC_14:52:45 statmgr: startup of statmgr version: 2.0.11 - 2014-06-01
       *****  Statmgr Configuration file  *****

Lastly, I will say that somebody who has a clue about C programming should review this. My skills have atrophied to the point that I wasted an hour because I didn't remember that you have to put the function def in the .h file, so the chance that I have done something stupid is really high. That said, I think this is a worthwhile addition to earthworm and so I hope someone will review, clean up and get this into svn soon.

Have a nice day.... Philip

Changed 5 years ago by philip

patch to allow statmgr to read corresponding .d files looking for heartbeatint

comment:9 Changed 5 years ago by paulf

Awesome. Thanks Philip. Got nothing to do this weekend?? Thanks for reserving the snarky comment on your side.

This should do the trick, however we haven't done away with .desc files entirely yet, but maybe on our way....

We can review this patch during our EW class this week and test it out too.

Cheers,

Paul

comment:10 Changed 5 years ago by baker

While I was thinking some more about ways to deal with this, Philip's suggestions arrived. After reading his solution, here's my proposal.

  1. Embracing Paul's advocacy of not disturbing existing installations, do not change any behavior for those systems. It is likely they have set the heartbeat intervals and statmgr timeouts the way they want them. This argues against Paul's original proposal.
  1. The problem affects new Earthworm users that are not so familiar with the nuances of heartbeats and statmgr timeouts. The root of the problem is that the template configuration files supplied by Earthworm do not follow recommended guidelines for the relationship between a module's heartbeat interval and the stamgr's idea of when it should be killed. I think the first order of business should be to fix the inconsistencies in the distribution. That would solve the problem Paul initially brought up with new users.

I can write a parser that compares the .d heartbeat intervals with the .desc timeouts, borrowing from Philip's investigations already of the several ways the heartbeat intervals are named. Would it be sufficient to parse the template configuration files in the distribution params directory? I could also make it work for the entire svn/src tree.

  1. On the way to a single configuration file per module covering both the module parameters and the statmgr information, I like Philip's idea of having statmgr read the .d files. An asymmetrical unified configuration file is a good start: statmgr sees all, but the module sees only it's configuration. It would be way too much work to merge the .desc files into the .d files for every module. But, merging the .d files into the .desc files involves only changes to statmgr.

I don't think it is a good idea to hard-code the .d, though, for the module parameter file that matches the .desc file. .d and .desc are only conventions; there's no guarantee that is what the module actually used in the startstop Process line.

There are two possible solutions: use the existing @ kom file mechanism or add a new parameter to the .desc files. The @ mechanism does not require any coding in statmgr for the "include", but statmgr would have to beware of parameter names that conflicted with its own. If earlier or later seen names might occur, then it matters whether the @ is at the top, middle, or end of the .desc file. And, statmgr can't tell what params came from where -- it's strictly a syntactic text merge without any semantic information.

Use of a new parameter in a .desc file is probably a better way to go. That can also have semantic meaning. Like "USE" in Fortran or "Import" in Java, which can be qualified to refine what is being included. E.g., USE MODULE in Fortran brings in the entire module; USE MODULE COMPONENT brings in only the requested component. Java Import is like that too. Import jave.util.Date imports only the java.util Date class, while Import java.util.* imports all the classes from the java.util package.

I propose something like "Use" or "Import", rather than "Include". "Include" to me indicates that the contents of the file argument are to be unconditionally inserted into the text at the point of the Include, like CPP #include. That's not what Philip is suggesting. He's saying statmgr only needs to import the .d file if it needs something from it.

  1. Add the logic Philip supplied, but warn only if the sanity check fails. I still think programs should do what they're told. People are smarter than computers. If someone has a reason for choosing a value, that is what should be used. I think we've all encountered programs that "Won't let me do what I want to do!" Hopefully, novices will see the warnings and try to understand what to do.
  1. Once .desc files are modified to import .d files, we can gradually reconcile the different ways the same thing is specified. For each usage, say "HeartbeatInterval?", we can settle on a standard term. statmgr can warn that it is finding a non-standard usage when it has to import the value. The module can add the standard term as a synonym (checking it is not specified both ways!). If the old spelling is found, the module can print a warning that the use is DEPRECATED and should be replaced with the new spelling. That should last at least one full release cycle. The next release cycle, the module should print an error that the old spelling is OBSOLETE and quit. This should also be done for other param names that should be standardized (e.g., so statmgr can share them), such as the term for the module ID. When param names become OBSOLETE, statmgr can remove the code that Philip wrote to handle all the synonyms.
  1. Figure out how to finish the merge of .desc files into .d files. The extra stuff that statmgr needs is provided by the module for stammer, but the module would have to skip those items without complaining. On the other side, we'd have to make sure statmgr's param names don't conflict with any param names used by modules. When that is done, a single .conf file for each module should work.

Separate from this issue, I have been thinking for quite a while about enhancing parameter key/value lookups. I think the object-oriented principle of inheritance could make parameter management much easier. For example, at the most refined level (think leafs in a tree), each module has to have its own ID. However, multiple modules that are duplicates could share a lot of the rest. For example, the Earthworm system I built has three wave_serverV's. If the module configuration files were leafs in an inheritance tree, the next higher level in the tree could have a configuration file with shared values. This is what XML files look like. A tree-structured file system would do. I like that the 2-dimensional structure is visible in a file system. Plus, that could be completely compatible with existing .d files.

comment:11 Changed 5 years ago by philip

For 4) tsec <= heartbeat is an error condition, statmgr should either fix it (eg set tsec = heartbeat+5) or quit. Continuing with only a warning is wrong.

comment:12 Changed 5 years ago by baker

Philip,

Yes, it is true that sec < heartbeat is a misconfiguration under normal circumstances. But, is it worth doing anything about it? Won't it fix itself? Users will see that statmgr is warning about something and the module will likely be killed and restarted perpetually. Except that, for novice users, this causes the questions Paul gives as the reason for this proposed change. But, this is before your changes to issue the warning message.

How will such misconfigurations occur? 1) A user configures their Earthworm system using the template parameter files. Out-of-the box, this error occurs. The cause is the faulty template parameter files. We should not be delivering faulty template parameter files. We should fix them in the svn. My parser will help find them. This will take care of the novices. 2) A user has changed tsec or heartbeat. When the user restarts Earthworm, hopefully they will be wise enough to look for the consequences of their changes. This would not likely be a novice. They should see the warning message and correct their edits. They might also run the parameter file checker -- best right after making the changes.

Let me provide a counter example when you would not want statmgr to change the tsec that is specified. Suppose a module is hanging and statmgr is not killing and restarting it. To troubleshoot this problem, you want to trip statmgr's timeout and watch what happens. How will you do that if statmgr keeps thwarting your attempts to set a tsec to cause a timeout?

Or, Paul wants to demonstrate this feature to his class? Or wants to show them the consequence of misconfiguring tsec or heartbeat? How shall he do that without modifying code?

Or, a developer making changes to statmgr wants to exercise the code paths in the timeout and restart logic.

These are situations where you actually do want sec < heartbeat.

I think your original suggestion many years ago to have to a single .conf file was correct. These misconfigurations are possible because there are two independent ways of specifying what amounts to the same thing. When statmgr can infer what tsec should be, we can distribute template parameter files without tsec at all. Then, only when a user -- for whatever reason -- wants to specify their own tsec will statmgr not compute the tsec to use. If statmgr doesn't like a tsec override value it should complain. But it should not change it.

Larry Baker

comment:13 Changed 5 years ago by baker

Attached is a shell script, ValidateHeartbeatToTimeout?.sh, that will validate a module's Heartbeat Interval with its statmgr Timeout. There are 184 .d files in earthworm-7.7-6103/params (my svn export of rev 6103), and 116 .desc files. Clearly not all the .d files have a .desc file. There are also cases like k2ew, which have one .desc file for three possible .d files:

earthworm-7.7-6103/params/k2ew.desc
earthworm-7.7-6103/params/k2ew_com.d
earthworm-7.7-6103/params/k2ew_tcp.d
earthworm-7.7-6103/params/k2ew_tty.d

When I ran ValidateHeartbeatToTimeout?.sh on all the .d files in earthworm-7.7-6103/params, there were 71 without a matching .desc file, 9 with an invalid Heartbeat Interval (>Timeout/2), and 5 with no Heartbeat Interval at all.

savaii:Earthworm baker$ find earthworm-7.7-6103/params -name \*.d -exec ./ValidateHeartbeatToTimeout.sh {} ';'
earthworm-7.7-6103/params/activated_scripts.desc: no such file
earthworm-7.7-6103/params/ak135_mod.desc: no such file
earthworm-7.7-6103/params/archman.desc: no such file
earthworm-7.7-6103/params/carlstatrig.d: Heartbeat Interval (=0) must be > 0
earthworm-7.7-6103/params/cleandir.desc: no such file
D_FILE:    earthworm-7.7-6103/params/compress_UA.d
HeartBeat: 15
DESC_FILE: earthworm-7.7-6103/params/compress_UA.desc
Timeout: 20
The module's Heartbeat Interval is too long for statmgr's Timeout
The recommended Heartbeat Interval is no more than 10
earthworm-7.7-6103/params/cont_id.desc: no such file
D_FILE:    earthworm-7.7-6103/params/decompress_UA.d
HeartBeat: 30
DESC_FILE: earthworm-7.7-6103/params/decompress_UA.desc
Timeout: 40
The module's Heartbeat Interval is too long for statmgr's Timeout
The recommended Heartbeat Interval is no more than 20
earthworm-7.7-6103/params/diskmgr.d: no Heartbeat Interval found
earthworm-7.7-6103/params/dsclient.desc: no such file
earthworm-7.7-6103/params/eqbuf.desc: no such file
earthworm-7.7-6103/params/eqbuf_nll.desc: no such file
earthworm-7.7-6103/params/eqcoda.desc: no such file
earthworm-7.7-6103/params/eqcoda_nll.desc: no such file
D_FILE:    earthworm-7.7-6103/params/eqfilter.d
HeartBeat: 30
DESC_FILE: earthworm-7.7-6103/params/eqfilter.desc
Timeout: 40
The module's Heartbeat Interval is too long for statmgr's Timeout
The recommended Heartbeat Interval is no more than 20
earthworm-7.7-6103/params/eqprelim_nll.desc: no such file
earthworm-7.7-6103/params/eqverify.desc: no such file
earthworm-7.7-6103/params/eqverify_assemble.desc: no such file
earthworm-7.7-6103/params/eqverify_nll.desc: no such file
earthworm-7.7-6103/params/evanstrig_params.desc: no such file
earthworm-7.7-6103/params/ew2mseed.desc: no such file
earthworm-7.7-6103/params/ew2seisvole.d: no Heartbeat Interval found
earthworm-7.7-6103/params/ew_rsamalarm.desc: no such file
earthworm-7.7-6103/params/ewaccel.desc: no such file
earthworm-7.7-6103/params/ewhtmlreport.desc: no such file
earthworm-7.7-6103/params/ewhttpd.desc: no such file
earthworm-7.7-6103/params/ewmseedarchiver.desc: no such file
earthworm-7.7-6103/params/ewnotify.desc: no such file
earthworm-7.7-6103/params/ewshear.desc: no such file
earthworm-7.7-6103/params/ewspectra.d: no Heartbeat Interval found
earthworm-7.7-6103/params/ewthresh.desc: no such file
earthworm-7.7-6103/params/getfile_cl.desc: no such file
earthworm-7.7-6103/params/getfile_ew.desc: no such file
earthworm-7.7-6103/params/getfileII.desc: no such file
earthworm-7.7-6103/params/getstation.desc: no such file
D_FILE:    earthworm-7.7-6103/params/glass.d
HeartBeat: 30
DESC_FILE: earthworm-7.7-6103/params/glass.desc
Timeout: 45
The module's Heartbeat Interval is too long for statmgr's Timeout
The recommended Heartbeat Interval is no more than 22
earthworm-7.7-6103/params/glass_assoc.desc: no such file
earthworm-7.7-6103/params/glass_earth.desc: no such file
earthworm-7.7-6103/params/glass_pub_params.desc: no such file
earthworm-7.7-6103/params/grf2ew.d: Heartbeat Interval (=0) must be > 0
D_FILE:    earthworm-7.7-6103/params/heli_ewII.d
HeartBeat: 10
DESC_FILE: earthworm-7.7-6103/params/heli_ewII.desc
Timeout: 10
The module's Heartbeat Interval is too long for statmgr's Timeout
The recommended Heartbeat Interval is no more than 5
earthworm-7.7-6103/params/hyp2000_mgr.desc: no such file
earthworm-7.7-6103/params/hyp71_mgr.desc: no such file
D_FILE:    earthworm-7.7-6103/params/import_ack.d
HeartBeat: 90
DESC_FILE: earthworm-7.7-6103/params/import_ack.desc
Timeout: 120
The module's Heartbeat Interval is too long for statmgr's Timeout
The recommended Heartbeat Interval is no more than 60
D_FILE:    earthworm-7.7-6103/params/import_generic.d
HeartBeat: 120
DESC_FILE: earthworm-7.7-6103/params/import_generic.desc
Timeout: 120
The module's Heartbeat Interval is too long for statmgr's Timeout
The recommended Heartbeat Interval is no more than 60
earthworm-7.7-6103/params/k2ew_com.desc: no such file
earthworm-7.7-6103/params/k2ew_tcp.desc: no such file
earthworm-7.7-6103/params/k2ew_tty.desc: no such file
earthworm-7.7-6103/params/loc_wcatwc.desc: no such file
earthworm-7.7-6103/params/localmag-uw.desc: no such file
earthworm-7.7-6103/params/localmag.d: no Heartbeat Interval found
earthworm-7.7-6103/params/localmag.scnl_par.desc: no such file
earthworm-7.7-6103/params/makehbfile.desc: no such file
earthworm-7.7-6103/params/makeTrace.desc: no such file
earthworm-7.7-6103/params/MmNew.desc: no such file
earthworm-7.7-6103/params/moleserv.desc: no such file
earthworm-7.7-6103/params/Ms.desc: no such file
earthworm-7.7-6103/params/mseed2tbuf.desc: no such file
earthworm-7.7-6103/params/mseedarchplayer.desc: no such file
earthworm-7.7-6103/params/nll_mgr.desc: no such file
earthworm-7.7-6103/params/nsmp2ew.desc: no such file
earthworm-7.7-6103/params/palert_svr.desc: no such file
earthworm-7.7-6103/params/pick_wcatwc.desc: no such file
earthworm-7.7-6103/params/probes.desc: no such file
earthworm-7.7-6103/params/quake_id.desc: no such file
earthworm-7.7-6103/params/regions.desc: no such file
earthworm-7.7-6103/params/ringvtanks.desc: no such file
earthworm-7.7-6103/params/samtac2ew_com.desc: no such file
earthworm-7.7-6103/params/samtac2ew_tcp.desc: no such file
earthworm-7.7-6103/params/SCNL_MAP.desc: no such file
earthworm-7.7-6103/params/sendfile_srv.desc: no such file
earthworm-7.7-6103/params/server_template.desc: no such file
D_FILE:    earthworm-7.7-6103/params/sgram.d
HeartBeat: 600
DESC_FILE: earthworm-7.7-6103/params/sgram.desc
Timeout: 960
The module's Heartbeat Interval is too long for statmgr's Timeout
The recommended Heartbeat Interval is no more than 480
earthworm-7.7-6103/params/simple_mod_base.desc: no such file
earthworm-7.7-6103/params/snwclient.desc: no such file
earthworm-7.7-6103/params/startstop_nt.desc: no such file
earthworm-7.7-6103/params/startstop_sol.desc: no such file
earthworm-7.7-6103/params/startstop_unix.desc: no such file
earthworm-7.7-6103/params/statmgr.d: no Heartbeat Interval found
D_FILE:    earthworm-7.7-6103/params/tankplayer.d
HeartBeat: 30
DESC_FILE: earthworm-7.7-6103/params/tankplayer.desc
Timeout: 30
The module's Heartbeat Interval is too long for statmgr's Timeout
The recommended Heartbeat Interval is no more than 15
earthworm-7.7-6103/params/tbuf2mseed.desc: no such file
earthworm-7.7-6103/params/testtr.desc: no such file
earthworm-7.7-6103/params/theta.desc: no such file
earthworm-7.7-6103/params/trig_id.desc: no such file
earthworm-7.7-6103/params/tt_tables.desc: no such file
earthworm-7.7-6103/params/wave_viewer.desc: no such file
earthworm-7.7-6103/params/waveman2disk.desc: no such file

Changed 5 years ago by baker

Shell script to validate a module's heartbeat rate

comment:14 Changed 5 years ago by baker

Phil,

I found one more way Heartbeat Interval was specified on top of the five you provided, plus glass.d uses its own syntax that is unique. See the logic in the attachment.

Larry Baker

comment:15 Changed 5 years ago by philip

Comment for anyone finding this and using Larry's shell script, many linux dists use "dash" as /bin/sh which is a fast loading but limited shell mainly for startup scripts and the typeset command is not included. Change /bin/sh to /bin/bash on the first line and then it should work.

To run this over all of your .d files you can do something like:

ls *.d | groovy -p -e "println './ValidateHeartbeatToTimeout.sh '+line" | /bin/bash

It is normal to see:

earthworm.desc: no such file
earthworm_global.desc: no such file
startstop_unix.desc: no such file
statmgr.d: no Heartbeat Interval found

Any other output probably warrants investigation.

comment:16 Changed 5 years ago by paulf

I agree, the .desc files info can be merged into .d, and probably should, but its a matter of priorities and time. If someone has time to do this coding and validate it all, be my guest.

We are at the mercy of funding sources and this has not been made a high prio by the powers above as there are other more burning issues.

comment:17 follow-up: ↓ 19 Changed 5 years ago by philip

Committed my patch to svn as 6129.

comment:18 Changed 5 years ago by philip

Oops, forgot .h, so also svn 6130

comment:19 in reply to: ↑ 17 Changed 5 years ago by baker

Replying to philip:

Committed my patch to svn as 6129.

This patch hard-codes .desc and .d file name suffixes. This is a mistake. Earthworm does not require .desc and .d; those are conventions. Each program when it is started is given the name of its options file as a command-line argument. statmgr is given the names of the descriptor files for each program it is supposed to watch in its own options file. The file names are arbitrary. This is a significant, potentially incompatible change. I recommend backing out this patch for now.

comment:20 Changed 5 years ago by philip

I do not think you are correct about this. It does "hard code" the xyz.desc => xyz.d file mapping convention, but if the .d file does not exist just prints to the log file, so no harm. Obviously if someone chooses not to follow the .desc => .d convention, this patch doesn't help, but that doesn't seem to be sufficient to reject the patch imho.

My C skills are rusty, so please check to see if I am wrong, but I don't understand the issue you are seeing.

comment:21 Changed 5 years ago by baker

Philip,

I don't see that this patch is necessary. Without it, statmgr requires tsec in a .desc file. There is no default -- a value is required. If the svn .desc files provided proper values, everything would work fine out-of-the box. Isn't this the actual problem originally reported? I.e., that the distributed files were wrong and novices didn't know that was the reason their setups failed? The problem is the tsec settings in the .desc files, not some deficient behavior of statmgr. Assuming tsec values work fine out-of-the box, the only way they stop working is when a user changes them. Users surely have the common sense to make sure their changes cause the system to behave as expected.

The warning about potential problems with the tsec value is good. But, it only works when one of the (not standardized, as you say) heartbeat options is recognized by statmgr_checkheartbeat(). The shell script I wrote does the same thing -- using your hack to match one of the heartbeat option names currently used, plus an oddball one I found in glass.d.

This new behavior where stamgr calculates a default tsec should be part of a strategy to merge .d and .desc files. Or, have .desc files "import" their .d files, as I described previously. As part of doing that, I think the name of the heartbeat option should also be standardized. At that point, computing a default tsec makes sense, since it can happen for every case where the .desc file (which can have any arbitrary name) imports its .d file (which can also have any arbitrary name). You don't end up with two different behaviors from statmgr that (silently) happen to depend on whether it can find files with identical paths and base names and .d and .desc suffixes.

I've made my case. Other developers should weigh in. If this change stays, then there should be documentation in the code and the statmgr .d file explaining that statmgr behaves differently if there is not an path/basename.d file when statmgr is handed a file named path/basename.desc. There should also be comments in startstop_* about the implicit link between path/basename.d in startstop and path/basename.desc in statmgr. And, hopefully a promise that some day the two files will become one.

Larry Baker

comment:22 follow-up: ↓ 23 Changed 5 years ago by philip

Here lies the problem:

Users surely have the common sense to make sure their changes cause the system to behave as expected.

I say expecting users to change heartbeats in two places is redundant and error prone. They should not set it at all (have a default) or if they want a custom value, set it in one place (the .d file). This patch allows that without breaking existing systems.

Next step is that tsec should be commented out from all the example .desc files as a first step towards eliminating .desc files. Assuming this patch is not rolled back, I will work on that.

I do not agree that .desc files should include the .d files, I think they should not exist at all and the information should be moved to default values and/or the .d files. I fell this patch is a small step in that direction.

Just my $0.02

comment:23 in reply to: ↑ 22 Changed 5 years ago by baker

Replying to philip:

I agree with most of your comments.

Here lies the problem:

Users surely have the common sense to make sure their changes cause the system to behave as expected.

I say expecting users to change heartbeats in two places is redundant and error prone.

Evidence that this was a poor choice in the first place.

I do not agree that .desc files should include the .d files

Take a look at my previous explanation for the "include". In the long run, you and I both agree that there should not be separate .d and .desc files. However, merging them into one means every module has to be modified to ignore the parts that are intended for statmgr. That would be an incompatible change and a lot of work (necessary, however, in the long run). Including the .d file from a .desc file lets statmgr see everything about a module. That is sufficient to solve the problem being discussed and makes headway on unifying .d and .desc files.

Other advantages of adding a syntax for a .desc fie to import a .d file: no more hard-coding of magic file names and it's completely backwards compatible. If I can find some time, I'll try to give this a shot.

In the mean time, how about I submit a feature request to standardize on the name of the heartbeat option in the .d files? It would be pretty easy to do a global search for the existing text strings. I think the right way to do that would be to support the old and new names with a warning from the modules that the old name is deprecated. Your thoughts?

Note: See TracTickets for help on using tickets.