Ticket #695 (closed defect: fixed)

Opened 2 years ago

Last modified 2 years ago

wave_serverV compiled as 64 bit fails to work on Solaris 10

Reported by: paulf Owned by: alexander
Priority: major Milestone: Solaris
Component: wave_serverV Version: 7.10
Keywords: Cc:

Description

okay, after r7806 changes and getting wave_serverV to finally compile on solaris by commenting out the #define _XOPEN_SOURCE 500 line, the program now fails at startup with Bus error.....when running the memphis test.

See below, this is a 64 bit compilation:

{paulf@niobite:params} !wave
wave_serverV wave_serverV.d
20190215_UTC_20:15:08 LoadTableEnvVariable(): nfiles 2
20190215_UTC_20:15:08 Unsuccessful read of 16 bytes, at offset 0 in tnk/structp1000-2.str
20190215_UTC_20:15:08 TANK File read failed, File was probably empty
20190215_UTC_20:15:08 Unsuccessful read of 16 bytes, at offset 0 in tnk/structp1000-1.str
20190215_UTC_20:15:08 TANK File read failed, File was probably empty
Bus Error (core dumped)
{paulf@niobite:params} file core
core:           ELF 64-bit MSB core file SPARCV9 Version 1, from 'wave_serverV'
{paulf@niobite:params} dbx `which wave_serverV` core
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.6' in your .dbxrc
Reading wave_serverV
core file header read successfully
Reading ld.so.1
Reading libm.so.2
Reading libc.so.1
Reading libdl.so.1
Reading libnsl.so.1
Reading libsocket.so.1
Reading librt.so.1
Reading libpthread.so.1
Reading libthread.so.1
Reading libaio.so.1
Reading libmd.so.1
Reading libc_psr.so.1
t@1 (l@1) program terminated by signal BUS (invalid address alignment)
0x0000000100019738: WriteLIndex+0x01c8: stx      %l0, [%l3]
(dbx) backtrace                                                              
backtrace: not found
(dbx) where   
current thread: t@1
=>[1] WriteLIndex(0x100260300, 0x0, 0x5c671dcc, 0x14, 0x0, 0x10025c300), at 0x100019738 
  [2] main(0x2, 0xffffffff7ffff078, 0x3e8, 0xffffffff7ec4c9c0, 0xffffffff7e200140, 0xffffffff7dd00200), at 0x10000a224 
(dbx) 

Change History

comment:1 Changed 2 years ago by paulf

Okay, more clues as to the bus error:

Running: wave_serverV wave_serverV.d 
(process id 29064)
Reading libc_psr.so.1
20190215_UTC_20:23:21 LoadTableEnvVariable(): nfiles 2
20190215_UTC_20:23:21 Unsuccessful read of 16 bytes, at offset 0 in tnk/structp1000-2.str
20190215_UTC_20:23:21 TANK File read failed, File was probably empty
20190215_UTC_20:23:21 Unsuccessful read of 16 bytes, at offset 0 in tnk/structp1000-1.str
20190215_UTC_20:23:21 TANK File read failed, File was probably empty
t@1 (l@1) signal BUS (invalid address alignment) in WriteLIndex at line 833 in file "index_util.c"
  833     *pTStamp2=*pTStamp1;
(dbx) 

comment:2 Changed 2 years ago by paulf

Okay, I just confirmed that this is a 64 bit issue on Solaris 10 only, the memphis test works great if you run on a 32 bit compile.

comment:3 Changed 2 years ago by alexander

  • Owner changed from somebody to alexander

I've been working on fixing the alignment issues, I'll take over this ticket

comment:4 Changed 2 years ago by alexander

  • Status changed from new to closed
  • Resolution set to fixed

Final fix pushed on Revision 7826

comment:5 Changed 2 years ago by baker

  • Status changed from closed to reopened
  • Resolution fixed deleted

r7819 breaks localmag on 64 bit SPARC Solaris 10:

getWsSCNLList: wsAppendMenu returned unknown error -2
getAmpFromWS():  WARNING!  No channels successfully processed for Event 235
20190301_UTC_20:29:40 localmag: No local magnitude available, eventId=235, magType=ML, magTypeIdx=1

wave_serverV shows the connection from localmag was improperly closed:

20190301_UTC_20:29:40 wave_serverV: Connection accepted from IP address 127.0.0.1
20190301_UTC_20:29:40 Wave_serverV: Client at IP address 127.0.0.1 closed socket (improperly).

comment:6 Changed 2 years ago by baker

  • Status changed from reopened to closed
  • Resolution set to fixed

r7819 looks to be a global substitution from int to long in all the wave_serverV code. There does not appear to have been any analysis of the actual cause of the invalid address alignment trap.

The trap occurs when a time_t variable within a char buffer is accessed. The reason for the trap is because the time_t variable is not properly aligned within the buffer.

Wave tanks contain packed blocks of data or indices stored as:

(time_t) TimeStamp1 (int) Count <Data> (time_t) TimeStamp2

time_t is an 8-byte data type on Solaris. <Data> are arrays of structs, which have the appropriate alignment for the architecture. The trouble is caused by the (int) Count. An int is a 4-byte data type. The variables/arrays are packed. Thus, (time_t) TimeStamp2 is aligned on a 4-byte boundary, not on an 8-byte boundary. On a 32-bit SPARC CPU, TimeStamp2 is accessed using 32-bit instructions—the misalignment is not a problem. On a 64-bit SPARC CPU, TimeStamp2 is accessed using 64-bit instructions—the misalignment is illegal, which causes an invalid address alignment trap.

The fix is to serialize/deserialize program data, ala C++, when writing/reading the wave tank files.

I have defined a new fixed-size uint32_t data type, tank_count_t for the Count variable:

(time_t) TimeStamp1 (tank_count_t) Count <Data> (time_t) TimeStamp2

As long as the time_t data type and the <Data> arrays are the same size, the wave tanks should be interchangeable between 32-bit and 64-bit Earthworm. That is my expectation, but I do not have the experience to verify that.

• Revert r7819
• Fix invalid address alignment trap on 64-bit SPARC Solaris
• Reapply r7821, r7823, r7824

Fixed in r7841.

Note: See TracTickets for help on using tickets.