[Orca-users] Orcallator - Segmentation Fault

Cockcroft, Adrian acockcroft at ebay.com
Thu Sep 7 11:32:08 PDT 2006


OK, so it's failing while walking the directory tree, I can see that the
renew is already in place a line or so earlier.

Its dereferencing a directory structure that isn't there, so a test
needs to be added to skip this if readdir returns something bad. Its
already testing for null, so there is something bad happening between
the null test and the actual usage of the dirp.

http://docs.sun.com/app/docs/doc/819-2243/6n4i099g0?q=readdir&a=view

I'm not sure how to fix this, maybe a second test for null immediately
before it's de-referenced?

Adrian

-----Original Message-----
From: Brian Poole [mailto:pooleb at gmail.com] 
Sent: Thursday, September 07, 2006 10:39 AM
To: Cockcroft, Adrian
Cc: Dmitry Berezin; Biju Joseph; orca-users at orcaware.com
Subject: Re: [Orca-users] Orcallator - Segmentation Fault

Here is all of the information I've been able to gather on the crash
(SE Toolkit 3.4 on Solaris 10). I compiled it fresh using Forte with
debugging enabled. I took a quick look at trying to find where the
problem actually lies but was unable to come up with anything useful.

Here is running the disks.se with debug:

# /opt/RICHPse/bin/se.sparcv9 -d /opt/RICHPse/examples/disks.se
if (count<31> == GLOBAL_diskinfo_size<101>)
dp = *((dirent_t *) ld<4281687704>)
if (dp.d_name<c3t8d24s3> == <.> || dp.d_name<c3t8d24s3> == <..>)
if (!(dp.d_name<c3t8d24s3> =~ <s0$>))
ld = readdir(dirp<4281664128>)
if (count<31> == GLOBAL_diskinfo_size<101>)
dp = *((dirent_t *) ld<4281687736>)
if (dp.d_name<c3t8d24s4> == <.> || dp.d_name<c3t8d24s4> == <..>)
if (!(dp.d_name<c3t8d24s4> =~ <s0$>))
ld = readdir(dirp<4281664128>)
if (count<31> == GLOBAL_diskinfo_size<101>)
dp = *((dirent_t *) ld<4281687768>)
if (dp.d_name<c3t8d24s5> == <.> || dp.d_name<c3t8d24s5> == <..>)
if (!(dp.d_name<c3t8d24s5> =~ <s0$>))
ld = readdir(dirp<4281664128>)
if (count<31> == GLOBAL_diskinfo_size<101>)
dp = *((dirent_t *) ld<4281687800>)
Segmentation Fault (core dumped)

So tracking that back shows the segfault occurs on line 215 of
include/diskinfo.se:

    for (ld = readdir(dirp); ld != 0; ld = readdir(dirp)) {
      // grow the array if needed
      if (count == GLOBAL_diskinfo_size) {
        GLOBAL_diskinfo_size += 4;
        GLOBAL_disk_info = renew GLOBAL_disk_info[GLOBAL_diskinfo_size];
      }
      dp = *((dirent_t *) ld);     <---------

Also the truss output:

# truss -fo /tmp/truss.log /opt/RICHPse/bin/se.sparcv9
/opt/RICHPse/examples/disks.se
# tail -15 /tmp/truss.log
5967:   ioctl(4, KSTAT_IOC_READ, "sd3547,err")          = 701015
5967:   ioctl(4, KSTAT_IOC_CHAIN_ID, 0x00000000)        = 701015
5967:   ioctl(4, KSTAT_IOC_READ, "sd2146,err")          = 701015
5967:   ioctl(4, KSTAT_IOC_CHAIN_ID, 0x00000000)        = 701015
5967:   ioctl(4, KSTAT_IOC_READ, "sd2177,err")          = 701015
5967:   ioctl(4, KSTAT_IOC_CHAIN_ID, 0x00000000)        = 701015
5967:   ioctl(4, KSTAT_IOC_READ, "sd3935,err")          = 701015
5967:   ioctl(4, KSTAT_IOC_CHAIN_ID, 0x00000000)        = 701015
5967:   ioctl(4, KSTAT_IOC_READ, "sd1971,err")          = 701015
5967:   ioctl(4, KSTAT_IOC_CHAIN_ID, 0x00000000)        = 701015
5967:   ioctl(4, KSTAT_IOC_READ, "sd1972,err")          = 701015
5967:       Incurred fault #6, FLTBOUNDS  %pc = 0xFF2E08EC
5967:         siginfo: SIGSEGV SEGV_MAPERR addr=0xFF356000
5967:       Received signal #11, SIGSEGV [default]
5967:         siginfo: SIGSEGV SEGV_MAPERR addr=0xFF356000

And perhaps more indicative, the trace:

# /opt/SUNWspro/bin/dbx /opt/RICHPse/bin/se.sparcv9 core
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.5' in
your .dbxrc
Reading se.sparcv9
core file header read successfully
Reading ld.so.1
Reading libkvm.so.1
Reading libkstat.so.1
Reading libdl.so.1
Reading libelf.so.1
Reading libgen.so.1
Reading libm.so.2
Reading libsocket.so.1
Reading libnsl.so.1
Reading libc.so.1
Reading libc_psr.so.1
Reading libmp.so.2
Reading libmd5.so.1
Reading libscf.so.1
Reading libdoor.so.1
Reading libuutil.so.1
Reading librt.so.1
Reading libaio.so.1
program terminated by signal SEGV (no mapping at the fault address)
0xff2e08ec: _memcpy+0x042c:     ldd      [%o1], %c2
Current function is member_fill
dbx: warning: can't find file "/tmp/se-src/run.c"
dbx: warning: see `help finding-files'
(dbx) where
  [1] _memcpy(0x129938, 0xff356000, 0x8, 0xfffffffa, 0x4, 0x1), at
0xff2e08ec
=>[2] member_fill(vp = 0x1297f0, area = 0xff355ef8 "", bias = 0), line
994 in "run.c"
  [3] struct_fill(vp = 0x1296b0, area = 0xff355ef8 "", bias = 0), line
1043 in "run.c"
  [4] run_indirection(sp = 0xffbfc4b8), line 1308 in "run.c"
  [5] run_call(sp = 0xffbfc4b8), line 1608 in "run.c"
  [6] resolve_expression(vp = 0xffbfcae0, ep = 0x129620, runit = 1),
line 2892 in "run.c"
  [7] run_assign(sp = 0x127530), line 1675 in "run.c"
  [8] run_statement_list(lp = 0x127510), line 513 in "run.c"
  [9] run_for(sp = 0x12c078), line 2538 in "run.c"
  [10] run_statement_list(lp = 0x127330), line 513 in "run.c"
  [11] run_for(sp = 0x12c0b8), line 2538 in "run.c"
  [12] run_statement_list(lp = 0x121208), line 513 in "run.c"
  [13] run_block(bp = 0x133288), line 402 in "run.c"
  [14] run_call(sp = 0xffbfcec8), line 1625 in "run.c"
  [15] resolve_expression(vp = 0xffbfd450, ep = 0x13cd80, runit = 1),
line 2892 in "run.c"
  [16] resolve_l_expression(ep = 0x13ae18), line 2659 in "run.c"
  [17] run_if(sp = 0x13cf88), line 523 in "run.c"
  [18] run_statement_list(lp = 0x13cf88), line 513 in "run.c"
  [19] run_block(bp = 0x1426f8), line 402 in "run.c"
  [20] se_run(argc = 1, argv = 0x74b88), line 366 in "run.c"
  [21] main(argc = 2, argv = 0xffbffcc4), line 542 in "main.c"
*vp = {
    var_flags      = VF_MEMBER
    var_special    = 0
    var_type       = VAR_CHAR
    var_struct     = (nil)
    var_name       = 0xc44f0 "d_name"
    var_qname      = (nil)
    var_attach_lib = (nil)
    var_address    = (nil)
    var_initial    = (nil)
    var_un         = {
        var_string  = 0x129840 "c3t8d24s6"
        var_digit   = 1218624
        var_udigit  = 1218624U
        var_ldigit  = 5233950226120704LL
        var_uldigit = 5233950226120704ULL
        var_rdigit  = 2.5859149987693e-308
        var_user    = 0x129840
        var_array   = 0x129840
    }
    var_dimension  = 256
    var_subscript  = (nil)
    var_instances  = (nil)
    var_offset     = 10
    var_parent     = 0xffbfd588
    var_next       = (nil)
}

I would be more than happy to provide any additional information on
the problem you might need. Feel free to contact me directly on this
issue.

Thank you,

Brian

On 9/7/06, Cockcroft, Adrian <acockcroft at ebay.com> wrote:
> It should still be possible to avoid the crash by checking for a null
at
> the right point.
>
> Is it crashing in kstat read of the iostats, or the devinfo name
mapping
> at startup?
>
> Adrian
>
> -----Original Message-----
> From: Dmitry Berezin [mailto:dberezin at surfside.rutgers.edu]
> Sent: Thursday, September 07, 2006 8:43 AM
> To: Cockcroft, Adrian; 'Biju Joseph'; orca-users at orcaware.com
> Subject: RE: [Orca-users] Orcallator - Segmentation Fault
>
> Adrian,
>
> I believe that the actual problem is not with the array sizes, but has
> to do
> with the "stale" disk devices. SE "segfaults" when it tries to access
a
> device that is not currently present on the system. That is why the
> problem
> is usually seen on the clustered systems with shared storage or
systems
> with
> BCV devices that frequently change their state to offline. A number of
> people had previously reported that rebuilding device tree fixed the
> problem.
>
> I have not had time to look at the code, so I do not know if this
could
> be
> solved by changing scripts or SE itself has to be patched.
>
>   -Dmitry.
>
>
> > -----Original Message-----
> > From: orca-users-bounces+dberezin=acs.rutgers.edu at orcaware.com
> > [mailto:orca-users-bounces+dberezin=acs.rutgers.edu at orcaware.com] On
> > Behalf Of Cockcroft, Adrian
> > Sent: Thursday, September 07, 2006 11:13 AM
> > To: Biju Joseph; orca-users at orcaware.com
> > Subject: Re: [Orca-users] Orcallator - Segmentation Fault
> >
> > Years ago I fixed the code that looks at disks to resize the array
> > dynamically, I guess that this code got overwritten at some point,
but
> its
> > a simple fix, just doesn't look much like C code...
> >
> > You can use the "renew" keyword to make a new array that is bigger
and
> > contains the same items, so figure out where its indexing into the
> disk
> > array, check the index and renew the array to be size+10 or
something.
> > There's example code in the generic SE disk class, which for some
> reason
> > orcallator doesn't seem to use?
> >
> > I'm not currently working on a Solaris box, so it will take me a
while
> to
> > get a setup I could test this fix on, probably a few weeks when I
get
> back
> > from a business trip.
> >
> > Adrian
> >
> > -----Original Message-----
> > From: orca-users-bounces+acockcroft=ebay.com at orcaware.com on behalf
of
> > Biju Joseph
> > Sent: Thu 9/7/2006 7:28 AM
> > To: orca-users at orcaware.com
> > Subject: [Orca-users] Orcallator - Segmentation Fault
> >
> > Hello All,
> >
> > I am trying to start orcallator on two nodes of VCS cluster ( 4.1 )
> with
> > VxVM 4.1 . Database is on EMC disks. Orcallator is giving
segmentation
> > fault.
> >
> > RICHPse version is 3.4 (03:59 PM 01/05/05).  I tried using
> orcallator.se
> > 1.36 and 1.37. Both giving same problem.
> >
> > The same combination is working on non clustered systems. All
systems
> are
> > Solaris 10
> >
> > Can any of you help.
> >
> > Appreciate your help.
> >
> > Regards
> > Biju K Joseph
> > +91-9866116298
> >
> > _______________________________________________
> > Orca-users mailing list
> > Orca-users at orcaware.com
> > http://www.orcaware.com/mailman/listinfo/orca-users
>
> _______________________________________________
> Orca-users mailing list
> Orca-users at orcaware.com
> http://www.orcaware.com/mailman/listinfo/orca-users
>




More information about the Orca-users mailing list