Software update caused loss of generators in separate redundant groups

Case narrative

A DP2 platform supply vessel (PSV) was set up on auto DP whilst conducting trials in open water after extensive upgrades.  The redundancy concept was based on two redundant groups with power generation, thruster supplies, and all DP consumers equally split between them.  Bow thruster 1 was supplied from a dual drive which can be supplied from either redundant group.  Each redundant group has dissimilar sized generators as shown in the graphic below.

The vessel is designed to operate with open bus ties at all power levels.  At the time of the incident, the main bus tie was open, separating the two redundant groups.  All generators were online.  The weather was fair and sea conditions were light with wind speed of 1 knot, current 2 knots and wave height of 1m.

The vessel was performing a turn when the DG2 breaker unexpectedly tripped, followed by DG4 breaker tripping a few seconds later.  The DP operator received alarms indicating loss of the generators and engineers observed alarms for loss of generator protections in the engine control room.  The vessel was able to maintain position due to the light sea conditions, but power limitations became active for thrusters in the port redundant group.

Subsequent investigation identified that DG2 breaker had tripped due to a defective Paralleling and Protection Unit (PPU).  The PPU is a microprocessor-based protection unit containing all of the functions for protecting the generator.  The PPU is set up to trip the generator in the case of over-current, under-voltage, short circuit, reverse power and excitation problems.  It is important to note that a generator breaker will also trip if it’s PPU loses power or reboots.

As well as generator protection, the PPUs were configured for load sharing, where the PPU will maintain the rated speed of the engine following an increase or decrease in generator load.  The interface between the PPUs for active load sharing was via a ‘daisy chained’ CANBUS network as shown in the graphic below.

Further tests demonstrated that switching the power supply of DG2 PPU on and off would cause the screens of other PPUs to flash and reboot.  This was observed to occur approximately fifty percent of the time.  The flashing and reboot would cause the respective generator breaker to trip.  It was noted that cycling the power of DG2 PPU could cause any other generator to trip, regardless of redundant group.

Analysis of the faulty DG2 PPU found that it had a defective power supply and a software error.  An intermittent power supply fault had caused DG2 PPU to reboot during operation.  A ‘bug’ in the software caused the PPU to boot up with the wrong communication settings and data was sent at the wrong speed to other PPUs over the CANBUS.  Receiving data at the wrong speed caused errors in the other units on the CANBUS, causing breaker tripping.  The root cause was a coding error in the boot loader software for the PPUs.

The defective software version had been installed by switchboard vendors during recent vessel upgrade work.  The software updates were produced by the PPU manufacturer and downloaded from the internet.  The switchboard vendors assumed that using the latest software version was safe and it was their company policy to update the software anytime work was done on the units.

The lessons

  • Initial fault finding was hampered because the vendors had left no record of installing different software versions. A fleetwide review showed that the faulty software had been installed on other vessels.  After the incident, software versions were recorded and documented in the FMEA.
  • The control of software and those vendors involved with software needs to be managed as part of a formal management of change (MoC) process and the need/benefit of the version fully understood prior to its acceptance.
  • Although testing of the PPU had been undertaken during FMEA proving trials, the hidden failure was only introduced during the software upgrade. The design of the CANBUS is being further analysed to determine the need for all PPUs to be connected together, rather than just those within each redundant group.  The results of this analysis may allow the splitting of the CANBUS and the removal of any potential hidden failure in this system in the future.

This case study demonstrates the importance of software MoC on vessels.  Although it is accepted that the best of MoC processes may not capture issues such as a bug left in the software by the producer of the software.  In this case having such a process would have significantly aided the identification of the issue.

The best form of prevention of such a fault is not having unnecessary network links across redundant groups in the first place.  Such that any maloperation of software cannot propagate between one redundant group to another.