Thomas Moore, Kristopher Kozak, and Klaus Brun, Southwest reserach Institute, San Antonio, Texas;

and Alfredo ramos, CIATEQ, Queretaro state, Mexico

Risk is inherent to all industries, but is particularly evident within the gas compression industry. In terms of economic impact, reliability, and safety, risks must be regularly assessed and mitigated. How effectively these risks are addressed determines the efficiency, profitability, and safety of a particular operation. Risk is broadly defined here to include economic, reliability, and safety factors.

Economic risks include the costs associated with transferring risk through insurance, operational efficiency, and future planning. Risk associated with reliability is closely related to economic risk, such that an unreliable system will not operate efficiently, and can therefore impact the profitability of the operation.

There are several factors which affect reliability, including system design, maintenance, logistics, and training. Reliability is a balance between the efficiency or profitability of an operation and the cost associated with maintaining a particular level of efficiency. Not allocating the necessary resources to properly maintain a system can have grave consequences not only in economic terms but in terms of safety as well. Providing a safe work environment for employees is of utmost importance in most organizations. Failure to provide a safe environment or adequate training can have tragic consequences. The costs associated with inadequate safety directly impacts profitability in terms of lost revenue and degraded efficiency as well as exacting a human toll. All of these risks are clearly interdependent, and significantly impact compressor and pipeline operations.

In order to manage risk, one must first understand the risks and have some method of evaluating the relative risks. Over the past forty years, the nuclear power and aerospace industries have developed methods and tools for quantitatively evaluating risks. These same methods and tools have been adapted for use within the oil and gas pipeline industry, and are discussed below. Furthermore, this analysis is crucial to understanding the significant factors related to complex interrelated systems.

Theoretical background

Many of the tools and methods used to evaluate risk have been developed as quality control measures. A number of methodologies are available to perform risk, reliability, or cause and effect analysis, including the well known Six-Sigma based process. This process relies on consecutive levels of analysis. There are various methodologies available within the Six-Sigma process, all include a definition of the process improvement goals and some form of process improvement. For an existing system, there will be a definition of goals, measurement of relevant data, analysis of cause and effect, improvement, and control to detect problems; this is the DMAIC methodology. The methodology described herein relies on the same levels of analysis, but is adapted to industrial processing rather than manufacturing.

One of the key differences of this approach is in the level of detail which is included in the analysis. A typical risk analysis would not simply look at a component failure, such as a valve, but might look at the micro-level, internally, as to how the component failed. Details about the functioning of Line Replaceable Units (LRUs) are often used to develop a statistical model of the component which will be used to provide failure modes and rates for that component. The analysis which is proposed here utilizes a macroscopic approach such that the level of detail terminates with an LRU. Failure rates are obtained from historic failure data provided by the operators, manufacturers, or in some cases through statistical analysis when a failure history is unavailable. Additionally, the pipeline station is divided up into a small number of system categories or groups.

As previously stated, there are five levels or phases in the DMAIC process. The risk, reliability, and failure mode analysis only utilizes the first three phases: define, measure, and analysis. The last two phases are the quality control phases of improvement and control. In the definition phase, Ishikawa or fishbone diagrams are used to define the system groups, top-level undesired events, and possible modes of failure that could lead to an undesired event.

Data collection constitutes the measurement phase of the process. Data typically collected at a station includes: piping and instrumentation diagrams (P&IDs), electrical system schematics, fire protection systems, turbo-machinery systems, failure and maintenance historical data. After the goals or undesired events have been defined and the data collected, the analysis phase begins. The collected data is used to construct functional block diagrams for each system. A fault tree is then developed which will provide quantitative measures of risk and reliability. Data obtained from the fault tree analysis is then statistically evaluated to identify potential issues to be addressed in the improvement phase.

Fishbone diagrams

The Ishikawa or fishbone diagrams are part of the first steps of the definition phase. As a practical matter, this phase will often occur concurrently with the data collection phase for an operating compressor station. These diagrams have, at their root, a single top-level event which is related to certain causes. The causes are grouped into several broadly defined categories, from which the resulting structure resembles a fishbone, as illustrated in Figure 1. These diagrams are conceptually simple, but can become rather complex in their analysis for large systems. The basic concept of a fishbone diagram is to generate a simple schematic diagram where all the causes of a problem against the effects are addressed. The diagrams provide a basis from which to begin the detailed analysis. Operationally, the fishbone diagrams may be used to help station operators to understand the causes and effects of the interrelated systems. Additionally, these diagrams should be regularly reviewed and revised to adopt any process changes and/or improve relational understanding.

Data collection

On-site data is gathered which is related to the configuration of each station and installed equipment. This data includes the function and operation of the station, general condition of the station, and engineering system drawing including process and P&IDs. Additionally, empirical, anecdotal and actual maintenance records of systems and system components are collected. On-site data is gathered which is related to the configuration of each station and installed equipment. This data includes the function and operation of the station, general condition of the station, and engineering system drawing including process and P&IDs. Additionally, empirical, anecdotal and actual maintenance records of systems and system components are collected.

The maintenance history data collected on-site is then categorized into the previously defined system groups. This data is then statistically reduced and failure rates for each LRU may be used in the fault tree analysis. If failure rate data collected on-site is inadequate for use in the analysis, data may be collected from industry failure databases, manufacturers, or a statistical model of the component. The failure rates are dependent on the age of the component, the environment, and how the unit is being used. This data will often follow a bathtub like curve with a higher number of failures at the beginning, infant mortality, and near end-of-life as shown in Figure 2. This type of curve can often be modeled with the use of a Weibull distribution for each of the three regions. The intrinsic failure period is the period of time which a component is nominally expected to perform and will suffer intrinsic random failures at a constant rate over the ensemble of components.

Associated with a component failure is the time necessary to make a repair. The Mean-Time-To-Repair (MTTR) is the average time it would take to make a particular repair. In order to develop a comprehensive reliability model, MTTR data must also be obtained. This data provides the capability to forecast reliability from the present to any later point in time. Failure events which are caused by human interaction with the system must also be modeled. Human factors data, such as the frequency of involvement with a component, verification or oversight, and probability of mistake must also be collected. Verification and oversight refers to how often a system is checked for proper operation or configuration. This oversight may include an automated system. Finally, what is the probability that personnel interacting with the system could cause an inadvertent failure? Intentional acts of sabotage are not considered here, but it is possible to include terrorist acts and security in the analysis. The human factor is one of the most challenging components to model.

Statistical analysis

The analyses performed on the stations are based on the fault trees developed specifically for each station. In each case, several types of analysis are performed. The highest-level analysis simply estimates the failure statistics of the top-level systems as calculated directly from the fault trees, which are populated by estimated component failure rates. These top-level failure statistics include unavailability, frequency, and MTTF. The unavailability is calculated on a one-year baseline. These statistics, which do not include higher order statistics, such as standard deviation or other reliability information such as mean time to repair, are provided for all of the top-level gates for each station, and can be viewed for all of the lower level gates as desired.

Beyond the top level failure statistics, a more detailed analysis is performed that looks at the contributions of lower level subsystems or combinations of components as they affect the overall failure rates of the top level systems. This analysis is broken down into three categories:

  • Cut-set analysis
  • Risk analysis
  • Pareto analysis.

The cut-set analysis, which is also the basis of the risk and Pareto analyses, identifies the components or subsystems that have the strongest affect on the overall failure rate of the systems. The cut-sets are generated automatically and adaptively within the fault tree software, and thus sometimes terminate at an event and sometimes terminate at a gate. Therefore, the strongest influences on system reliability do not always correspond to the same types of system components. The cut-sets are generated for a nominal set of set-generation parameters and represent a single useful, but by no means the only, interpretation of the system.

The risk analysis provides a means of quantifying the cost of specific types of failures. The cut-sets are used to assess the unavailability of the system as affected by its individual components or subsystems. This method of analysis helps to prioritize which problems need to be addressed first. In addition to the basic concept of a cut-set, there are several uses of cut-sets for analyzing the relative reliability and risk of specific subsystems and components. The two approaches used here were Pareto analysis and a monetary risk analysis, though both are somewhat related.

The Pareto analysis shows the relative importance of each of the cut-sets on a normalized scale (Figure 3). Pareto analysis is based on Pareto’s law, which suggests that 80% of the effects are generated by only 20% of the causes. Unfortunately, as is the case with both the base cut-set and risk analysis, the influence of each of the cut-sets on the overall failure rates may overlap with other cut-sets, so there is a chance that fixing the faults that elevate a particular cut-set to a position of high influence may not actually have a significant effect on the overall failure rates of the system. Therefore, even the Pareto analysis must be viewed as a tool rather than a solution.

By decomposing the fault tree into cut-sets, the cut-sets can be ordered according to relative significance. Unfortunately, there is no unique way to decompose a system into cut-sets, so there may be some significant failure drivers that do not readily appear in a single subset decomposition. This is particularly true when using some commercial fault tree software packages. This problem can often be minimized by varying the set generation parameters, but nevertheless it is a challenge. Once the desired cut-sets are ordered, the subsystems and components can be compared quantitatively, and thus this type of analysis can serve as a driver for planning system redesigns or component replacements.

To delve further into the relative significance of component faults, a monetary risk analysis can be performed. This analysis utilizes the results from the cut-set decomposition and/or the Pareto analysis to then put a value on a subsystem or component failure. The most straightforward calculation of monetary risk is to simply multiply probability of failure times the cost of a failure; similarly, in some cases, the calculation would be unavailability times the cost of unavailability. This type of analysis is useful when a single maintenance budget must be used to service several important systems, and the budget must be prioritized.

Future state analysis

To reach the performance goals for the future, modifications can be made to the station to improve reliability. This information is presented as a modification to the fault tree. The modification made in the fault tree represents some change to the system (e.g., a component change or a process change) that improves the overall reliability of the system. The fault trees provide a great benefit here, in that it is immediately clear whether a change improves the overall reliability of the system and by how much.

Case studies

A detailed risk and reliability analysis, a metric of station integrity, has been performed for several compressor and pump stations. Field data was collected at each station and includes functional and operational assessments, verification of engineering system drawings, and historical maintenance records. Based on data provided by the customer, as well as data collected in the field, process diagrams were created and detailed reliability analysis performed. The results of this analysis were statistically reduced and compared to historical failure data collected from each station.

The purpose of compression station No. 7 is to recompress natural gas received between stations (Figure?4). The natural gas enters the station through the 48-in. station suction valves POV-041 or POV-058. The natural gas subsequently passes through a filtration process before passing through one of the LM2500 driven turbocompressors. Two 30,000 HP GE LM2500+ turbines power the compressors. An anti-surge valve is utilized to recycle the natural gas through the compressor when insufficient flow exists. Natural gas exits the station through the 48-in. station discharge valves POV-058, POV-070, or POV072 after passing through the flow metering system PA01.

The process begins with the definition phase where the system categories are defined and the fishbone diagrams are developed. The systems are grouped into the following seven categories:

  • Process pipeline
  • Gas conditioning
  • Rotating machinery
  • Communication and control
  • Blowdown and flare
  • Electric power
  • Fire protection.

The fishbone causes are grouped into the following four categories:

  • Human factors
  • Equipment
  • Maintenance
  • Design.

The fishbone diagrams are developed on-site with input from the pump station operators. Each top-level undesired event is listed for each of the seven systems and possible causes are explored in detail.

During the measurement phase, data is collected and validated on-site. This includes a subjective assessment of the station. The subjective assessment includes evaluating the general condition of the station equipment, pipeline, operating environment, and anecdotal information obtained from station personnel. In addition, station geography is also assessed. The locations of fire hydrants are noted and also the layout of gas conditioning and other systems which may effect the reliability of a station. The subjective and geographical data does not readily enter into the quantitative analysis, but does have an impact on station risk and reliability. This data is particularly important when comparing two similar stations.

Following the definition and measurement phase the analysis phase begins with the development of functional block diagrams. These diagrams incorporate several sources of data from P&IDs, photographs, electrical and network schematics, maintenance manuals, and so forth. These diagrams only include components which are directly related to system operation. Smaller items such as plumbing and valving to pressure or temperature sensors are ignored. The pertinent components are identified on the P&IDs and the block diagrams are developed.

Fault trees are developed from the functional block diagrams and fishbones using the Fault Tree+ v11.0 software. Component failure rates are input into the software or obtained from the NPRD. In rare instances, statistical models of the component are developed. The Fault Tree+ software automatically computes the unavailability and the MTTF. Cut-sets are also generated from within Fault Tree+ and a rank order risk chart is developed based on the cut-sets. This risk chart provides management with the value of economic risk that exists for each system and the station. The reliability analysis provides a quantitative basis for the allocation of maintenance resources, a baseline for future planning and improvements, and cause and effect information which can be utilized by station operators. A comparison of four years of failures from the compressor station maintenance records and the predicted value is shown in Figure 5.

The comparison shows good agreement between the predicted value and the measured values given the standard deviation. The command-control predicted high as it did with the pump station, as it did the conditioned gas. Many of the values indicate less than five failures in a given year, and do not show a significant deviation from the predicted values.

Pump station

The purpose of the Pump Station No. 6 is to take Liquefied Petroleum Gas (LPG) product received from Pump Station No. 5 and pump it to Pump Station No. 7. The LPG enters the station at a temperature of 83°F and pressure of 344 psig through the station suction valve. The LPG subsequently passes through a filtration process consisting of filters FV-61 and FV-62 before passing through one of the Byron-Jackson centrifugal pumps BC-61 or BC-62. Two 5,000 HP Ruston turbines, TB-61 and TB-62, power the pumps BC-61 and BC-62, respectively. The temperature and pressure at the discharge is increased to 87°F and 830 psig, respectively. A Yarway valve is utilized to recycle the LPG through the pump when insufficient pump pressure exists. The valve provides a mechanically automatic means for controlling the recirculation of LPG, and will divert LPG to be recycled based on the output pressure. LPG, which is discharged past the Yarway valve, will exit the station through the station discharge valve to Pump Station No. 7.

The analysis for the pump station is similar to that of the compressor station. Methodologically, the process is the same whether for a compressor or pump station. As with the compressor station, the reliability analysis provides a quantitative basis for the allocation of maintenance resources, a baseline for future planning and improvements, and cause and effect information which can be utilized by station operators.

Again, there is reasonably good agreement between the failure data collected at the station and the predicted values, considering the standard deviation of the measured data. There are two notable exceptions: communication control and process piping. Over-prediction of the communication and control system failures may be explained by the assumption that was made in the analysis that any failure in the system would create a loss of production. This assumption is probably overly conservative, and while small failures are expected within the control software or communications, many of these failures would not impact the pumping process. The value predicted for the process piping is high when compared to the collected data. It is possible that over a longer data collection period, these values would have better agreement. However, as with communication control, the definition of failure may be somewhat conservative, including failures which would not interrupt the processing of LPG.

Acknowledgment

Based on a paper presented at the Gas Machinery Conference, held in Dallas, Texas, October 1-3, 2007.