OpenSAF


OpenSAF is an open-source service-orchestration system for automating computer application deployment, scaling, and management. OpenSAF is consistent with, and expands upon, Service Availability Forum and SCOPE Alliance standards.
It was originally designed by Motorola ECC, and is maintained by the OpenSAF Project. OpenSAF is the most complete implementation of the SAF AIS specifications, providing a platform for automating deployment, scaling, and operations of application services across clusters of hosts. It works across a range of virtualization tools and runs services in a cluster, often integrating with JVM, Vagrant, and/or Docker runtimes. OpenSAF originally interfaced with standard C Application Programming interfaces, but has added Java and Python bindings.
OpenSAF is focused on Service Availability beyond High Availability requirements. While little formal research is published to improve high availability and fault tolerance techniques for containers and cloud, research groups are actively exploring these challenges with OpenSAF.

History

OpenSAF was founded by an Industry consortium, including Ericsson, HP, and Nokia Siemens Networks, and first announced by Motorola ECC, acquired by Emerson Network Power, on February 28, 2007. The OpenSAF Foundation was officially launched on January 22, 2008. Membership evolved to include Emerson Network Power, SUN Microsystems, ENEA, Wind River, Huawei, IP Infusion, Tail-f, Aricent, GoAhead Software, and Rancore Technologies. GoAhead Software joined OpenSAF in 2010 before being acquired by Oracle. OpenSAF's development and design are heavily influenced by Mission critical system requirements, including Carrier Grade Linux, SAF, ATCA and Hardware Platform Interface. OpenSAF was a milestone in accelerating adoption of Linux in Telecommunications and embedded systems.
The goal of the Foundation was to accelerate the adoption of OpenSAF in commercial products. The OpenSAF community held conferences between 2008 and 2010; the first conference hosted by Nokia Siemens Networks in Munich, second hosted by Huawei in Shenzhen, and third hosted by HP in Palo Alto. In February 2010, the first commercial deployment of OpenSAF in carrier networks was announced. Academic and industry groups have independently published books describing OpenSAF-based solutions. A growing body of research in service availability is accelerating the development of OpenSAF features supporting mission-critical cloud and microservices deployments, and service orchestration.
OpenSAF 1.0 was released January 22, 2008. It comprised the NetPlane Core Service codebase contributed by Motorola ECC. Along with the OpenSAF 1.0 release, the OpenSAF foundation was incepted. OpenSAF 2.0 released on August 12, 2008, was the first release developed by the OpenSAF community. This release included Log service and 64-bit support. OpenSAF 3.0 released on June 17, 2009, included platform management, usability improvements, and Java API support.
OpenSAF 4.0 was a milestone release in July 2010. Nicknamed the "Architecture release", it introduced significant changes including closing functional gaps, settling internal architecture, enabling in-service upgrade, clarify APIs, and improve modularity. Receiving significant interest from industry and academics, OpenSAF held two community conferences in 2011, one hosted by MIT University in Boston MA, and a second hosted by Ericsson in Stockholm.

Concepts

OpenSAF defines a set of building blocks, collectively providing a mechanism to manage Service Availability of applications based on resource-capability models. SA and High Availability is the probability of a service being available at a random point in time; mission-critical systems require at least 99.999% availability. HA and SA are essentially the same, but SA goes further. OpenSAF is designed for loosely coupled systems with fast interconnections between nodes, and extensible to meet different workloads; components communicate between themselves using any protocol. This extensibility is provided in large part by the IMM API, used by internal components and core services. The platform can exert control over compute and storage resources by defining as Objects, to be managed as instances and/or node constraints.
OpenSAF software is distributed in nature, following the primary/replica architecture. In an `OpenSAF' cluster, there are two
types of nodes which can be divided into those that manage an individual node and control plane. One system controller runs in "active" mode, another in "standby" mode, and remaining system controllers are spares ready to take over as Active or Standby role in case of a fault. Nodes can run headless, without control plane, adding cloud resilience.

System Model

The OpenSAF System Model is the key enabler API, allowing OpenSAF to process and validate requests, and update the state of objects in the AMF model, allowing directors to schedule workloads and service groups across worker/payload nodes. AMF behavior is changed via a configuration object. Services can use ‘No Redundancy’, 2N, N+M, N-way, and N-way Active redundancy models. OpenSAF lacks obvious modeling toolchains to simplify design and generation of AMF configuration Models. Ongoing research to address this gap, needs to deliver ecosystem tools, to better support modeling and automation of carrier-grade and Cloud Native Computing Foundation use cases.

Control Plane

The OpenSAF System Controller is the main controlling unit of the cluster, managing its workload and directing communication across the system. The OpenSAF control plane consists of various components, each its own process, that can run both on a single SC node or on multiple SC nodes, supporting high-availability clusters and service availability. The various components of the OpenSAF control plane are as follows:
  • Information Model Manager is a persistent data store that reliably stores the configuration data of the cluster, representing the overall state of the cluster at any given time. Provides a means to define and manage middleware and application configuration and state information in the form of managed objects and their corresponding attributes. IMM is implemented as an in-memory database that replicates its data on all nodes. IMM can use SQLite as a persistent backend. Like Apache ZooKeeper, IMM guarantees transaction-level consistency of configuration data over availability/performance. The IMM service follows the three-tier OpenSAF "Service Director" framework, comprising IMM Director, IMM Node Director, and IMM Agent library. IMMD is implemented as a daemon on controllers using a 2N redundancy model, the active controller instance is "primary replica", the standby controller instance kept up-to-date by a message based checkpointing service. IMMD tracks cluster membership, provides data store access control, and administrative interface for all OpenSAF services.
  • Availability Management Framework serves high availability and workload management framework with robust support for the full fault management lifecycle. AMF follows the three-tier OpenSAF "Service Director", comprising director, node director, and agents, and an internal watchdog for AmfND protection. The active AmfD service is responsible for realizing service configuration, persisted in IMM, across system/cluster scope. Node directors perform the same function for any component within its scope. It ensures state models are in agreement by acting as the main information and API bridge across all components. AMF monitors the IMM state, applying configuration changes or simply restore any divergences back to "wanted configuration" using fault management escalation policies to schedule the creation of the wanted deployment.
  • AMF Directors are schedulers that decides which nodes an unscheduled Service Group runs on. This decision is based on current v.s. "desired" availability and capability models, service redundancy models, and constraints such as quality-of-service, affinity/anti-affinity, etc. AMF directors match resource "supply" to workload "demand", and its behavior can be manipulated through an IMM system object.

    Component

The Component is a logical entity of the AMF system model and represents a normalized view of a computing resource such as processes, drivers, or storage. Components are grouped into logical Service Units, according to fault inter-dependencies, and associated with a Node. The SU is an instantiable unit of workload controlled by an AMF redundancy model, either active, standby, or failed state. SU of the same type is grouped into Service Groups which exhibit particular redundancy modeling characteristics. SU within an SG gets assigned to Service Instances and given an Availability state of active or standby. SI's are scalable redundant logical services protected by AMF.

Node

A Node is a compute instance where service instances are deployed. The set of nodes belonging to the same communication subnet comprise the logical Cluster. Every node in the cluster must run an execution environment for services, as well as OpenSAF services listed below:
  • Node director : The AmfND is responsible for the running state of each node, ensuring all active SU on that node are healthy. It takes care of starting, stopping, and maintaining CSI, and/or SUs organized into SG as directed by the control plane. The AmfND service enforces the desired AMF configuration, persisted in IMM, on the node. When a node failure is detected, the director observes this state change and launches a service unit on another eligible healthy node.
  • Non-SA-Aware component: OpenSAF can provide HA for instantiable components originating from cloud computing, Containerization, Virtualization, and JVM domains, by modeling the component and service lifecycle commands in the AMF Model.
  • Container-contained: An AMF container-contained can reside inside a SU. The container-contained is the lowest level of runtime which can be instantiated. The SA-Aware container-contained component currently targets a Java Virtual Machine per JSR139.