SIP Thor PlatformAG Projects RTC Platforms (Platforms)
This document describes SIP Thor platform.
SIP Thor provides scalability, load-sharing and resilience for Multimedia Service Platform. The software is mature and stable, having +10 years in production environments with a good track record. Based on previous experiences, it takes between 6 to 12 weeks to put in service a SIP service infrastructure based on it.
SIP Thor platform uses the same software components for the interfaces with the end-user devices, namely the SIP Proxy, Web server, Media relay and XCAP server used by Multimedia Service Platform but it implements a different system architecture for them by using Peer-To-Peer self-organizing concepts.
To implement its functions, SIP Thor introduces several new components to the Multimedia Service Platform. SIP Thor creates a self-organizing peer-to-peer overlay of several logical network entities called roles installed on multiple physical machines called nodes.
Each node can be configured to run one or multiple roles. Typical example of such roles are sip_proxy and media_relay. All platform functions like XCAP, presence, provisioning, rating are scaled up in the same way. Nodes that advertise their roles capabilities, will handle the load associated with their traffic respectively and will inherit the built-in resilience and load distribution provided by SIP Thor design.
SIP Thor operates at IP layer 3 and the nodes can be installed at different locations in multiple data centers, cities or countries. The sum of all nodes provide a consolidated single logical platform.
The platform provides a fail-proof NAT traversal solution that impose no requirements in the SIP and Web clients by using a reverse-outbound technique for SIP signaling and geographically distributed relay function for RTP media streams. Based on configured policy in the nodes, ICE is supported in the end-points and selection of a media relay can be done by taking into consideration geographical location of the calling party.
The closest reference of a standard related to what SIP Thor implements is the "Self-organizing SIP Proxy farm" described in 2007 by the original P2P use cases draft produced by IETF P2PSIP Working Group. SIP Thor started development during early 2005, for this reason the software uses a slight variation of the terminology used later by the P2PSIP Working Group.
SIP Thor particular design and implementation has been explored in several white-papers and conferences:
- Addressing survivability and scalability of SIP networks by using Peer-to-Peer protocols published by SIP Center in September 2005
- Building scalable SIP networks presented by Adrian Georgescu at VON Conference held Stockholm in May 2006
- Solving IMS problems using P2P technology presented by Adrian Georgescu at Telecom Signalling World held in London in October 2006
- Overview of P2P SIP Principles and Technologies presented by Dan Pascu at International SIP Conference held in Paris in January 2007
- P2PSIP and the IMS: Can they complement each other? published by IMS forum, June 2008
SIP Thor is designed around the concept of a peer-to-peer overlay with equal peers. The overlay is a flat level logical network that handles multiple roles. Peers are dedicated servers with good IP connectivity and low churn rate and are part of an infrastructure managed by a service provider. The software design and implementation has been fine-tuned for this scope and differs to some degree from other classic implementations of P2P overlays that are typically run by transitive end-points.
The nodes interface with native SIP clients that are unaware of the overlay logic employed by the servers. Internally to the SIP Thor network, the lookup of a resource (a node that handles a certain subscriber for a given role at the moment of the query) is a one step lookup in a hash table.
The hash table is an address space with integers arranged on a circle, nodes and SIP addresses map to integers in this space. This concept can be found in classic DHT implementations like Chord. Join and leave primitives take care for the addition and removal of nodes in the overlay in a self-organizing fashion.
Communication between SIP Thor nodes is encrypted by using Transport Level Security (TLS). Each node part of the SIP Thor network is uniquely identified by a X.509 certificate. The certificates are signed by a Certificate Authority managed by the service provider and can be revoked as necessary for example when a node has been compromised.
The X.509 certificate and its attributes are used for authentication and authorization of the nodes when they exchange overlay messages over the SIP Thor network.
Because by scope, the number of peers in the overlay is fairly limited (tens to hundreds of nodes in practice), there is no need for a Chord-like finger table, iterative or recursive queries. The overlay lookup type is one hop , referred as O(1) in classic P2P terminology and SIP Thor's implementation handles up to half a million queries per second on a typically server processor, which is several orders of magnitude higher than what is expected in normal operations.
SIP call flows over the SIP Thor overlay involves at a minimum one node (when both subscribers are served by the same node) and at a maximum four nodes (when subscribers are on two different nodes and use different edge proxies than their home nodes). The routing logic does not change regardless of the number of nodes, subscribers or devices handled by the SIP Thor network. Shall SIP devices be 'SIP Thor aware' and able to perform lookups in the overlay themselves, this could greatly improve the overal efficiency of the system as less SIP traffic and less queries will be generated inside the SIP Thor network. A publicly reachable lookup interface is exposed over a TCP socket by each node using a simple query syntax, shall such integration be desired by the operator.
The current implementation allows SIP Thor to grow to accomodate thousands of physical nodes, which can handle the traffic of any size for a real-time communication service deployable in the real world today (e.g. if the SIP server node implementation can handle one hundred thousand subscribers then 100 nodes (roughly the equivalent of three 19 inch racks of data center equipment) are required to handle a base of 10 million subscribers.
The service scalability is in reality limited by the performance of accounting sub-system used by the operator or by the presence of centralized functions like prepaid, which requires querying a central point. If the accounting functions are performed outside SIP Thor, for instance in the external gateway systems, there is no hard limitation in how much the network can really scale.
SIP Thor is designed to equally share the traffic between all available nodes. This is done by returning to the SIP clients that use standard RFC 3263 style lookups, a random and limited slice of the DNS records that point to actual live nodes that perform the SIP server role. DNS records are managed internally by a special role thor-dns on multiple nodes assigned as DNS servers in the network. This simple DNS query/response mechanism achieves a near perfect distribution without introducing any intermediate load balancer or latency. Internally to SIP Thor, similar principle is used for load balancing internal functions like XCAP queries or SOAP/XML provisioning requests.
For functions driven internally by SIP Thor, for instance the reservation of a media relay for a SIP session, other selection techniques could be potentially applied for instance selecting a candidate based on geographic proximity to the calling party to minimize round trip time. Though captured in the initial design, such techniques have not been implemented because no customers demanded them.
By using a virtualization technique, the peer-to-peer network is able to function with a minimum number of nodes while still achieving fair equal and random distribution of load when using at least three physical servers.
There is no need to configure anything in the SIP Thor network for supporting the addition of a new node besides starting it with the right X.509 certificate.
SIP Thor is designed to automatically recover from disasters like network connectivity loss, server failures or denial of service attacks. On node failure , all requests handled by the faulty node are automatically distributed to surviving nodes without any human intervention. When the failed node becomes available, it takes back its place in the network without any manual interaction.
The logic of all active and signaling active components inherit this failover property from SIP Thor.
thor-eventserver is an event server, which is the core of the messaging system that is used by the SIP Thor network to implement communication between the network members. The messaging system is based on publish/subscribe messages that are exchanged between network members. Each entity in the network publishes its own capabilities and status for whomever is interested in the information. At the same time each entity may subscribe to certain types of information which is published by the other network members based on the entity's functionality in the network.
Multiple event servers can be run as part of a SIP Thor network (on different systems, that are preferably in different hosting facilities), which will improve the redundancy of the SIP Thor network and its resilience in the face of network/system failures, at the expense of linearly increasing the messaging traffic with the number of the network members. It is recommended to run at least 3 event servers in a given SIP Thor network.
thor-manager is the SIP Thor network manager, which has the role of maintaining the consistency of the SIP Thor network as members join and leave the network. The manager will publish the SIP Thor network status regularly, or as events occur to inform all network members of the current network status, allowing them to adjust their internal state as the network changes.
Multiple managers can be run as part of a SIP Thor network (on different systems, that are preferably in different hosting facilities), which will improve the redundancy of the SIP Thor network and its resilience in the face of network/system failures, at the expense of a slight increase in the messaging traffic with each new manager that is added. If multiple managers are run, they will automatically elect one of them as the active one and the others will be idle until the active manager stops working or leaves the network. Then a new manager is elected and becomes the active manager. It is recommended to run at least 3 managers in a given SIP Thor network preferably in separate hosting facilities.
thor-database is a component of the SIP Thor network that runs on the central database(s) used by the SIP Thor network. Its purpose is to publish the location of the provisioning database in the network, so that other SIP Thor network members know where to find the central database if they need to access information from it.
thor-dns is a component of the SIP Thor network that runs on the authoritative name servers for the SIP Thor domain. Its purpose is to keep the DNS entries for the SIP Thor network in sync with the network members that are currently online. Each authoritative name-server needs to run a copy of the DNS manager in combination with a DNS server. The SIP Thor DNS manager will update the DNS backend database with the appropriate records as nodes join/leave the SIP Thor network, making it reflect the network status in realtime.
thor-node is to be run on a system that wishes to become a SIP Thor network member. By running this program, the system will join the SIP Thor network and become part of it, sharing its resources and announcing its capabilities to the other SIP Thor network members.
The network can accomodate one or more nodes with this role, SIP Thor takes care automatically of the additions and removal of each instance. The currently supported roles are sip_proxy in combination with OpenSIPS, voicemail_server in combination with Asterisk and rating_server in combination with CDRTool. Other roles are directly built in MediaProxy (media_relay), NGNPro ( provisioning_server) and OpenXCAP (xcap_server), for these resources no thor-node standalone component is required.
thor-monitor is a utility that shows the SIP Thor network state in a terminal. It can be used to monitor the SIP Thor network status and events.
NGNPro component performs the enrollment and provisioning server role. It saves all changes persistently in the bootstrap database and caches the data on the responsable node at the moment of the change. The network can accomodate multiple nodes with this role, SIP Thor takes care automatically of the additions and removal of each instance.
NGNPro exposes a SOAP/XML interface to the outside world and bridges the SOAP/XML queries with the distributed data structures employed by SIP Thor nodes.
NGNPro is also the component used to harvest usage statistics and provide status information from the SIP Thor nodes.
Adding new roles to the system can be realized programatically by obeying to the SIP Thor API and depending on the way of working of the component that needs to be integrated in the SIP Thor network.
The following integration steps must be taken to add a new role to the system in the form of a third-party software:
- The third-party software must implement a component that publishes its availability in the network. This can also be programmed outside of the specific software by adding it to the generic *thor_node* configuration and logic
- The third-party software must be able to lookup resources in the SIP Thor network and use the returned results in its own application logic
- Depending of the inner-working of the application performed by the new role, other roles may need to be updated in order to serve it (e.g. adding specific entries into the DNS or moving provisioning data to it)
While the software is designed to be self-organizing, it can only do so if is deployed in a way that avoids correlated failures related to Internet connectivity. If the DNS, central database and Thor manager functions are all down at the same time, no self-organizing software is of much use. The following measures can improve the self-recovery in both complete connectivity failures or unstable connectivity with high packet loss:
- Host the DNS servers and SIP Thor manager/event servers in different data centers than the Thor nodes used for signaling and media (DC1, DC2, DC3)
- Host all Thor nodes for signaling and media in different data centers than the ones used above in three different data centers (DC4, DC5, DC6)
- Host central database in DC1 with active slaves in DC2 and DC3
- The SIP Thor DNS zone must be run by other DNS servers (typically an external DNS registrar)
With such setup most connectivity failures are handled automatically:
- In case of complete Thor node failures or datacenter failures (DC4, DC5, DC6), the network will automatically take out the DNS records of the fault components and routing logic is automatically adjusted
- In case of partial connectivity loss of any data center (A sees B, B sees C but A does not see A), the network will pick the best visible candidates automatically using an arbitrage logic
- In case of intermittent packet loss (flip-lop of network connectivity causing continuous re-organization, Thor nodes can be shut down administratively
In the worst case scenario that the location of the central database is completely down, the network can fall back for read usage to secondary database automatically with the exception of the prepaid functionality. Accounting records are synced back from Thor nodes with the central database at a later time when the connectivity is resumed. In case the main datacenter does not come back online manual failover to another data center can be done by changing the DNS records of the Thor domain in the DNS of the parent zone to point to another datacenter where data has been previously replicated. This allows continuing the accounting and provisioning in the new data center.
In practice, such setup is not cost efficient, there is always a high price to pay to handle automatically failures related to IP connectivity.