Boutique Tech Conference · 4. – 6. Juni in Rostock
Picture of the talk

Five Nines in an open source PBX

in English by Jim Van Meggelen at AMOOCON 2009

Abstract

Using Sangoma Wanpipe and Linux Heartbeat it is now possible to build a PBX with full hot-standby—including the PRI circuit. This means that using inexpensive hardware, it is possible to bring a higher level of redundancy to the enterprise.

Even though this talk will use Asterisk as an example, the reality is that any software that can use Wanpipe for it’s circuits can take advantage of this capability.

Jimheadshot2

Additional material

Here you can find all available material for this talk.

PDFs

Audio recordings

Video recordings

The slides

There are 23 different slides. Click on them to view an enlarged version.

  1. Slide-0
  2. Slide-1
  3. Slide-2
  4. Slide-3
  5. Slide-4
  6. Slide-5
  7. Slide-6
  8. Slide-7
  9. Slide-8
  10. Slide-9
  11. Slide-10
  12. Slide-11
  13. Slide-12
  14. Slide-13
  15. Slide-14
  16. Slide-15
  17. Slide-16
  18. Slide-17
  19. Slide-18
  20. Slide-19
  21. Slide-20
  22. Slide-21
  23. Slide-22

Transcript

Jim Van Meggelen: Alright, I’m going to get started now so … My name is Jim Van Meggelen and I’ve been involved in traditional telecom, PBX telecom, for about 20 years now. I work with a lot of traditional PBX technologies in Canada and discovered Asterisk , I guess about five or six years ago now.

I helped write O’Reilly’s Asterisk book so I’m one of the folks involved in that. And yes, we are working on a third edition, finally.

I wanted to talk to you about a technology – I’m not here to pitch you on some new cool thing. This is much more an old problem, I guess, and something that in my experience has always been a…I don’t know if shortcoming is the right word but it’s always been difficult when talking to customers about open source technology because the traditional vendors of telecommunication equipment have solutions to this that have been more technically complicated for us to do in the open source world.

So I’ve been brainstorming this with some of the Sangoma folks for a number of years and other folks that I’ve met throughout the community. And that came up with something that is simple, which is usually when you know you’re on to something that has a lot of potential and I wanted to talk about that with you folks today because I think this is a really significant evolution in making it simple to provide high availability systems.

So I want to talk a little bit about high availability first, so we all have an understanding of where I’m coming from. So availability, I categorize that in four ways. I got this – part of what I got came from Wikipedia but there’s also a book that I’ve read which is called “Blueprints for High Availability” and that’s a very useful book on some of this.

You have normal availability, which is really nothing special. It’s just anything that you buy is available or it’s not available, depending on whether it’s functioning or not. You have redundancy where you’re going to provide especially components that are prone to fail. You’re going to build redundant components in the hopes that if a component fails, the redundant component will be available.

Then you have high availability which is based on redundancy but high availability looks at the problem more from the user’s perspective. So there are redundant components in there but you’re also looking at the environment and most importantly probably, you’re looking at the processes around that so that it’s not just enough to have a hot spare standing by but that all the people involved in keeping that system available know what it is they’re supposed to do when a failure occurs.

And then finally disaster recovery, which is essentially taking high availability to the next level where, if for some reason, you have a disaster that causes the entire facility to be available, you will have a redundant facility somewhere else that can be brought up in short order and made available to your users.

Excuse me, I’m a little parched here. I’m going to get my water so that I don’t dry out on you. That feels better.

Availability is popularly described as a percentage. I expect that most folks have heard the term “five nines” – it’s a very popular term. And what it means is that over the course of the year, the system is 99.999% available. So if users are able to use the system, it is said to be available. It’s important to note that it’s from the user’s perspective because it’s not enough for the components to be working. Just because your server is up, it doesn’t mean that it’s available.

And it’s difficult to come up with what a clear definition of high availability is unless you understand the parameters that you’re using to define high availability. For instance, if you have a home internet connection, one of your users, their home internet connection is down for three days. Does that mean that your high availability system is not available? For that user it’s not but that might not be relevant in the big picture.

On the other hand, if you’re providing a critical service of some kind and that user is a critical part of the system, then that could indeed be a factor in that system not being available. So you have to take the big picture into consideration.

So five nines is marketing terminology and it’s really important to understand that it doesn’t really mean anything, in a way. But it is a well-known term, so it’s useful when talking to people to use that term to sort of define what the parameters are that we’re looking for.

So here’s a chart that I got off of Wikipedia and it just talk about what exactly we’re talking about when we’re talking about various percentages. And what you can see is that five nines means that over the course of the year, the system cannot be unavailable for more than 5.62 minutes.

I see a lot of marketing on websites where they say, “Oh, our stuff, we guarantee it to be 99.99% available” and I think that’s a big mistake. If that’s the best you can do, you really shouldn’t be bragging about it. Because that means that over the course of a year, it’s entirely possible that you can be out of service for an entire business day.

And since we’re in Germany here, I think there’s one more nine that we can talk about. The other thing I want to talk about – I don’t want to talk about it, I just want to mention it. Getting to six nines is enormously more complicated and this is probably one of the reasons why five nines has sort of become the catch-all. Because if you’re going to go to only 31.5 seconds of down-time per year, it’s going to get a lot more costly to make things work, and complex.

So high availability is a system designed protocol and associated implementation that insures a certain absolute degree of operational continuity during a given measurement period. I got that from Wikipedia. Again, high availability can only be defined in terms of the needs of the users. It doesn’t mean anything unless you know what it is that you’re trying to accomplish.

It’s not about servers. You have to look at it from the perspective of the users; if they can’t use it, it’s not available for whatever reason. You really can’t do it with a single server; you can’t do it with just two servers. You have to take into account all of the factors. I’m still working on getting all of the right graphics for this because I haven’t been able to include power and environment in here and those are all things that need to be taken into account. We’ve got routers and switches and servers here.

So you have to consider the system as a whole. The system is not a box; it’s not a machine. It’s one part of the system, but the system includes all of the processes and the people and the technology that goes into it.

Complexity is generally not a good thing in high availability. If you have complex procedures, people are less likely to understand them and won’t be able to follow them, not be able to find them. If you have complex software interactions where before the failover can happen this database lookup has to happen and that script has to be run and these phones have to ring and who knows what. You’re more likely to have a failure when your high availability is not going to succeed in getting you going again.

So always try to focus on simplicity. And this is one of the reasons why what I want to talk to you about today is exciting for me.

Don’t assume anything. Many people will input a redundant solution and they’ll have two servers and the one server will run for three years and then it will fail and the secondary server won’t be running because it failed two years ago and nobody tested it. So testing has to be part of your high availability solution or you can’t have any confidence that you’ve actually got high availability. I’m drying out again.

So how do we achieve it? This is something that I find a lot of people – sorry for you having to listen to me fill up my water – don’t want to pay attention to. They power in the environment – everything that is actually feeding into the systems. And what a lot of people will do is they’ll provide redundancy in the servers and they’ll plug them both into the same UPS unit and they’ll plug that same UPS unit into the same outlet in the wall. And that outlet happens to be the one that’s shared with the microwave in the kitchen, because nobody took a look at that. So all those things need to be taken into account.

The UPS should be a separate UPS. They should be on separate electrical circuits, maybe on a separate electrical phase on the electrical panel. They might even go out separate building entrances so even if you have a failure of electrical power in the building, you still have redundancy there. You have to look at the cost; you have to look at the needs. But you need to consider this stuff.

Redundant cooling – if your air conditioner fails and you don’t have a redundant air conditioner, your servers are going to start going down pretty quickly when the heat rises. You want a generator; that’s something to be considered. Oops, well I’m all done.

Network and cabling: another thing I see people do is they take a redundant pair of systems and they plug them into the same switch. Then the switch goes down; you’re out of business. So you need to make that redundant as well: redundant switches.

One of the things I’ve seen done which is fairly common – well it’s not fairly common but it’s a good idea when you’re doing this – is to stagger your phones. So if you have five phones in your department, they will alternate between one switch and another switch. So when a switch fails, only half the department will be out of service. So if you do that throughout the organization, any failure of a component won’t take the whole department offline, it will only take half the users of that department offline. It makes cabling a little more complicated, but it means that a failure is less impacting; you’re not losing a whole floor or a whole section of your company.

This is something I’ve been thinking about and sometimes I just think I think too much. But I think if you have two servers on top of each other and somebody spills a coffee on them, they’re both gone, right? If you put them into a separate rack, have you increased your safety? I’m not sure because the other side of it is, you have to run a cable between the two of them and it’s a lot safer to have a short cable between those two servers rather than one running between two racks. I’m still up in the air as to whether actually physically separating the two servers in the same room provides any increased reliability.

Multiple carriers: In Canada – and I imagine this is true in the rest of the world as well – hospitals provision multiple building entrances for all of their critical structure coming in. so they will have completely separate feed at either end of the building for their electrical, for their telephone, everything. So if someone accidentally digs up the cable on the street, they are still in business because the other street, that’s still intact. Very expensive to do this but it will increase your availability and will help you to accomplish high availability.

And then again, the thing that everybody seems to forget to do is to have procedures and plans and to test those procedures and plans and to check those on a regular basis to make sure that new people who join the organization know what’s going on.

So it’s really quite simple, how do you start to achieve high availability with an open source PBX? Obviously you have a redundant system and you duplicate all the critical components. The challenge was, if you were using a PEER I circuit, there was no way inherent in a PEER I circuit to accomplish a failover. There were devices that you could put in front that might have that capability but that either got expensive or that itself became a point of failure.

So what the Sangoma folks came up with is something that is very simple. And what is done is that the transmitter on the secondary T1/E1 interface is switched off. When a failure occurs, Linux Heartbeat switches over to the secondary server, turn the transmitter on and starts the telephony software. And this takes less than 10 seconds.

So your loss of available is measured in seconds and now you can start to talk about five nines. It doesn’t mean that you can put this together and say, “Hey, we have five nines!” but you can certainly say that you’re able to approach that. Because in a traditional situation you’ve either got to have a box in front that can do that or you’ve got to get someone to run down there and move the plug over. Well this way, Heartbeat can do that.

It’s currently only in the latest beta of Wanpipe, but this is something that will be rolled into all future releases. Is it in stable now? My apologies, that’s awesome.

So this is just a very simple graphic that sort of visually shows what’s going on. The secondary server, the Wanpipe is running and because the receive is happening, as far as the T1 is concerned the circuit is functioning correctly but it’s not transmitting so that it’s not confusing the other end. Because obviously if both circuits were transmitted at the same time, your circuit would be in an alarm state. So that’s what’s so simple about it; it’s that the transmitter is off.

To configure this, you just make one line change to your Wanpipe configuration file. You just change the Tristate from “no” to “yes”. What’s important to note is as soon as you do that, when you start up Wanpipe, it will start with the transmitter off. So if you just make that change, you’ve just broken your system because your T1 won’t come in correctly because the transmitter is off. So this is the command that you will do; this is the command that Heartbeat needs to perform to actually switch the transmitter on.

So this is a drawing I did of Heartbeat. I’m not completely happy with it, but for now it’s helpful. What Heartbeat does is it create sort of a virtual environment as far as the addressing this concern and all of your devices register to that. And then which server is actually providing to that is invisible to the set so if you have a failure, the secondary will take over and other than a short interruption, the sets won’t know that happened.

Now you will lose calls in progress but any new call that gets generated will simply be responded to by whatever server is the primary at that point. The other piece that’s important is the connection between the two servers; this is where the Heartbeat’s actually happening between them. Don’t run that through a switch because if the switch fails you obviously just killed your Heartbeat connection, so you use a crossover cable.

The best practice is also to use an Ethernet null modem at the same time between the serial ports. So you actually have two cables using two completely different technologies, synchronizing those two servers to make sure that they know about it other. Because if they lose connection to each other, each one is going to assume that the other one is and they’re both going to try to come up, which is not ideal.

Man: [inaudible question]

Jim: Ah, yes, definitely it is. That’s a good question and it’s true. That’s one of the reasons to have more than one of them. You can have multiple synchronizations. The other thing you can do is you can do this with more than two servers. You can have three, four. But yes, that is a danger. If someone gets to the back of those and starts pulling on cables, you could have a problem on your hands. Yes, exactly. If they cut both of them, then you’re in the same…

But you’re right. Someone is more likely to unplug an Ethernet cable than they are, say, a serial cable because serial cables are now old and people don’t dare to touch them, whereas Ethernet – Oh, I’ll just move this around quickly, right?

Man: [inaudible question]

Jim: Yeah, I don’t know if I’d be comfortable with that but I kind of like where you’re going. Because really, anything that allows a reliable Heartbeat to take place…Maybe that’s a great idea; you can’t cut a cable like that. Use a serial as your primary synch, Ethernet as your secondary and your tertiary could be something wireless. That would be kind of neat. But the point being, and it’s a great point, is that you need to make sure that you’ve got a really reliable Heartbeat connection between those two machines.

So really Heartbeat doesn’t have to do a heck of a lot. It has to manage the shared virtual IP address and it has to manage the state of the transmitter. This is something that is probably worth looking at doing. There is a child application to Heartbeat called “STONITH”:http://en.wikipedia.org/wiki/STONITH and STONITH stands for “Shoot the other node in the head” and it allows Heartbeat to perform an action that absolutely makes sure that the primary is down.

So where STONITH could be useful, for instance, if you had an intelligent powerbar, you could have the secondary send a command to the powerbar to disable to power connection to the primary. So if somebody, say, unplugged the synch cable between the two of them, the STONITH on the secondary would say, “Kill the power on the primary” and the primary would go down. Maybe it hadn’t actually failed, but at least that way you don’t have a disaster on your hands. That does of course increase complexity. Is the intelligent powerbar reliable? Does it work well? Is that now a single point of failure? They are all things that have to be considered.

So this is kind of Asterisk-centric, but I hope that it’s obvious that most of what I’ve talked about so far can be used with any open source PBX. It’s not an Asterisk-centric thing. But in Asterisk you would have to make sure that etc/Asterisk was updated regularly. This is really important, the Asterisk database which stores the registration state for devices. If you don’t keep that synchronized on the secondary, if the secondary comes up it won’t know where all your dynamic end points are. And then you should keep your voicemail synchronized because otherwise that’s going to be out-of-date. There might be something else, but those are the key ones I thought of.

Rsync is probably all that’s need. If you start having to do database synchronization, that can get tricky. I don’t recommend using Heartbeat to do that or rsync. There’s all kinds of other ways to do database synchronization.

Man: [inaudible question]

Jim: Well etc/Asterisk is – those files are being read as they’re being…like they’re read once and once they’re read they just sit there. If you’re doing the sync while the reload was happening, frankly, I don’t know. That’s an interesting question. I mean it’s controlled by the Linux file system. Usually you can set that to synchronize every minute or so because it takes moments. It’s not a high bandwidth process to replicate that data; there’s just some text files.

Man: [inaudible question]

Jim: That’s a good point. I can’t say with any confidence. If anybody also has any ideas on that because it is a good point.

[inaudible]

Jim: That’s an interesting idea. You know, a lot of what I’m talking about here is sort of a new concept because prior to being able to turn the transmitter off, it just wasn’t possible to do this sort of thing. But some of that is real-time publication challenges that have nothing to do with the PBX per se.

The cabling side of it is something that has to be thought about and it’s not particularly complicated but it can get a little tricky. One of the things that I’ve done to myself is: don’t try to do anything with the cabling until you’ve got your first E1 up and running; until you have your circuit up and running. Because otherwise you’re going to be sitting here thinking you have a cabling problem where in fact you don’t. Your circuit is just not up. So get the circuit up with a normal cable and don’t do anything until it’s working and you know it’s working. Then you’ll always have a stable point to go back to when you’re trying to get the bridge done.

Keep in mind that what you’re doing is – it’s a hack. Most telecom professionals are going to look at this and say, “What have you done? You’ve split the E1 circuit. This is ridiculous; you’ve just broken your machine.” So you have to make sure that it’s very easy to prove that what you’re doing is fundamentally sound.

So this is a simple diagram. This describes a crossover cable. You might not be using a crossover but basically you’re transmitting and your feed pairs are fed into the primary system and they are bridged and simultaneously feed into the bridge system. And that’s pretty much all there is to the cable. It’s not difficult to do. It can get a little thorny though, with polarity and things like that.

Usually I would run this through either a cross-connect block or a patch panel. You could just splice these together but it’s going to look messy and it’s going to scare somebody. If you do it through a patch panel or something like that, you want to make it look neat so if someone from the carrier is looking at it they’re not going to wonder what is going on.

So availability has to be from the user perspective. This is just a really important fundamental concept of high availability. In order to accomplish this, you have to be able to disable the transmitter on a T1/E1 card. So any T1/E1 card that is capable of having the transmitter dynamically disabled and enabled by Heartbeat by sending a command in theory should be able to do this.

It has to be simple in order to invoke the failover. It’s got to be rapid. The systems have to be synchronized; they have to be containing the same data. It’s much more than just hardware redundancy. It’s really important to take the time to implement good policies and have testing procedures and make sure you keep those things up-to-date.

I put “question” but in fact I’m still learning this myself and I’m in a room full of people who have an awesome amount of talent and knowledge in this area. This is still a fairly new concept, at least from my experience in making it very simple to deliver a high degree of availability without having to get into complicated multiple servers.

So if anybody has any questions or anyone has any comments or really anything you can think of, I’d be very pleased to hear them.

Man: [inaudible question]

Jim: The as/db does that. Or, it doesn’t do that, it stores it. It registers with the virtual Heartbeat that IP controls, so when the primary is disabled, the secondary will assume that IP address. If you don’t have the ASDB copied over and current, then Asterisk won’t know about that set because it didn’t register to that. That’s why ASDB stores registration information, so that on the startup of Asterisk it knows what it knew before about set registration.

Man: [inaudible question]

Jim: Asterisk is not running. Wanpipe is running and Zaptel is running. Wanpipe is running, Zaptel is running, the transmitter is turned off. But Zaptel doesn’t know that. So Zaptel is still transmitting, but it’s just not going out the wire. So what happens is the transmitter is turned on and Asterisk is started. Yes?

Man: [inaudible question]

Jim: No, I do not. But I can’t see any reason why that wouldn’t work. Heartbeat itself is capability of doing that. In theory I don’t know how many of these you could actually wire up but you should be able to do more than one – again, as long as the transmitter is turned off. Yes, Tim? What was that? I want to get this into the microphone. An impedance mis-scratch? And that you’re saying no?

Man: [inaudible question]

Jim: So is there a danger in increasing…like the more of these you put on there, the greater…? OK, so a log-in application uses high impedance the same way. Yeah, you’re just kind of passively sitting there. You’re not…yeah. Yes?

Man: [inaudible question]

Jim: In theory, because the receiver is working you are hearing what’s happening on the circuit but now you need to have this secondary application that’s interpreting what it’s seeing. Then if the secondary has to come into play, that application needs to get out of the way so that the secondary can get involved. So I would think that conceptually it’s possible. I’d be concerned that it increases complexity but it might be a good idea to do that. I’m not saying it’s not a good idea. It’s something to think about.

Man: [inaudible question]

Jim: Well you are doing HDLC, but Wanpipe is not analyzing what it’s seeing on…but I like the way you’re thinking because it does increase the…You know, one of the things that concerns me is: what if the failure in the primary is not catastrophic? It’s affecting the users but in fact most of the applications appear to be working. The secondary is not going to know that it needs to get involved because everything appears to be functioning correctly. So what you’re suggesting is an application that’s got a little intelligence and is actually hearing what’s happening on the primary and doing something about it.

But maybe you have somebody in there doing service or something and they…yeah. There’s no panacea here; there’s no one solution and I think that’s kind of what Kevin was alluding to before my talk had even spoken. Five nines is a bit of a joke because it doesn’t actually mean anything. You can’t put a stamp on something and go “We have five nines!” You don’t know that because if it blows up one day and it takes three days to get it back up, then you don’t have five nines. [laughs] Exactly, you’ve got to reset the nines’ clock to zero. Yes?

Man: [inaudible question]

Jim: I agree, that’s right. That’s why it’s so important to look at high availability from the user’s perspective. Because you’re going to say to the users, “Hey, we’ve got our five nines because…” and you’re going to give them some mathematical formula; they’re going to shoot you because they don’t care. My phone is not reliable.

I agree with you. That’s why the five nines is not a…this isn’t about a mathematical formula, it’s about making sure you’re doing everything that you can to have your systems available to your users. You don’t want that failure to happen and if it does, I lost my phone call. I’m going to be upset about that. But at least when I made the call again, I reconnected my phone call, I’m back. It worked. As opposed to, “Oh no, the phone system is done. Who do I choke?” Do you understand what I’m saying? It’s not about the numbers, it’s about the availability, about the application.

Man: [inaudible question]

Jim: That’s a great question and I would say right now I can’t think of a way to do it. Yeah, because on a traditional PBX, you can have completely redundant system core and when you have a failure you don’t even lose call processing. Right? But that’s not possible, not at this current state. Theoretically? I guess it could be possible to do that but wow, I don’t know how you’d accomplish synchronizing the switchover from the primary E1 circuit to the secondary without any application confusion. I think that would be incredibly difficult to accomplish. Yeah, I mean there…no. Yes?

Man: [inaudible comment]

Jim: Yeah, one of the things I remember you telling me is even on an SS7 circuit, typically you will lose the call. Even though SS7 can handle that, nobody actually building chunking facilities that way or it’s rare.

[unmiked conversation]

Jim: Unless he’s the CEO, then it’s a big problem. The thing about this is that you can now talk to somebody – I’m sort of trying to give everybody – trying to make sure it’s clear what five nines means. Because you can go out there and say, “Hey we can sell you all five nines in a row with the decimal place in the correct location.” But what this is all about is being able to go to the customer and say, “We can very inexpensively provide you with a very high amount of hot failover capabilities without having to break the bank.”

To actually accomplish this is the cost of the second server, the cost of all of the duplication of the cards and whatever configuration is required to make them synchronize and then some professional services to make sure that the processes are in place.

So if we take the technology away from it from a sales perspective this is very compelling because it is relatively simple to implement, it’s relatively inexpensive. It does provide a very high degree of hot failover. Yeah.

Man: [inaudible question]

Jim: I was thinking about BRI and I’m not that familiar with it. In North America, it’s a different standard and it’s a mess and it’s almost never used. But here it’s delivered on an ST interface, right, on a four wire? Is that not a BUS interface? So in theory, could you not have a secondary that…

Man: [inaudible comment]

Jim: Somebody asked me about doing this with analog and I said, “Well, there’s no need to.” Because analog, you can put as many devices as you want on the line, just don’t answer the phone.

Man: [inaudible comment]

Jim: Yeah, and that’s the part where I don’t know how it’s commonly implemented. But it makes it a lot less expensive to look at, providing high availability…I’m always reluctant to use the term “high availability” because I know it means a heck of a lot more than just, “Oh, when my server dies the other one hopefully starts running.” Yes?

Man: [inaudible question]

Jim: That’s a good question. Yeah, probably you have to end up doing something along those lines. The immediate question that pops into my head is: Are you increasing complexity? Which sometimes you have to do, it’s a necessary evil. But that would be always the question, is: Am I increasing complexity because I have no choice? Or am I just not thinking about making it simple. So I would expect you would probably have to…

Man: [inaudible comment]

Jim: Yeah, I’m not strong enough on databases to feel comfortable lecturing on the right way to do that.

Man: [inaudible comment]

Jim: Awesome! Not so good for me, but…Tim?

Man: [inaudible comment]

Jim: OK, great. So that’s all kind of relevant to that stuff. Yeah. Wonderful.

Man: [inaudible question]

Jim: You just have to put your tap in front of the split in the wire.

Man: [inaudible comment]

Jim: I would guess you would have a third machine for that that would have nothing to do with the…

Man: [inaudible comment]

Jim: Well, we’re out of time. Thank you very much and I know your time is precious. I appreciate you spending it with me. I hope you found it informative.

[applause]