Why you must not use Azure if you want to keep your sanity

I’ve been working with Azure for about a month, maybe 20-30 hours directly spent on it in total, but it is clear I don’t need to waste any more time on it to make this conclusion:

Azure is shit.

Azure is the worst sysadmin nightmare come true.

Azure is like a bad marriage where there’s no love to begin with, but there’s an accidental pregnancy, and you’re both from religious families, so you have no choice but marry, and then you hate your life forever.

Allow me to explain with some technical details.

The main problem that Azure has is that it suffers from a multitude of limitations that are badly documented or not documented at all, so they bite you after you’ve done 80% of your work only to find that you now have to wipe it all and start from scratch, but now with one little thing done differently in the beginning of the process. Just two examples:

  • If you create a VM with a single NIC, you can’t add more. If you create a VM with several NICs, you can’t remove them so you’d have less than two. You built something, spent time customizing it, then want the second NIC? Delete the VM, start from scratch. Microsoft doesn’t give a fuck about your time.
  • Static VPN can’t have overlapping networks on the two sides of the tunnel. Cisco can do it. Juniper can do it. AWS basically requires you to do it (local side of an AWS IPSEC tunnel is 0/0). Try doing this in Azure – it will fail. Moreover, it will fail with no useful diagnostics whatsoever, if you’re setting up the VPN using the Azure portal (thankfully, someone at Microsoft created a decent CLI that produces error messages – the only good thing about working with Azure). And since there’s either a bug or a misfeature that doesn’t allow modifying the set of subnets assigned to the local peer, you have to delete the VPN connection, delete the local peer, create a new one with a different set of subnets, create the VPN connection again.

Then there are general ecosystem issues (or actually just one basic philosophical one):

  • Azure documentation is worthless, for two reasons. First, they haven’t bothered to spend enough time on it. Everything is barebones. Some features and behaviors are not explained at all. Second, Microsoft writes documentation for losers. Their target is people who have no ability to think whatsoever. A typical document does not explain the feature and show you how it works and what can be done with it. It is a walkthrough: do this and only this, don’t think, just follow our directions. You get stuck? Open a support case. Just don’t think.
  • Another aspect of the same problem: there are no good discussions or blogs on Azure that would help to solve problems or do deep dives to help you understand how things work. Microsoft is your only help. Don’t think, just go ask them.
  • They believe all you need for interoperability are sample configuration files (this one is a more specific issue encountered with VPN setup). Don’t expect any explanation why the configuration is exactly like that. You have samples, use them. Don’t think.
  • Did I mention you’re not supposed to think when working in Azure?

When I started using AWS, my most pronounced thought was “holy shit, I’ll never learn to use everything this thing can do.” With Azure, after just a few weeks, I feel like I’m running through a tiny little maze with no exit, designed by an evil clown.

Oh, and that unwanted pregnancy? Microsoft Office and parent company choice. No way out of this marriage.

Categories: Uncategorized

Cisco ASA Group Policy Access Lists Are Evil

Anyone who ever had to use a Cisco ASA firewall knows how access control works. You have to create ACLs, which are collections of rules that specify allowed and denied traffic. Every rule specifies the source of the traffic and its destination. If the source and destination IPs and/or ports of the packet match the source and destination IPs and/or ports of a permissive rule, it’s allowed to pass. If the match is with a denying rule, the packet is not allowed. Simple, right?

Not exactly. There’s an exception to this principle, and that’s the ACLs used for filtering traffic going through VPN tunnels. They are called “group policy access lists” or just “vpn filters” (from the “vpn-filter” CLI command used to configure them). In those ACLs, the definition of source and destination is turned upside down. Though not always. It’s more like you no longer know what they are – at all. Source is now everything that’s on the remote side of the tunnel, and destination is local to the firewall. The logic is now very different, even though the syntax remains the same.

Now, let’s take a look at what this means for our security. We will consider the same ACL, first its meaning when used for standard access rules, and then its vpn filter meaning.

access-list web-access extended permit tcp 192.168.1.0 255.255.255.0 
host 5.5.5.5 eq http

This rule will allow all our internal users on 192.168.1.0/24 network to access the external web server on 5.5.5.5. It’s very straightforward, does exactly what we want it to, no surprises here. But if that web server happens to sit on the other side of a VPN tunnel, everything changes. We need to rewrite the rule to put that web server as the source, since it’s on the remote end, and our internal network as the destination:

access-list web-access extended permit tcp host 5.5.5.5 eq http 
192.168.1.0 255.255.255.0

It achieves the same result, but also something else, completely unintended and very insecure. Now, anyone, who has root access to that web server, will be able to access any tcp port on any host on 192.168.1.0 network, as long as they can open that connection with port 80 as the source. The ASA no longer knows which side has to initiate the connection and doesn’t have any means to distinguish between legitimate client-to-webserver connection and someone running netcat to hit the client network. The client network is suddenly no more secure than that remote web server.

And this is why VPN filters are evil.

P.S. VPN filters actually have one useful capability, which is conspicuously absent from standard ASA access control mechanism: by the virtue of being applied to traffic between two specific sets of networks, they effectively implement the concept of zones, so popular (and so useful!) in other vendors’ firewalls. I’ll cover this concept – or rather the lack of it and the nastiness of the necessary workarounds – in a future post.

Categories: Networking

Linux, EMC SANs, and TCP Delayed ACKs

December 21, 2011 4 comments

One of  relatively well-known issues when using EMC (and some other vendors’) SANs over iSCSI is the SANs’ dislike for TCP delayed ACKs. The reasons for the dislike are best described in this VMware KB article. EMC also has several articles discussing delayed ACKs on Primus, but overall the issue is confusing. With this post, I’ll try to clear up the confusion.

(Since you can’t deep-link to EMC Powerlink pages, I’m just going to give article numbers.) Out of many articles returned by Powerlink search on “delayed ack”, we can consider emc245445 to be the starting point, since it discusses general best practices for improving Clariion iSCSI performance and provides references to the articles covering Windows and ESX hosts specifically. About Linux hosts, it has this strange “may also apply” statement with no further explanation or instructions. Looking at the Windows article (emc150702), we see very detailed instructions on tweaking TCP stack settings; the articles for ESX (emc191777 and emc273003) lead to the VMware article mentioned above, which tells us where to find the magic checkbox that disables delayed ACKs. But the best Powerlink can do about Linux is emc264171, which reads like an exam answer of a mediocre student, who remembers some stuff from lectures, but can’t put it all together into a cohesive response. So, what’s going on?

The issue is not trivial and was probably never researched by EMC deeply enough to produce a useful Primus article. In Linux, delayed ACKs are disabled by setting the TCP_NODELAY option on the socket. It’s done by the application that wants to use the socket and is not a simple system-wide setting like in Windows or ESX. Therefore, in Linux, there are many different places where this setting may be specified.

If the Linux server is using software iSCSI, it is the responsibility of the iSCSI initiator code to set this option. Fortunately, open-iscsi, which is used by most (all?) modern Linuxes, does the right thing: TCP_NODELAY is hardcoded in the function that handles TCP connections.

In the case of hardware iSCSI initiators (proper HBA or iSCSI offload provided by Broadcom NICs and the like), things get more complicated. Since these adapters implement their own TCP stacks, delayed ACKs need to be disabled through the driver. And, of course, every driver will have its own setting for that. For example, Broadcom’s bnx2i will take “options bnx2i en_tcp_dack=0” in modprobe.conf. For other iSCSI implementations, you will need to consult the documentation or contact the vendor (or just Google it).

Categories: Operating Systems, Storage

VMware vCenter Server 5 Upgrade and Service Accounts

When upgrading our vCenter Server instances, I encountered an annoying quirk in the vCenter Server installer, which may or may not depend on the way the database backend is configured. At least in our case, with the transition from a local SQL Express instance to a separate SQL database server, the installer didn’t allow to choose the account to be used to run vpxd and tomcat. Obviously, running the service under the personal account of the user who happened to be installing the service is a bad idea. So here’s a quick recipe for changing the service account:

  • Create the database per http://www.vmware.com/files/pdf/techpaper/vSphere-5-Upgrade-Best-Practices-Guide.pdf, pg.20. Run SQL Management Studio under the AD account that you will later use to install vCenter Server, make sure first to give it sufficient privileges in SQL Server to create the database (you can even make it a sysadmin – it’s temporary, so you can drop the privileges later). When creating the database, leave the owner default.
  • On the system where you’ll install vCenter Server, create a DSN (see same document), make sure to specify Windows Authentication.
  • Run vCenter Server installer.
  • After the installation is finished, install vCenter Client, confirm that everything works.
  • Stop VirtualCenter Server and Web Management services.
  • Go into these services’ properties and specify a special account you want to use for this purpose on the Log On tab.
  • Go to SQL Management Studio again. Create a regular user associated with the same AD account, then execute the following query on VCDB: “ALTER AUTHORIZATION ON DATABASE::VCDB TO <useraccount>”. This will change the database’s owner to this user.
  • Start the two vCenter services.

It seems there’s also a similar issue with VUM: its service gets installed with local SYSTEM account, but your default DSN configuration will most likely use Windows Authentication and therefore require a proper AD account. So, go to the service’s properties, change its account to the same one you used for vCenter, and restart the service. This issue is sufficiently widespread to have earned itself a KB article: http://kb.vmware.com/kb/1011858.

Update 10/19/12 (something that wasn’t a problem until upgrade to 5.1 where we had to start using vSphere Web Client): use the same account to run Inventory Service and Profile-Driven Storage Service.

Categories: Virtualization

Why pfSense is not production ready

November 2, 2010 8 comments

(Caveat: everything said below is applicable only to pfSense 1.2.3, since this is only version I ever used.)

pfSense is a great piece of software. Easy to install, easy to configure, very powerful, lightweight, stable. It’s no surprise that so many people use it when they need a software firewall or router. But after running it for about half a year in production, I have formed the opinion that it was a wrong decision to use it in a critical role. And here’s why.

Over this period, I had exactly three issues with pfSense. One of them, the breakage of CARP due to multicasts coming back over teamed physical adapters , is mostly VMware’s fault, and I’m not going to count it against pfSense. The other two, however, are clearly a reflection of the FOSS mindset (or rather lack of resources).

The first of the two is the default number of entries in the state table: 10,000. This number is fine for home use or a small startup’s web site, but any organization beyond infancy will have more traffic and will need to increase the table size. The change is simple and can be made on the fly, so it may not seem like a problem, but it’s easy to miss, and difficult to troubleshoot: connections just randomly timeout or take a long time to establish, while pfSense happily keeps its system logs free of any notifications. Considering that each table entry occupies just 1K of memory, it would make a lot of sense to set the default to a much larger number, or, better yet, implement dynamic table resize.

The second problem is much nastier. There’s something broken with IP fragmentation handling. In our specific case it affected EDNS responses (DNSSEC-enabled servers now return 2-3KB-long responses, which necessarily become fragmented). pfSense’s scrub feature would reassemble them for analysis, then send them down to the destination, again in fragmented form, and the second fragment would come in with broken checksum, which made the reassembly at destination or any intermediary firewall impossible. There are some hints that this may actually be a problem with em driver checksum offload, but at this point it’s irrelevant: if pfSense can’t do something as basic as IP fragment processing, regardless of the underlying drivers and hardware (in this case it was actually pfSense-distributed virtual appliance, so no compatibility issues should be expected), it doesn’t qualify as a production-ready firewall.

I expect it to be gone from our environment in about two weeks.

Categories: Networking

Fixing VM-based pfSense CARP announcement echoes when using teamed network adapters

A few people who have tried to run pfSense as a virtual appliance in an ESX(i) host have found that CARP may refuse to work. Both pfSense nodes remain in “Backup” state – none of them is willing to take the Master role and start serving the VIP. This problem can be observed only if the VIP belongs to a virtual network interface that has multiple underlying physical adapters in a teamed configuration.

The best clue to the solution can be found in pfSense logs. There, one can see that the primary pfSense node actually tries to become a Master, but every time it receives a CARP announcement with the same advertisement frequency as its own, which makes it drop to Backup state (a router must see only lower frequency advertisements to remain Master). In fact, those advertisements are its own.

So, the real question is: why does the router VM receive IP packets that it has just sent? To answer it, we need to remember that, first, CARP advertisements are multicast, and, second, the typical ESX teaming setup uses the “Route based on the originating virtual switch port ID” option. This setting means that any given vNIC will consistently use the same pNIC, unless a hardware failure occurs. When this setting is used by the host, the physical switch, which the host’s pNICs are connected to, has all its corresponding ports configured by default, with no link aggregation.

Now, what happens when a CARP advertisement is sent? It exits the host on one pNIC, travels to the switch, where, being a multicast, it’s sent to other switch ports, including the other pNICs in the same team as the originating pNIC. The multicast comes back into the host, where it’s sent to all VMs on the same vSwitch, including the originating router. Oops.

We can argue whether or not this ESX behavior is correct, but the important fact is that VMware doesn’t seem to be interested in changing this behavior (the problem existed in 3.5 and still exists in 4.0). Instead, but there has been no VMware solution until 4.0U2. If you’re on this version, you can use the new Net.ReversePathFwdCheckPromisc option (refer to 4.0U2 release notes). Or we can fix it by ourselves, in a very simple way. We need to make the switch aware of the teamed nature of the pNICs involved. This way the switch won’t send the multicast packets back to the host.

I will let you figure out the correct setting for the switch (different vendors use different names for the same thing: Cisco has EtherChannel, Nortel calls it MultiLink Trunking, etc.). As for the host side, change the load balancing algorithm from the default to “Route based on IP hash”. Just keep in mind that until you have made the changes on both ends, that connection may not work, so make sure you’re not transferring anything important to/from the VMs on the same vSwitch while you’re making the changes. (I’m assuming your management network is on a different vSwitch; otherwise you’re on your own.)

Update: Thank you to Anne Jan Elsinga for pointing out that 4.0U2 provides a new option that can be used to solve the problem. I’ve modified the post to reflect this.

Missing access points after upgrade of Cisco Wireless LAN Controller to Release 5.2

If you upgrade your Cisco Wireless LAN Controller to Release 5.2, and suddenly some or all access points disappear, go to Controller, Advanced, Master Controller Mode, check the box, and power cycle the missing access points.

Root cause: 5.2 introduces CAPWAP protocol as the replacement for LWAPP. Some access points don’t transition to the new protocol unless the Master controller tells them to. Unfortunately, Master Controller Mode gets disabled after every reboot of the controller. Since you always have to reboot to apply the new software release, the problematic access points are left without guidance until you manually check that Master Controller Mode box.

Categories: Networking Tags: ,

Solution of the Clariion plaid issue

The problem was not with plaids as such, but rather with network congestion easily triggered by plaids.

The default configuration of PowerPath is to make all available paths active. With two iSCSI ports per SP, there are four paths, each path 1Gbps wide, 4Gbps total. However, that’s true only for the array side. On the host, we only have two 1G NICs. So, every time the array starts firing on all cylinders (and using plaids built from LUNs owned by different SPs is guaranteed to make it so), it is pumping twice as much data as the host interfaces can handle. Therefore, network congestion, frame discards, and severely impacted throughput.

Solution: change the mode of two of the four paths from active to standby, choosing them so that there’s one active path on each SP to each NIC. Alternatively, add two more NICs so that the host bandwidth is equal to the array bandwidth (though this may require four NICs on all other hosts, since using multiple host NICs on the same VLAN is not recommended).

Expect a new Primus KB and some changes to EMC iSCSI documentation.

Clariion plaids on RHEL 4

We will begin with a quick mention of the problem currently under investigation with EMC support:

  • Host: Dell PowerEdge R710, PowerPath, RHEL 4.8.
  • SAN: Clariion CX4-120
  • Connectivity: iSCSI – two ports per SP, two ports on the host, two VLANs on Nortel ERS5520.

Problem: if we present two LUNs to the host, put them into a single VG, then create an LV with no striping (LUNs get concatenated on that volume), everything is good. If we take the same LUNs and put them into a striped LV per various EMC Best Practices documents (since the LUNs themselves are striped across physical disks EMC calls this layout “plaids”), read performance suffers in a very bad way, dropping to about 4MB/sec. Write performance stays perfect at over 100MB/sec.