Archive for the ‘Storage’ Category

Linux, EMC SANs, and TCP Delayed ACKs

December 21, 2011 4 comments

One of  relatively well-known issues when using EMC (and some other vendors’) SANs over iSCSI is the SANs’ dislike for TCP delayed ACKs. The reasons for the dislike are best described in this VMware KB article. EMC also has several articles discussing delayed ACKs on Primus, but overall the issue is confusing. With this post, I’ll try to clear up the confusion.

(Since you can’t deep-link to EMC Powerlink pages, I’m just going to give article numbers.) Out of many articles returned by Powerlink search on “delayed ack”, we can consider emc245445 to be the starting point, since it discusses general best practices for improving Clariion iSCSI performance and provides references to the articles covering Windows and ESX hosts specifically. About Linux hosts, it has this strange “may also apply” statement with no further explanation or instructions. Looking at the Windows article (emc150702), we see very detailed instructions on tweaking TCP stack settings; the articles for ESX (emc191777 and emc273003) lead to the VMware article mentioned above, which tells us where to find the magic checkbox that disables delayed ACKs. But the best Powerlink can do about Linux is emc264171, which reads like an exam answer of a mediocre student, who remembers some stuff from lectures, but can’t put it all together into a cohesive response. So, what’s going on?

The issue is not trivial and was probably never researched by EMC deeply enough to produce a useful Primus article. In Linux, delayed ACKs are disabled by setting the TCP_NODELAY option on the socket. It’s done by the application that wants to use the socket and is not a simple system-wide setting like in Windows or ESX. Therefore, in Linux, there are many different places where this setting may be specified.

If the Linux server is using software iSCSI, it is the responsibility of the iSCSI initiator code to set this option. Fortunately, open-iscsi, which is used by most (all?) modern Linuxes, does the right thing: TCP_NODELAY is hardcoded in the function that handles TCP connections.

In the case of hardware iSCSI initiators (proper HBA or iSCSI offload provided by Broadcom NICs and the like), things get more complicated. Since these adapters implement their own TCP stacks, delayed ACKs need to be disabled through the driver. And, of course, every driver will have its own setting for that. For example, Broadcom’s bnx2i will take “options bnx2i en_tcp_dack=0” in modprobe.conf. For other iSCSI implementations, you will need to consult the documentation or contact the vendor (or just Google it).

Categories: Operating Systems, Storage

Solution of the Clariion plaid issue

The problem was not with plaids as such, but rather with network congestion easily triggered by plaids.

The default configuration of PowerPath is to make all available paths active. With two iSCSI ports per SP, there are four paths, each path 1Gbps wide, 4Gbps total. However, that’s true only for the array side. On the host, we only have two 1G NICs. So, every time the array starts firing on all cylinders (and using plaids built from LUNs owned by different SPs is guaranteed to make it so), it is pumping twice as much data as the host interfaces can handle. Therefore, network congestion, frame discards, and severely impacted throughput.

Solution: change the mode of two of the four paths from active to standby, choosing them so that there’s one active path on each SP to each NIC. Alternatively, add two more NICs so that the host bandwidth is equal to the array bandwidth (though this may require four NICs on all other hosts, since using multiple host NICs on the same VLAN is not recommended).

Expect a new Primus KB and some changes to EMC iSCSI documentation.

Clariion plaids on RHEL 4

We will begin with a quick mention of the problem currently under investigation with EMC support:

  • Host: Dell PowerEdge R710, PowerPath, RHEL 4.8.
  • SAN: Clariion CX4-120
  • Connectivity: iSCSI – two ports per SP, two ports on the host, two VLANs on Nortel ERS5520.

Problem: if we present two LUNs to the host, put them into a single VG, then create an LV with no striping (LUNs get concatenated on that volume), everything is good. If we take the same LUNs and put them into a striped LV per various EMC Best Practices documents (since the LUNs themselves are striped across physical disks EMC calls this layout “plaids”), read performance suffers in a very bad way, dropping to about 4MB/sec. Write performance stays perfect at over 100MB/sec.