Last week I encountered a strange networking problem: I could ssh into a server and work fine but when I tried to copy something with scp it was very very slow, it seemed that large packets were being dropped.
Let me tell you first some background info: we set up a new application a few months ago with a somewhat complex architecture deployed on a two host VMware cluster (several Linux web application servers published through a Linux load balancer). Since a few weeks ago the application owners were complaining that copying files through SSH/SCP was extremely slow through the WAN from one server.
At first, I thought the cause was some MTU issue due to restricted firewall configurations or some problem with the VM or custom gateway because between the servers in the same VLAN everything was ok. Then the users found out that other servers were having the same problem and all were on the same host. So we started to do some packet captures to see what was going on (click on the pictures to see their full width):
So it seemed that the gateway was receiving packets larger than 1500 with DF bit set and the app server ignored the ICMP messages and kept on sending non fragmentable large messages! At first I didn’t understand what could be going on. We tried uninstalling VMware Tools and it magically started working… Installed them back, stopped working. During the Linux boot process I noticed something about activating TCP Segmentation offload. Disabled this using ethtool -K eth0 tso off, and BINGO! ssh copying started to work!
So it seems that VMware enhanced driver optimizes intrahost communication by sending the whole tcp segment to the next virtual machine but ignores the possibility of it being a router and the packet marked as non-fragmentable! We are taking this to VMware support, I don’t think it should be happening like this.
Update: VMware support responded with this: http://kb.vmware.com/kb/1010939
So it’s a known issue
.




Leave a comment
Comments feed for this article