This week I encountered a really nasty problem with the Team Foundation Server Build Agents. Before I walk you through the details let me first explain the situation.
Scenario
At a customer we are building a new build “farm” (a little farm 😉 ). Currently we have 2 build servers with each 2 agents on them (A, B, C, D). The customer runs on VM Ware ESX 5.1 and because we will have more build servers in the future, the build servers are clones of 1 parent image. Important to know is that the image is a clean OS (windows 2012) with the Build Agent installed but not configured. The VM Ware solution makes a “smart copy” of the parent server. I checked and the 2 build servers that are created have a unique MAC ADDRESS, NETBIOS name and IP. After cloning the image, the Build agents are configured on the clones. Problem 1: Non-responding Build Agent When starting a new build the agent starts with getting sources from Source Control. When the workspace (in which it gets the build) is not yet present (deleted, first run, changed workspace) and there is more than 1 agent enabled on the build server the build hangs. It does not get anything and does not create the workspace and eventually the build times. When we disable the second agent and have only 1 agent active, it runs ok.When the workspace already exists, the build runs also runs ok. Workaround: We disable 1 agent, run the build, disable the other agent run the build..Not ideal… Problem 2: The CRC in GZip footer does not match the CRC calculated from the decompressed data. This is more scary. When downloading large files or large chunks (this happens when running a definition for the first time, when it gets all sources), we regularly get the message: The CRC in GZip footer does not match the CRC calculated from the decompressed data. What this actually means, is that the data, retrieved from Source Control, is corrupt. Workaround: Log in on build server, Get latest with Visual Studio, rerun build.. Not ideal
Solution
As you can read we had some workarounds but they were not great. I had some contact with my network colleagues and with Microsoft and they all stated it had something to do with the network. But how? It was all running on Virtual Machines on VMWare.. At first I suspected that it had something to do with the cloning. Maybe a build server was not unique etc…But.. After some searches on the internet I found some useful posts. At first posts talking about related issues on Hyper-V. Then I thought, what if we search for network issues in combination with VMWare. I found this KB article from VMWare. This stated that there were known issues with the default e1000e network card in combination with Windows 2012. Searching some more I found this and this article which explained the problem even more from a VMWare perspective. The solution was actually very simple. Changing the network cards from e1000e to a VMXNET3 network card solved the issues! Build agents are responsive again and the CRC error has vanished ! Hope this helps!
UPDATE 10-03-2015:
I hit this problem again. However, the error was different. When getting the latest version of about 10 Gb I received:
The underlying connection was closed: An unexpected error occurred
AND
“’.’, hexadecimal value 0x00, is an invalid character. Line 1, position 99902.”
The solution above worked also in these cases..
Thank you for this solution. It worked for our environment perfectly!
This solution worked for me as well. We were on TFS 2013 running a VM with Windows 2008 and using the e1000e network adapter and never had a problem. When we created a new VM with Windows 2012 and upgraded to TFS 2015 this is when all of our problems started. I’ve been troubleshooting this for the past week and even had Microsoft online for a different TFS error. As soon as I changed the network adapter to VMXNET3 our dev team and build servers were finally able to get the source reliably.
Thanks so much for sharing.