2014-06 try server update

Chatting with Aki the other day, I realized that word of all the wonderful improvements to the try server issue have not been publicized. A lot of folks have done a lot of work to make things better - here’s a brief summary of the good news.

Try server pushes could appear to take up to 4 hours, during which time others would be locked out.
The major time taker has been found and eliminated: ancestor processing. And we understand the remaining occasional slow downs are related to caching . Fortunately, there are some steps that developers can take now to minimize delays.

What folks can do to help

The biggest remaining slowdown is caused by rebuilding the cache. The cache is only invalidated if the push is interrupted. If you can avoid causing a disconnect until your push is complete, that helps everyone! So, please, no Ctrl-C during the push! The other changes should address the long wait times you used to see.

What has been done to infrastructure

There has long been a belief that many of our hg problems, especially on try, came from the fact that we had r/w NFS mounts of the repositories across multiple machines (both hgssh servers & hgweb servers). For various historical reasons, a large part of this was due to the way pushlog was implemented.

Ben did a lot of work to get sqlite off NFS, and much of the work to synchronize the repositories without NFS has been completed.

What has been done to our hooks

All along, folks have been discussing our try server performance issues with the hg developers. A key confusing issue was that we saw processes “hang” for VERY long times (45 min or more) without making a system call. Kendall managed to observe an hg process in such an infinite-looking-loop-that-eventually-terminated a few times. A stack trace would show it was looking up an hg ancestor without makes system calls or library accesses. In discussions, this confused the hg team as they did not know of any reason that ancestor code should be being invoked during a push.

Thanks to lots of debugging help from glandium one evening, we found and disabled a local hook that invoked the ancestor function on every commit to try. \o/ team work!

Caching – the remaining problem

With the ancestor-invoking-hook disabled, we still saw some longish periods of time where we couldn’t explain why pushes to try appeared hung. Granted it was a much shorter time, and always self corrected, but it was still puzzling.

A number of our old theories, such as “too many heads” were discounted by hg developers as both (a) we didn’t have that many heads, and (b) lots of heads shouldn’t be a significant issue – hg wants to support even more heads than we have on try.

Greg did a wonderful bit of sleuthing to find the impact of ^C during push. Our current belief is once the caching is fixed upstream, we’ll be in a pretty good spot. (Especially with the inclusion of some performance optimizations also possible with the new cache-fixed version.)

What is coming next

To take advantage of all the good stuff upstream Hg versions have, including the bug fixes we want, we’re going to be moving towards removing roadblocks to staying closer to the tip. Historically, we had some issues due to http header sizes and load balancers; ancient python or hg client versions; and similar. The client issues have been addressed, and a proper testing/staging environment is on the horizon.

There are a few competing priorities, so I’m not going to predict a completion date. But I’m positive the future is coming. I hope you have a glimpse into that as well.