Cygwin multithreading performance

Discussion:

Mark Geisert

2015-12-09 05:31:20 UTC

(Maybe cygwin-developers is a better list for this? It's pretty obscure.)

Yes, cygwin-developers is fine since it's gory implementation details.

Here are some mutex lock stats I've been talking about providing. These are
from the OP's original testcase 'git repack -a -f' running over a clone of
the newlib-cygwin source tree. Run on a 2-core, 4-HT machine under Windows
7 x64. I'm running a slightly modified cygwin1.dll that has 3 one-line mods
to thread.cc.

Which I'd like to see a patch of, just to know what you mean.

I'm considering adding the tools that produced these displays to the
cygutils package. I'm unsure if the cygwin1.dll mods I've made locally
should be shipped generally; I don't know how much extra CPU they use, if
any.

Well, let's have a look. This is open source after all :)

Here's my patchlet against Cygwin 2.3.1-1...

/oss/src/winsup/cygwin diff -u thread.cc.safe thread.cc
--- thread.cc.safe 2015-11-14 03:41:15.000000000 -0800
+++ thread.cc 2015-12-04 03:49:03.463794000 -0800
@@ -1752,6 +1752,7 @@
int
pthread_mutex::lock ()
{
+ int tstamp = strace.microseconds ();
pthread_t self = ::pthread_self ();
int result = 0;

@@ -1772,8 +1773,8 @@
result = EDEADLK;
}

- pthread_printf ("mutex %p, self %p, owner %p, lock_counter %d, recursion_counter %u",
- this, self, owner, lock_counter, recursion_counter);
+ pthread_printf ("mutex %p, self %p, owner %p, lock_counter %d, recursion_counter %u, tstamp %d, caller %p",
+ this, self, owner, lock_counter, recursion_counter, tstamp, caller_return_address ());
return result;
}

@@ -1801,8 +1802,8 @@
res = 0;
}

- pthread_printf ("mutex %p, owner %p, self %p, lock_counter %d, recursion_counter %u, type %d, res %d",
- this, owner, self, lock_counter, recursion_counter, type, res);
+ pthread_printf ("mutex %p, owner %p, self %p, lock_counter %d, recursion_counter %u, type %d, res %d, caller %p",
+ this, owner, self, lock_counter, recursion_counter, type, res, caller_return_address ());
return res;
}

The pthread_printf() call modifications don't add any CPU unless there's a
relevant strace in progress, so that should be acceptable. The other mod at
the start of the function adding the strace.microseconds() call is what I was
a little concerned about. That call needn't be done unless an strace is in
progress but I do not know how expensive it is. It appears to be a
QueryPerformanceCounter() underneath. If that's cheap, the mod is OK as-is.

..mark

Mark Geisert

2015-12-09 06:10:24 UTC

Permalink

caller 0x018014CC77, count 1, L, /oss/src/winsup/cygwin/thread.cc:475
caller 0x018014CD00, count 1, U, /oss/src/winsup/cygwin/thread.cc:496
caller 0x018014CDAF, count 432, L, /oss/src/winsup/cygwin/thread.cc:971
caller 0x018014CDE6, count 432, U, /oss/src/winsup/cygwin/thread.cc:982
caller 0x018014D07E, count 1, L, /oss/src/winsup/cygwin/thread.cc:1946
caller 0x018014D090, count 1, U, /oss/src/winsup/cygwin/thread.cc:1951
caller 0x018014D7E6, count 1, L, /oss/src/winsup/cygwin/thread.cc:525
caller 0x018014D7FF, count 1, U, /oss/src/winsup/cygwin/thread.cc:533
caller 0x018014EDD7, count 1, U, /oss/src/winsup/cygwin/thread.cc:2400
caller 0x018014EE97, count 1, L, /oss/src/winsup/cygwin/thread.cc:2389

This is interesting. I'm not sure if anything in the rest of the
output shows how much is wasted on the above two calls, though.
thread.cc:971 and thread.cc:982 are pthread_setcancelstate, and it's
called pretty often as part of stdio functions. Every stdio function
which has to lock the FILE structure also calls pthread_setcancelstate
to disable and reenable cancellation before and after locking. That's
almost any stdio function.
This may be one of the problems which lower performance, but there's no
easy or quick way around that, AFAICS.
There's also the fact that, even for tools using __fsetlocking to disable
stdio locking, pthread_setcancelstate will still be called unconditionally.
The question here is, if that's wrong and pthread_setcancelstate should be
skipped if the application sets FSETLOCKING_BYCALLER.

For a start, I simply removed the mutex lock/unlock in calls to
pthread_setcancelstate and pthread_setcanceltype. These locks are
completely unnecessary. These functions are only called for the current
thread anyway.
I'm just creating a developer snapshot which I'll upload to
https://cygwin.com/snapshots/ in half an hour at the latest. Please
have a look if your testcase behaves better now.

I'll give the snapshot a try, thanks, but it's hard for me to see :) how
eliminating a few hundred locks and unlocks that are 75% under 1 usec anyway
will help much. I may have unduly influenced anybody following this by only
showing detailed stats for the last two locks that I picked at random.

What I'm more focused on are the locks being heavily used by git and heavily
contended by the four git threads in OP's testcase. Here are the caller stats
for this set of locks...

lock latency buckets: <=1 <=10 <=100 <=1000 >1000
mtx 4908/06002A00F0 lks 155066, ulks 155066, 66359 4395 78895 4742 675
caller 0x0100454FAD, count 154792, L, /usr/src/git/builtin/pack-objects.c:1758
caller 0x0100454FFC, count 154777, U, /usr/src/git/builtin/pack-objects.c:1769
caller 0x010045557F, count 15, U, /usr/src/git/builtin/pack-objects.c:1760
caller 0x010045588D, count 15, L, /usr/src/git/builtin/pack-objects.c:1944
caller 0x01004558AC, count 15, U, /usr/src/git/builtin/pack-objects.c:1947
caller 0x01004598AC, count 15, L, /usr/src/git/builtin/pack-objects.c:2039
caller 0x01004599B9, count 11, U, /usr/src/git/builtin/pack-objects.c:2078
caller 0x0100459A7B, count 4, U, /usr/src/git/builtin/pack-objects.c:2078
caller 0x018014DB4F, count 244, U, /oss/src/winsup/cygwin/thread.cc:1254
caller 0x018014DB93, count 15, L, /oss/src/winsup/cygwin/thread.cc:1286
caller 0x018014DBD3, count 229, L, /oss/src/winsup/cygwin/thread.cc:1286

mtx 4908/0600FA6860 lks 154745, ulks 154745, 56092 3217 64883 25435 5118
caller 0x0100455659, count 1011, L, /usr/src/git/builtin/pack-objects.c:1640
caller 0x0100455687, count 1011, U, /usr/src/git/builtin/pack-objects.c:1642
caller 0x01004556E0, count 153734, L, /usr/src/git/builtin/pack-objects.c:1628
caller 0x010045570E, count 153734, U, /usr/src/git/builtin/pack-objects.c:1630

mtx 4908/0603498600 lks 316825, ulks 316825, 146254 4986 155345 9686 554
caller 0x0100455269, count 196769, L, /usr/src/git/builtin/pack-objects.c:1695
caller 0x01004552C4, count 15148, U, /usr/src/git/builtin/pack-objects.c:1705
caller 0x0100455478, count 181621, U, /usr/src/git/builtin/pack-objects.c:1702
caller 0x010045554C, count 120056, L, /usr/src/git/builtin/pack-objects.c:1834
caller 0x010045556E, count 120056, U, /usr/src/git/builtin/pack-objects.c:1837

If you look at the ">1000" column, you see that these 3 locks account for
675+5118+554 = 6347ms = 6.347sec spent waiting for these locks, best case for
just this 1000+ usec latency bucket. The next column to the left (a faster
bucket for latencies 101us up to 1000us) has 4742+25435+9686 = 39863, times
101us best case for this bucket, giving 4.026sec spent waiting for these locks.

I probably need to modify my stats gathering program to display the total time
spent waiting to make it more clear how things add up for each mutex across the
whole testcase. Again, these stats are not timing how long locks are being
held, but rather how long it takes just to acquire the lock.

It may be, when all is said and done, that this is the price one pays for POSIX
compliant pthread_mutexes. It just kind of bugs me that the testcase works fine
under MinGW using MS CriticalSection ops but we don't seem to have something
similarly light-weight in our pthread_mutex implementation. Or I'm missing it :).
Thanks for reading,

..mark

Corinna Vinschen

2015-12-09 09:20:05 UTC

Permalink

Post by Mark Geisert
I'll give the snapshot a try, thanks, but it's hard for me to see :) how
eliminating a few hundred locks and unlocks that are 75% under 1 usec anyway
will help much. I may have unduly influenced anybody following this by only
showing detailed stats for the last two locks that I picked at random.
What I'm more focused on are the locks being heavily used by git and heavily
contended by the four git threads in OP's testcase. Here are the caller
stats for this set of locks...
lock latency buckets: <=1 <=10 <=100 <=1000 >1000
mtx 4908/06002A00F0 lks 155066, ulks 155066, 66359 4395 78895 4742 675
caller 0x0100454FAD, count 154792, L, /usr/src/git/builtin/pack-objects.c:1758
caller 0x0100454FFC, count 154777, U, /usr/src/git/builtin/pack-objects.c:1769
[...]
mtx 4908/0600FA6860 lks 154745, ulks 154745, 56092 3217 64883 25435 5118
caller 0x0100455659, count 1011, L, /usr/src/git/builtin/pack-objects.c:1640
caller 0x0100455687, count 1011, U, /usr/src/git/builtin/pack-objects.c:1642
caller 0x01004556E0, count 153734, L, /usr/src/git/builtin/pack-objects.c:1628
caller 0x010045570E, count 153734, U, /usr/src/git/builtin/pack-objects.c:1630
mtx 4908/0603498600 lks 316825, ulks 316825, 146254 4986 155345 9686 554
caller 0x0100455269, count 196769, L, /usr/src/git/builtin/pack-objects.c:1695
caller 0x01004552C4, count 15148, U, /usr/src/git/builtin/pack-objects.c:1705
caller 0x0100455478, count 181621, U, /usr/src/git/builtin/pack-objects.c:1702
caller 0x010045554C, count 120056, L, /usr/src/git/builtin/pack-objects.c:1834
caller 0x010045556E, count 120056, U, /usr/src/git/builtin/pack-objects.c:1837
If you look at the ">1000" column, you see that these 3 locks account for
675+5118+554 = 6347ms = 6.347sec spent waiting for these locks, best case
for just this 1000+ usec latency bucket. The next column to the left (a
faster bucket for latencies 101us up to 1000us) has 4742+25435+9686 = 39863,
times 101us best case for this bucket, giving 4.026sec spent waiting for
these locks.
I probably need to modify my stats gathering program to display the total
time spent waiting to make it more clear how things add up for each mutex
across the whole testcase. Again, these stats are not timing how long locks
are being held, but rather how long it takes just to acquire the lock.

But acquiring a lock includes the time it actually has to wait for
another thread holding the lock so you're measuring *also* the time
locks are hold, just by another thread. locking and unlocking a
pthread_mutex in Cygwin might be slower than using a critical section,
but it's not *that* slow.

The underlying point is, mutexes are not used as an end in itself. They
guard a shared resource. What we ultimately need to know here is

1. What resource is it the pthread_mutexes are guarding?

2. Why does accessing the resource take so long that threads have to wait
so long?

Corinna

--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Maintainer cygwin AT cygwin DOT com
Red Hat

Mark Geisert

2015-12-10 08:00:58 UTC

Permalink

Post by Corinna Vinschen

But acquiring a lock includes the time it actually has to wait for
another thread holding the lock so you're measuring *also* the time
locks are hold, just by another thread. locking and unlocking a
pthread_mutex in Cygwin might be slower than using a critical section,
but it's not *that* slow.
The underlying point is, mutexes are not used as an end in itself. They
guard a shared resource. What we ultimately need to know here is
1. What resource is it the pthread_mutexes are guarding?
2. Why does accessing the resource take so long that threads have to wait
so long?

Silly me; I assumed they were guarding short runs of instructions that would
take the same amount of time on MinGW vs Cygwin, thus differences must be in the
locking implementations. That does seem to be the case for the 3rd lock listed
above, a guard around cache accounting that doesn't do any I/O. But the 1st
lock guards display_progress() (how time-consuming could *that* be?) and the 2nd
lock guards read_sha1_file() which sounds like it involves I/O so I need to dig
deeper there.

..mark

Mark Geisert

2016-01-21 09:05:11 UTC

Permalink

Post by Corinna Vinschen

Post by Mark Geisert
I probably need to modify my stats gathering program to display the total
time spent waiting to make it more clear how things add up for each mutex
across the whole testcase. Again, these stats are not timing how long locks
are being held, but rather how long it takes just to acquire the lock.

But acquiring a lock includes the time it actually has to wait for
another thread holding the lock so you're measuring *also* the time
locks are hold, just by another thread. locking and unlocking a
pthread_mutex in Cygwin might be slower than using a critical section,
but it's not *that* slow.
The underlying point is, mutexes are not used as an end in itself. They
guard a shared resource. What we ultimately need to know here is
1. What resource is it the pthread_mutexes are guarding?
2. Why does accessing the resource take so long that threads have to wait
so long?

That sanity check was much appreciated. I worked a bit longer on this then set
it aside to wait for further inspiration while doing other work. No such
inspiration was forthcoming :) so I'll note what I found and close this off.

Bottom line, for the OP's original testcase having git do a repack of a locally
cloned Cygwin source tree, on my test system the 4 parallel threads were
spending no more than 6% of their time waiting for locks. That includes the
time those locks were held by some other thread. So this line of investigation
doesn't seem to answer the questions posed by the OP (i.e., why don't we see
~100% CPU utilization during a presumably CPU-bound operation, and why does
utilization seem so much lower with Cygwin git vs MinGW git). I could
speculate, but I won't.

I'm now looking into modifying the Cygwin profiling code to support the
profiling of multi-threaded applications. I'll post anew when I've got
something to report.
Thanks & Cheers,

..mark