identifying the cpu a go routine is running on?

3,449 views
Skip to first unread message

matt

unread,
Jan 28, 2013, 6:00:36 AM1/28/13
to golan...@googlegroups.com
Is there anyway of telling which cpu/core a goroutine is actually running on?

Dave Cheney

unread,
Jan 28, 2013, 6:02:07 AM1/28/13
to matt, golan...@googlegroups.com
Possibly by scrobbling in /proc (on linux), but that is may cause your
thread to move to another CPU.

What is the problem you are trying to solve ?

Dave

On Mon, Jan 28, 2013 at 10:00 PM, matt <matthew....@gmail.com> wrote:
> Is there anyway of telling which cpu/core a goroutine is actually running
> on?
>
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group, send email to
> golang-nuts...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

matt

unread,
Jan 28, 2013, 6:07:16 AM1/28/13
to golan...@googlegroups.com
So if I generate 100 goroutines on a 4-core machine I'd like to know how many threads exist on each core. I can readily profile the runtimes, but I can't grok whether they're all on a single core or split across all four or some variety in between. When the goroutines hang up (and emit a value to an exit channel I'd like them to emit the id of the core they were last present on).

I'm running some simple locking tests and would like the insight that say the syscall getAffinity would show; but I don't think thats in the syscall package in go.

Thanks!

Dave Cheney

unread,
Jan 28, 2013, 6:12:18 AM1/28/13
to matt, golan...@googlegroups.com
Like most things, it depends. On the value of GOMAXPROCS, on how many
blocking syscalls are currently executing, on the other things going
on on your host at that time, on the overhead of whatever profiling
you are using, etc.

On Mon, Jan 28, 2013 at 10:07 PM, matt <matthew....@gmail.com> wrote:
> So if I generate 100 goroutines on a 4-core machine I'd like to know how
> many threads exist on each core. I can readily profile the runtimes, but I
> can't grok whether they're all on a single core or split across all four or
> some variety in between. When the goroutines hang up (and emit a value to an
> exit channel I'd like them to emit the id of the core they were last present
> on).
>
> I'm running some simple locking tests and would like the insight that say
> the syscall getAffinity would show; but I don't think thats in the syscall
> package in go.

even if syscall getAfinity was implemented, it wouldn't work
correctly, blocking syscalls spawn a new thread to do the blocking.

I think you should be very clear about what you are trying to achieve.
You appear to be trying to benchmark lock contention, is that correct
?
>
> Thanks!
>
>
> On Monday, 28 January 2013 11:00:36 UTC, matt wrote:
>>
>> Is there anyway of telling which cpu/core a goroutine is actually running
>> on?
>

matt

unread,
Jan 28, 2013, 6:23:56 AM1/28/13
to golan...@googlegroups.com, matt
Yes, I could have been clearer, I am trying to do some simple benchmarking of lock contention. I appreciate all the assumptions/dependencies etc but it would be good to know if this is possible at all (without hacking the  go src in runtime or making C calls). My first grok through the packages and exposed APIs suggests no.

Thanks

Dave Cheney

unread,
Jan 28, 2013, 6:29:44 AM1/28/13
to matt, golang-nuts

Have you checked out the benchmarks in the runtime and sync packages? You may be able to adapt them to your purpose.

Donovan Hide

unread,
Jan 28, 2013, 6:30:39 AM1/28/13
to matt, golang-nuts
Yes, I could have been clearer, I am trying to do some simple benchmarking of lock contention. I appreciate all the assumptions/dependencies etc but it would be good to know if this is possible at all (without hacking the  go src in runtime or making C calls). My first grok through the packages and exposed APIs suggests no.

If you are using tip, you can profile blocking:


matt

unread,
Jan 28, 2013, 9:30:36 AM1/28/13
to golan...@googlegroups.com, matt
Thanks, but I don't think either the benchmarks or the profiling provide the information I'm after i.e. which core did the go routine execute on. Contention for a lock will clearly be different if both routines run in the same core, cpu or same socket as opposed to running distributed across cores/cpu/sockets in a shared memory system.

It's not a problem, just curious to know if there was a way to observe it.

// Matt

Donovan Hide

unread,
Jan 28, 2013, 11:31:55 AM1/28/13
to matt, golang-nuts
I spotted this, might be of interest:


Also had a go at doing it in a very hacky, unportable low-level way with cpuid directly:


The trick is doing cpuid with 0xB in AX. Current processor is reported in DX.

This probably could be made portable. I've not tested on AMD, only 4xCPU (8x CPU with hyper-threading) Intel Sandy Bridge. Obviously would work much better if it were integrated into the Block Profile.



bryanturley

unread,
Jan 28, 2013, 1:39:54 PM1/28/13
to golan...@googlegroups.com, matt
On Monday, January 28, 2013 8:30:36 AM UTC-6, matt wrote:
Thanks, but I don't think either the benchmarks or the profiling provide the information I'm after i.e. which core did the go routine execute on. Contention for a lock will clearly be different if both routines run in the same core, cpu or same socket as opposed to running distributed across cores/cpu/sockets in a shared memory system.

It's not a problem, just curious to know if there was a way to observe it.

// Matt


Isn't this what the up coming race detector is for?

If not
perf list --help
+ your cpu's manual for the specific performance monitoring counter.
There are counters for cache contention which will be caused by lock contention.

Haven't tried either of these methods myself though.
I don't think there is currently a way to force a goroutine to stay on a predetermined cpu.
The scheduler is most likely going to be updated pretty frequently in the near and long term which will most likely help well written programs.

Using cpuid instruction might not be the best way since the kernel can and will randomly switch your thread to another cpu just after you get the cpuid data back or just before you ask for it.
Just because your function starts on one cpu doesn't mean it will end on the same cpu (http://en.wikipedia.org/wiki/Preemptive_multitasking#Preemptive_multitasking).

You would need to set the goroutines cpu affinity or use HTM (http://en.wikipedia.org/wiki/Transactional_memory#Hardware) to have donovan's method work 100% of the time.
That said working ~99% of the time might be enough to find your problem.

Donovan Hide

unread,
Jan 28, 2013, 2:44:43 PM1/28/13
to bryanturley, golang-nuts, matt
Isn't this what the up coming race detector is for?

I think it is more to find unsafe access to values by multiple goroutines.
 
If not
perf list --help
+ your cpu's manual for the specific performance monitoring counter.
There are counters for cache contention which will be caused by lock contention.

There seems to be a specific option to just monitor lock contention:


perf is very cool!

Dmitry Vyukov

unread,
Jan 28, 2013, 10:58:02 PM1/28/13
to golan...@googlegroups.com, matt
On Monday, January 28, 2013 6:30:36 PM UTC+4, matt wrote:
Thanks, but I don't think either the benchmarks or the profiling provide the information I'm after i.e. which core did the go routine execute on. Contention for a lock will clearly be different if both routines run in the same core, cpu or same socket as opposed to running distributed across cores/cpu/sockets in a shared memory system.

It's not a problem, just curious to know if there was a way to observe it.



If you have GOMAXPROCS=N, at least N cores and at least N runnable goroutines, the runtime will distribute them across threads, and OS will distribute threads across cores.
So you just need to do series of tests like:
$ taskset 1 GOMAXPROCS=1 ./test
$ taskset 3 GOMAXPROCS=2 ./test
$ taskset 17 GOMAXPROCS=2 ./test
$ taskset 15 GOMAXPROCS=4 ./test
$ taskset 85 GOMAXPROCS=4 ./test
etc

matt

unread,
Jan 29, 2013, 9:53:57 AM1/29/13
to golan...@googlegroups.com, matt
Thanks for the suggestions, getcpuid and the instrumented mutex is the closest to what I needed.

matt

unread,
Jan 29, 2013, 9:54:33 AM1/29/13
to golan...@googlegroups.com, matt
It's a shame not all architectures allow user level access to getting cpuid.

bryanturley

unread,
Jan 29, 2013, 4:26:23 PM1/29/13
to golan...@googlegroups.com, matt
On Tuesday, January 29, 2013 8:54:33 AM UTC-6, matt wrote:
It's a shame not all architectures allow user level access to getting cpuid.


Meh... cpuid style instructions are abused quite frequently.
 

matt

unread,
Jan 29, 2013, 4:34:34 PM1/29/13
to golan...@googlegroups.com, matt
In response to the code you listed on github, you probably want to use a slice (or array) for the counter inside the instrumented mutex. The map introduces a 10x hit on lock/unlock of the mutex. Additionally you probably only need to update the counter on the Lock method.

In case you plan to keep using it.

On Monday, 28 January 2013 16:31:55 UTC, Donovan wrote:

Donovan Hide

unread,
Jan 29, 2013, 5:01:36 PM1/29/13
to matt, golang-nuts
In response to the code you listed on github, you probably want to use a slice (or array) for the counter inside the instrumented mutex. The map introduces a 10x hit on lock/unlock of the mutex. Additionally you probably only need to update the counter on the Lock method.

I have no use for it, perf does all I need, I just like playing with assembly. I think there might well have been a likely data race on the Unlock map access as well....

What are you testing? Sounds like you might be having fun abusing locks :-)

 

j...@tsp.io

unread,
Apr 26, 2015, 8:01:17 PM4/26/15
to golan...@googlegroups.com
Hi all!

Sorry to revive this fairly old thread, but I have come up with a use-case for this feature that I believe is fairly compelling.

Consider the case of RWMutexes when data is being read frequently, but written infrequently. The standard RWMutex implementation will add a fair amount of overhead if the locks are kept for short periods of time, because every lock acquisition requires inter-core communication (for the global variable increment).

A better technique to use here would be to have an RWMutex *per core*, where goroutines only call RLock() on *their core's lock*, and instead require writers to call Lock() on the locks of *every* core. This way, many readers on different cores can operate concurrently without lots of cross-core cache line bouncing, at the cost of marginally slowing down writers. This technique will also work even if a goroutine is moved to a different core, because it is still taking out an RLock(). It'll incur some unnecessary cross-core traffic, but only at the moment when it is moved.

Unfortunately, this is not possible in Go today without the ability for a goroutine to determine which CPU core it is running on. Exposing it, even just to sync and RWMutex, could yield significant performance improvements in fast multi-core applications. I don't know if adding a runtime.WhichCoreAmIOn() function would even be possible to provide without adding significant complexity, but if it is, maybe it'd be worth investigating further?

Thoughts?
Jon

Ian Lance Taylor

unread,
Apr 26, 2015, 9:17:28 PM4/26/15
to Jon Gjengset, golang-nuts
On Sun, Apr 26, 2015 at 5:01 PM, <j...@tsp.io> wrote:
>
> Sorry to revive this fairly old thread, but I have come up with a use-case
> for this feature that I believe is fairly compelling.
>
> Consider the case of RWMutexes when data is being read frequently, but
> written infrequently. The standard RWMutex implementation will add a fair
> amount of overhead if the locks are kept for short periods of time, because
> every lock acquisition requires inter-core communication (for the global
> variable increment).
>
> A better technique to use here would be to have an RWMutex *per core*, where
> goroutines only call RLock() on *their core's lock*, and instead require
> writers to call Lock() on the locks of *every* core. This way, many readers
> on different cores can operate concurrently without lots of cross-core cache
> line bouncing, at the cost of marginally slowing down writers. This
> technique will also work even if a goroutine is moved to a different core,
> because it is still taking out an RLock(). It'll incur some unnecessary
> cross-core traffic, but only at the moment when it is moved.
>
> Unfortunately, this is not possible in Go today without the ability for a
> goroutine to determine which CPU core it is running on. Exposing it, even
> just to sync and RWMutex, could yield significant performance improvements
> in fast multi-core applications. I don't know if adding a
> runtime.WhichCoreAmIOn() function would even be possible to provide without
> adding significant complexity, but if it is, maybe it'd be worth
> investigating further?

Providing a function that returns which core a goroutine is running on
does not help to implement your more efficient lock, because a
goroutine may move to a different core between the Lock and Unlock
calls. I have no particular objection to your idea, but
WhichCoreAmIOn does not seem like the right primitive.

Given a limited number of worker goroutines, you can implement what
you want more awkwardly by using a []*sync.RWMutex, assigning one to
each goroutine, having that goroutine use that for read locks, and for
write locks acquiring all the mutexes.

Ian

Jon Gjengset

unread,
Apr 26, 2015, 10:11:53 PM4/26/15
to Ian Lance Taylor, golang-nuts
> > ... have an RWMutex *per core*, where goroutines only call RLock()
> > on *their core's lock*, and instead require writers to call Lock()
> > on the locks of *every* core.

> Providing a function that returns which core a goroutine is running on
> does not help to implement your more efficient lock, because a
> goroutine may move to a different core between the Lock and Unlock
> calls. I have no particular objection to your idea, but WhichCoreAmIOn
> does not seem like the right primitive.

While it is true that a goroutine *may* be moved to a different core,
the scheduler presumably still attempts to avoid moving a goroutine
unnecessarily to not completely trash the L123 caches? As long as this
is the case, this optimization would help (and not violate correctness).
I agree it might not be the right primitive, but I haven't been able to
come up with a better alternative.

Also, even if this is not the case with Go's current scheduler, it seems
likely that this will be added sooner rather than later. It is becoming
increasingly expensive to move computation between cores as the number
of CPU cores increases, since cores are necessarily further apart,
making cross-core communication more expensive. The cache hierarchy
between (especially some) cores also makes moving computations more
expensive. For example, in a NUMA system, you wouldn't want goroutines
to bounce all over the place, as this would kill performance.

> Given a limited number of worker goroutines, you can implement what
> you want more awkwardly by using a []*sync.RWMutex, assigning one to
> each goroutine, having that goroutine use that for read locks, and for
> write locks acquiring all the mutexes.

I don't see how that would work?
The writer would have to lock the list to ensure that new readers aren't
added while it is doing work on shared data, and readers would have to
RLock (and thus increment) that "meta"-lock, which would have the same
problem as the original.

Jon
signature.asc

Josh Bleecher Snyder

unread,
Apr 26, 2015, 10:20:58 PM4/26/15
to Jon Gjengset, golang-nuts
> Consider the case of RWMutexes when data is being read frequently, but
> written infrequently.

You might find atomic.Value useful: http://golang.org/pkg/sync/atomic/#Value

-josh

Jon Gjengset

unread,
Apr 26, 2015, 10:25:57 PM4/26/15
to Josh Bleecher Snyder, golang-nuts
I'm not sure how atomic.Value would work as a replacement for a lock
except by doing busy-waiting with runtime.Gosched()? The lock should
*block* if someone else is writing.

Jon
signature.asc

Josh Bleecher Snyder

unread,
Apr 26, 2015, 10:31:08 PM4/26/15
to Jon Gjengset, golang-nuts
You can use a separate sync.Mutex to coordinate writes (which are
posited to be rare), and do copy-on-write.

It's not right for all uses, thus the "might" in "you might find it to
be useful". It has the advantage of being available for use right now
and scaling across cores very well.

-josh

Jon Gjengset

unread,
Apr 26, 2015, 10:48:41 PM4/26/15
to golang-nuts
Josh Bleecher Snyder wrote:
> >> > Consider the case of RWMutexes when data is being read
> >> > frequently, but written infrequently.
> >>
> >> You might find atomic.Value useful: http://golang.org/pkg/sync/atomic/#Value
> >
> > I'm not sure how atomic.Value would work as a replacement for a lock
> > except by doing busy-waiting with runtime.Gosched()? The lock should
> > *block* if someone else is writing.
>
> You can use a separate sync.Mutex to coordinate writes (which are
> posited to be rare), and do copy-on-write.

Ah, of course. Yes, you're right, I just didn't think that one step
further. For applications that update a single value, or where copies
can be made reasonably cheap, this would work fine. It does potentially
require significant code changes though.

> It's not right for all uses, thus the "might" in "you might find it to
> be useful". It has the advantage of being available for use right now
> and scaling across cores very well.

Unfortunately, in my particular case, this approach wouldn't work. I
have a number of goroutines that are all allowed to change the
underlying data structure concurrently, but then one particular
goroutine that needs an exclusive lock. The concurrent modifiers all
take out RLocks, but the "special" writer needs an "real" WLock where
all readers are prevented from accessing the critical section.

Jon

Jon Gjengset

unread,
Apr 26, 2015, 10:51:25 PM4/26/15
to golang-nuts
Wouldn't Load()s from atomic.Value also come at the additional cost of
interface{} conversion, including (potentially) an extra memory
allocation per read?

Josh Bleecher Snyder

unread,
Apr 26, 2015, 11:38:52 PM4/26/15
to golang-nuts
Writes may alloc, but reads shouldn't. Yes, there's an interface
conversion, but they're pretty cheap, and they'll be cheaper yet in Go
1.5.

-josh

Ian Lance Taylor

unread,
Apr 27, 2015, 6:10:35 PM4/27/15
to Jon Gjengset, golang-nuts
On Sun, Apr 26, 2015 at 7:11 PM, Jon Gjengset <j...@thesquareplanet.com> wrote:
>> > ... have an RWMutex *per core*, where goroutines only call RLock()
>> > on *their core's lock*, and instead require writers to call Lock()
>> > on the locks of *every* core.
>
>> Providing a function that returns which core a goroutine is running on
>> does not help to implement your more efficient lock, because a
>> goroutine may move to a different core between the Lock and Unlock
>> calls. I have no particular objection to your idea, but WhichCoreAmIOn
>> does not seem like the right primitive.
>
> While it is true that a goroutine *may* be moved to a different core,
> the scheduler presumably still attempts to avoid moving a goroutine
> unnecessarily to not completely trash the L123 caches? As long as this
> is the case, this optimization would help (and not violate correctness).
> I agree it might not be the right primitive, but I haven't been able to
> come up with a better alternative.

I don't understand how your suggestion works if a goroutine moves
between cores. On core 1 the goroutine locks the lock attached to
that core. Then it moves to core 2. Then it unlocks the lock
attached to that core, but that's the wrong lock.

I guess you would suggest that the goroutine should call a function to
get a pointer to the lock, where the pointer is based on the current
CPU, and should then use that pointer to acquire and release the lock.
I suppose that would work.

You could do that today on GNU/Linux using
syscall.Syscall(syscall.SYS_GETCPU, ...)
Might be interesting to try.


> Also, even if this is not the case with Go's current scheduler, it seems
> likely that this will be added sooner rather than later. It is becoming
> increasingly expensive to move computation between cores as the number
> of CPU cores increases, since cores are necessarily further apart,
> making cross-core communication more expensive. The cache hierarchy
> between (especially some) cores also makes moving computations more
> expensive. For example, in a NUMA system, you wouldn't want goroutines
> to bounce all over the place, as this would kill performance.

All true, but, because goroutines can block for arbitrary lengths of
time, we are always going to want to reserve the right to move them
between cores when necessary.


>> Given a limited number of worker goroutines, you can implement what
>> you want more awkwardly by using a []*sync.RWMutex, assigning one to
>> each goroutine, having that goroutine use that for read locks, and for
>> write locks acquiring all the mutexes.
>
> I don't see how that would work?
> The writer would have to lock the list to ensure that new readers aren't
> added while it is doing work on shared data, and readers would have to
> RLock (and thus increment) that "meta"-lock, which would have the same
> problem as the original.

True, I'm assuming not only a limited number of worker goroutines, but
a fixed number.

Ian

Jon Gjengset

unread,
Apr 27, 2015, 7:36:38 PM4/27/15
to golang-nuts
> > While it is true that a goroutine *may* be moved to a different core,
> > the scheduler presumably still attempts to avoid moving a goroutine
> > unnecessarily to not completely trash the L123 caches? As long as this
> > is the case, this optimization would help (and not violate correctness).
> > I agree it might not be the right primitive, but I haven't been able to
> > come up with a better alternative.
>
> I don't understand how your suggestion works if a goroutine moves
> between cores. On core 1 the goroutine locks the lock attached to
> that core. Then it moves to core 2. Then it unlocks the lock
> attached to that core, but that's the wrong lock.
>
> I guess you would suggest that the goroutine should call a function to
> get a pointer to the lock, where the pointer is based on the current
> CPU, and should then use that pointer to acquire and release the lock.
> I suppose that would work.

I was thinking of the goroutine just keeping track of which lock it
acquired (i.e. its index in the array of locks), but using pointers for
it works too.

> You could do that today on GNU/Linux using
> syscall.Syscall(syscall.SYS_GETCPU, ...)
> Might be interesting to try.

I don't see a syscall.SYS_GETCPU in the syscall package?

I'm also worried that doing a syscall on every lock acquisition might
introduce quite a lot of overhead in and of itself (potentially more so
than the atomic add performed in normal RWMutex), and so it might not be
a good proxy for the gains one might see with a "proper" solution.
Presumably the runtime knows what CPU it's running on already, and
wouldn't need to ask the kernel?

FWIW, after poking around a bit, it seems like this approach to RW
locking is far from new. Linux calls then "big reader locks"
(https://lwn.net/Articles/378911/). Benchmarks have shown that this kind
of locking displays much better scaling properties than for a shared RW
lock
(http://joeduffyblog.com/2009/02/20/a-more-scalable-readerwriter-lock-and-a-bit-less-harsh-consideration-of-the-idea/),
though it does also add overhead for writers.

There's some interesting newer work in this field too:
https://www.usenix.org/system/files/conference/atc14/atc14-paper-liu.pdf
though that is beyond the scope of the change I want to make.

> > Also, even if this is not the case with Go's current scheduler, it
> > seems likely that this will be added sooner rather than later.
>
> All true, but, because goroutines can block for arbitrary lengths of
> time, we are always going to want to reserve the right to move them
> between cores when necessary.

Oh, sure, that's true. I was more referring to the fact that the
scheduler will (hopefully?) not move goroutines unless some core is
being starved for work.

> >> Given a limited number of worker goroutines, you can implement what
> >> you want more awkwardly by using a []*sync.RWMutex
> >
> > I don't see how that would work?
>
> True, I'm assuming not only a limited number of worker goroutines, but
> a fixed number.

Ah, I see. I suppose the application might spawn GOMAXPROCS goroutines
on startup, each with its own lock, and reap some of the benefits above.
I suspect it might be hard to program that way though, as you now loose
some of the neat benefits of lightweight threads (e.g. you can't spawn
off a bunch of readers of the lock).

Ian Lance Taylor

unread,
Apr 27, 2015, 8:06:16 PM4/27/15
to golang-nuts
On Mon, Apr 27, 2015 at 4:36 PM, Jon Gjengset <j...@thesquareplanet.com> wrote:
>
>> You could do that today on GNU/Linux using
>> syscall.Syscall(syscall.SYS_GETCPU, ...)
>> Might be interesting to try.
>
> I don't see a syscall.SYS_GETCPU in the syscall package?

That's interesting, it's there for every GNU/Linux target except
amd64. That's an odd oversight.


> I'm also worried that doing a syscall on every lock acquisition might
> introduce quite a lot of overhead in and of itself (potentially more so
> than the atomic add performed in normal RWMutex), and so it might not be
> a good proxy for the gains one might see with a "proper" solution.
> Presumably the runtime knows what CPU it's running on already, and
> wouldn't need to ask the kernel?

The Go runtime does not in fact know which CPU a goroutine is running
on. Why would it? The runtime knows which thread a goroutine is
running on, but that is not, of course, the same thing. The runtime
leaves it to the kernel to keep threads preferentially on a single
CPU.

Ian

Ingo Oeser

unread,
Apr 28, 2015, 4:19:45 AM4/28/15
to golan...@googlegroups.com, j...@thesquareplanet.com
On Tuesday, April 28, 2015 at 1:36:38 AM UTC+2, Jon Gjengset wrote:
FWIW, after poking around a bit, it seems like this approach to RW
locking is far from new. Linux calls then "big reader locks"
(https://lwn.net/Articles/378911/). 

Linux has mostly replaced the "big reader locks" with a better algorithm for this use case called "RCU" 
(see http://en.wikipedia.org/wiki/Read-copy-update for more details) due to scale-ability issues

This algorithm might also be implemented using Go, since you have the building blocks for it.
The necessary quiescent period which all Go routines must go through could be achieved with 
a sufficient amount of runtime.GC() calls at the moment and I hope the new garbage collector with either retain that semantics or add a runtime function for this use case.

More info on that specific locking topic regarding the Linux Kernel are at http://lxr.free-electrons.com/source/Documentation/locking/lglock.txt#L148


Ingo

Jon Gjengset

unread,
Apr 28, 2015, 11:03:04 AM4/28/15
to golan...@googlegroups.com
Ingo Oeser wrote:
> On Tuesday, April 28, 2015 at 1:36:38 AM UTC+2, Jon Gjengset wrote:
> >
> > FWIW, after poking around a bit, it seems like this approach to RW
> > locking is far from new. Linux calls then "big reader locks"
> > (https://lwn.net/Articles/378911/).
> >
>
> Linux has mostly replaced the "big reader locks" with a better
> algorithm for this use case called "RCU" due to scale-ability issues

RCU is great if the data you are replacing can be copied cheaply, but
not so much if you are changing a small part of a large dataset
(especially if it contains pointers internally). See also the discussion
with Josh elsewhere in this thread.

RCU also doesn't work for cases where you need a "dual-mode" lock
more-so than a RWlock. For example, in my particular case, the "readers"
all modify the underlying data, but they do so in a way that is safe in
the face of concurrent operations. The "writers" perform operations that
require the underlying data not to change while they are operating,
meaning they really do need an exclusive lock.

> This algorithm might also be implemented using Go, since you have the
> building blocks for it.

Sure, but I don't think sync.RWMutex could be re-implemented solely
using RCU, in particular because it's not right for all use-cases for RW
locks.

Jon

Joubin Houshyar

unread,
Apr 28, 2015, 11:13:20 AM4/28/15
to golan...@googlegroups.com
RCU seems to require a non-preemptive scheduler. The original Go runtime .. 

(Aside to gophers: are there plans for having flags for switching to a pure CSP scheduler? RCU is a use-case.)

Dmitry Vyukov

unread,
Apr 28, 2015, 11:16:07 AM4/28/15
to Joubin Houshyar, golang-nuts
What is CSP scheduler? And why does RCU require a non-preemptive
scheduler (linux kernel is obviously preemptive)?
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Dmitry Vyukov

unread,
Apr 28, 2015, 11:20:10 AM4/28/15
to matt, golang-nuts
I've implemented a distributed RW mutex for Go back in 2011:
https://groups.google.com/d/msg/golang-nuts/q59mZRC16KE/WeNJhVvZxRAJ
Nobody was interested enough to push it into std lib.
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Jon Gjengset

unread,
Apr 28, 2015, 11:30:47 AM4/28/15
to golang-nuts
'Dmitry Vyukov' via golang-nuts wrote:
> I've implemented a distributed RW mutex for Go back in 2011:
> https://groups.google.com/d/msg/golang-nuts/q59mZRC16KE/WeNJhVvZxRAJ
> Nobody was interested enough to push it into std lib.

Ah, that's interesting.

My guess is that the fact that your solution requires more integration
into the go runtime than mine does makes it a less likely candidate for
adoption (at least in the short term); I just need to know which CPU I'm
currently on, whereas you need thread-local storage and guarantees about
what happens if a goroutine is moved to another thread.

Admittedly, your solution might also have applications beyond RWMutex;
knowing what CPU a goroutine is currently on seems like a feature with
very limited applicability.

Jon

Joubin Houshyar

unread,
Apr 28, 2015, 12:02:00 PM4/28/15
to Dmitry Vyukov, golang-nuts
In the original Go runtime, goroutines would run to completion or IO boundaries (e.g. channels). 

Note “seems” in my OP )) That said: “Examples of nonpreemptive software environments include the Linux kernel when built with CONFIG_PREEMPT=n …”
RCU: https://queue.acm.org/detail.cfm?id=2488549  // acm diagrams are rather ungainly but +interesting article

Dmitry Vyukov

unread,
Apr 28, 2015, 12:20:12 PM4/28/15
to Joubin Houshyar, golang-nuts
On Tue, Apr 28, 2015 at 7:01 PM, Joubin Houshyar <jhou...@gmail.com> wrote:
> In the original Go runtime, goroutines would run to completion or IO
> boundaries (e.g. channels).
> CSP: http://www.usingcsp.com/cspbook.pdf


No, they always could have been preempted for GC at some points. And
since several thread can run in parallel these points were non even
deterministic.

> Note “seems” in my OP )) That said: “Examples of nonpreemptive software
> environments include the Linux kernel when built with CONFIG_PREEMPT=n …”
> RCU: https://queue.acm.org/detail.cfm?id=2488549 // acm diagrams are rather
> ungainly but +interesting article

I know what it RCU. It works fine with preemptive schedulers.

Joubin Houshyar

unread,
Apr 28, 2015, 12:27:29 PM4/28/15
to Dmitry Vyukov, golang-nuts

> On Apr 28, 2015, at 12:19 PM, Dmitry Vyukov <dvy...@google.com> wrote:
>
> On Tue, Apr 28, 2015 at 7:01 PM, Joubin Houshyar <jhou...@gmail.com> wrote:
>> In the original Go runtime, goroutines would run to completion or IO
>> boundaries (e.g. channels).
>> CSP: http://www.usingcsp.com/cspbook.pdf
>
>
> No, they always could have been preempted for GC at some points. And
> since several thread can run in parallel these points were non even
> deterministic.

On GC completion, the pre-empted goroutine would not get resumed?

>
>> Note “seems” in my OP )) That said: “Examples of nonpreemptive software
>> environments include the Linux kernel when built with CONFIG_PREEMPT=n …”
>> RCU: https://queue.acm.org/detail.cfm?id=2488549 // acm diagrams are rather
>> ungainly but +interesting article
>
> I know what it RCU. It works fine with preemptive schedulers.

Not questioning your knowledge, Dmitry. Ref was to the source of my info.

Dmitry Vyukov

unread,
Apr 28, 2015, 12:40:48 PM4/28/15
to Joubin Houshyar, golang-nuts
On Tue, Apr 28, 2015 at 7:27 PM, Joubin Houshyar <jhou...@gmail.com> wrote:
>
>> On Apr 28, 2015, at 12:19 PM, Dmitry Vyukov <dvy...@google.com> wrote:
>>
>> On Tue, Apr 28, 2015 at 7:01 PM, Joubin Houshyar <jhou...@gmail.com> wrote:
>>> In the original Go runtime, goroutines would run to completion or IO
>>> boundaries (e.g. channels).
>>> CSP: http://www.usingcsp.com/cspbook.pdf
>>
>>
>> No, they always could have been preempted for GC at some points. And
>> since several thread can run in parallel these points were non even
>> deterministic.
>
> On GC completion, the pre-empted goroutine would not get resumed?

Most likely it will be replaced by another goroutine from runqueue.
This effectively gives preemptive scheduling (at least the worst parts
without the good parts).

Jon Gjengset

unread,
Apr 30, 2015, 7:04:25 PM4/30/15
to golang-nuts
Ian Lance Taylor wrote:
> On Mon, Apr 27, 2015 at 4:36 PM, Jon Gjengset <j...@thesquareplanet.com> wrote:
> >
> >> You could do that today on GNU/Linux using
> >> syscall.Syscall(syscall.SYS_GETCPU, ...)
> >> Might be interesting to try.
> >
> > I don't see a syscall.SYS_GETCPU in the syscall package?
>
> That's interesting, it's there for every GNU/Linux target except
> amd64. That's an odd oversight.

I was right that the syscall approach would slow down the lock too much,
but I managed to get something working using the CPUID assembly
instruction instead. The performance improvements are quite dramatic:

https://gist.github.com/jonhoo/05774c1e47dbe4d57169

Jon

Tamás Gulácsi

unread,
May 1, 2015, 4:20:04 AM5/1/15
to golan...@googlegroups.com
Sorry, but I didn't get it: what's the matter with []*sync.RWMutex, and each goroutine using one? This way correctness is a fact (no need to know which CPU is the goroutine is running ATM), performance I don't know.

Jon Gjengset

unread,
May 1, 2015, 11:13:12 AM5/1/15
to golan...@googlegroups.com
This would require you to know *in advance* how many goroutines you are
going to have. Say you have n goroutines and k cores, you will also need
n > k locks, which means both increased time for writers and additional
storage overhead.

Tamás Gulácsi

unread,
May 1, 2015, 11:27:09 AM5/1/15
to golan...@googlegroups.com
Why would that be so bad? You won't have more than 2*GOMAXPROCS goroutines doing computation, do you?
But you can always limit the number of locks, with a small increase of contention on some of the locks.

As I understand, having multiple locks is to slice contention, and each additional lock slices less, and more than CPU locks is waste.

My suggestion is to try this simple and correct solution, before try hard to circumvent the runtime.

Jon Gjengset

unread,
May 1, 2015, 11:39:39 AM5/1/15
to golan...@googlegroups.com
> Why would that be so bad? You won't have more than 2*GOMAXPROCS
> goroutines doing computation, do you?

If you use goroutines in the "Go way" where you spawn off a goroutine to
handle each request, you might have thousands of goroutines, but only a
handful of cores. This is also beneficial in that it allows overlapping
computation and IO, which having a single (or very few) goroutines per
core would not.

What you are proposing is keeping a worker pool, which introduces
additional complexity to request handling, and additional communication
through some shared queue that keeps incoming requests. While this is
possible to implement, it is fairly uncommon to see in Go where
lightweight threads often obviate the need for such pools.

> But you can always limit the number of locks, with a small increase of
> contention on some of the locks.

That is true, though then you're back to the original problem of a
lock's memory location bouncing back and forth between the cores that
have readers of that core. My guess is that you would still see a
significant slowdown once goroutines on different cores (or even
different NUMA nodes) share the same lock.

> As I understand, having multiple locks is to slice contention, and
> each additional lock slices less, and more than CPU locks is waste.

Well, it's not quite to slice contention, because a RWMutex is in a
sense uncontended if you only have readers. The problem is that the way
RWMutex is implemented; since RLocks still do a memory write, the CPUs
are contending for write access to a shared memory location, causing
them to constantly invalidate each others' caches.

> My suggestion is to try this simple and correct solution, before try
> hard to circumvent the runtime.

The solution I proposed is *always* correct, because the reader will
unlock the same lock as they locked. Even if cpu() returns the wrong
value, correctness is guaranteed. A wrong cpu() value will only impact
performance, as it will cause a CPU to invalidate another CPU's cache.

I'm not sure I'd call this circumventing the runtime. In a sense, this
has nothing to do with the runtime at all. While it would be nice if the
runtime provided a runtime.GetActiveCPU() function, this is not
necessary for a better RWMutex to be implemented; all it means is that
the RWMutex would have to implement the function to get the current CPU
itself.

It is also worth pointing out that an entirely safe fallback
implementation of runtime.GetActiveCPU() is "return 0", because any
implementation that relies on it *must* be prepared for the case where a
goroutine has been moved between the function call and when it is
executing subsequent instructions. This will reduce performance, but not
impact correctness.

Jon

Tamás Gulácsi

unread,
May 1, 2015, 4:37:11 PM5/1/15
to golan...@googlegroups.com
Thanks for your patience, and detailed answer, Jon!

Then instead of a CurrentCPU() call, a goroutine "identifier" (a number from a sequence, given at start time, maybe hidden in the per-goroutine lock implementation) would do modulo the number of underlying RWLocks, which would be equal with the new number of cores?

Jon Gjengset

unread,
May 1, 2015, 4:52:50 PM5/1/15
to golan...@googlegroups.com
> Thanks for your patience, and detailed answer, Jon!

No worries at all. It is good to get all these thoughts out in writing
so that people can fully understand the change.

> Then instead of a CurrentCPU() call, a goroutine "identifier" (a
> number from a sequence, given at start time, maybe hidden in the
> per-goroutine lock implementation) would do modulo the number of
> underlying RWLocks, which would be equal with the new number of cores?

This would still have the issue that goroutines running on different
cores access the same lock, which is what causes the performance
problems at a large number of cores.

Tamás Gulácsi

unread,
May 2, 2015, 12:57:23 AM5/2/15
to golan...@googlegroups.com
So, we only need the cpuid for select which lock shall we acquire, but for releasing it, stick with that lock.

Now (I think) I understand why your proposal is correct, even if the CurrentCPU function returns bogus data (say: check for the CPU only for every 10th call).

Thanks for the enlightenment!

Jesper Louis Andersen

unread,
May 3, 2015, 9:49:42 AM5/3/15
to Tamás Gulácsi, golang-nuts
Note that the RWMutex primitive you mention here, the "Big Reader" RWMutex, also sees quite some use in the Erlang VM at the lowest level (with the name a "reader optimized" RWLock). It's written in the same way, one cache-line per scheduler/core.

I agree this is a nice primitive, and I also agree it is safe even in the case of goroutines moving around on cores, as long as the ID of the core is remembered and the corresponding lock is released.

Safety aside, there are two differences though:

* You have no access to the RWLock from the language, by design.
* When the lock is held, the process won't be preempted, but data access can of course race against other cores.

One example where this lock is used is if you declare a shared ETS table between processes, and tell the VM it will be read far more than written[0]:

ets:new(foo, [named_table, public, set, {read_concurrency, true}]).

[0] Implementation is a hash table variant which can grow gradually in chunks.


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
J.
Reply all
Reply to author
Forward
0 new messages