From uhler at mips.com Tue Feb 1 10:10:50 2005 From: uhler at mips.com (Michael Uhler) Date: Tue, 1 Feb 2005 02:10:50 -0800 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <20050131205639.GK30888@nevyn.them.org> Message-ID: <005b01c50846$4ef8ac60$1bc0a8c0@MIPS.COM> The one area that I'm concerned about is the use of rdhwr to return the pointer. There are several reasons why I'm not sure this is the right thing to do: - rdhwr is a MIPS32/64 Release 2 instruction. No existing MIPS I-IV implementation has this instruction and probably never will. Even existing MIPS3264 Release 2 implementations don't have the internal register to hold the data. This means that it will be years before any hardware will support the feature, and that support depends on an architecture decision (see next item) - We have some concerns at the architecture level about using rdhwr for this purpose rather than using a GPR under the umbrella of an ABI re-work that some of you are involved in. - Some preliminary work at MIPS suggests that a tuned handler for syscall is faster than one for handling rdhwr at the reserved instruction handler. This means that we're betting on having actual hardware implementations of rdhwr out there in sufficient volume to make up for the fact that we're penalizing everybody else by using an RI trap hander vs. a syscall trap handler. To me, all of these suggest that we may want to use syscall rather than rdhwr to get the pointer, at least until we can decide whether to dedicate a GPR for this purpose. By the way, sorry for the late response to the original posting. There was some confusion within MIPS as to who was going to respond. I've asked our UK team to run point on the interaction on TLS from now on, so we'll be more responsive. /gmu --- Michael Uhler, Chief Technology Officer MIPS Technologies, Inc. Email: uhler at mips.com 1225 Charleston Road Voice: (650)567-5025 FAX: (650)567-5225 Mountain View, CA 94043 Mobile: (650)868-6870 Admin: (650)567-5085 -----Original Message----- From: Daniel Jacobowitz [mailto:dan at codesourcery.com] Sent: Monday, January 31, 2005 12:57 PM To: mips-tls at codesourcery.com Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI Hi folks, We have the 32-bit port of NPTL basically working now, and I have been preparing the first patches for submission - which means binutils. So I have been going over the ABI in some detail looking for things that we want to finalize before the patches are integrated. I found three points... First, a minor correction: LE was using "ori" where the equivalent LD sequence used "addiu". I've updated this on the Wiki. It was a leftover from an earlier draft. Second, the %tpoff operator is currently ambiguous. When we say %dtpoff, we are always talking about the offset from the base of this module's DTV entry to the location of the variable; currently this is always used with %hi and %lo. However, %tpoff can be used with %hi and %lo (Local Exec model, refers to variable offset from thread pointer) %or without (Initial Exec model, refers to GOT slot holding the variable offset). How about this instead? R_MIPS_TLS_TPOFF -> R_MIPS_TLS_GOTTPOFF %tpoff(x)($28) -> %gottpoff(x)($28) This also frees up %tpoff in case we want to use that the way PowerPC uses @tprel. foo at tprel@l is the low 16 bits of the offset; foo at tprel s the low 16 bits also, but signals an error if the ofset does not fit entirely in 16 bits. The alternate sequence III of Local Exec could take advantage of this. Third, the Design Choices section of the specification has this to say: * The compiler is not allowed to schedule the sequences below. The sequences below must appear exactly as written in the code generated by the compiler. This restriction is present because we have not yet determined what linker optimizations may be possible. In order to facilitate adding linker optimizations in the future, without recompiling current code, the compiler is restricted from scheduling these sequences. I'd like to settle this one way or the other before finalizing the spec. For reference, the possible linker optimizations are: General Dynamic -> Initial Exec (whenever linking an exec) Local Dynamic -> Local Exec (ditto) Initial Exec -> Local Exec (when the symbol is known to live in the executable; can be applied starting at GD too) The major advantage here is replacing a call to __tls_get_addr with a rdhwr instruction. These are, in theory, doable for MIPS. Here's an example o32 GD -> IE, probably the most important one: 0x00 lw $25, %call16(__tls_get_addr)($28) R_MIPS_CALL16 g 0x04 jalr $25 0x08 addiu $4, $28, %tlsgd(x) R_MIPS_TLS_GD x 0x12 $gp restore (not mentioned in the TLS ABI) 0x00 rdhwr $4, $5 0x04 lw $5, %tpoff(x1)($28) R_MIPS_TLS_TPOFF x1 0x08 addu $4, $4, $5 0x12 the $gp restore can be nop'd out There are a couple of other quirks for this; the only one I can think of offhand is MIPS-I load delay slots, which would mean neither sequence could be used as-is. The immediate disadvantage is that the compiler can not schedule the sequences. I don't know what all the tradeoffs are here, I just know that the compiler implementation would be simpler if we did not make the sequences fixed and unschedulable. So I'd like to ditch that unless folks think that (A) the linker optimizations are useful (B) the linker optimizations are feasible (C) someone is likely to implement the linker optimizations Any opinions? I see that Alpha does implement the TLS linker relaxations; on the other hand, Alpha already had a linker relaxation mechanism in place, and the GNU tools for MIPS don't. -- Daniel Jacobowitz From dom at mips.com Tue Feb 1 13:28:12 2005 From: dom at mips.com (Dominic Sweetman) Date: Tue, 1 Feb 2005 13:28:12 +0000 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <20050131205639.GK30888@nevyn.them.org> References: <20050131205639.GK30888@nevyn.them.org> Message-ID: <16895.33772.807508.339803@doms-laptop.algor.co.uk> Daniel, > We have the 32-bit port of NPTL basically working now, and I have been > preparing the first patches for submission - which means binutils. So > I have been going over the ABI in some detail looking for things that > we want to finalize before the patches are integrated. I found three > points... Thanks, ace work. I hope we'll be in a position to pick this stuff up fairly soon and beat on it. > Second, the %tpoff operator is currently ambiguous. When we say > %dtpoff, we are always talking about the offset from the base of this > module's DTV entry to the location of the variable; currently this is > always used with %hi and %lo. However, %tpoff can be used with %hi and > %lo (Local Exec model, refers to variable offset from thread pointer) > %or without (Initial Exec model, refers to GOT slot holding the > variable offset). > > How about this instead? > R_MIPS_TLS_TPOFF -> R_MIPS_TLS_GOTTPOFF > %tpoff(x)($28) -> %gottpoff(x)($28) Sounds better. > Third, the Design Choices section of the specification has this to say: > > * The compiler is not allowed to schedule the sequences below. > > The sequences below must appear exactly as written in the code > generated by the compiler. This restriction is present because we have > not yet determined what linker optimizations may be possible. In order > to facilitate adding linker optimizations in the future, without > recompiling current code, the compiler is restricted from scheduling > these sequences. > > I'd like to settle this one way or the other before finalizing the > spec. Why not sit on the fence? See below. > The major advantage here is replacing a call to __tls_get_addr with a > rdhwr instruction. These are, in theory, doable for MIPS. Mike Uhler responded separately about MIPS Technologies' position on using rdhwr for this purpose. For NUBI we plan to reserve a general purpose register for a thread pointer, making the optimization a bigger win. > There are a couple of other quirks for this; the only one I can think > of offhand is MIPS-I load delay slots, which would mean neither > sequence could be used as-is. There are relatively few MIPS-I machines out there. Would it be unacceptable if the standard NPTL system failed to work on them? > The immediate disadvantage is that the compiler can not schedule the > sequences. I don't know what all the tradeoffs are here, I just > know that the compiler implementation would be simpler if we did not > make the sequences fixed and unschedulable. > > So I'd like to ditch that unless folks think that > (A) the linker optimizations are useful > (B) the linker optimizations are feasible > (C) someone is likely to implement the linker optimizations > > Any opinions? I see that Alpha does implement the TLS linker > relaxations; on the other hand, Alpha already had a linker relaxation > mechanism in place, and the GNU tools for MIPS don't. Why not re-write the spec to say "unless you generate this sequence exactly like this, you'll probably prevent any future linker optimization from working" - and then leave it to the compiler toolchain. My opinion is that linker optimizations could be very valuable, given a cheap way of accessing a thread pointer. But MIPS Technologies are very unlikely to do heroic work on the linker for the o32 ABI - but we do intend to at least try that out for our NUBI ABI. -- Dominic Sweetman, MIPS Technologies (UK) The Fruit Farm, Ely Road, Chittering, CAMBS CB5 9PH, ENGLAND phone: +44 1223 706205 / fax: +44 1223 706250 / swbrd: +44 1223 706200 http://www.mips.com From dan at codesourcery.com Tue Feb 1 19:43:42 2005 From: dan at codesourcery.com (Daniel Jacobowitz) Date: Tue, 1 Feb 2005 14:43:42 -0500 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <16895.33772.807508.339803@doms-laptop.algor.co.uk> References: <20050131205639.GK30888@nevyn.them.org> <16895.33772.807508.339803@doms-laptop.algor.co.uk> Message-ID: <20050201194340.GV30888@nevyn.them.org> On Tue, Feb 01, 2005 at 01:28:12PM +0000, Dominic Sweetman wrote: > > There are a couple of other quirks for this; the only one I can think > > of offhand is MIPS-I load delay slots, which would mean neither > > sequence could be used as-is. > > There are relatively few MIPS-I machines out there. Would it be > unacceptable if the standard NPTL system failed to work on them? There are still plenty of non-MIPS-I configurations compiled for MIPS-I. I doubt that will change any time soon, so we have to be able to cope with the load delay slots. > > The immediate disadvantage is that the compiler can not schedule the > > sequences. I don't know what all the tradeoffs are here, I just > > know that the compiler implementation would be simpler if we did not > > make the sequences fixed and unschedulable. > > > > So I'd like to ditch that unless folks think that > > (A) the linker optimizations are useful > > (B) the linker optimizations are feasible > > (C) someone is likely to implement the linker optimizations > > > > Any opinions? I see that Alpha does implement the TLS linker > > relaxations; on the other hand, Alpha already had a linker relaxation > > mechanism in place, and the GNU tools for MIPS don't. > > Why not re-write the spec to say "unless you generate this sequence > exactly like this, you'll probably prevent any future linker > optimization from working" - and then leave it to the compiler > toolchain. > > My opinion is that linker optimizations could be very valuable, given > a cheap way of accessing a thread pointer. But MIPS Technologies are > very unlikely to do heroic work on the linker for the o32 ABI - but we > do intend to at least try that out for our NUBI ABI. Fine by me; if no one else responds, I'll do this. Note that when I make this change, I'm also going to let the compiler schedule the sequences; whoever implements the linker optimizations can go back and undo that. -- Daniel Jacobowitz From dan at codesourcery.com Tue Feb 1 19:49:19 2005 From: dan at codesourcery.com (Daniel Jacobowitz) Date: Tue, 1 Feb 2005 14:49:19 -0500 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <005b01c50846$4ef8ac60$1bc0a8c0@MIPS.COM> References: <20050131205639.GK30888@nevyn.them.org> <005b01c50846$4ef8ac60$1bc0a8c0@MIPS.COM> Message-ID: <20050201194918.GW30888@nevyn.them.org> On Tue, Feb 01, 2005 at 02:10:50AM -0800, Michael Uhler wrote: > The one area that I'm concerned about is the use of rdhwr to return the > pointer. There are several reasons why I'm not sure this is the right thing > to do: I'm getting a lot of conflicting feedback about this. > - rdhwr is a MIPS32/64 Release 2 instruction. No existing MIPS I-IV > implementation has this instruction and probably never will. Even existing > MIPS3264 Release 2 implementations don't have the internal register to hold > the data. This means that it will be years before any hardware will support > the feature, and that support depends on an architecture decision (see next > item) Compare this to a syscall. There is no existing implementation which will implement the syscall efficiently, and _never_ will be. > - We have some concerns at the architecture level about using rdhwr for this > purpose rather than using a GPR under the umbrella of an ABI re-work that > some of you are involved in. This objection is way too vague for me to respond to. Also, using rdhwr does not prevent future use of a GPR, in an ABI that doesn't exist yet. Nice thing about read-only state; you can keep it in multiple places easily. > - Some preliminary work at MIPS suggests that a tuned handler for syscall is > faster than one for handling rdhwr at the reserved instruction handler. > This means that we're betting on having actual hardware implementations of > rdhwr out there in sufficient volume to make up for the fact that we're > penalizing everybody else by using an RI trap hander vs. a syscall trap > handler. So is this one. Can you be more specific? The only substantial difference in overhead that I am familiar with is the additional register save/restores; note that this is a substantial _advantage_ for userland which would otherwise have to save and restore additional registers. Keeping the save/restores in the kernel is a win for code size and complexity. > To me, all of these suggest that we may want to use syscall rather than > rdhwr to get the pointer, at least until we can decide whether to dedicate a > GPR for this purpose. In any case, I am putting Ralf on the hot seat here. I'm going to do whatever he likes, anyway, since it's no good to me if the kernel doesn't support it :-) -- Daniel Jacobowitz From mark at codesourcery.com Tue Feb 1 20:07:45 2005 From: mark at codesourcery.com (Mark Mitchell) Date: Tue, 01 Feb 2005 12:07:45 -0800 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <20050201194918.GW30888@nevyn.them.org> References: <20050131205639.GK30888@nevyn.them.org> <005b01c50846$4ef8ac60$1bc0a8c0@MIPS.COM> <20050201194918.GW30888@nevyn.them.org> Message-ID: <41FFE191.8090603@codesourcery.com> Daniel Jacobowitz wrote: > On Tue, Feb 01, 2005 at 02:10:50AM -0800, Michael Uhler wrote: > >>The one area that I'm concerned about is the use of rdhwr to return the >>pointer. There are several reasons why I'm not sure this is the right thing >>to do: > > > I'm getting a lot of conflicting feedback about this. From our point of view, we've already got a validated implementation using rdhwr. We'd like to avoid having to rework our code and then revalidate. Realistically, if rdhwr isn't officially blessed, some vendors might still use our implementation. Or, things might just languish. In other words, I'm somewhat afraid that we've missed the technical window to debate this particular technical point. As Dan says, the new MIPS ABI can do better in this regard, as in others. Furthermore, if the kernel adds a syscall that can be used by the o32 ABI, then the tools can be updated to work with that too. I think the only immutable aspect of this existing design is that if/once our implementation escapes into the wild, then kernels forevermore may have to support the rdhwr solution, even if most programs no longer use it. I think that's a relatively small price to pay to get NPTL working on MIPS. -- Mark Mitchell CodeSourcery, LLC mark at codesourcery.com (916) 791-8304 From ica2_ts at csv.ica.uni-stuttgart.de Tue Feb 1 21:58:59 2005 From: ica2_ts at csv.ica.uni-stuttgart.de (Thiemo Seufer) Date: Tue, 1 Feb 2005 22:58:59 +0100 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <005b01c50846$4ef8ac60$1bc0a8c0@MIPS.COM> References: <20050131205639.GK30888@nevyn.them.org> <005b01c50846$4ef8ac60$1bc0a8c0@MIPS.COM> Message-ID: <20050201215859.GM15265@rembrandt.csv.ica.uni-stuttgart.de> Michael Uhler wrote: > The one area that I'm concerned about is the use of rdhwr to return the > pointer. There are several reasons why I'm not sure this is the right thing > to do: > > - rdhwr is a MIPS32/64 Release 2 instruction. No existing MIPS I-IV > implementation has this instruction and probably never will. Even existing > MIPS3264 Release 2 implementations don't have the internal register to hold > the data. This means that it will be years before any hardware will support > the feature, and that support depends on an architecture decision (see next > item) Yes. This means we will have a TLS register which is a bit slower than a regular GPR for MIPS{32,64}R2 and a relatively slow emulated register for older implementations. If we use a pseudo-syscall instead, we'll have only the second variant with less performance potential. > - We have some concerns at the architecture level about using rdhwr for this > purpose rather than using a GPR under the umbrella of an ABI re-work that > some of you are involved in. The ABI re-work surely isn't mutually exclusive to o32 TLS. The main reason for o32 TLS is to get rid of unmaintained linuxthreads while maintaining source-level compatibility. It will also be available soon (the same rationale applies to n32/n64 TLS, of course). The ABI re-work is much more ambitious. > - Some preliminary work at MIPS suggests that a tuned handler for syscall is > faster than one for handling rdhwr at the reserved instruction handler. > This means that we're betting on having actual hardware implementations of > rdhwr out there in sufficient volume to make up for the fact that we're > penalizing everybody else by using an RI trap hander vs. a syscall trap > handler. That's surprising, at least for the current Linux implementation. The basic exception handler is the same, and the syscall path is already time-critical and loaded with ABI dispatch etc. Adding another path to it will penalize syscalls further. RI has no critical path yet, adding the rdhwr emulation should be fast and relatively straightforward. Extracing the instruction from mapped space could get slow if it interferes with TLB handling, but I don't think that's the common case. > To me, all of these suggest that we may want to use syscall rather than > rdhwr to get the pointer, I nany case it should be easy to try both and see what works better, once the rest of the Userland implementation is working. > at least until we can decide whether to dedicate a GPR for this purpose. Is there some data available how much pressure has to be expected for the TLS register vs. normal GPR? I would guess it is significantly lower, since there are several Linux implementations which use emulation sucessfully. In that case, rdhwr might be even benefical for the ABI re-work, and free up an GPR which can be used for better things. Thiemo From dan at codesourcery.com Tue Feb 1 22:01:57 2005 From: dan at codesourcery.com (Daniel Jacobowitz) Date: Tue, 1 Feb 2005 17:01:57 -0500 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <20050201215859.GM15265@rembrandt.csv.ica.uni-stuttgart.de> References: <20050131205639.GK30888@nevyn.them.org> <005b01c50846$4ef8ac60$1bc0a8c0@MIPS.COM> <20050201215859.GM15265@rembrandt.csv.ica.uni-stuttgart.de> Message-ID: <20050201220156.GX30888@nevyn.them.org> On Tue, Feb 01, 2005 at 10:58:59PM +0100, Thiemo Seufer wrote: > I nany case it should be easy to try both and see what works better, > once the rest of the Userland implementation is working. FYI, the userland implementation is complete. I hope to start by posting the binutils patches this week, followed by GCC bits to queue for 4.1. -- Daniel Jacobowitz From ica2_ts at csv.ica.uni-stuttgart.de Tue Feb 1 22:02:11 2005 From: ica2_ts at csv.ica.uni-stuttgart.de (Thiemo Seufer) Date: Tue, 1 Feb 2005 23:02:11 +0100 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <16895.33772.807508.339803@doms-laptop.algor.co.uk> References: <20050131205639.GK30888@nevyn.them.org> <16895.33772.807508.339803@doms-laptop.algor.co.uk> Message-ID: <20050201220211.GN15265@rembrandt.csv.ica.uni-stuttgart.de> Dominic Sweetman wrote: [snip] > > There are a couple of other quirks for this; the only one I can think > > of offhand is MIPS-I load delay slots, which would mean neither > > sequence could be used as-is. > > There are relatively few MIPS-I machines out there. Would it be > unacceptable if the standard NPTL system failed to work on them? But there are some, and a generic userland should IMHO support them. It's a small price to keep o32 working everywhere. Thiemo From dan at codesourcery.com Wed Feb 2 17:59:36 2005 From: dan at codesourcery.com (Daniel Jacobowitz) Date: Wed, 2 Feb 2005 12:59:36 -0500 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <20050201194340.GV30888@nevyn.them.org> References: <20050131205639.GK30888@nevyn.them.org> <16895.33772.807508.339803@doms-laptop.algor.co.uk> <20050201194340.GV30888@nevyn.them.org> Message-ID: <20050202175934.GD30888@nevyn.them.org> On Tue, Feb 01, 2005 at 02:43:42PM -0500, Daniel Jacobowitz wrote: > On Tue, Feb 01, 2005 at 01:28:12PM +0000, Dominic Sweetman wrote: > > Why not re-write the spec to say "unless you generate this sequence > > exactly like this, you'll probably prevent any future linker > > optimization from working" - and then leave it to the compiler > > toolchain. > > > > My opinion is that linker optimizations could be very valuable, given > > a cheap way of accessing a thread pointer. But MIPS Technologies are > > very unlikely to do heroic work on the linker for the o32 ABI - but we > > do intend to at least try that out for our NUBI ABI. > > Fine by me; if no one else responds, I'll do this. > > Note that when I make this change, I'm also going to let the compiler > schedule the sequences; whoever implements the linker optimizations can > go back and undo that. Unfortunately, I've realized that we can't sit on this fence. Here's a somewhat contrived example. int __thread a; int *bar (void); int *foo (int use_tls) { if (use_tls) return &a; else return bar (); } foo: ... .set noreorder bnez $4, .Lfunccall lw $25, %call16(bar)($28) .set reorder lw $25, %call16(__tls_get_addr)($28) addu $4, $28, %tlsgd(a) .Lfunccall: jal $25 ... i.e. the two calls have been tail merged. A perfectly valid optimization, and one that GCC theoretically could perform, though I do not know offhand if it does. Note that the valid TLS GD sequence, exactly as defined by the ABI, occurs here. Yet we can't remove the call instruction without breaking the function. The only way to make this work is either to mandate that an ABI-conforming compiler can not optimize the TLS access sequences, or to define additional relaxation marker relocations to mark valid sequences. Some platforms do one, some do the other. My preference is to do the latter, which we can postpone until someone is ready to implement them. -- Daniel Jacobowitz From uhler at mips.com Thu Feb 3 01:00:03 2005 From: uhler at mips.com (Michael Uhler) Date: Wed, 2 Feb 2005 17:00:03 -0800 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <20050201215859.GM15265@rembrandt.csv.ica.uni-stuttgart.de> Message-ID: <00d301c5098b$b758ee50$1bc0a8c0@MIPS.COM> Rather than respond to each email individually, I'm including one global response to everybody's feedback. >> The one area that I'm concerned about is the use of rdhwr to return >> the pointer. There are several reasons why I'm not sure this is the >> right thing to do: Dan> I'm getting a lot of conflicting feedback about this. Mark> From our point of view, we've already got a validated implementation Mark> using rdhwr. We'd like to avoid having to rework our code and then Mark> revalidate. Mark> Mark> Realistically, if rdhwr isn't officially blessed, some vendors might Mark> still use our implementation. Or, things might just languish. In other Mark> words, I'm somewhat afraid that we've missed the technical window to Mark> debate this particular technical point. Mark> Mark> As Dan says, the new MIPS ABI can do better in this regard, as in Mark> others. Furthermore, if the kernel adds a syscall that can be used by Mark> the o32 ABI, then the tools can be updated to work with that too. I Mark> think the only immutable aspect of this existing design is that if/once Mark> our implementation escapes into the wild, then kernels forevermore may Mark> have to support the rdhwr solution, even if most programs no longer use Mark> it. I think that's a relatively small price to pay to get NPTL working Mark> on MIPS. In terms of the ship leaving the dock, is the issue one of specifically rdhwr, or could we use another instruction which also traps as an RI (or something else that isn't a syscall)? I'll talk more about rdhwr below, but it's important for me to understand whether it's the instruction, or the mechanism that makes you believe that the technical window has passed. >> - rdhwr is a MIPS32/64 Release 2 instruction. No existing MIPS I-IV >> implementation has this instruction and probably never will. Even >> existing MIPS3264 Release 2 implementations don't have the internal >> register to hold the data. This means that it will be years before >> any hardware will support the feature, and that support depends on an >> architecture decision (see next >> item) Dan> Compare this to a syscall. There is no existing implementation which will implement the syscall efficiently, and _never_ will be. Thiemo> Yes. This means we will have a TLS register which is a bit slower than a regular GPR for MIPS{32,64}R2 and a relatively slow emulated register for older implementations. If we use a pseudo-syscall instead, we'll have only the second variant with less performance potential. I take your point on syscall vs. something else that traps as an ri. So let me try to explain my concern about the use of rdhwr specifically. Compliance with the MIPS32/MIPS64 architectures (which is what's required for implementations by both MIPS Technologies and MIPS architecture licensees) requires passing a set of tests. These tests check the corner cases of the architecture at each revision. We do this to prevent fragmentation of the architecture and make your (you == the community of people writing software for implementations of the architecture) life easier. In the particular case of rdhwr, we explicitly check that this instruction generates a reserved instruction on implementations of Release 1 of the architecture, and that all reserved encodings of rdhwr registers (which is what you're proposing to use) cause a reserved instruction exception on implementations of Release 2. This means that there will never be a real implementation of rdhwr on Release 1 implementations. With the current architecture spec, Release 2 implementations will be non-compliant with the architecture unless we make an architecture change. Changes to existing architecture can certainly be done, but we don't take them lightly because we need to get comment from those people who thought they had a stable architecture from which to implement. The fact that it's rdhwr makes it somewhat simpler because we would make the TLS register optional, and optional registers would cause a reserved instruction anyway. But the point is, the decision to use a particular instruction for the TLS pointer means that the architecture has to change. To do that is going to require some time while we consult with all of the architecture licensees. Once that happens, somebody would have to actually implement the register on an implementation of Release 2 of the architecture. It will be years (probably at least 2-3) before the first implementation appears with the TLS register implemented via rdhwr, and the total population of those implementations is going to be small. The vast majority of MIPS implementations will continue to trap with a reserved instruction, which will fundamentally limit the performance of NPTL on MIPS. The alternatives seem to be to use a GPR (but this requires an ABI change) or to park the TLB pointer someplace in the address space. I wondered to Mark at one point whether we could put it at the base of the stack, then down-align sp to access it. We played with this a bit, but couldn't come up with anything that was relatively clean. So my feedback on the use of rdhwr (or any other instruction that traps) is that as long as this is a short-term solution and/or we understand the performance implications of how often that trap happens, it's OK. Depending on rdhwr to appear in a real implementation any time in the next 2-3 years simply isn't going to happen. If we do decide to use rdhwr (as opposed to another trapping instruction - see further comments below), we're probably going to have to change whatever RDHWR register number that you're using now. You can't just pick one at random as that will conflict with the architecture as we add new registers. >> - We have some concerns at the architecture level about using rdhwr >> for this purpose rather than using a GPR under the umbrella of an ABI >> re-work that some of you are involved in. Dan> This objection is way too vague for me to respond to. Also, using rdhwr does not prevent future use of a GPR, in an ABI that doesn't exist yet. Nice thing about read-only state; you can keep it in multiple places easily. Thiemo> The ABI re-work surely isn't mutually exclusive to o32 TLS. The main reason for o32 TLS is to get rid of unmaintained linuxthreads while maintaining source-level compatibility. It will also be available soon (the same rationale applies to n32/n64 TLS, of course). The ABI re-work is much more ambitious. I personally believe that the ABI change is the long-term solution, but I take your point about the needs for o32. I've talked about my concerns about the use of rdhwr above. My general concern is about the widespread use of any instruction whose emulation requires reading the instruction from memory (which would be pretty much anything but syscall, which has at dedicated exception vector and passes arguments via register). We had occasion to have to debug a problem with another operating system and a MIPS core from a different manufacturer. We discovered that this particular implementation did not guarantee that a load done off the EPC value would always hit in the TLB. In fact, it missed and the kernel didn't use a guarded load, so it took a nested exception and crashed. You could say that this is a bug in the implementation, but we started to look more broadly and concluded that it is possible for implementations, particularly those that implement a virtual instruction cache, to wind up at the reserved instruction handler and not have the instruction page mapped. The advantage of syscall is that the argument is in a register, and no instruction read is required to interpret the instruction. One can certainly use another instruction (e.g., rdhwr) whose emulation requires reading the instruction, but the read needs to be guarded. If this is true of the Linux RI handler, we're all set. If not, this needs to be considered in the selection of the instruction that's going to trap. While a TLB miss isn't going to happen very often (maybe never on some processors), the code has to deal with the case to ensure correctness. When thinking about the choice of rdhwr or something else that traps, we should consider this situation. >> - Some preliminary work at MIPS suggests that a tuned handler for >> syscall is faster than one for handling rdhwr at the reserved >> instruction handler. This means that we're betting on having actual >> hardware implementations of rdhwr out there in sufficient volume to >> make up for the fact that we're penalizing everybody else by using an >> RI trap hander vs. a syscall trap handler. Dan> So is this one. Can you be more specific? The only substantial difference in overhead that I am familiar with is the additional register save/restores; note that this is a substantial _advantage_ for userland which would otherwise have to save and restore additional registers. Keeping the save/restores in the kernel is a win for code size and complexity. Thiemo> That's surprising, at least for the current Linux implementation. The basic exception handler is the same, and the syscall path is already time-critical and loaded with ABI dispatch etc. Adding another path to it will penalize syscalls further. RI has no critical path yet, adding the rdhwr emulation should be fast and relatively straightforward. Extracing the instruction from mapped space could get slow if it interferes with TLB handling, but I don't think that's the common case. In thinking this over, I realized that I was combining performance with the correctness issue that I mentioned above. If one ignores the need to do a guarded load, the performance delta between syscall and ri was very small. I think that Macro did the original testing, and we can show you the code. But I acknowledge that the isn't much difference in the two implementations. It's been awhile since I looked at the code, but I thought we could hide the additional instructions required to do this with syscall under the current code for almost all implementations of the architecture. That is, knowing that all implementations are pipelined and that certain things create holes in the pipeline, I seem to recall thinking that it would add no more cycles (as opposed to instructions) to the syscall flow. But as I said, it's been awhile. >> To me, all of these suggest that we may want to use syscall rather >> than rdhwr to get the pointer, at least until we can decide whether to >> dedicate a GPR for this purpose. Dan> In any case, I am putting Ralf on the hot seat here. I'm going to do whatever he likes, anyway, since it's no good to me if the kernel doesn't support it :-) Thiemo> Is there some data available how much pressure has to be expected for the TLS register vs. normal GPR? I would guess it is significantly lower, since there are several Linux implementations which use emulation sucessfully. In that case, rdhwr might be even benefical for the ABI re-work, and free up an GPR which can be used for better things. We have started some experiments on register usage and pressure as part of the ABI work. We'll let you know as we get more data. /gmu --- Michael Uhler, Chief Technology Officer MIPS Technologies, Inc. Email: uhler at mips.com 1225 Charleston Road Voice: (650)567-5025 FAX: (650)567-5225 Mountain View, CA 94043 Mobile: (650)868-6870 Admin: (650)567-5085 From mark at codesourcery.com Fri Feb 4 05:56:03 2005 From: mark at codesourcery.com (Mark Mitchell) Date: Thu, 03 Feb 2005 21:56:03 -0800 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <00d301c5098b$b758ee50$1bc0a8c0@MIPS.COM> References: <00d301c5098b$b758ee50$1bc0a8c0@MIPS.COM> Message-ID: <42030E73.7050603@codesourcery.com> Michael Uhler wrote: > In terms of the ship leaving the dock, is the issue one of specifically > rdhwr, or could we use another instruction which also traps as an RI (or > something else that isn't a syscall)? I'll talk more about rdhwr below, but > it's important for me to understand whether it's the instruction, or the > mechanism that makes you believe that the technical window has passed. I think it's the mechanism, but Daniel could answer more definitively. I know that there's a kernel patch out there to interpret the rdhwr instruction -- but from the toolchain point of view I can't see any reason to think that any single instruction couldn't be used just as well. > The alternatives seem to be to use a GPR (but this requires an ABI change) > or to park the TLB pointer someplace in the address space. I wondered to > Mark at one point whether we could put it at the base of the stack, then > down-align sp to access it. We played with this a bit, but couldn't come up > with anything that was relatively clean. I don't remember quite what happenned to the idea of putting the value at some known location in memory. I think that Dan shot this down relatively effectively, but I can't remember on quite what basis. One downside is that it means that all implementations will always be somewhat inefficient; you're going to take a memory access hit, no matter what improvements are made to the architecture. > So my feedback on the use of rdhwr (or any other instruction that traps) is > that as long as this is a short-term solution and/or we understand the > performance implications of how often that trap happens, it's OK. Depending > on rdhwr to appear in a real implementation any time in the next 2-3 years > simply isn't going to happen. I think that matches our expectations. My understanding is that we'll have a new MIPS ABI, probably with a GPR for the thread pointer, by then. So, I think we should view this as short-term hack to o32. -- Mark Mitchell CodeSourcery, LLC mark at codesourcery.com (916) 791-8304 From dom at mips.com Fri Feb 4 10:13:44 2005 From: dom at mips.com (Dominic Sweetman) Date: Fri, 4 Feb 2005 10:13:44 +0000 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <20050202175934.GD30888@nevyn.them.org> References: <20050131205639.GK30888@nevyn.them.org> <16895.33772.807508.339803@doms-laptop.algor.co.uk> <20050201194340.GV30888@nevyn.them.org> <20050202175934.GD30888@nevyn.them.org> Message-ID: <16899.19160.280516.549900@gargle.gargle.HOWL> Daniel Jacobowitz (dan at codesourcery.com) writes: > > > Why not re-write the spec to say "unless you generate this sequence > > > exactly like this, you'll probably prevent any future linker > > > optimization from working" - and then leave it to the compiler > > > toolchain. > > > > > > My opinion is that linker optimizations could be very valuable, given > > > a cheap way of accessing a thread pointer. But MIPS Technologies are > > > very unlikely to do heroic work on the linker for the o32 ABI - but we > > > do intend to at least try that out for our NUBI ABI. > > > > Fine by me; if no one else responds, I'll do this. > > > > Note that when I make this change, I'm also going to let the compiler > > schedule the sequences; whoever implements the linker optimizations can > > go back and undo that. > > Unfortunately, I've realized that we can't sit on this fence. Here's a > somewhat contrived example... A very devilishly cunning example, too. > The only way to make this work is either to mandate that an > ABI-conforming compiler can not optimize the TLS access sequences, or > to define additional relaxation marker relocations to mark valid > sequences. Some platforms do one, some do the other. > > My preference is to do the latter, which we can postpone until someone > is ready to implement them. I agree. -- Dominic Sweetman MIPS Technologies The Fruit Farm, Ely Road, Chittering, CAMBS CB5 9PH, ENGLAND phone +44 1223 706205/fax +44 1223 706250/swbrd +44 1223 706200 http://www.mips.com From dan at codesourcery.com Fri Feb 4 14:12:54 2005 From: dan at codesourcery.com (Daniel Jacobowitz) Date: Fri, 4 Feb 2005 09:12:54 -0500 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <16899.19160.280516.549900@gargle.gargle.HOWL> References: <20050131205639.GK30888@nevyn.them.org> <16895.33772.807508.339803@doms-laptop.algor.co.uk> <20050201194340.GV30888@nevyn.them.org> <20050202175934.GD30888@nevyn.them.org> <16899.19160.280516.549900@gargle.gargle.HOWL> Message-ID: <20050204141250.GA3829@nevyn.them.org> On Fri, Feb 04, 2005 at 10:13:44AM +0000, Dominic Sweetman wrote: > > Daniel Jacobowitz (dan at codesourcery.com) writes: > > Unfortunately, I've realized that we can't sit on this fence. Here's a > > somewhat contrived example... > > A very devilishly cunning example, too. Why thank you! :-) > > The only way to make this work is either to mandate that an > > ABI-conforming compiler can not optimize the TLS access sequences, or > > to define additional relaxation marker relocations to mark valid > > sequences. Some platforms do one, some do the other. > > > > My preference is to do the latter, which we can postpone until someone > > is ready to implement them. > > I agree. Great. This gives me room to move forward with the GCC patches, pending the continuing discussion about rdhwr. -- Daniel Jacobowitz From dan at codesourcery.com Mon Feb 7 20:49:02 2005 From: dan at codesourcery.com (Daniel Jacobowitz) Date: Mon, 7 Feb 2005 15:49:02 -0500 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <00d301c5098b$b758ee50$1bc0a8c0@MIPS.COM> References: <20050201215859.GM15265@rembrandt.csv.ica.uni-stuttgart.de> <00d301c5098b$b758ee50$1bc0a8c0@MIPS.COM> Message-ID: <20050207204900.GH3829@nevyn.them.org> On Wed, Feb 02, 2005 at 05:00:03PM -0800, Michael Uhler wrote: > In terms of the ship leaving the dock, is the issue one of specifically > rdhwr, or could we use another instruction which also traps as an RI (or > something else that isn't a syscall)? I'll talk more about rdhwr below, but > it's important for me to understand whether it's the instruction, or the > mechanism that makes you believe that the technical window has passed. I don't care what the trapping instruction is; I would prefer not to move away from an RI. As Mark wrote, that code is tested and working (although not quite finalized). > >> - rdhwr is a MIPS32/64 Release 2 instruction. No existing MIPS I-IV > >> implementation has this instruction and probably never will. Even > >> existing MIPS3264 Release 2 implementations don't have the internal > >> register to hold the data. This means that it will be years before > >> any hardware will support the feature, and that support depends on an > >> architecture decision (see next > >> item) > > Dan> Compare this to a syscall. There is no existing implementation which > will implement the syscall efficiently, and _never_ will be. > > Thiemo> Yes. This means we will have a TLS register which is a bit slower > than a regular GPR for MIPS{32,64}R2 and a relatively slow emulated register > for older implementations. If we use a pseudo-syscall instead, we'll have > only the second variant with less performance potential. Thiemo, you may already be clear on this point, but I'm going to highlight it for the discussion anyway: the rdhwr solution does not use a real register on MIPS32r2. It will trap on every existing CPU. > I take your point on syscall vs. something else that traps as an ri. So let > me try to explain my concern about the use of rdhwr specifically. > > Compliance with the MIPS32/MIPS64 architectures (which is what's required > for implementations by both MIPS Technologies and MIPS architecture > licensees) requires passing a set of tests. These tests check the corner > cases of the architecture at each revision. We do this to prevent > fragmentation of the architecture and make your (you == the community of > people writing software for implementations of the architecture) life > easier. > > In the particular case of rdhwr, we explicitly check that this instruction > generates a reserved instruction on implementations of Release 1 of the > architecture, and that all reserved encodings of rdhwr registers (which is > what you're proposing to use) cause a reserved instruction exception on > implementations of Release 2. This means that there will never be a real > implementation of rdhwr on Release 1 implementations. With the current > architecture spec, Release 2 implementations will be non-compliant with the > architecture unless we make an architecture change. Changes to existing > architecture can certainly be done, but we don't take them lightly because > we need to get comment from those people who thought they had a stable > architecture from which to implement. The fact that it's rdhwr makes it > somewhat simpler because we would make the TLS register optional, and > optional registers would cause a reserved instruction anyway. No, I don't think you are looking at this from the right side. The decision to use a reserved rdhwr encoding for the thread pointer, AT SOME FUTURE TIME, does not mean that Release 2 has any need to change. The RI can be trapped on current processors, and it can be added to a future architecture revision. On the other hand, if the performance benefits are compelling enough, that leaves you room to change the architecture. The phrase "Release 2 implementations will be non-compliant" only applies to "Release 2 implementations with this hypothetical register, of which I expect there to be none". > But the point is, the decision to use a particular instruction for the TLS > pointer means that the architecture has to change. To do that is going to > require some time while we consult with all of the architecture licensees. > Once that happens, somebody would have to actually implement the register on > an implementation of Release 2 of the architecture. It will be years > (probably at least 2-3) before the first implementation appears with the TLS > register implemented via rdhwr, and the total population of those > implementations is going to be small. The vast majority of MIPS > implementations will continue to trap with a reserved instruction, which > will fundamentally limit the performance of NPTL on MIPS. > > The alternatives seem to be to use a GPR (but this requires an ABI change) As many people have pointed out, waiting for the ABI change isn't practical. In a sense, that would also fundamentally limit the performance of NPTL on MIPS :-) > or to park the TLB pointer someplace in the address space. I wondered to > Mark at one point whether we could put it at the base of the stack, then > down-align sp to access it. We played with this a bit, but couldn't come up > with anything that was relatively clean. You can't do it that way. This is what LinuxThreads used to do and it imposes impossible limits on your stack alignment and sizing. Want to talk to me more about using a parked TLB entry? I spoke with someone (Ralf or Jun, probably) about the idea originally; I was told it wasn't possible on MIPS SMP implementations to make this work, or that there was some other reason why it was undesirable. If that's not accurate then we could use a reserved memory location. However, that makes the TLS model dependent on details of Linux's memory mapping - not good for a hopefully generally useful ABI. Note that I haven't been doing thread benchamarking, but the performance overhead from emulating rdhwr has not been significant in casual testing. I'm not weeping for the lost speed either way. > So my feedback on the use of rdhwr (or any other instruction that traps) is > that as long as this is a short-term solution and/or we understand the > performance implications of how often that trap happens, it's OK. Depending > on rdhwr to appear in a real implementation any time in the next 2-3 years > simply isn't going to happen. I understand that. > If we do decide to use rdhwr (as opposed to another trapping instruction - > see further comments below), we're probably going to have to change whatever > RDHWR register number that you're using now. You can't just pick one at > random as that will conflict with the architecture as we add new registers. Hint: that's why I asked MIPS for feedback, so that we could get a non-conflicting register number assigned. The only reason I picked $5 was because it was unassigned in the MIPS32r2 spec and I couldn't find any reference to plans for it. > I've talked about my concerns about the use of rdhwr above. My general > concern is about the widespread use of any instruction whose emulation > requires reading the instruction from memory (which would be pretty much > anything but syscall, which has at dedicated exception vector and passes > arguments via register). We had occasion to have to debug a problem with > another operating system and a MIPS core from a different manufacturer. We > discovered that this particular implementation did not guarantee that a load > done off the EPC value would always hit in the TLB. In fact, it missed and > the kernel didn't use a guarded load, so it took a nested exception and > crashed. > > You could say that this is a bug in the implementation, but we started to > look more broadly and concluded that it is possible for implementations, > particularly those that implement a virtual instruction cache, to wind up at > the reserved instruction handler and not have the instruction page mapped. > > The advantage of syscall is that the argument is in a register, and no > instruction read is required to interpret the instruction. One can > certainly use another instruction (e.g., rdhwr) whose emulation requires > reading the instruction, but the read needs to be guarded. If this is true > of the Linux RI handler, we're all set. If not, this needs to be considered > in the selection of the instruction that's going to trap. While a TLB miss > isn't going to happen very often (maybe never on some processors), the code > has to deal with the case to ensure correctness. When thinking about the > choice of rdhwr or something else that traps, we should consider this > situation. All reads from userspace are always protected in Linux; anything else is a bug, plain and simple. This is a non-issue. Your point about the performance overhead of reading and decoding an instruction is worth keeping in mind, but hasn't been a big slowdown. > It's been awhile since I looked at the code, but I thought we could hide the > additional instructions required to do this with syscall under the current > code for almost all implementations of the architecture. That is, knowing > that all implementations are pipelined and that certain things create holes > in the pipeline, I seem to recall thinking that it would add no more cycles > (as opposed to instructions) to the syscall flow. But as I said, it's been > awhile. I've seen pretty strong negative push for adding more complexity to the already extremely complex syscall path. A syscall which didn't trash the same set of registers would be a lot of complexity. In any case, I'll note that the binutils and GCC portions of TLS support are ready for submission to the FSF and I'm moving on to glibc. The GCC bits include the instruction under debate, whatever it turns out to be. I don't want to bog this down in discussion longer than necessary, so I hope we can come to an agreement in the next few days. -- Daniel Jacobowitz From nigel at mips.com Wed Feb 9 13:11:50 2005 From: nigel at mips.com (Nigel Stephens) Date: Wed, 9 Feb 2005 13:11:50 +0000 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI Message-ID: <16906.3094.245677.488728@mips.com> Daniel Jacobowitz wrote: > Note that I haven't been doing thread benchamarking, but the > performance overhead from emulating rdhwr has not been significant > in casual testing. I'm not weeping for the lost speed either way. One reason why MIPS is concerned about using rdhwr is that it may condemn the whole MIPS architecture to poor multi-threading performance relative to other architectures for some number of years. If the architecture gains a poor reputation in this area, then the mud could stick. As part of our own experiments Maciej implemented a "fast path" rdhwr emulation, which he promises he will post to this list today. It has a typical emulation time of between 30 and 60 cycles, depending on the CPU, and assuming a fixed destination register for rdhwr (e.g. only rdhwr $2,$5). Not too bad, but not brilliant either if thread pointer access time turns out to be critical to the performance of some threaded applications. But I don't yet understand how important the thread pointer access time is for a typical threaded program. Can we get a better idea of the dynamic frequency of thread register loads? Does anyone know some suitable application programs or benchmarks which exercise the TLS mechanism, and from which we could extract some statistics? With regard to TLS on other architectures, people might like to read this email conversation http://lists.arm.linux.org.uk/pipermail/linux-arm-kernel/2005-January/026468.html and continued here http://lists.arm.linux.org.uk/pipermail/linux-arm-kernel/2005-February/thread.html#26620. Ah, I see that since I last looked, you've been contributing to that discussion already Daniel. For other readers, here's a summary. ARM have developed a new ABI (EABI) which adds a user-accessible coprocessor thread register, but that register is only available on ARM architecture v6 and above (shades of rdhwr!), and requires trap-based kernel emulation on older CPUs. Since the majority of ARM CPUs are pre-v6, the ARM Linux developers aren't happy about the cost of this emulation and are discussing alternative mechanisms which would work better on old ARM CPUs using the old ABI. The favorite seems to be mapping a read-only page which holds a copy of the thread pointer at a fixed virtual address. Since the ARM MMU architecture apparently cannot handle this on SMP configurations it looks like ARM Linux is considering splitting into two TLS "universes", depending on the compiler/linker options used to build the application and libraries: 1) Old ABI and memory-mapped thread pointer for pre ARM v6 architectures, which can't run SMP anyway. For compatibility, old memory-mapped code running on ARM v6 SMP systems will generate a page fault when accessing the magic page, and the thread pointer load will be emulated by the kernel. 2) New EABI only for SMP ARM v6 and above, using the new user coprocessor reg. The new EABI can't run well on old CPUs anyway, because it uses other new instructions which aren't available on pre-ARM v6 (atomic instructions). Again, for compatibility purposes, access to the new thread coprocessor register can be trapped and emulated by the kernel. So ARM Linux may before too long have well-performing TLS implementations for all of its Linux-capable CPUs. > Want to talk to me more about using a parked TLB entry? I spoke with > someone (Ralf or Jun, probably) about the idea originally; I was told > it wasn't possible on MIPS SMP implementations to make this work, or > that there was some other reason why it was undesirable. If that's not > accurate then we could use a reserved memory location. However, that > makes the TLS model dependent on details of Linux's memory mapping - > not good for a hopefully generally useful ABI. While it's true that all threads within the same address space must share a single page table, which would prevent such a trick being implemented using the normal TLB refill mechanism, I think that it might indeed be possible to implement this by using a per-thread wired (parked) TLB entry, updated on a context switch. This would map a magic page containing a copy of the thread pointer to a fixed virtual address: if in the bottom 32K then it would need just one instruction (e.g. lw $v0,0x1000($0)). A wired TLB entry like this should be fairly straightforward to implement on even a simple RTOS, since it doesn't require a full-blown VM system. I don't see why it wouldn't work on an SMP system too, since each CPU has its own TLB, and therefore a unique wired TLB entry per active thread. To maintain single-threaded performance we'd perhaps want to increment the Wired count only when running real TLS code, so as not to reduce the number of TLB entries available for random replacement for the majority single-threaded applications. I've spoken to Ralf about this idea, and though he's not thrilled by it, he hasn't (yet) said that it couldn't work. But a notable downside of this technique is that it won't work on the new generation of explicitly multi-threaded CPUs, which could be executing instructions from many threads within the same address space "simultaneously", using a single, shared TLB. On such CPUs the memory-mapped thread pointer would require that the kernel not map the wired page, and then emulate a faulting thread pointer load in the page fault handler - rather defeating the point of having a CPU designed to accelerate multi-threading. For multi-threaded CPUs the ultimate solution is the new ABI which reserves a GPR for the thread register. But in the short-term, before that new ABI is fully supported, a rdhwr-based implementation could be argued for. How do people feel about supporting two different TLS implementations on top of the existing MIPS ABIs - one optimised for legacy MIPS CPUs, and another for multi-threaded CPUs - similar to what ARM Linux seems to be considering for SMP? In each case the kernel could provide binary compatibility, at reduced performance, for the incompatible thread access mechanism i.e. emulating the a memory-mapped load, or the rdhwr, as appropriate. It might also be sensible to provide two variants of libc.so (per ABI), compiled for the different TLS mechanisms, much as the x86 Linux dynamic linker automatically loads either /usr/lib/i486/libc.so or /usr/lib/i686/libc.so depending on the CPU type. > All reads from userspace are always protected in Linux; But not if you are trying to accelerate the emulation code by handling it at exception level, so as to avoid a full register save/restore and kernel context setup. But Maciej's patch does include a fix to the nested TLBL exception handler to make this work. > Your point about the performance overhead of reading and decoding an > instruction is worth keeping in mind, but hasn't been a big > slowdown. Measured how? Nigel From dan at codesourcery.com Wed Feb 9 16:21:26 2005 From: dan at codesourcery.com (Daniel Jacobowitz) Date: Wed, 9 Feb 2005 11:21:26 -0500 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <16906.3094.245677.488728@mips.com> References: <16906.3094.245677.488728@mips.com> Message-ID: <20050209162119.GA8011@nevyn.them.org> On Wed, Feb 09, 2005 at 01:11:50PM +0000, Nigel Stephens wrote: > Daniel Jacobowitz wrote: > > Note that I haven't been doing thread benchamarking, but the > > performance overhead from emulating rdhwr has not been significant > > in casual testing. I'm not weeping for the lost speed either way. > > One reason why MIPS is concerned about using rdhwr is that it may > condemn the whole MIPS architecture to poor multi-threading > performance relative to other architectures for some number of > years. If the architecture gains a poor reputation in this area, then > the mud could stick. > > As part of our own experiments Maciej implemented a "fast path" rdhwr > emulation, which he promises he will post to this list today. It has a > typical emulation time of between 30 and 60 cycles, depending on the > CPU, and assuming a fixed destination register for rdhwr (e.g. only > rdhwr $2,$5). Not too bad, but not brilliant either if thread pointer > access time turns out to be critical to the performance of some > threaded applications. Can you compare this to the normal cost of an emulated instruction? I'm not sure if I've posted the rdhwr emulation patch anywhere; I know Ralf has a copy. I'm not thrilled about hardcoding the target register but if that's what ya gotta do... > But I don't yet understand how important the thread pointer access > time is for a typical threaded program. Can we get a better idea of > the dynamic frequency of thread register loads? Does anyone know some > suitable application programs or benchmarks which exercise the TLS > mechanism, and from which we could extract some statistics? TLS is a user level feature, so it's very difficult to predict access patterns. Glibc uses it for: - errno - locale data - pthread_self There are, roughly, twentyish TLS traps in the startup of a typical single-threaded application. I do not know anything about thread benchmarking so I can't give you much there. > With regard to TLS on other architectures, people might like to read > this email conversation > http://lists.arm.linux.org.uk/pipermail/linux-arm-kernel/2005-January/026468.html > and continued here > http://lists.arm.linux.org.uk/pipermail/linux-arm-kernel/2005-February/thread.html#26620. > Ah, I see that since I last looked, you've been contributing to that > discussion already Daniel. I wrote most of that except for Nico's reimplementation. > > Want to talk to me more about using a parked TLB entry? I spoke with > > someone (Ralf or Jun, probably) about the idea originally; I was told > > it wasn't possible on MIPS SMP implementations to make this work, or > > that there was some other reason why it was undesirable. If that's not > > accurate then we could use a reserved memory location. However, that > > makes the TLS model dependent on details of Linux's memory mapping - > > not good for a hopefully generally useful ABI. > > While it's true that all threads within the same address space must > share a single page table, which would prevent such a trick being > implemented using the normal TLB refill mechanism, I think that it > might indeed be possible to implement this by using a per-thread wired > (parked) TLB entry, updated on a context switch. This would map a > magic page containing a copy of the thread pointer to a fixed virtual > address: if in the bottom 32K then it would need just one instruction > (e.g. lw $v0,0x1000($0)). > > A wired TLB entry like this should be fairly straightforward to > implement on even a simple RTOS, since it doesn't require a full-blown > VM system. I don't see why it wouldn't work on an SMP system too, > since each CPU has its own TLB, and therefore a unique wired TLB > entry per active thread. > > To maintain single-threaded performance we'd perhaps want to increment > the Wired count only when running real TLS code, so as not to reduce > the number of TLB entries available for random replacement for the > majority single-threaded applications. > > I've spoken to Ralf about this idea, and though he's not thrilled by > it, he hasn't (yet) said that it couldn't work. You can't do this for "only single-threaded code" - see above about errno and locales. TLS is either used by glibc or not. When built for NPTL, it's used. I'm pretty dubious about the tradeoff of wiring a TLB entry for this. Benchmarks will not accurately show the cost of having fewer available. > But a notable downside of this technique is that it won't work on the > new generation of explicitly multi-threaded CPUs, which could be > executing instructions from many threads within the same address space > "simultaneously", using a single, shared TLB. On such CPUs the > memory-mapped thread pointer would require that the kernel not map the > wired page, and then emulate a faulting thread pointer load in the > page fault handler - rather defeating the point of having a CPU > designed to accelerate multi-threading. Well, that's a bit of a full stop! Now we've got _three_ TLS mechanisms to contend with. That's too many. > How do people feel about supporting two different TLS implementations > on top of the existing MIPS ABIs - one optimised for legacy MIPS CPUs, > and another for multi-threaded CPUs - similar to what ARM Linux seems > to be considering for SMP? I think it's a horrible, horrible idea. Complexity has a cost too. > > Your point about the performance overhead of reading and decoding an > > instruction is worth keeping in mind, but hasn't been a big > > slowdown. > > Measured how? As I said, only casually. >From this discussion I am still inclined to stay with rdhwr. 30-60 cycles is not very long. -- Daniel Jacobowitz CodeSourcery, LLC From macro at mips.com Wed Feb 9 18:32:36 2005 From: macro at mips.com (Maciej W. Rozycki) Date: Wed, 9 Feb 2005 18:32:36 +0000 (GMT) Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <20050209162119.GA8011@nevyn.them.org> References: <16906.3094.245677.488728@mips.com> <20050209162119.GA8011@nevyn.them.org> Message-ID: On Wed, 9 Feb 2005, Daniel Jacobowitz wrote: > > As part of our own experiments Maciej implemented a "fast path" rdhwr > > emulation, which he promises he will post to this list today. It has a > > typical emulation time of between 30 and 60 cycles, depending on the > > CPU, and assuming a fixed destination register for rdhwr (e.g. only > > rdhwr $2,$5). Not too bad, but not brilliant either if thread pointer > > access time turns out to be critical to the performance of some > > threaded applications. > > Can you compare this to the normal cost of an emulated instruction? For the 24Kf processor the cost of doing a normal emulation is about 550% of that of my fast path. For the 4Kc one it's 1975%... > I'm not sure if I've posted the rdhwr emulation patch anywhere; I know > Ralf has a copy. I'm not thrilled about hardcoding the target > register but if that's what ya gotta do... You can have a fast path for the dedicated target register and normal emulation for the others to keep the semantics consistent. The cost rise from doing a computed goto to emulate a write to an arbitrary target register is about 25%, i.e. the total cost is about 125% of the original. Maciej From dan at codesourcery.com Wed Feb 9 18:52:35 2005 From: dan at codesourcery.com (Daniel Jacobowitz) Date: Wed, 9 Feb 2005 13:52:35 -0500 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: References: <16906.3094.245677.488728@mips.com> <20050209162119.GA8011@nevyn.them.org> Message-ID: <20050209185227.GB8011@nevyn.them.org> On Wed, Feb 09, 2005 at 06:32:36PM +0000, Maciej W. Rozycki wrote: > On Wed, 9 Feb 2005, Daniel Jacobowitz wrote: > > > > As part of our own experiments Maciej implemented a "fast path" rdhwr > > > emulation, which he promises he will post to this list today. It has a > > > typical emulation time of between 30 and 60 cycles, depending on the > > > CPU, and assuming a fixed destination register for rdhwr (e.g. only > > > rdhwr $2,$5). Not too bad, but not brilliant either if thread pointer > > > access time turns out to be critical to the performance of some > > > threaded applications. > > > > Can you compare this to the normal cost of an emulated instruction? > > For the 24Kf processor the cost of doing a normal emulation is about 550% > of that of my fast path. For the 4Kc one it's 1975%... > > > I'm not sure if I've posted the rdhwr emulation patch anywhere; I know > > Ralf has a copy. I'm not thrilled about hardcoding the target > > register but if that's what ya gotta do... > > You can have a fast path for the dedicated target register and normal > emulation for the others to keep the semantics consistent. The cost rise > from doing a computed goto to emulate a write to an arbitrary target > register is about 25%, i.e. the total cost is about 125% of the original. For GCC and ABI purposes, this means we might as well define in the TLS ABI which register has to be used, and we can open it up when we look back in ten years and everyone has the register :-) Thanks for the numbers. I think that working with the fast-path emulation and rdhwr is our best bet at this time. It also has a substantial locality (i.e. all the code in one place) benefit over playing with the TLB... -- Daniel Jacobowitz CodeSourcery, LLC From macro at mips.com Wed Feb 9 19:40:59 2005 From: macro at mips.com (Maciej W. Rozycki) Date: Wed, 9 Feb 2005 19:40:59 +0000 (GMT) Subject: Linux TLS pointer access reference emulation Message-ID: Hello, I have published a Linux patch and a small test user program that I've used for performance evaluation of a few possible TLS pointer access methods. The software is available at: "ftp://ftp.linux-mips.org/pub/linux/mips/people/macro/tls.tar.bz2". The patch implements an emulation of "rdhwr $2, $4" and syscall #0x10000, both retrieving a member of "struct thread_info" associated with the current process (the patch uses an arbitrary one; of course for the TLS pointer that should be replaced with a meaningful struct field). The patch applies to Linux 2.6.9-rc1, specifically to the malta CVS repository at linux-mips.org as of Oct 20th, 2004. It should work with the corresponding version from the main repository as well, but using it with the current revision requires adjusting it to these synthesized TLB handlers. The patch has its shortcomings, most notably it's been written for the 32-bit kernel only. For 64-bit ones it needs to be aware of the XTLB refill handler. It may actually be done quite nicely with these synthesized handlers; also avoiding the need to fetch the shadow of the EBase cp0 register. The userland software consists of a small program that benchmarks the available methods in tight loops. Keeping caches warm this should provide a reasonable optimistic execution time estimate. The program expects two arguments, a CPU frequency (which you can obtain from `dmesg' or failing this -- from your system's specs) and a number of loops to execute. It provides a number of outputs which are essentially raw as it's not really been meant for general use, but most importantly you are after "cycle count" reports for "scall" and "rdhwr" (these should be self explaining); perhaps "instr" as well (which counts cycles used for a single instruction and may not be accurate depending on the implementation of your processor's execution pipeline(s)). There are actually two programs included -- "time-0" and "time-1"; the former is what I've used for benchmarking and the latter is mainly for verification of proper operation with VIVT I-caches (but its output is meaningful, too). With Linux from the malta repository the programs can be trivially modified to benchmark a full instruction emulation. This version of Linux already emulates "rdhwr" for other registers, so all that has to be done is to replace "rdhwr $2, $4" with e.g. "rdhwr $2, $2" in rdhwr.h; that's how I did these additional benchmarks for Daniel. Alongside the software there are a few reports provided that I've obtained with a Malta board for CPUs I've had immediately available on core cards. Please let me know if you have any further questions regarding this package. Maciej From ralf at linux-mips.org Wed Feb 9 19:50:13 2005 From: ralf at linux-mips.org (Ralf Baechle) Date: Wed, 9 Feb 2005 20:50:13 +0100 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <16906.3094.245677.488728@mips.com> References: <16906.3094.245677.488728@mips.com> Message-ID: <20050209195013.GD5740@linux-mips.org> On Wed, Feb 09, 2005 at 01:11:50PM +0000, Nigel Stephens wrote: > One reason why MIPS is concerned about using rdhwr is that it may > condemn the whole MIPS architecture to poor multi-threading > performance relative to other architectures for some number of > years. If the architecture gains a poor reputation in this area, then > the mud could stick. I'm not too concerned here because currently there seems to be very little code available that exploits TLS. As time progresses more such code will be written and that I hope would be sufficient time for the hardware folks, if we deciede to take that route. > But I don't yet understand how important the thread pointer access > time is for a typical threaded program. Can we get a better idea of > the dynamic frequency of thread register loads? Does anyone know some > suitable application programs or benchmarks which exercise the TLS > mechanism, and from which we could extract some statistics? That's a not very quantitative statement but the Alpha people are apparently very satisfied with their PAL code exception based solution. At least so said rth when I last spoke to him. > To maintain single-threaded performance we'd perhaps want to increment > the Wired count only when running real TLS code, so as not to reduce > the number of TLB entries available for random replacement for the > majority single-threaded applications. > > I've spoken to Ralf about this idea, and though he's not thrilled by > it, he hasn't (yet) said that it couldn't work. My general feeling it's the kind of tradeoff that are sort of the equivalent to juggling with razor blades with closed eyes ;-) Or translated into plain English, it's one of those optimizations which I think have a strong potencial to fire back and actually result in a loss. It's hard to say without actual application and without knowing the size of the workload in advance. > For multi-threaded CPUs the ultimate solution is the new ABI which > reserves a GPR for the thread register. But in the short-term, before > that new ABI is fully supported, a rdhwr-based implementation could be > argued for. A new ABI is alot of work, will take time and not last convincing. We want something sooner than that could happen. > How do people feel about supporting two different TLS implementations > on top of the existing MIPS ABIs - one optimised for legacy MIPS CPUs, > and another for multi-threaded CPUs - similar to what ARM Linux seems > to be considering for SMP? It's a bit like hardware and software floating point, ll/sc and non-ll/sc binaries which are flavours which we already have. Once can only hope that the number of binary variants won't reach the actual worst case number. Ralf From uhler at mips.com Wed Feb 9 20:41:49 2005 From: uhler at mips.com (Michael Uhler) Date: Wed, 9 Feb 2005 12:41:49 -0800 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <20050209162119.GA8011@nevyn.them.org> Message-ID: <003601c50ee7$c763d510$0202a8c0@MIPS.COM> > From this discussion I am still inclined to stay with rdhwr. > 30-60 cycles is not very long. The problem that I'm having with this email thread is that I no longer understand what problem is being solved. You're getting push-back from the MIPS folks, including me, about a solution that uses emulation of an instruction which is unlikely to be actually implemented in hardware any time soon, if ever. It sounds like the ARM people are having the same heartburn for the same reason. There appears to be no appreciable benchmarking that suggests that a trap-and-emulate approach will be fine, or not so fine in terms of performance of the solution. If the problem we're trying to solve is to get NPTL to work on MIPS, period the end, then we can stop the email thread now. That certainly seems to be the path we're taking as of now. If we're trying to find a solution to the problem such that MIPS isn't at a competitive disadvantage, perhaps we should start looking for such a solution, rather than rejecting suggestions because of the changes to the current software implementation. So can you please crisply state what problem we're trying to solve here, and the bounds on an acceptable solution from your point of view. I'm really confused. /gmu --- Michael Uhler, Chief Technology Officer MIPS Technologies, Inc. Email: uhler at mips.com 1225 Charleston Road Voice: (650)567-5025 FAX: (650)567-5225 Mountain View, CA 94043 Mobile: (650)868-6870 Admin: (650)567-5085 From dan at codesourcery.com Wed Feb 9 21:13:41 2005 From: dan at codesourcery.com (Daniel Jacobowitz) Date: Wed, 9 Feb 2005 16:13:41 -0500 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <003601c50ee7$c763d510$0202a8c0@MIPS.COM> References: <20050209162119.GA8011@nevyn.them.org> <003601c50ee7$c763d510$0202a8c0@MIPS.COM> Message-ID: <20050209211337.GC8011@nevyn.them.org> On Wed, Feb 09, 2005 at 12:41:49PM -0800, Michael Uhler wrote: > > > From this discussion I am still inclined to stay with rdhwr. > > 30-60 cycles is not very long. > > The problem that I'm having with this email thread is that I no longer > understand what problem is being solved. You're getting push-back from the > MIPS folks, including me, about a solution that uses emulation of an > instruction which is unlikely to be actually implemented in hardware any > time soon, if ever. It sounds like the ARM people are having the same > heartburn for the same reason. There appears to be no appreciable > benchmarking that suggests that a trap-and-emulate approach will be fine, or > not so fine in terms of performance of the solution. > > If the problem we're trying to solve is to get NPTL to work on MIPS, period > the end, then we can stop the email thread now. That certainly seems to be > the path we're taking as of now. > > If we're trying to find a solution to the problem such that MIPS isn't at a > competitive disadvantage, perhaps we should start looking for such a > solution, rather than rejecting suggestions because of the changes to the > current software implementation. > > So can you please crisply state what problem we're trying to solve here, and > the bounds on an acceptable solution from your point of view. I'm really > confused. My point is for NPTL to work. Specifically: - Tying it to a new ABI is unacceptable in the short term, and possibly in the longer term, because of community pushback against additional ABIs. - Using a wired TLB entry gives kernel developers the shakes, because it restricts the available TLB slots, which can have complex effects on the performance of existing applications. That only leaves methods which enter the kernel. Our choices are via a load from an unmapped page, via a syscall, or via an RI exception. Only one of those models is compatible with acceleration via future hardware. Using rdhwr does not provide cripplingly slow - or even perceptibly slow - performance, and I'm using a much more expensive emulation layer than Maciej's. If the MIPS folks can clearly justify their concerns about this solution, and some relevant ideas to benchmark the problem, then maybe we can make forward progress. As was just pointed out to me, Maciej's numbers are on the same order of magnitude as a load miss. Note that the rdhwr is never needed more than once per function with an optimizing compiler. I think that at the cost of a load miss, you're getting a bargain. -- Daniel Jacobowitz CodeSourcery, LLC From mark at codesourcery.com Wed Feb 9 21:19:25 2005 From: mark at codesourcery.com (Mark Mitchell) Date: Wed, 09 Feb 2005 13:19:25 -0800 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <003601c50ee7$c763d510$0202a8c0@MIPS.COM> References: <003601c50ee7$c763d510$0202a8c0@MIPS.COM> Message-ID: <420A7E5D.3050008@codesourcery.com> Michael Uhler wrote: >>From this discussion I am still inclined to stay with rdhwr. >>30-60 cycles is not very long. > > The problem that I'm having with this email thread is that I no longer > understand what problem is being solved. I'll try to summarize what I understand the situation to be. CodeSourcery has done an implementation for one of its customers using rdhwr. We'd like to get that an implementation, or a variant of it, out to the broader community. From our point of view, it's just as easy to use any other single instruction that loads the thread pointer, instead of "rdhwr"; that could be "lw" or some other trapping load. There's no technical problem making that change. What is important is that we get buy-in from the rest of the Linux MIPS community. We really want to avoid multiple different versions of this stuff out there. Concern about that issue is right now preventing Daniel from posting his patches; he doesn't want to see things fragment. Daniel feels rdhwr is the best choice, as it would seem to avoid complexity down the road; MIPS seems to feel that the number of cycles required to access the thread pointer on current hardware is more important. I don't think I have an opinion, but I would point out that (a) optimizations for TLS models mean that you can often make multiple accesses to TLS with a single thread pointer load, and (b) a lot of threaded programs make very little use of TLS, but one still needs *some* implementation of TLS in order to get NPTL off the ground. Both of these points suggest that cycles-per-thread-pointer-load may not be that important a metric. A compromise solution would be to say that the ABI requires "rdhwr", but mark the instruction with a dynamic relocation that the dynamic loader could modify into a different instruction if that makes sense on a particular system. Ultimately, I think it would be most helpful would be for the Linux MIPS community to make a decision of what single instruction needs to go in that slot, and then let us know. If, in the end, it's not rdhwr, that's OK. However, Daniel's validated his changes using the rdhwr solution, and we don't have the resources to re-validate with another solution, under our existing contracts. So, we would provide our patches to MIPS, and let MIPS handle the upstream submission. -- Mark Mitchell CodeSourcery, LLC mark at codesourcery.com (916) 791-8304 From ica2_ts at csv.ica.uni-stuttgart.de Wed Feb 9 23:26:41 2005 From: ica2_ts at csv.ica.uni-stuttgart.de (Thiemo Seufer) Date: Thu, 10 Feb 2005 00:26:41 +0100 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <003601c50ee7$c763d510$0202a8c0@MIPS.COM> References: <20050209162119.GA8011@nevyn.them.org> <003601c50ee7$c763d510$0202a8c0@MIPS.COM> Message-ID: <20050209232641.GA11812@rembrandt.csv.ica.uni-stuttgart.de> Michael Uhler wrote: > > > From this discussion I am still inclined to stay with rdhwr. > > 30-60 cycles is not very long. > > The problem that I'm having with this email thread is that I no longer > understand what problem is being solved. The short term goal is still to add TLS to the existing ABIs without breaking source compatibility. > You're getting push-back from the > MIPS folks, including me, about a solution that uses emulation of an > instruction which is unlikely to be actually implemented in hardware any > time soon, if ever. It sounds like the ARM people are having the same > heartburn for the same reason. Yes, and nobody backs up this concern with hard data. Maciej's numbers are good to know, but they tell nothing about how heavily the TLS implementation will be used in real-world applications. > There appears to be no appreciable > benchmarking that suggests that a trap-and-emulate approach will be fine, or > not so fine in terms of performance of the solution. I looked for good NPTL benchmarks and decent performance comparisions in the meanwhile. I found none at all for emulated TLS vs. TLS register. Numbers were cited by - mail(s) from Ulrich Drepper to the Linux Kernel Mailing List, which compares against linuxthreads, and uses some rather unrealistic loads like 100000 threads. The NPTL source has some highly synthetic benchmarks which might have been used to get those numbers. - Some mails on developer mailing lists which cite a 1-2% performance improvement over linuxthreads for larger java applications. For that case the difference between emulated and hardware TLS appears to be negligable. > If the problem we're trying to solve is to get NPTL to work on MIPS, period > the end, then we can stop the email thread now. That certainly seems to be > the path we're taking as of now. > > If we're trying to find a solution to the problem such that MIPS isn't at a > competitive disadvantage, perhaps we should start looking for such a > solution, rather than rejecting suggestions because of the changes to the > current software implementation. > > So can you please crisply state what problem we're trying to solve here, and > the bounds on an acceptable solution from your point of view. I'm really > confused. The assumption of "competitive disadvantage" is AFAICS unproven, and the schemes suggested to improve TLS performance may well turn out as over-engineering. Using e.g. a wired TLB is known to have a performance impact for all applications, the same is (probably to a lesser extent) true for reserving a GPR. Thiemo From uhler at mips.com Wed Feb 9 23:46:33 2005 From: uhler at mips.com (Michael Uhler) Date: Wed, 9 Feb 2005 15:46:33 -0800 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <20050209232641.GA11812@rembrandt.csv.ica.uni-stuttgart.de> Message-ID: <002001c50f01$95c7cb50$cb14a8c0@MIPS.COM> > Yes, and nobody backs up this concern with hard data. > Maciej's numbers are good to know, but they tell nothing > about how heavily the TLS implementation will be used in > real-world applications. > The assumption of "competitive disadvantage" is AFAICS > unproven, and the schemes suggested to improve TLS > performance may well turn out as over-engineering. Using e.g. > a wired TLB is known to have a performance impact for all > applications, the same is (probably to a lesser extent) true > for reserving a GPR. Both points are valid. But they assume that if we DO have a performance problem, we'll be able to go back and fix that problem with an alternative method (something other than a new ABI). It was my impression that we were discussing something that was not going to be easy to change once defined. So if we are proposing something that uses trap-and-emulate of rdhwr (whose register number will still have to change - I'll figure out what that is) as an initial proposal, and we're prepared to change it to address any performance problem that we find, I'm OK with that. If we're not prepared to change things if we find a performance problem, I wonder why the burden of proof is on us to prove that we're not at a competitive disadvantage (something the ARM folks are obviously concerned about also) vs. the burden of proof being on having sufficient performance analysis to know that it will be fine. To me, it all comes down to whether this is a final, unchangable, solution or not. /gmu --- Michael Uhler, Chief Technology Officer MIPS Technologies, Inc. Email: uhler at mips.com 1225 Charleston Road Voice: (650)567-5025 FAX: (650)567-5225 Mountain View, CA 94043 Mobile: (650)868-6870 Admin: (650)567-5085 From mark at codesourcery.com Thu Feb 10 00:04:51 2005 From: mark at codesourcery.com (Mark Mitchell) Date: Wed, 09 Feb 2005 16:04:51 -0800 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <002001c50f01$95c7cb50$cb14a8c0@MIPS.COM> References: <002001c50f01$95c7cb50$cb14a8c0@MIPS.COM> Message-ID: <420AA523.4060307@codesourcery.com> Michael Uhler wrote: > Both points are valid. But they assume that if we DO have a performance > problem, we'll be able to go back and fix that problem with an alternative > method (something other than a new ABI). It was my impression that we were > discussing something that was not going to be easy to change once defined. That's why I suggested, as a possible compromise, that we require that compilers/linkers mark the rdhwr instruction with a relocation. That would allow dynamic linkers to make appropriate changes to the code, if appropriate. To me, this seems like a very practical way of moving forward with our current implementation, while hedging our bets; what do you and others think? > If we're not prepared to change things if we find a performance problem, I > wonder why the burden of proof is on us to prove that we're not at a > competitive disadvantage (something the ARM folks are obviously concerned > about also) vs. the burden of proof being on having sufficient performance > analysis to know that it will be fine. I desparately want to avoid getting into an ARM/MIPS controversy. (CodeSourcery is not an advocate for one architecture over the other; we are pleased to work with many major semiconductor vendors, and while loyal to each of our customers, neutral overall.) However, I will say that the ARM GNU/Linux community and ARM, Ltd., are not necessarily of one mind on this topic. I believe (thought I certainly cannot speak for them) that ARM, Ltd., is OK with the requirement that, to get maximum NPTL/TLS performance, you use an ARM V6 chip and the new ARM ABI, including a coprocessor-read instruction to access the thread pointer. The ARM GNU/Linux commmunity seems to have a greater attachment to the old ABI and a greater desire to hardware without the appropriate coprocessors. -- Mark Mitchell CodeSourcery, LLC mark at codesourcery.com (916) 791-8304 From dan at codesourcery.com Thu Feb 10 00:10:10 2005 From: dan at codesourcery.com (Daniel Jacobowitz) Date: Wed, 9 Feb 2005 19:10:10 -0500 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <002001c50f01$95c7cb50$cb14a8c0@MIPS.COM> References: <20050209232641.GA11812@rembrandt.csv.ica.uni-stuttgart.de> <002001c50f01$95c7cb50$cb14a8c0@MIPS.COM> Message-ID: <20050210001007.GE8011@nevyn.them.org> On Wed, Feb 09, 2005 at 03:46:33PM -0800, Michael Uhler wrote: > > > Yes, and nobody backs up this concern with hard data. > > Maciej's numbers are good to know, but they tell nothing > > about how heavily the TLS implementation will be used in > > real-world applications. > > > The assumption of "competitive disadvantage" is AFAICS > > unproven, and the schemes suggested to improve TLS > > performance may well turn out as over-engineering. Using e.g. > > a wired TLB is known to have a performance impact for all > > applications, the same is (probably to a lesser extent) true > > for reserving a GPR. > > Both points are valid. But they assume that if we DO have a performance > problem, we'll be able to go back and fix that problem with an alternative > method (something other than a new ABI). It was my impression that we were > discussing something that was not going to be easy to change once defined. > > So if we are proposing something that uses trap-and-emulate of rdhwr (whose > register number will still have to change - I'll figure out what that is) as > an initial proposal, and we're prepared to change it to address any > performance problem that we find, I'm OK with that. > > If we're not prepared to change things if we find a performance problem, I > wonder why the burden of proof is on us to prove that we're not at a > competitive disadvantage (something the ARM folks are obviously concerned > about also) vs. the burden of proof being on having sufficient performance > analysis to know that it will be fine. > > To me, it all comes down to whether this is a final, unchangable, solution > or not. There's a couple of different possibilities covered by "final" and "unchangeable". The one thing I'm most desperate to avoid is something that becomes impossible or extremely difficult to support later. Once we publish this as part of an ABI, there will soon be deployed systems using it; I don't want to have to force them to transition to something else. The TP access instruction ends up in both system libraries and user applications, so there's real legacy impact. Rdhwr could end up in this state, if MIPS determines that there is not an available register encoding or that devoting one to this usage is a bad idea. If that's the case, let us know - we'll have to go back to the drawing board. [FYI: ARM actually uses a function call for application TP access. I think this is a bad design decision, which is why I haven't proposed it for MIPS. Having to make a function call tacks even more overhead onto IE/LE access, particularly userspace register pressure.] However, if we're willing to support whatever we choose here for the unspecified future, I don't see a big barrier to choosing a new, superior approach later. Probably none of the toolchain components would be affected; they could seamlessly transition to a new model. The kernel would have to continue to support the existing model and the new model; I'll leave that answer to the kernel developers reading this, but I believe that Maciej's patches would not be infeasible to maintain. If this is OK with you, I'd appreciate it if you could get back to us about the choice of register number. I've got at least a couple more days worth of work before the port will be ready. It won't cripple me to sit on it for a couple weeks after that, but I'd love to have it submitted - I've already started to get requests for the code. -- Daniel Jacobowitz CodeSourcery, LLC From dan at codesourcery.com Thu Feb 10 00:18:41 2005 From: dan at codesourcery.com (Daniel Jacobowitz) Date: Wed, 9 Feb 2005 19:18:41 -0500 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <420AA523.4060307@codesourcery.com> References: <002001c50f01$95c7cb50$cb14a8c0@MIPS.COM> <420AA523.4060307@codesourcery.com> Message-ID: <20050210001837.GF8011@nevyn.them.org> On Wed, Feb 09, 2005 at 04:04:51PM -0800, Mark Mitchell wrote: > Michael Uhler wrote: > >Both points are valid. But they assume that if we DO have a performance > >problem, we'll be able to go back and fix that problem with an alternative > >method (something other than a new ABI). It was my impression that we were > >discussing something that was not going to be easy to change once defined. > > That's why I suggested, as a possible compromise, that we require that > compilers/linkers mark the rdhwr instruction with a relocation. That > would allow dynamic linkers to make appropriate changes to the code, if > appropriate. > > To me, this seems like a very practical way of moving forward with our > current implementation, while hedging our bets; what do you and others > think? I don't think it's worthwhile, since it doesn't hedge bets very well. An alternative sequence could easily turn out to be more than one instruction, but still faster than a trap. -- Daniel Jacobowitz CodeSourcery, LLC From mark at codesourcery.com Thu Feb 10 00:22:27 2005 From: mark at codesourcery.com (Mark Mitchell) Date: Wed, 09 Feb 2005 16:22:27 -0800 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <20050210001837.GF8011@nevyn.them.org> References: <002001c50f01$95c7cb50$cb14a8c0@MIPS.COM> <420AA523.4060307@codesourcery.com> <20050210001837.GF8011@nevyn.them.org> Message-ID: <420AA943.2040901@codesourcery.com> Daniel Jacobowitz wrote: > I don't think it's worthwhile, since it doesn't hedge bets very well. > An alternative sequence could easily turn out to be more than one > instruction, but still faster than a trap. (I was thinking that thus far all the sequences seem to have been single-instruction; the options thus far seem to have been "rhdwr", "lw $x, 0x1000($0)", and some other reserved-instruction sequence.) Anyhow, there goes my attempt at cutting through this particular Gordion knot. :-) Thanks, -- Mark Mitchell CodeSourcery, LLC mark at codesourcery.com (916) 791-8304 From ica2_ts at csv.ica.uni-stuttgart.de Thu Feb 10 00:42:48 2005 From: ica2_ts at csv.ica.uni-stuttgart.de (Thiemo Seufer) Date: Thu, 10 Feb 2005 01:42:48 +0100 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <002001c50f01$95c7cb50$cb14a8c0@MIPS.COM> References: <20050209232641.GA11812@rembrandt.csv.ica.uni-stuttgart.de> <002001c50f01$95c7cb50$cb14a8c0@MIPS.COM> Message-ID: <20050210004248.GB11812@rembrandt.csv.ica.uni-stuttgart.de> Michael Uhler wrote: > > > Yes, and nobody backs up this concern with hard data. > > Maciej's numbers are good to know, but they tell nothing > > about how heavily the TLS implementation will be used in > > real-world applications. > > > The assumption of "competitive disadvantage" is AFAICS > > unproven, and the schemes suggested to improve TLS > > performance may well turn out as over-engineering. Using e.g. > > a wired TLB is known to have a performance impact for all > > applications, the same is (probably to a lesser extent) true > > for reserving a GPR. > > Both points are valid. But they assume that if we DO have a performance > problem, we'll be able to go back and fix that problem with an alternative > method (something other than a new ABI). It was my impression that we were > discussing something that was not going to be easy to change once defined. >From a technical POV it's not that hard to change the method. But for adoption of NPTL it would be a disaster to create incompatible variants after the initial deployment. > So if we are proposing something that uses trap-and-emulate of rdhwr (whose > register number will still have to change - I'll figure out what that is) as > an initial proposal, and we're prepared to change it to address any > performance problem that we find, I'm OK with that. The current state is already well beyond the proposal phase. A potential change of the current implementation would need to happen soon, and would also need a sound reasoning. (Mark's idea of attaching a marker relocation to the instruction is probably the best to keep an emergency exit open. It would trade startup time for a TLS access change.) > If we're not prepared to change things if we find a performance problem, I > wonder why the burden of proof is on us to prove that we're not at a > competitive disadvantage (something the ARM folks are obviously concerned > about also) vs. the burden of proof being on having sufficient performance > analysis to know that it will be fine. The burden of proof is always on the side who wants to change a working solution. :-) Thiemo From uhler at mips.com Thu Feb 10 00:57:01 2005 From: uhler at mips.com (Michael Uhler) Date: Wed, 9 Feb 2005 16:57:01 -0800 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <20050210004248.GB11812@rembrandt.csv.ica.uni-stuttgart.de> Message-ID: <003601c50f0b$6e54df40$cb14a8c0@MIPS.COM> > The burden of proof is always on the side who wants to change > a working solution. :-) Oh, please. Tell that to any open source maintainer who is looking for the right solution, not just a working solution. /gmu --- Michael Uhler, Chief Technology Officer MIPS Technologies, Inc. Email: uhler at mips.com 1225 Charleston Road Voice: (650)567-5025 FAX: (650)567-5225 Mountain View, CA 94043 Mobile: (650)868-6870 Admin: (650)567-5085 From ralf at linux-mips.org Thu Feb 10 00:58:01 2005 From: ralf at linux-mips.org (Ralf Baechle) Date: Thu, 10 Feb 2005 01:58:01 +0100 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <420AA523.4060307@codesourcery.com> References: <002001c50f01$95c7cb50$cb14a8c0@MIPS.COM> <420AA523.4060307@codesourcery.com> Message-ID: <20050210005801.GA10366@linux-mips.org> On Wed, Feb 09, 2005 at 04:04:51PM -0800, Mark Mitchell wrote: > Michael Uhler wrote: > >Both points are valid. But they assume that if we DO have a performance > >problem, we'll be able to go back and fix that problem with an alternative > >method (something other than a new ABI). It was my impression that we were > >discussing something that was not going to be easy to change once defined. > > That's why I suggested, as a possible compromise, that we require that > compilers/linkers mark the rdhwr instruction with a relocation. That > would allow dynamic linkers to make appropriate changes to the code, if > appropriate. > > To me, this seems like a very practical way of moving forward with our > current implementation, while hedging our bets; what do you and others > think? So we're now close to a consenus. Given that I'd now accept a kernel patch that does the right thing. Which probably means taking Maciej's patch and polishing to work for the latest kernel. Ralf From ica2_ts at csv.ica.uni-stuttgart.de Thu Feb 10 01:32:13 2005 From: ica2_ts at csv.ica.uni-stuttgart.de (Thiemo Seufer) Date: Thu, 10 Feb 2005 02:32:13 +0100 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <003601c50f0b$6e54df40$cb14a8c0@MIPS.COM> References: <20050210004248.GB11812@rembrandt.csv.ica.uni-stuttgart.de> <003601c50f0b$6e54df40$cb14a8c0@MIPS.COM> Message-ID: <20050210013213.GD11812@rembrandt.csv.ica.uni-stuttgart.de> Michael Uhler wrote: > > > The burden of proof is always on the side who wants to change > > a working solution. :-) > > Oh, please. Tell that to any open source maintainer who is looking for the > right solution, not just a working solution. Well, so far the rdhwr emulation looks like it is both. A different solution would IMHO need to show an improvement beyond theoretical considerations. Thiemo From uhler at mips.com Thu Feb 10 21:25:13 2005 From: uhler at mips.com (Michael Uhler) Date: Thu, 10 Feb 2005 13:25:13 -0800 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <20050210001007.GE8011@nevyn.them.org> Message-ID: <005b01c50fb7$0285c480$0c02a8c0@MIPS.COM> > If this is OK with you, I'd appreciate it if you could get > back to us about the choice of register number. I've got at > least a couple more days worth of work before the port will > be ready. It won't cripple me to sit on it for a couple > weeks after that, but I'd love to have it submitted - I've > already started to get requests for the code. I wouldn't exactly say that it's OK with us. The impression that I get is that it's too late to change, and even if it weren't we'd have to prove that the trap-and-emulate approach had performance problems. We have to trade that off with our own desire to get NPTL supported, even if we have a feeling that the implementation may cause problems in the future. So, I have allocated RDHWR register 29 (decimal) for use as the pseudo-TLS pointer. What this means is that we have changed the architecture documents to indicate that this register is used for an ABI-related activity such that it will never be re-allocated for another purpose. At this point, we do not intend to implement this as a hardware register, nor will other MIPS implementations be doing so. We'll revisit this (as an architecture change) once we measure the performance impact of the proposal and compare that with other potential changes to the ABI. Based on the email thread, I'm not sure if Mark's suggestion of a compromise by marking the RDHWR with a relocation has benefit or not. If it does, it would be nice to have some hedge in the future. /gmu --- Michael Uhler, Chief Technology Officer MIPS Technologies, Inc. Email: uhler at mips.com 1225 Charleston Road Voice: (650)567-5025 FAX: (650)567-5225 Mountain View, CA 94043 Mobile: (650)868-6870 Admin: (650)567-5085 From dan at codesourcery.com Thu Feb 10 21:58:24 2005 From: dan at codesourcery.com (Daniel Jacobowitz) Date: Thu, 10 Feb 2005 16:58:24 -0500 Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI In-Reply-To: <005b01c50fb7$0285c480$0c02a8c0@MIPS.COM> References: <20050210001007.GE8011@nevyn.them.org> <005b01c50fb7$0285c480$0c02a8c0@MIPS.COM> Message-ID: <20050210215822.GC12253@nevyn.them.org> On Thu, Feb 10, 2005 at 01:25:13PM -0800, Michael Uhler wrote: > > > If this is OK with you, I'd appreciate it if you could get > > back to us about the choice of register number. I've got at > > least a couple more days worth of work before the port will > > be ready. It won't cripple me to sit on it for a couple > > weeks after that, but I'd love to have it submitted - I've > > already started to get requests for the code. > > I wouldn't exactly say that it's OK with us. The impression that I get is > that it's too late to change, and even if it weren't we'd have to prove that > the trap-and-emulate approach had performance problems. We have to trade > that off with our own desire to get NPTL supported, even if we have a > feeling that the implementation may cause problems in the future. Let me quote a message you didn't directly answer: My point is for NPTL to work. Specifically: - Tying it to a new ABI is unacceptable in the short term, and possibly in the longer term, because of community pushback against additional ABIs. - Using a wired TLB entry gives kernel developers the shakes, because it restricts the available TLB slots, which can have complex effects on the performance of existing applications. That only leaves methods which enter the kernel. Our choices are via a load from an unmapped page, via a syscall, or via an RI exception. Only one of those models is compatible with acceleration via future hardware. Using rdhwr does not provide cripplingly slow - or even perceptibly slow - performance, and I'm using a much more expensive emulation layer than Maciej's. If the MIPS folks can clearly justify their concerns about this solution, and some relevant ideas to benchmark the problem, then maybe we can make forward progress. As was just pointed out to me, Maciej's numbers are on the same order of magnitude as a load miss. Note that the rdhwr is never needed more than once per function with an optimizing compiler. I think that at the cost of a load miss, you're getting a bargain. I'm not asking for "proof" that there are performance problems. I'm asking for something that I can understand as more than a "feeling", since I "feel" that there aren't. I'm trying to be responsive to your concerns, but I'm still having trouble getting a handle on why (whether?) you think any of the other ideas presented are better than emulating rdhwr. Please, if you have a better proposal... > So, I have allocated RDHWR register 29 (decimal) for use as the pseudo-TLS > pointer. What this means is that we have changed the architecture documents > to indicate that this register is used for an ABI-related activity such that > it will never be re-allocated for another purpose. At this point, we do not > intend to implement this as a hardware register, nor will other MIPS > implementations be doing so. We'll revisit this (as an architecture change) > once we measure the performance impact of the proposal and compare that with > other potential changes to the ABI. Thank you. We'll need to choose a preferred GPR for the fast-path emulation also; Thiemo suggested $3, which sounds reasonable to me. The choice does not make a great deal of difference. > Based on the email thread, I'm not sure if Mark's suggestion of a compromise > by marking the RDHWR with a relocation has benefit or not. If it does, it > would be nice to have some hedge in the future. I don't think it adds any additional value. It would only be useful if we wanted to replace the one instruction with any single other instruction in legacy code; most things will be rebuildable and I do not see the application startup time overhead as preferable to the kernel emulation for legacy code. Does anyone else see a scenario in which this would be good to have? -- Daniel Jacobowitz CodeSourcery, LLC