| Version Number | Date of Revision | Document status |
| 0.1 | 2004/03/29 | First draft |
| 0.2 | 2004/04/05 | Draft |
| 0.3 | 2004/04/30 | Guidelines draft |
Currently it is only possible to observe and debug runtime execution of NPTL-based multithreaded programs by using gdb, or any debugging tool relying on breakpoints and static states reading. Dealing with multithreading, one might find such information irrelevant and rather get a trace of what happened during execution without interfering with normal execution. It is of course possible to make a program trace itself, but one can't get feedback from NPTL internal synchronization mechanisms without modifying it.
This document attempts to describe ideas to add such tracing mechanism to the NPTL, from multithreaded programs as well as NPTL maintainers' point of view.
This document is inspired from other NPTL trace mechanisms proposals, trying to take every needs and wishes into account, but taking a few definitive decisions nevertheless. It is written in the hope to get comments and to become a design documentation before actually implementing the tracing system, as such a system seems actually wanted, being part of the official branch or as a patch.
Tracing the NPTL is relevant in case of a multithreaded program using it is malfunctioning, or when a developer wants to check whether threads behave and synchronize as expected. Therefore, traces should help checking threads concurrency, detecting unwanted behavior or common problems like deadlocks. Traces should also give information to track back the cause of problems that are detected. Finally traces should help checking whether the NPTL itself behaves as expected.
Requirements to achieve these goals are :
There are three entities, three parts to study for building up a NPTL tracing system :
The NPTL itself. it should be modified to generate traces and submit them to a NPTL trace handler. It Should also be modified to support a way to choose which handler to use, and to give the tracing system some parameters (trace details level). There should be the fewest possible modifications, the only purpose is to be able to trace NPTL.
The NPTL trace handler. It should be as customizable as possible, the only mandatory thing it should have at tracing time is to provide a function to receive generated traces. As the NPTL operates mostly in user space, the handler should not have to use kernel space features, at least while collecting traces. This is in order to avoid disturbing normal thread scheduling while collecting traces. Besides, most system calls must explicitly be avoided inside a POSIX thread primitive function. Enabling the handler to add its own traces to the NPTL stream might be beneficial, for instance if it stores traces in a buffer which needs to be flushed periodically, thus disturbing the whole process.
The traced program. It should not have to do anything else than it already does for using NPTL. A user might want to enable a program to add traces in the NPTL trace stream, so it should be considered.

End-users willing to get an execution trace of their multithreaded programs should put in place :
When wanting to trace a multithreaded program by dynamically link it with a "trace-enabled" NPTL, the user should define an environment variable, which indicates the trace handler to use, optional parameters for it, trace verbosity, and possibly optional parameters for the NPTL tracing mechanism itself (only in prevision for future extensions.)
The variable name : PTHREAD_NPTL_TRACE.
Of course, naming is an issue. In this name, "NPTL" is not supposed to mean the system is for use with NPTL only, but it should be merely a name, avoiding a variable PTHREAD_TRACE that could be used by another pthread tracing convention.
Variable content formatting : <handler>[:<handler_arg>(,<handler_arg>)*][:<verbosity>][:<tracing_arg>]
"Handler" being either a normally accessible "libhandler.so" or some absolute or relative path to a shared object which is a handler. Handler_args should not be in random order (or it is the handler responsibility to make it so.)
Verbosity is a number indicating the level of trace details the user wishes. The greater the most detailed, minimum value being 1, maximum value unlimited (in prevision for future extensions.) Each defined traceable event is provided a detail level, and will be reported during trace if the requested verbosity is equal or greater than this value. The detail level of an event will be indicated along with this event's documentation (still to establish.)
"Tracing_arg" is left as is for now. Examples :
PTHREAD_NPTL_TRACE=my_handler:arg1,arg2:1
PTHREAD_NPTL_TRACE=my_handler
PTHREAD_NPTL_TRACE=/usr/local/my_handler/libmy_handler.so:arg1
PTHREAD_NPTL_TRACE=my_handler:arg1:2:nptl_arg1
PTHREAD_NPTL_TRACE=my_handler::1
PTHREAD_NPTL_TRACE=my_handler:::nptl_arg1
Thus, start and trace a multithreaded program should look like :
PTHREAD_NPTL_TRACE=my_handler:arg1 my_command
It should be distinguishable if traces come from the NPTL, from the trace handler or from the traced program. Traces should indicate :
A program could be traced with different kinds of handlers. For instance, a handler could :
The final output presentation and format will essentially depend on the trace handler that the user chose. However, the default handler we plan to develop will output the traces in a file. We plan to output in a format that can be read with LTT's standard visualizer, if possible. We keep in mind that LTT and this trace mechanism do not trace the same things and that visualizers do not necessarily need the same features (for instance we do not indicate which processor generated which event, and we are interested in viewing only events that deal with a given mutex.) The final decision will be taken once a few trace hooks are successfully inserted in NPTL code and a trace handler is able to output it in a trivial format.
In order to enable developers to trace and debug their NPTL-based multithreaded programs, the NPTL needs changes to provide at least internal hooks. These hooks can then call external and customizable handler functions, but adding hooks to the NPTL itself is the very least one must do to achieve providing a NPTL tracing mechanism. Such modifications would probably be appreciated in the official NPTL tree (ie the official glibc tree,) but it will be provided as a patch, first.
Here is a proposal for a minimal modification to be done.
At NPTL initialization, a function should check and obey the PTHREAD_NPTL_TRACE environment variable, load the requested handler shared object if there is one, attach trace hooks to its functions, and call a nptl_trace_init(char**) function of the handler, passing its parameters.
Some boolean global variable of the NPTL should indicate whether an execution trace has been requested. For instance, int __nptl_trace_enabled. Some integer global variable should indicate the requested detail level, for instance int __nptl_trace_detail_level.
At NPTL termination, a function should call a nptl_trace_fini() function of the handler, then unload it.
At each point in code where a significant event occurs, a trace must be generated if its detail level is less than or equal to the requested one. It could be done by building a struct nptl_trace_event and calling a nptl_trace_handler_getevent(struct nptl_trace_event) function from the handler.
Ideally, generating a trace should be decided both at compile time and at runtime. Thus, before building the structure and calling the handler function, a single test on the __nptl_trace_enabled and __nptl_trace_detail_level variables should be made. Doing this, the library's normal behavior should be little disturbed when not enabling traces. In addition, a preprocessor check on an ENABLE_NPTL_TRACE definition around each event generator will keep ability to not support traces at all in the library.
Thus, we would define some global function that looks like :
static inline int __must_generate_trace_level(int level) {
#ifdef ENABLE_NPTL_TRACE
if(__nptl_trace_enabled)
if(__nptl_trace_detail_level >= level)
return 1;
return 0;
#else
return 0;
#endif
}
And each event occurrence in the code would look like :
if(__must_generate_trace_level(event_detail_level)) {
(*nptl_trace_handler_getevent) ( { event structure } );
}
An NPTL trace event structure is defined. It contains an enumerated type identifying the event source (NPTL, handler or program,) two others to identify the event itself (event domain and event identity in the domain), and a void pointer to indicate mem space containing event parameters when needed, along with the parameters size.
struct nptl_trace_event {
char source;
char domain;
char event;
size_t param_size;
void* param;
};
We forgot about providing timestamps because it is a trace handler problem, and also because it may be irrelevant. We plan to use an event counter atomically increased instead.
Trace parameters formats depend on the event traced itself, therefore the structure can't be more precise. It should be allowed to free the memory where the parameters are stored after nptl_trace_handler_getevent returns, though. It is the handler's responsibility to copy and store it.
The domain and event properties both are there to identify an event, and making it not-so-complicated to add events inside the NPTL. Domain will identify a function or a function group (functions about mutexes...) and event will identify a unique event in the domain.
Thanks to a field indicating the source of each event in the event structure, supporting the possibility to add traces that are specific to the trace handler and the traced program is trivial. The trace handler may add its own traces to the stream without needing to communicate with the other two entities. The traced program will need a function to generate events and add them to the stream. As it seems a complication to make the traced program aware of the trace handler, an nptl_trace_send_user_event(struct nptl_trace_event) function should be provided to the program by the NPTL, which will report it to the trace handler.
The trace handler is in charge to collect traces generated during execution. To do that, it must implement a set of entry points that are called by the modified NPTL to submit traces of occurring events. Only this is mandatory theorically, but it is necessary to keep in mind a few constraints to produce something usable. As the only features we plan to add inside the NPTL itself are hooks to generate traces and trace mechanisms initialization, the trace handler is in charge of the whole complexity.
Since trace handlers entry points are called on thread events, they run as threads, inside a NPTL primitive function. This implies that these entry points and the trace collecting mechanisms must respect a few constraints :
Low cost in time. We aim to consume not more than an average 1% additional time than without tracing anything. Therefore, traces should not require to be treated by complex algorithms.
No system call. It forces context switches and probably severely disturbs normal scheduling.
No cancellation point. This is because a lot of POSIX Thread functions explicitly must not be cancellation points, and such functions may generate traces. This constraint is another reason to avoid system calls, as a lot are cancellation points. It also prevents to use POSIX thread functions inside the handler entry point, as a few of these functions are cancellation points as well.
As no system call can be made while tracing, it seems necessary to store collected traces in memory, and to output them after the program completion. An immediate issue with this method is the size of the buffer storing traces : the buffer can't be resized during execution as it is at least time-consuming, and thread-unsafe.
A proposal is to create a "big enough" circular buffer at initialization time, and to share the buffer memory with another process in charge to read and output the buffer content. How to output traces exactly is where the handler becomes really customizable : it could be written on standard output in a readable format, stored in a file for later analysis, or used by some runtime analyzer tool.

How the buffer is built precisely is an issue.
One "big" buffer to store every traces from every processors and every threads. This eases the task of the process in charge to read and output the buffer, since it keeps the order of events occurrences correct. It may cause problems of operation atomicity when adding a trace to the buffer, though.
For each thread, one buffer to store its traces. This keeps the atomic operation issues away, but it leads to a few issues. One can't guess at initialization time how many threads will be created, and theorically a large amount of threads can be. It is not acceptable to allocate memory for a buffer for each "possibly existing" thread of a program. It is possible and seems reasonable to allocate a buffer when a "thread creation" event occur. This disturbs normal execution, but only at thread creation, which is acceptable in some cases.
RelayFS already implements a lockless, thread-safe, in-memory storing algorithm, and it is claimed that it can be fairly doable to make it run in userspace without the need of any syscall. We did not study this method much, but we plan to, when we will try to implement a smart trace handler.
To make it possible to start implementing NPTL's trace hooks and initialization, as well as trace handlers entry points, we will implement a trivial trace handler that simply reserves a big buffer and stops storing new traces when it is full. It will still write an output file in an obvious format at the program termination to make it possible to trace at least the beginning. Of course, this method is only for pre-debugging purpose.
Once traces have been generated by the NPTL, submitted to the trace handler, stored in a buffer and re-read by a parallel process, we need some output of them. It is the trace handler responsibility to decide how to output traces, but we can list a few methods and directions.
"trace file for later analysis" vs "live analysis." The trace handler's parallel process may store collected traces on disk in a given file, or it may directly output traces in a useful manner. The latest seems barely convenient : it may need the process "to compute too much" to read the buffer, and the user can't study the result later. therefore we will rather study an approach based on generating a "trace file", the format of which depending on the trace handler preference.
Output type. It is possible to have an output of collected traces on the standard output, but this would barely be a toy, or for the purpose of testing the whole trace mechanism. More reasonably, we need a tool enabling to "browse" inside collected traces. For instance, the user might want to see all traces, or only traces from a given thread, or only traces reporting events of a given type, or only traces related tp a given mutex... To ease such browsing, a GUI based on a "trace file" interpretation will probably be ideal.
LTT's visualizer compliance. As stated earlier, we are wanting to study whether it is doable to make collected NPTL traces visualized by the Linux Trace Toolkit visualizer. This may require changes in the visualizer, or this may be usable out of the box thanks to LTT's user-defined traces feature. Current priority and studies go elsewhere.
| Event name | parameters |
| NEW_THREAD_CREATED | void* address of start routine, pthread_t created thread id |
| THREAD_BEGIN | none |
| THREAD_END | void* pthread_exit() returned value (not so useful, but might help in some cases) |