 |
nptl.BullOpenSource.Org NPTL Tests & Trace project. Homepage.
|
| View previous topic :: View next topic |
| Author |
Message |
tHeDoc Project Maintener

Joined: 21 Apr 2004 Posts: 237 Location: Echirolles, France
|
Posted: Tue Aug 31, 2004 1:48 pm Post subject: SCHED_RR scheduling policy issue |
|
|
Update 09/28
This post is dedicated to a linux kernel scheduler problem.
This problem was first spotted with the sample described bellow (sched_rr.c) and then isolated by Loïc with its (test_schedrr.c) sample.
The issue is the following: with a recent kernel (2.6.x), it looks like the SCHED_RR scheduling policy won't behave as expected.
The basics of the test cases is to create a thread with a high priority, and some other threads with lower priority. The expected behavior should be that the high priority thread executes all the time, and the lower priority threads share the remaining CPUs on a time sharing basis. What we can see is that some of the low-priority threads are never executed, or either are scheduled only once.
The rest of this forum thread is the history of those two samples.
End of update
Here is a sample to demonstrate a possible issue in the linux scheduler.
The internal is:
-> Main creates {Thread1} with SCHED_RR and max priority (99)
-> {Thread1} creates N * {thread2} with SCHED_RR and lesser priority (9
-> {Thread1} creates {thread3} with SCHED_RR and min priority (0)
-> {Thread1} does some dummy task.
-> {Thread2} does some dummy task.
-> {Thread3} does some dummy task.
If N+1 is >= the # of CPU, thread 3 is expected not to run. In any case, thread1 is expected to run until completion.
What we can see:
-> If N is big (ex: 100 on a 2xCPU machine), thread 1 won't execute.
-> If thread1 acquires some mutex (even without contension) thread 3 will execute.
Both issues are out-of-spec according to the POSIX standard.
| Description: |
| This is the version 2 of Loïc's test. See mailing list archive for details and internals. |
|
 Download |
| Filename: |
test_schedrr.c.gz |
| Filesize: |
4 KB |
| Downloaded: |
2088 Time(s) |
| Description: |
V0.3---2004/09/08
(*)Semaphore to make sure HP started
V0.2---2004/09/07
(*) removed the for(i...); loop
(*)Sched policy is now in a macro
V0.1---- 2004/08/31
Original version |
|
 Download |
| Filename: |
sched_rr.c.gz |
| Filesize: |
3.13 KB |
| Downloaded: |
2109 Time(s) |
Last edited by tHeDoc on Tue Sep 28, 2004 1:54 pm; edited 3 times in total |
|
| Back to top |
|
 |
PthreadsIsFun Active contributor

Joined: 31 Aug 2004 Posts: 5
|
Posted: Tue Aug 31, 2004 11:03 pm Post subject: |
|
|
Hi,
For one thing: memory/cache coherency might be a big issue on SMP machines. It might happen (on machines using write back policy, which is often the case) that ctrl is only stored to the CPU local cache running the thread threaded. Possibly the other
hp_func threads running on other CPUs may work only with a copy of ctrl (i.e. 0) and might miss the new value ctrl=1.
That would explain that you need that SIGARLM signal in order to stop your process...
I shall devise a program that check that issue...
I believe, if we want to make a clean test, we might need a read-write lock or something to ensure the coherency of ctrl on all threads. read-write lock is better than mutex, because from my understanding the contention should occur - if ever - only at two
points:
- if lp_func thread runs, or
- if we're done with the tests.
Oder?
Just my 2 Cents,
Loic Domaigné.
|
|
| Back to top |
|
 |
tHeDoc Project Maintener

Joined: 21 Apr 2004 Posts: 237 Location: Echirolles, France
|
Posted: Wed Sep 01, 2004 10:11 am Post subject: |
|
|
Hmm I thought the cache problem was something possible but which did not happen
Well, I'll experiment with a rw-lock (or is the asm("lock"); possible?) but this might lead to another issue: nothing garanties that a writer will get the lock as long as readers are around. (I thought at first that when some thread tries to get the write lock, any other thread asking for the read lock would be served after the writer, but when testing it appears that the readers will get it and the writer will keep on waiting... and never acquire the lock :/ )
Well , I think I'll have to experiment more...
Thanks!
_________________ Sebastien Decugis. |
|
| Back to top |
|
 |
PthreadsIsFun Active contributor

Joined: 31 Aug 2004 Posts: 5
|
Posted: Wed Sep 01, 2004 6:03 pm Post subject: |
|
|
| Quote: |
Hmm I thought the cache problem was something possible but which did not happen
|
So I did some tests... Looks like that we can throw the "cache coherency theory" away...
I must admit, cache issues on SMP isn't really my area of expertise
But I found an answer to the following:
| Quote: |
What we can see:
-> If N is big (ex: 100 on a 2xCPU machine), thread 1 won't execute.
|
The threads hp_func are pure compute bound. Furthermore, you must loop at least one time in order to see that ctrl is not 0 anymore. On my machine, one round takes about 7.7s...
If I would start 100 hp_func threads then I will only run ~ 0.1s each (assuming that all got started and got the CPU equally, which is another issue) on my 2xCPUs box. Because threaded monopolizes one CPU during DURATION seconds long (10s).
After DURATION, I got one CPU more (threaded is joining the threads). However I still need ~ (7.7*100-10)/2 = 380 seconds to complete... SIGALRM occurs after 30s, so no wonder that lp_func didn't get a chance to run...
That's explain what you are seeing: if you double N, you approximatively double the time needed to complete....
Isn't that exactly the defintion of pure compute bound thread, actually?
How long does one round of the while() loop in the hp_func last on your machine?
Furthermore, an other issue of your code: Does it assume that the thread starts as soon as pthread_create() return? Pthreads doesn't guarantee anything in that direction...
I have attached some tracing of your program run on my 2xCPU box. This illustrated
what I am explaining here. You shall also find a patch to sched_rr.c in order to get those traces.
... But the first question should have been: what are you trying to achieve with this test?
Cheers,
Loic.
| Description: |
| some sample of traces that explains the behavior of the sched_rr program |
|
 Download |
| Filename: |
sched_rr.explanation.tgz |
| Filesize: |
2.37 KB |
| Downloaded: |
2068 Time(s) |
|
|
| Back to top |
|
 |
PthreadsIsFun Active contributor

Joined: 31 Aug 2004 Posts: 5
|
Posted: Thu Sep 02, 2004 8:49 am Post subject: |
|
|
| Quote: |
Furthermore, an other issue of your code: Does it assume that the thread starts as soon as pthread_create() return? Pthreads doesn't guarantee anything in that direction...
|
I doubt about that part:
| Code: |
for (i=0; i<NCPU; i++)
{
ret = pthread_create(&hpth[i], &ta, hp_func, &ctrl);
if (ret != 0) { UNRESOLVED(ret, "Failed to create enough threads"); }
}
/* Create the low-priority thread */
sp.sched_priority = sched_get_priority_min(SCHED_RR);
ret = pthread_attr_setschedparam(&ta, &sp);
ret = pthread_create(&lpth, &ta, lp_func, &ctrl);
|
From my understanding, we don't know at which point of times the X_func threads shall be executed. When pthread_create() returns, it is merely a confirmation that "your request has been successfully received, and that every information (thread ID, attribute etc.) needed for thread creation have been saved".
What I want to say: using default SCHED_OTHER scheduling policy, it might be well that lp_func executes prior to the hp_func ones...
Of course, once one KSE bound to hp_func exists, then scheduling must follow POSIX/SUS. There might be an gap between the time at which pthread_create() returns and the time at which the corresponding KSE exists from a kernel viewpoint. During that gap, lp_func might start (theoritically, at least).
Traces above would tend to confirm that? From my experiments, each time that the test failed, none of the hp_func thread have been started.
I Do not know if POSIX/SUS says something in that direction... I have to check that.
Could you reproduce that behavior that lp_func threads runs on other OS / Pthreads?
Or perhaps all of this I just again a stupid theory from me
Cheers,
Loic.
| Description: |
| Trace showing that lp_func executes prior to any hp_func threads. |
|
 Download |
| Filename: |
trace.txt.gz |
| Filesize: |
297 Bytes |
| Downloaded: |
1945 Time(s) |
|
|
| Back to top |
|
 |
tHeDoc Project Maintener

Joined: 21 Apr 2004 Posts: 237 Location: Echirolles, France
|
Posted: Thu Sep 02, 2004 5:31 pm Post subject: |
|
|
Sorry for the late answer, I've just had to change computer :/ I'll have a look at what you describe here.
At first sight, it looks like I made a mistake in the hp_func, as the loop time was meant to be negligible. Did I put some kind of "for (i=0 to 100000000) noop;" ? That was a mistake it should be removed!
What I was trying to test actually was that the thread really execute with the SCHED_RR scheduling policy -- so that a low-priority thread won't execute as long as high-priority are around and geopardizing the CPUs.
I'll look at your code as soon as I can and give you a status.
Thanks!
_________________ Sebastien Decugis. |
|
| Back to top |
|
 |
tHeDoc Project Maintener

Joined: 21 Apr 2004 Posts: 237 Location: Echirolles, France
|
Posted: Mon Sep 06, 2004 1:42 pm Post subject: |
|
|
Hi again!
Well, for the loop duration you're 100% right, I added a noop loop but thas was a mistake; and the line shall be removed from the testcase. This way, a loop inside an HP thread will be very quick.
For the trace addition, the problem is that it can change the behavior, as the trace will use some synchronization (either the mutex in the output routine, or some hidden lock for the terminal output). That is why there is so few traces in the original sample.
For the delayed thread creation, I don't think you're right. IMHO, as soon as pthread_create() returns, the thread shall be at least in the "runnable" state (and possibly have already started or even terminated). Otherwise, the SUS would say something like "when the function returns, the news thread shall be scheduled for creation" but it tells: "shall be created".
Then I agree that with SCHED_OTHER sched policy, the low-priority could perfectly run before the first high-priority thread. But, when you have both HP and LP threads in the runnable state, and everybody has the SCHED_RR or SCHED_FIFO policy, the LP shall not run if one of the HP is not running. (That's what is tested in this test).
The only two possibilities remaining are:
-> either you "cache coherence theory" is right -- but it seems you were telling it's not
-> or the Linux scheduler is wrong.
The problem is that I have the same behavior with Aix5L and Solaris -- hard to believe they're both broken!
Well... it comes to my mind that the SUS is broken It's the best explanation, what do you think ?
_________________ Sebastien Decugis. |
|
| Back to top |
|
 |
tHeDoc Project Maintener

Joined: 21 Apr 2004 Posts: 237 Location: Echirolles, France
|
Posted: Tue Sep 07, 2004 8:51 am Post subject: |
|
|
I just updated the sample in the OP, and removed the time-consuming loop. I'll have to test again on Solaris and Aix5L as soon as possible.
I also added the possibility to switch to SCHED_FIFO to make comparison.
_________________ Sebastien Decugis. |
|
| Back to top |
|
 |
tHeDoc Project Maintener

Joined: 21 Apr 2004 Posts: 237 Location: Echirolles, France
|
Posted: Wed Sep 08, 2004 9:40 am Post subject: |
|
|
A little follow-up from the mailing list:
| Loic wrote: |
Hello,
Well. Suprisingly enough, the BUGs weren't that hard to find out... And so I
have even a few minutes left during launch's break to newsgroup's a bit...
Surely something *really* bad on the way this afternoon
Let's see...
| Quote: |
For the trace addition, the problem is that it can change the behavior, as
the trace will use some synchronization (either the mutex in the output
routine, or some hidden lock for the terminal output). That is why there is
so few traces in the original sample.
|
I know that issue. This is why traces should be thread-bound, not process
bound. Otherwise one introduces "artificial" synchronization points -- which
sometimes makes the program works, but that's another story
| Quote: |
For the delayed thread creation, I don't think you're right.
|
Unfortunately *I AM*
| Quote: |
IMHO, as soon as pthread_create() returns, the thread shall be at least in
the "runnable" state (and possibly have already started or even terminated).
|
AFAIK, there is no reference anywhere in SUS about that fact!
| Quote: |
Otherwise, the SUS would say something like "when the function returns, the
news thread shall be scheduled for creation" but it tells: "shall be
created".
|
SUS doesn't distinguish between thread creation and thread execution. An ADA
guys might see this as a mistake... But anyway...
That's the way it is. See for instance the rational of the
pthread_create() XSH:
http://www.opengroup.org/onlinepubs/000095399/functions/pthread_create.html
<copy>
A suggested alternative to pthread_create() would be to define two
separate operations: create and start. Some applications would find such
behavior more natural. Ada, in particular, separates the "creation" of a
task from its "activation".
...
...
</copy>
| Quote: |
Then I agree that with SCHED_OTHER sched policy, the low-priority could
perfectly run before the first high-priority thread. But, when you have both
HP and LP threads in the runnable state, and everybody has the SCHED_RR or
SCHED_FIFO policy, the LP shall not run if one of the HP is not running.
(That's what is tested in this test).
|
It might happen for instance that both HP threads have a page fault for
an short period of time. That might be enough for LP to run...
The behavior you are describing above is "undefined"/"unspecified".
|
| Steve wrote: |
} I'd like to take the opportunity of having of RT-expert....
} Suppose I have a code like the one below on a SMP machine (say 2xCPUs).
} All threads are using a SCHED_RR policy.
Aah, I just loooove SMP plus priorities. It's a tough area to get right.
} main therad running at prio x+1
} - create 2 threads running prio x
} - create 1 thread running at prio x-1
}
} Accordingly to POSIX/SUS, /pthread_create()/ is asynchronous. There is
} no relationship between /pthread_create()/ ordering and the scheduling
} on the CPU of the underlying KSE (Kernel Schedulable Entity).
Yup.
} From my understanding, is it possible that the thread with prio x-1
} starts to run prior to the two ones with prio x? ( Of course, at soon as
} one thread of prio x exists from a kernel perspective, it should preempt
} the thread of prio x-1... )
}
} Is my understanding correct? Or does POSIX/SUS specify the behavior?
} Something like:
}
} "threads runs accordingly to the scheduling policy/prio given to
} pthread_create()".
Based on my reading, it's indeterminate, as far as POSIX is concerned.
Depending on the details of how the scheduler works, it's possible that
one of the new mid-prio threads would kick something else off the other
processor, take, say, a stack fault, and then the lower prio thread
might get some cycles before the high priority ones.
A real system shouldn't care about something like this, because it's
hard (impossible?) to detect without violating memory visibility.
|
_________________ Sebastien Decugis. |
|
| Back to top |
|
 |
tHeDoc Project Maintener

Joined: 21 Apr 2004 Posts: 237 Location: Echirolles, France
|
Posted: Wed Sep 08, 2004 9:49 am Post subject: |
|
|
I just updated the sample in the 1st post of this thread (now is V0.3) to take into account the new comments.
- Removed this stupid loop from the hp_func, so now 1 loop is only ~10 cpu cycle.
- Added a semaphore to make sure the hp threads are running before we even create the lp thread.
I get exactly the same strange issues as before:
- If NCPU is small (ex: NCPU=6 on a 2-way CPU box), the test will FAIL. This means the LP thread executes while some HP threads should be monopolizing the CPUs.
- If NCPU is big (ex: NCPU=128 on the same 2x CPU), the test will TIMEOUT. This means the highest priority thread will not execute anymore -- or that we have a memory visibility issue, but this is not likely to happen (??).
_________________ Sebastien Decugis. |
|
| Back to top |
|
 |
tHeDoc Project Maintener

Joined: 21 Apr 2004 Posts: 237 Location: Echirolles, France
|
Posted: Thu Sep 09, 2004 2:06 pm Post subject: |
|
|
Just an addition concerning the hang when the test is executed as a user.
It appears that pthread_create will:
-> clone and create the new thread. The new thread waits on some futex.
-> try to change the new thread' scheduling. This fails.
-> cancel the new thread (sends a signal).
-> unlock or destroy (I did not check) the thread futex, data, etc....
-> return an error.
But the new thread apparently can work for some time before receiving the signal. In my case, it was able to lock the trace mutex. So, when pthread_create returns, it hangs in the output() routine.
I've uploaded a new sample which show this behavior into bugzilla here.
_________________ Sebastien Decugis. |
|
| Back to top |
|
 |
PthreadsIsFun Active contributor

Joined: 31 Aug 2004 Posts: 5
|
|
| Back to top |
|
 |
tHeDoc Project Maintener

Joined: 21 Apr 2004 Posts: 237 Location: Echirolles, France
|
Posted: Thu Sep 16, 2004 3:34 pm Post subject: |
|
|
Here are a few more results with this same testcase:
(*) 2x IA32 (Xeon); Linux kernel 2.6.8.1 (kernel.org, no patch); FC2
-> Same results as you got: as soon as nthreads >= #CPU, the test sample hangs from time to time, and the more threads, the more likely it hangs.
(*) 8x IA32 (Xeon); Linux kernel 2.6.6-1.406smp; FC2
-> same result: with nthreads >= 8, we have hangs (harder to reproduce than with 2x cpu)
(*) 2x IA32; Solaris 10:
-> Everything is OK ("test 1024 50" works fine)
(*) 4x PPC; Aix5.3:
-> Everything is OK
(*) AIX5.2
-> Unable to test: I got minprio = maxprio = -1.
(*) 4x IA64; Linux 2.6.7 Bull Advanced Server (based on RHAS)
-> Everything is OK.
(*) 2x IA64; Linux 2.4.21 RHEL 3 - U1
-> Everything is OK.
A first conclusion is that the problem is either related to IA32 + 2.6x kernels, either to FC2 distribution. It does not depend on: the # of CPU, the exact release of the kernel, the gcc release, the test "priority" parameter.
_________________ Sebastien Decugis. |
|
| Back to top |
|
 |
PthreadsIsFun Active contributor

Joined: 31 Aug 2004 Posts: 5
|
Posted: Sat Oct 16, 2004 7:48 pm Post subject: |
|
|
Good Evening,
This is the latest version of my test.
It should work on both NPTL and LinuxThreads.
Functionally, it is the same than the version "3" I have posted on the
NPTL Bull mailing list, with some slight improvements (mainly, in the comments).
Testers, please use this tests.
Thanks,
Loic.
| Description: |
| Latest Loic's test for the SCHED_RR issue. |
|
 Download |
| Filename: |
test_schedrr3.c.gz |
| Filesize: |
4.4 KB |
| Downloaded: |
2028 Time(s) |
|
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum You cannot attach files in this forum You can download files in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|