nptl.BullOpenSource.Org Forum Index nptl.BullOpenSource.Org
NPTL Tests & Trace project. Homepage.
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

SCHED_RR scheduling policy issue

 
Post new topic   Reply to topic    nptl.BullOpenSource.Org Forum Index -> Tests - General
View previous topic :: View next topic  
Author Message
tHeDoc
Project Maintener


Joined: 21 Apr 2004
Posts: 237
Location: Echirolles, France

PostPosted: Tue Aug 31, 2004 1:48 pm    Post subject: SCHED_RR scheduling policy issue Reply with quote

Update 09/28
This post is dedicated to a linux kernel scheduler problem.

This problem was first spotted with the sample described bellow (sched_rr.c) and then isolated by Loïc with its (test_schedrr.c) sample.

The issue is the following: with a recent kernel (2.6.x), it looks like the SCHED_RR scheduling policy won't behave as expected.

The basics of the test cases is to create a thread with a high priority, and some other threads with lower priority. The expected behavior should be that the high priority thread executes all the time, and the lower priority threads share the remaining CPUs on a time sharing basis. What we can see is that some of the low-priority threads are never executed, or either are scheduled only once.

The rest of this forum thread is the history of those two samples.

End of update




Here is a sample to demonstrate a possible issue in the linux scheduler.

The internal is:

-> Main creates {Thread1} with SCHED_RR and max priority (99)
-> {Thread1} creates N * {thread2} with SCHED_RR and lesser priority (9Cool
-> {Thread1} creates {thread3} with SCHED_RR and min priority (0)
-> {Thread1} does some dummy task.
-> {Thread2} does some dummy task.
-> {Thread3} does some dummy task.

If N+1 is >= the # of CPU, thread 3 is expected not to run. In any case, thread1 is expected to run until completion.

What we can see:
-> If N is big (ex: 100 on a 2xCPU machine), thread 1 won't execute.
-> If thread1 acquires some mutex (even without contension) thread 3 will execute.

Both issues are out-of-spec according to the POSIX standard.



test_schedrr.c.gz
 Description:
This is the version 2 of Loïc's test. See mailing list archive for details and internals.

Download
 Filename:  test_schedrr.c.gz
 Filesize:  4 KB
 Downloaded:  2088 Time(s)


sched_rr.c.gz
 Description:
V0.3---2004/09/08
(*)Semaphore to make sure HP started

V0.2---2004/09/07
(*) removed the for(i...); loop
(*)Sched policy is now in a macro

V0.1---- 2004/08/31
Original version

Download
 Filename:  sched_rr.c.gz
 Filesize:  3.13 KB
 Downloaded:  2109 Time(s)



Last edited by tHeDoc on Tue Sep 28, 2004 1:54 pm; edited 3 times in total
Back to top
View user's profile Send private message Send e-mail Visit poster's website
PthreadsIsFun
Active contributor


Joined: 31 Aug 2004
Posts: 5

PostPosted: Tue Aug 31, 2004 11:03 pm    Post subject: Reply with quote

Hi,

Exclamation For one thing: memory/cache coherency might be a big issue on SMP machines. It might happen (on machines using write back policy, which is often the case) that ctrl is only stored to the CPU local cache running the thread threaded. Possibly the other
hp_func threads running on other CPUs may work only with a copy of ctrl (i.e. 0) and might miss the new value ctrl=1.

That would explain that you need that SIGARLM signal in order to stop your process...
I shall devise a program that check that issue...

I believe, if we want to make a clean test, we might need a read-write lock or something to ensure the coherency of ctrl on all threads. read-write lock is better than mutex, because from my understanding the contention should occur - if ever - only at two
points:

- if lp_func thread runs, or
- if we're done with the tests.

Oder?

Just my 2 Cents,
Loic Domaigné.
Back to top
View user's profile Send private message
tHeDoc
Project Maintener


Joined: 21 Apr 2004
Posts: 237
Location: Echirolles, France

PostPosted: Wed Sep 01, 2004 10:11 am    Post subject: Reply with quote

Hmm I thought the cache problem was something possible but which did not happen Whistle

Well, I'll experiment with a rw-lock (or is the asm("lock"); possible?) but this might lead to another issue: nothing garanties that a writer will get the lock as long as readers are around. (I thought at first that when some thread tries to get the write lock, any other thread asking for the read lock would be served after the writer, but when testing it appears that the readers will get it and the writer will keep on waiting... and never acquire the lock :/ )

Well Think , I think I'll have to experiment more...

Thanks!

_________________
Sebastien Decugis.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
PthreadsIsFun
Active contributor


Joined: 31 Aug 2004
Posts: 5

PostPosted: Wed Sep 01, 2004 6:03 pm    Post subject: Reply with quote

Quote:

Hmm I thought the cache problem was something possible but which did not happen

So I did some tests... Looks like that we can throw the "cache coherency theory" away...
I must admit, cache issues on SMP isn't really my area of expertise Anxious

But I found an answer to the following:
Quote:

What we can see:
-> If N is big (ex: 100 on a 2xCPU machine), thread 1 won't execute.

The threads hp_func are pure compute bound. Furthermore, you must loop at least one time in order to see that ctrl is not 0 anymore. On my machine, one round takes about 7.7s...

If I would start 100 hp_func threads then I will only run ~ 0.1s each (assuming that all got started and got the CPU equally, which is another issue) on my 2xCPUs box. Because threaded monopolizes one CPU during DURATION seconds long (10s).

After DURATION, I got one CPU more (threaded is joining the threads). However I still need ~ (7.7*100-10)/2 = 380 seconds to complete... SIGALRM occurs after 30s, so no wonder that lp_func didn't get a chance to run...

That's explain what you are seeing: if you double N, you approximatively double the time needed to complete....
Isn't that exactly the defintion of pure compute bound thread, actually?
Dancing

How long does one round of the while() loop in the hp_func last on your machine?

Furthermore, an other issue of your code: Does it assume that the thread starts as soon as pthread_create() return? Pthreads doesn't guarantee anything in that direction...

I have attached some tracing of your program run on my 2xCPU box. This illustrated
what I am explaining here. You shall also find a patch to sched_rr.c in order to get those traces.

... But the first question should have been: what are you trying to achieve with this test?

Cheers,
Loic.



sched_rr.explanation.tgz
 Description:
some sample of traces that explains the behavior of the sched_rr program

Download
 Filename:  sched_rr.explanation.tgz
 Filesize:  2.37 KB
 Downloaded:  2068 Time(s)

Back to top
View user's profile Send private message
PthreadsIsFun
Active contributor


Joined: 31 Aug 2004
Posts: 5

PostPosted: Thu Sep 02, 2004 8:49 am    Post subject: Reply with quote

Quote:

Furthermore, an other issue of your code: Does it assume that the thread starts as soon as pthread_create() return? Pthreads doesn't guarantee anything in that direction...


I doubt about that part:
Code:

   for (i=0; i<NCPU; i++)
   {
      ret = pthread_create(&hpth[i], &ta, hp_func, &ctrl);
      if (ret != 0)  {  UNRESOLVED(ret, "Failed to create enough threads");  }
   }
   /* Create the low-priority thread */
   sp.sched_priority = sched_get_priority_min(SCHED_RR);
   ret = pthread_attr_setschedparam(&ta, &sp);
   ret = pthread_create(&lpth, &ta, lp_func, &ctrl);

From my understanding, we don't know at which point of times the X_func threads shall be executed. When pthread_create() returns, it is merely a confirmation that "your request has been successfully received, and that every information (thread ID, attribute etc.) needed for thread creation have been saved".

What I want to say: using default SCHED_OTHER scheduling policy, it might be well that lp_func executes prior to the hp_func ones...

Of course, once one KSE bound to hp_func exists, then scheduling must follow POSIX/SUS. There might be an gap between the time at which pthread_create() returns and the time at which the corresponding KSE exists from a kernel viewpoint. During that gap, lp_func might start (theoritically, at least).

Question Traces above would tend to confirm that? From my experiments, each time that the test failed, none of the hp_func thread have been started. Question

I Do not know if POSIX/SUS says something in that direction... I have to check that.
Could you reproduce that behavior that lp_func threads runs on other OS / Pthreads?

Or perhaps all of this I just again a stupid theory from me Wink

Cheers,
Loic.



trace.txt.gz
 Description:
Trace showing that lp_func executes prior to any hp_func threads.

Download
 Filename:  trace.txt.gz
 Filesize:  297 Bytes
 Downloaded:  1945 Time(s)

Back to top
View user's profile Send private message
tHeDoc
Project Maintener


Joined: 21 Apr 2004
Posts: 237
Location: Echirolles, France

PostPosted: Thu Sep 02, 2004 5:31 pm    Post subject: Reply with quote

Sorry for the late answer, I've just had to change computer :/ I'll have a look at what you describe here.

At first sight, it looks like I made a mistake in the hp_func, as the loop time was meant to be negligible. Did I put some kind of "for (i=0 to 100000000) noop;" ? That was a mistake Whistle it should be removed!

What I was trying to test actually was that the thread really execute with the SCHED_RR scheduling policy -- so that a low-priority thread won't execute as long as high-priority are around and geopardizing the CPUs.

I'll look at your code as soon as I can and give you a status.

Thanks!

_________________
Sebastien Decugis.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
tHeDoc
Project Maintener


Joined: 21 Apr 2004
Posts: 237
Location: Echirolles, France

PostPosted: Mon Sep 06, 2004 1:42 pm    Post subject: Reply with quote

Hi again!

Well, for the loop duration you're 100% right, I added a noop loop but thas was a mistake; and the line shall be removed from the testcase. This way, a loop inside an HP thread will be very quick.

For the trace addition, the problem is that it can change the behavior, as the trace will use some synchronization (either the mutex in the output routine, or some hidden lock for the terminal output). That is why there is so few traces in the original sample.

For the delayed thread creation, I don't think you're right. IMHO, as soon as pthread_create() returns, the thread shall be at least in the "runnable" state (and possibly have already started or even terminated). Otherwise, the SUS would say something like "when the function returns, the news thread shall be scheduled for creation" but it tells: "shall be created".

Then I agree that with SCHED_OTHER sched policy, the low-priority could perfectly run before the first high-priority thread. But, when you have both HP and LP threads in the runnable state, and everybody has the SCHED_RR or SCHED_FIFO policy, the LP shall not run if one of the HP is not running. (That's what is tested in this test).

The only two possibilities remaining are:
-> either you "cache coherence theory" is right -- but it seems you were telling it's not
-> or the Linux scheduler is wrong.

The problem is that I have the same behavior with Aix5L and Solaris -- hard to believe they're both broken!

Well... it comes to my mind that the SUS is broken Not talking It's the best explanation, what do you think ? Very Happy

_________________
Sebastien Decugis.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
tHeDoc
Project Maintener


Joined: 21 Apr 2004
Posts: 237
Location: Echirolles, France

PostPosted: Tue Sep 07, 2004 8:51 am    Post subject: Reply with quote

I just updated the sample in the OP, and removed the time-consuming loop. I'll have to test again on Solaris and Aix5L as soon as possible.

I also added the possibility to switch to SCHED_FIFO to make comparison.

_________________
Sebastien Decugis.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
tHeDoc
Project Maintener


Joined: 21 Apr 2004
Posts: 237
Location: Echirolles, France

PostPosted: Wed Sep 08, 2004 9:40 am    Post subject: Reply with quote

A little follow-up from the mailing list:
Loic wrote:

Hello,

Well. Suprisingly enough, the BUGs weren't that hard to find out... And so I
have even a few minutes left during launch's break to newsgroup's a bit...
Surely something *really* bad on the way this afternoon Wink

Let's see...

Quote:

For the trace addition, the problem is that it can change the behavior, as
the trace will use some synchronization (either the mutex in the output
routine, or some hidden lock for the terminal output). That is why there is
so few traces in the original sample.


I know that issue. This is why traces should be thread-bound, not process
bound. Otherwise one introduces "artificial" synchronization points -- which
sometimes makes the program works, but that's another story Wink


Quote:

For the delayed thread creation, I don't think you're right.


Unfortunately *I AM*


Quote:

IMHO, as soon as pthread_create() returns, the thread shall be at least in
the "runnable" state (and possibly have already started or even terminated).



AFAIK, there is no reference anywhere in SUS about that fact!


Quote:

Otherwise, the SUS would say something like "when the function returns, the
news thread shall be scheduled for creation" but it tells: "shall be
created".


SUS doesn't distinguish between thread creation and thread execution. An ADA
guys might see this as a mistake... But anyway...
That's the way it is. See for instance the rational of the
pthread_create() XSH:
http://www.opengroup.org/onlinepubs/000095399/functions/pthread_create.html

<copy>
A suggested alternative to pthread_create() would be to define two
separate operations: create and start. Some applications would find such
behavior more natural. Ada, in particular, separates the "creation" of a
task from its "activation".
...
...
</copy>

Quote:

Then I agree that with SCHED_OTHER sched policy, the low-priority could
perfectly run before the first high-priority thread. But, when you have both
HP and LP threads in the runnable state, and everybody has the SCHED_RR or
SCHED_FIFO policy, the LP shall not run if one of the HP is not running.
(That's what is tested in this test).


It might happen for instance that both HP threads have a page fault for
an short period of time. That might be enough for LP to run...

The behavior you are describing above is "undefined"/"unspecified".


Steve wrote:

} I'd like to take the opportunity of having of RT-expert....
} Suppose I have a code like the one below on a SMP machine (say 2xCPUs).
} All threads are using a SCHED_RR policy.

Aah, I just loooove SMP plus priorities. It's a tough area to get right.

} main therad running at prio x+1
} - create 2 threads running prio x
} - create 1 thread running at prio x-1
}
} Accordingly to POSIX/SUS, /pthread_create()/ is asynchronous. There is
} no relationship between /pthread_create()/ ordering and the scheduling
} on the CPU of the underlying KSE (Kernel Schedulable Entity).

Yup.

} From my understanding, is it possible that the thread with prio x-1
} starts to run prior to the two ones with prio x? ( Of course, at soon as
} one thread of prio x exists from a kernel perspective, it should preempt
} the thread of prio x-1... )
}
} Is my understanding correct? Or does POSIX/SUS specify the behavior?
} Something like:
}
} "threads runs accordingly to the scheduling policy/prio given to
} pthread_create()".

Based on my reading, it's indeterminate, as far as POSIX is concerned.
Depending on the details of how the scheduler works, it's possible that
one of the new mid-prio threads would kick something else off the other
processor, take, say, a stack fault, and then the lower prio thread
might get some cycles before the high priority ones.

A real system shouldn't care about something like this, because it's
hard (impossible?) to detect without violating memory visibility.

_________________
Sebastien Decugis.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
tHeDoc
Project Maintener


Joined: 21 Apr 2004
Posts: 237
Location: Echirolles, France

PostPosted: Wed Sep 08, 2004 9:49 am    Post subject: Reply with quote

I just updated the sample in the 1st post of this thread (now is V0.3) to take into account the new comments.

  • Removed this stupid loop from the hp_func, so now 1 loop is only ~10 cpu cycle.
  • Added a semaphore to make sure the hp threads are running before we even create the lp thread.

I get exactly the same strange issues as before:
  1. If NCPU is small (ex: NCPU=6 on a 2-way CPU box), the test will FAIL. This means the LP thread executes while some HP threads should be monopolizing the CPUs.
  2. If NCPU is big (ex: NCPU=128 on the same 2x CPU), the test will TIMEOUT. This means the highest priority thread will not execute anymore -- or that we have a memory visibility issue, but this is not likely to happen (??).

_________________
Sebastien Decugis.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
tHeDoc
Project Maintener


Joined: 21 Apr 2004
Posts: 237
Location: Echirolles, France

PostPosted: Thu Sep 09, 2004 2:06 pm    Post subject: Reply with quote

Just an addition concerning the hang when the test is executed as a user.

It appears that pthread_create will:
-> clone and create the new thread. The new thread waits on some futex.
-> try to change the new thread' scheduling. This fails.
-> cancel the new thread (sends a signal).
-> unlock or destroy (I did not check) the thread futex, data, etc....
-> return an error.

But the new thread apparently can work for some time before receiving the signal. In my case, it was able to lock the trace mutex. So, when pthread_create returns, it hangs in the output() routine.

I've uploaded a new sample which show this behavior into bugzilla here.

_________________
Sebastien Decugis.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
PthreadsIsFun
Active contributor


Joined: 31 Aug 2004
Posts: 5

PostPosted: Thu Sep 16, 2004 9:56 am    Post subject: Reply with quote

Hello,

As already annonced on the mailing list, the program to test the SCHED_RR issue
on Linux.

On a dual IA32 with FC-2 it fails with 4 threads.
Sometimes it works immediately, sometimes there is a delay of 10 (or more) seconds...
The more threads, the more likely the problem occurs.

On a dual Alpha EV56 with Tru64 4.0F it worked (tested up to 400 threads).

eagerly wating for other data Very Happy

Regards,
Loic.



test_schedrr.c.gz
 Description:
test the creation of SCHED_RR threads

Download
 Filename:  test_schedrr.c.gz
 Filesize:  3.23 KB
 Downloaded:  1954 Time(s)

Back to top
View user's profile Send private message
tHeDoc
Project Maintener


Joined: 21 Apr 2004
Posts: 237
Location: Echirolles, France

PostPosted: Thu Sep 16, 2004 3:34 pm    Post subject: Reply with quote

Here are a few more results with this same testcase:

(*) 2x IA32 (Xeon); Linux kernel 2.6.8.1 (kernel.org, no patch); FC2
-> Same results as you got: as soon as nthreads >= #CPU, the test sample hangs from time to time, and the more threads, the more likely it hangs.

(*) 8x IA32 (Xeon); Linux kernel 2.6.6-1.406smp; FC2
-> same result: with nthreads >= 8, we have hangs (harder to reproduce than with 2x cpu)

(*) 2x IA32; Solaris 10:
-> Everything is OK ("test 1024 50" works fine)

(*) 4x PPC; Aix5.3:
-> Everything is OK

(*) AIX5.2
-> Unable to test: I got minprio = maxprio = -1.

(*) 4x IA64; Linux 2.6.7 Bull Advanced Server (based on RHAS)
-> Everything is OK.

(*) 2x IA64; Linux 2.4.21 RHEL 3 - U1
-> Everything is OK.


Eh?
A first conclusion is that the problem is either related to IA32 + 2.6x kernels, either to FC2 distribution. It does not depend on: the # of CPU, the exact release of the kernel, the gcc release, the test "priority" parameter.

_________________
Sebastien Decugis.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
PthreadsIsFun
Active contributor


Joined: 31 Aug 2004
Posts: 5

PostPosted: Sat Oct 16, 2004 7:48 pm    Post subject: Reply with quote

Good Evening,

This is the latest version of my test.
It should work on both NPTL and LinuxThreads.

Functionally, it is the same than the version "3" I have posted on the
NPTL Bull mailing list, with some slight improvements (mainly, in the comments).

Testers, please use this tests.

Thanks,
Loic.



test_schedrr3.c.gz
 Description:
Latest Loic's test for the SCHED_RR issue.

Download
 Filename:  test_schedrr3.c.gz
 Filesize:  4.4 KB
 Downloaded:  2028 Time(s)

Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    nptl.BullOpenSource.Org Forum Index -> Tests - General All times are GMT + 1 Hour
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001, 2005 phpBB Group