This lite kernel patch enable a fine grained scheduler timing measurment by using the kernel function get_cycles() that, on x86 cpu families, uses the rdtsc instruction to fetch the CPU cycle counter. A new character device /dev/latsched ( MAJOR = 10 - MINOR = 117 ) has been introduced to control the behaviour and to fetch data from the kernel scheduler measure code. Other then measuring the scheduler latency this patch can be used to study process scheduling and migration between CPUs. To use the patch a new kernel must be built ( with the patch applied ) and the new character device /dev/latsched must be created with :
# mknod /dev/latsched c 10 117
The code that will make use of the LatSched patch must open the device with :
if ((lsfd = open("/dev/latsched",
O_RDWR)) == -1) {
...
}
The next step is to set the size of the sample ( circular ) buffer with :
if ((res = ioctl(lsfd, LS_SAMPLES,
samples))) {
...
}
Then the code will have to instruct to sampler to start collecting scheduler timings with :
if ((res = ioctl(lsfd, LS_START,
0))) {
...
}
To stop the sampling process a new ioctl() call is necessary :
if ((res = ioctl(lsfd, LS_STOP, 0)))
{
...
}
At this point collected data is held inside the scheduler data buffers and must be fetched with something like this :
int cpu, ncpus, ii;
struct lsctl_getdata lsgd;
ncpus = sysconf(_SC_NPROCESSORS_CONF);
memset(&lsgd, 0, sizeof(lsgd));
lsgd.size = samples;
lsgd.data = (struct latsched_sample *) malloc(samples * sizeof(struct
latsched_sample));
for (cpu = 0; cpu < ncpus; cpu++)
{
lsgd.cpu = cpu;
lsgd.size = samples;
if ((res =
ioctl(lsfd, LS_FETCH, &lsgd))) {
...
}
for (ii = 0; ii
< lsgd.rsize; ii++) {
...
}
}
The kernel patch can be downloaded here :
This is a simple program that
interfaces the LatSched kernel patch to fetch data from the scheduler
cycle sampler. Its code is also an example on how to use the LatSched
kernel patch.
Build:
gcc -o schedcnt schedcnt.c
Use:
schedcnt [--samples n] [--sttime u] [--klimit k] [-- cmdpath [arg] ...]
--samples = Set the size
of the sample buffer
--ttime = Set the sample time in seconds
--sttime = Set the sample time in microseconds
--klimit = Set the cut factor for samples. Samples
> AVSC*klimit are not evalued in KAVS
-- = Separate the
optional command to be executed during the test time
cmdpath = Command to be executed
arg = Command argouments
The output is, for each cpu :
CPU NSAMP
SAMP[0]
...
SAMP[NSAMP-1]
AVSC CHSQ KAVS
CPU = CPU number
NSAMP = Number of readed samples
AVSC = Average schedule() cycles
CHSQ = ChiSquare of the CSCH distribution
KAVS = High cut average schedule() cycles
SAMP[i] is in the form :
CENT CEXT CSCH PPID RTIM
CENT = Cycle counter at
schedule() entry
CEXT = Cycle counter at schedule() exit
CSCH = Cycle duration of schedule()
PPID = New scheduled PID
RTIM = Cycles run time ( this maybe incorrect some time due
schedule() nesting )
The SchedCnt code can be downloaded here :
This program creates runqueue load
to let the LatSched kernel patch to work on different scheduler
environments.
Build:
gcc -o cpuhog cpuhog.c
Use:
cpuhog [--ntasks n] [--ttime s] [--size b]
--ntask = Set the
number of tasks ( runqueue length )
--ttime = Set the time cpuhog should run in seconds
--size = Set the cache drain size in Kb
The CpuHog code can be downloaded here :
Linux kernel scheduler latency test.
To make the test work with a number of tasks that is less or equal to
the number of CPUs the sys_sched_yield() optimization must be removed
from kernel/sched.c. This program does not make use of the LatSched
patch and its result's accuracy can vary with the amount of the cache
load size. When the size become high the ratio between the schedule()
latency and the time spent to load the cache become very little and the
accuracy of the measure could be compromised.
Build:
gcc -O0 -o lat-sched lat-sched.c
Use:
lat-sched [--ntasks n] [--ttime s]
[--size b] [--stksize b] [--stkalign b]
[--max-eck k]
[--verbose] [--help]
--ntask = Set the
number of tasks ( runqueue length )
--ttime = Set thetest measure time in seconds
--size = Set the cache drain size in Kb
--stksize = Set the stack size for tasks
--stkalign = Set the shift each task stack will have over stksize
--max-eck = Set the maximum correction factor above which the
measure is invalid
--verbose = Activate verbose mode
--help = Print help screen
Output: NRTASK SBENCH MSRUN CSSEC LATSCH AVGWRK THRZ CHISQR
NRTASK = Number of test
tasks
SBENCH = Scheduler benchmark == counter / (time * ncpu)
MSRUN = Number of milliseconds the test is ran
CSSEC = Number of context switches / sec
LATSCH = Number of seconds for a context switch
AVGWRK = Number of context switches / NRTASK
THRZ = Number of task with zero context switch
CHISQR = Context switches chi-square over tasks
In case the measure will result
invalid ( according to eck ) the output line
will begin with a '*' character.
On UP systems this tool gives precise timings while on MP you can only
use
lat-sched as a benchmark by the meaning of the SBENCH field.
Messages are printed on <stderr> while results are printed on
<stdout>
The Lat-Sched code can be downloaded here :
This patch try to achieve a better
goodness() calculation through the introduction of the concept of 'cpu
time spent onto a processor', aka probable cache footprint of the task.
Right now if we've a couple of cpu bound tasks with the classical
editor keyboard ticking, we can see the two cpu bound tasks swapped when
the editor task kicks in.
This is wrong coz the cpu bound task has probably run for a bunch of ms
and hence has a quite big memory footprint on the cpu.
While the editor task very likely run for a very short time by keeping
the footprint of the cpu bound tasks amlost intact.
This patch addresses in a better way even the proc change penalty
problem that, right now, is constant.
It's abvious that having to choose between the cpu bound task and the
editor task, the better candidate should be the one that has less
memory footprint aka the editor task.
So we can change the constant penalty into a constant plus an estimate
of the memory footprint.
The patch tracks the number of jiffies that a task run as :
JR = Jout - Jin
where Jin is the jiffies value when
the task in kicked in and Jout is the time when it's kicked out.
In a given time the value of the footprint is :
W = JR > (J - Jout) ? JR - (J - Jout): 0
where J is the current time in
jiffies.
This means that if the task is run for 10 ms, 10 ms ago, it's current
weight will be zero.
This is quite clear coz the more a process does not run the more
footprint will lose.
The patch can be downloaded here :
This is an old patch for 2.2.14 that divided the runqueue in priority based clusters and achieved a "pretty good"(tm) performances with long run queues while maintained the same behavior with shortest ones :
This patch plug over the Balanced Multi Queue Scheduler ( BMQS ) to enable run queue length sampling. To use the patch a new device file must be created :
# mknod /dev/rqlsamp c 10 116
This is an example of how to use it :
if ((rqfd = open("/dev/rqlsamp",
O_RDONLY)) == -1) {
...
}
#define MAX_CPUS 64
int i, sreaded, retcpus;
int samples[MAX_CPUS][2];
if ((sreaded = read(rqfd, samples,
sizeof(samples))) < 0) {
...
}
retcpus = sreaded / (2 * sizeof(int));
for (i = 0; i < retcpus; i++) {
cpu = samples[i][0];
rqlen = samples[i][1];
}
The returned data is a set of couple of integers values, one is the CPU logical number ( -1 == global real time tasks queue ) and the next one is the run queue length. The patch is available here :
This is a simple program that uses the "Run Queue Length Sampler Patch" to print out data read from the /dev/rqlsamp device file. The program is available here :
This scheduler implementation split the task's time slice from the task's dynamic priority to improve the current scheduler behavior that uses the task->counter field as both time slice and dynamic priority value. The recalculation loop is now performed only for running tasks and a global variable rcl_curr keeps track of the number of recalculation loops done :
[kernel/sched.c]
static unsigned long rcl_curr = 0;
asmlinkage void schedule(void)
{
...
/* Do we need to
re-calculate counters? */
if (unlikely(!c)) {
++rcl_curr;
list_for_each(tmp, &runqueue_head) {
p = list_entry(tmp, struct task_struct, run_list);
p->time_slice = NICE_TO_TICKS(p->nice);
p->rcl_last = rcl_curr;
}
goto repeat_schedule;
}
...
}
and the function add_to_runqueue() is now become :
[kernel/sched.c]
static inline void add_to_runqueue(struct task_struct * p)
{
p->dyn_prio += rcl_curr -
p->rcl_last;
p->rcl_last = rcl_curr;
if (p->dyn_prio >
MAX_DYNPRIO) p->dyn_prio = MAX_DYNPRIO;
list_add(&p->run_list, &runqueue_head);
nr_running++;
}
A new task struct field rcl_last is used to keep track of the last loop seen by the task and is used to add priority to the woke up task. The goodness() function has been changed to do :
[kernel/sched.c]
...
if (!p->time_slice)
return 0;
weight = p->dyn_prio + 1;
...
using in this way the dyn_prio task struct member has dynamic priority value. The update_process_times() function has been changed to :
[kernel/timer.c]
void update_process_times(int user_tick)
{
struct task_struct *p =
current;
int cpu =
smp_processor_id(), system = user_tick ^ 1;
update_one_process(p, user_tick, system, cpu);
if (p->pid) {
expire_task(p);
if (p->nice > 0)
kstat.per_cpu_nice[cpu] += user_tick;
else
kstat.per_cpu_user[cpu] += user_tick;
kstat.per_cpu_system[cpu] += system;
} else if
(local_bh_count(cpu) || local_irq_count(cpu) > 1)
kstat.per_cpu_system[cpu] += system;
}
[kernel/sched.c]
void expire_task(struct task_struct *p)
{
if (!p->time_slice)
p->need_resched = 1;
else {
if (!--p->time_slice) {
if (p->dyn_prio > 0) {
--p->time_slice;
--p->dyn_prio;
}
p->need_resched = 1;
} else if (p->time_slice < -NICE_TO_TICKS(p->nice)) {
p->time_slice = 0;
p->need_resched = 1;
}
}
}
Basically the dyn_prio field accumulates dynamic priority when the task misses the recalculation loop and this, like the current scheduler, gives I/O bound tasks a nice interactive feel. But differently from the current scheduler, where a task having an high counter value can exercise its CPU time all at once, this implementation will grant to tasks having dyn_prio > 0 the ability to use an extra time slice by decreasing by one their dynamic priority. This will maintain the same interactive feel characteristic but will prevent tasks having a huge counter accumulation to freeze the system by exercising their time slice all at a time. Another improved behavior of this implementation is that CPU bound tasks, having dynamic priority zero, will be correctly sorted out by the goodness() function using mm affinities :
[kernel/sched.c]
...
if (p->mm == this_mm ||
!p->mm)
weight += 1;
...
where the +1 given, this time, matter. The function sys_sched_yield() can be written like :
[kernel/sched.c]
asmlinkage long sys_sched_yield(void)
{
int nr_pending = nr_running;
#if CONFIG_SMP
int i;
for (i = 0; i < smp_num_cpus; i++) {
int cpu = cpu_logical_map(i);
if (aligned_data[cpu].schedule_data.curr != idle_task(cpu))
nr_pending--;
}
#else
nr_pending--;
#endif
if (nr_pending) {
struct task_struct *ctsk = current;
if (ctsk->policy == SCHED_OTHER)
ctsk->policy |= SCHED_YIELD;
ctsk->need_resched = 1;
ctsk->time_slice = 0;
++ctsk->dyn_prio;
}
return 0;
}
where the task time slice is zeroed
and its dynamic priority is increased. This will fix the
infinite-switch behavior of sys_sched_yield() and the
increased dynamic priority will grant to the yielding task its time
slice later. To show the effect of this patch to
sys_sched_yield() two schedcnt
dumps are reported. The test has been run by creating two CPU hog
processes with cpuhog and by running
the lat-sched with two tasks
using sched_yield() to switch. The first
dump is the one generated by the current scheduler where the tasks 847
and 848 are the yield()ing tasks and tasks 843 and 844 are the CPU hog
tasks. It clearly shows the huge number of switches that the system has
to perform ( wasting CPU cycles ) before the CPU hog task has the
opportunity to run. The second dump
is generated by using the proposed scheduler where tasks 1763 and 1764
are the yield()ing tasks and tasks 1758 and 1759 are the CPU hog tasks.
It's clear that this scenario is much better and the CPU hog tasks get
the CPU way before the previous dump without wasting CPU time and
without loading the scheduler with tons of switches. The patch is
available here :
This program uses the LatSched patch to measure the scheduler latency
for real time tasks. The source code is available here :