arch, sched: Fix global IRQ control logics for SMP
Summary: - This commit fixes global IRQ control logic - In previous implementation, g_cpu_irqset for a remote CPU was set in sched_add_readytorun(), sched_remove_readytorun() and up_schedule_sigaction() - In this implementation, they are removed. - Instead, in the pause handler, call enter_critical_setion() which will call up_cpu_paused() then acquire g_cpu_irqlock - So if a new task with irqcount > 1 restarts on the remote CPU, the CPU will only hold a critical section. Thus, the issue such as 'POSSIBLE FOR TWO CPUs TO HOLD A CRITICAL SECTION' could be resolved. - Fix nxsched_resume_scheduler() so that it does not call spin_clrbit() if a CPU does not hold a g_cpu_irqset - Fix nxtask_exit() so that it acquires g_cpu_irqlock - Update TODO Impact: - All SMP implementations Testing: - Tested with smp, ostest with the following configurations - Tested with spresense:wifi_smp (NCPUS=2,4) - Tested with sabre-6quad:smp (QEMU, dev board) - Tested with maix-bit:smp (QEMU) - Tested with esp32-core:smp (QEMU) - Tested with lc823450-xgevk:rndis Signed-off-by: Masayuki Ishikawa <Masayuki.Ishikawa@jp.sony.com>
This commit is contained in:
parent
a24905059e
commit
409c65ce0b
14 changed files with 131 additions and 193 deletions
68
TODO
68
TODO
|
@ -1,4 +1,4 @@
|
|||
NuttX TODO List (Last updated November 20, 2020)
|
||||
NuttX TODO List (Last updated November 26, 2020)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
This file summarizes known NuttX bugs, limitations, inconsistencies with
|
||||
|
@ -10,7 +10,7 @@ issues related to each board port.
|
|||
nuttx/:
|
||||
|
||||
(16) Task/Scheduler (sched/)
|
||||
(3) SMP
|
||||
(2) SMP
|
||||
(1) Memory Management (mm/)
|
||||
(0) Power Management (drivers/pm)
|
||||
(5) Signals (sched/signal, arch/)
|
||||
|
@ -485,70 +485,6 @@ o SMP
|
|||
an bugs caused by this. But I believe that failures are
|
||||
possible.
|
||||
|
||||
Title: POSSIBLE FOR TWO CPUs TO HOLD A CRITICAL SECTION?
|
||||
Description: The SMP design includes logic that will support multiple
|
||||
CPUs holding a critical section. Is this necessary? How
|
||||
can that occur? I think it can occur in the following
|
||||
situation:
|
||||
|
||||
The log below was reported is NuttX running on two cores
|
||||
Cortex-A7 architecture in SMP mode. You can notice see that
|
||||
when nxsched_add_readytorun() was called, the g_cpu_irqset is 3.
|
||||
|
||||
nxsched_add_readytorun: irqset cpu 1, me 0 btcbname init, irqset 1 irqcount 2.
|
||||
nxsched_add_readytorun: nxsched_add_readytorun line 338 g_cpu_irqset = 3.
|
||||
|
||||
This can happen, but only under a very certain condition.
|
||||
g_cpu_irqset only exists to support this certain condition:
|
||||
|
||||
a. A task running on CPU 0 takes the critical section. So
|
||||
g_cpu_irqset == 0x1.
|
||||
|
||||
b. A task exits on CPU 1 and a waiting, ready-to-run task
|
||||
is re-started on CPU 1. This new task also holds the
|
||||
critical section. So when the task is re-restarted on
|
||||
CPU 1, we than have g_cpu_irqset == 0x3
|
||||
|
||||
So we are in a very perverse state! There are two tasks
|
||||
running on two different CPUs and both hold the critical
|
||||
section. I believe that is a dangerous situation and there
|
||||
could be undiscovered bugs that could happen in that case.
|
||||
However, as of this moment, I have not heard of any specific
|
||||
problems caused by this weird behavior.
|
||||
|
||||
A possible solution would be to add a new task state that
|
||||
would exist only for SMP.
|
||||
|
||||
- Add a new SMP-only task list and state. Say,
|
||||
g_csection_wait[]. It should be prioritized.
|
||||
- When a task acquires the critical section, all tasks in
|
||||
g_readytorun[] that need the critical section would be
|
||||
moved to g_csection_wait[].
|
||||
- When any task is unblocked for any reason and moved to the
|
||||
g_readytorun[] list, if that unblocked task needs the
|
||||
critical section, it would also be moved to the
|
||||
g_csection_wait[] list. No task that needs the critical
|
||||
section can be in the ready-to-run list if the critical
|
||||
section is not available.
|
||||
- When the task releases the critical section, all tasks in
|
||||
the g_csection_wait[] needs to be moved back to
|
||||
g_readytorun[].
|
||||
- This may result in a context switch. The tasks should be
|
||||
moved back to g_readytorun[] highest priority first. If a
|
||||
context switch occurs and the critical section to re-taken
|
||||
by the re-started task, the lower priority tasks in
|
||||
g_csection_wait[] must stay in that list.
|
||||
|
||||
That is really not as much work as it sounds. It is
|
||||
something that could be done in 2-3 days of work if you know
|
||||
what you are doing. Getting the proper test setup and
|
||||
verifying the change would be the more difficult task.
|
||||
|
||||
Status: Open
|
||||
Priority: Unknown. Might be high, but first we would need to confirm
|
||||
that this situation can occur and that is actually causes
|
||||
a failure.
|
||||
|
||||
o Memory Management (mm/)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
|
|
|
@ -212,9 +212,22 @@ int arm_pause_handler(int irq, FAR void *context, FAR void *arg)
|
|||
* been processed then g_cpu_paused[cpu] will not be locked.
|
||||
*/
|
||||
|
||||
if (spin_islocked(&g_cpu_paused[cpu]))
|
||||
if (up_cpu_pausereq(cpu))
|
||||
{
|
||||
return up_cpu_paused(cpu);
|
||||
/* NOTE: The following enter_critical_section() will call
|
||||
* up_cpu_paused() to process a pause request to break a deadlock
|
||||
* because the caller held a critical section. Once up_cpu_paused()
|
||||
* finished, the caller will proceed and release the g_cpu_irqlock.
|
||||
* Then this CPU will acquire g_cpu_irqlock in the function.
|
||||
*/
|
||||
|
||||
irqstate_t flags = enter_critical_section();
|
||||
|
||||
/* NOTE: the pause request should not exist here */
|
||||
|
||||
DEBUGVERIFY(!up_cpu_pausereq(cpu));
|
||||
|
||||
leave_critical_section(flags);
|
||||
}
|
||||
|
||||
return OK;
|
||||
|
|
|
@ -314,17 +314,13 @@ void up_schedule_sigaction(struct tcb_s *tcb, sig_deliver_t sigdeliver)
|
|||
DEBUGASSERT(tcb->irqcount < INT16_MAX);
|
||||
tcb->irqcount++;
|
||||
|
||||
/* In an SMP configuration, the interrupt disable logic also
|
||||
* involves spinlocks that are configured per the TCB irqcount
|
||||
* field. This is logically equivalent to
|
||||
* enter_critical_section(). The matching call to
|
||||
* leave_critical_section() will be performed in
|
||||
* arm_sigdeliver().
|
||||
/* NOTE: If the task runs on another CPU(cpu), adjusting
|
||||
* global IRQ controls will be done in the pause handler
|
||||
* on the CPU(cpu) by taking a critical section.
|
||||
* If the task is scheduled on this CPU(me), do nothing
|
||||
* because this CPU already took a critical section
|
||||
*/
|
||||
|
||||
spin_setbit(&g_cpu_irqset, cpu, &g_cpu_irqsetlock,
|
||||
&g_cpu_irqlock);
|
||||
|
||||
/* RESUME the other CPU if it was PAUSED */
|
||||
|
||||
if (cpu != me)
|
||||
|
|
|
@ -362,17 +362,13 @@ void up_schedule_sigaction(struct tcb_s *tcb, sig_deliver_t sigdeliver)
|
|||
DEBUGASSERT(tcb->irqcount < INT16_MAX);
|
||||
tcb->irqcount++;
|
||||
|
||||
/* In an SMP configuration, the interrupt disable logic also
|
||||
* involves spinlocks that are configured per the TCB irqcount
|
||||
* field. This is logically equivalent to
|
||||
* enter_critical_section(). The matching call to
|
||||
* leave_critical_section() will be performed in
|
||||
* arm_sigdeliver().
|
||||
/* NOTE: If the task runs on another CPU(cpu), adjusting
|
||||
* global IRQ controls will be done in the pause handler
|
||||
* on the CPU(cpu) by taking a critical section.
|
||||
* If the task is scheduled on this CPU(me), do nothing
|
||||
* because this CPU already took a critical section
|
||||
*/
|
||||
|
||||
spin_setbit(&g_cpu_irqset, cpu, &g_cpu_irqsetlock,
|
||||
&g_cpu_irqlock);
|
||||
|
||||
/* RESUME the other CPU if it was PAUSED */
|
||||
|
||||
if (cpu != me)
|
||||
|
|
|
@ -283,6 +283,7 @@ int up_cpu_paused(int cpu)
|
|||
int arm_pause_handler(int irq, void *c, FAR void *arg)
|
||||
{
|
||||
int cpu = up_cpu_index();
|
||||
int ret = OK;
|
||||
|
||||
DPRINTF("cpu%d will be paused \n", cpu);
|
||||
|
||||
|
@ -295,12 +296,37 @@ int arm_pause_handler(int irq, void *c, FAR void *arg)
|
|||
* interrupt by calling up_cpu_paused.
|
||||
*/
|
||||
|
||||
if (spin_islocked(&g_cpu_paused[cpu]))
|
||||
if (up_cpu_pausereq(cpu))
|
||||
{
|
||||
return up_cpu_paused(cpu);
|
||||
/* NOTE: The following enter_critical_section() would call
|
||||
* up_cpu_paused() to process a pause request to break a deadlock
|
||||
* because the caller held a critical section. Once up_cpu_paused()
|
||||
* finished, the caller will proceed and release the g_cpu_irqlock.
|
||||
* Then this CPU will acquire g_cpu_irqlock in the function.
|
||||
*/
|
||||
|
||||
irqstate_t flags = enter_critical_section();
|
||||
|
||||
/* NOTE: Normally, we do not call up_cpu_paused() here because
|
||||
* the above enter_critical_setion() would call up_cpu_paused()
|
||||
* inside because the caller holds a crtical section.
|
||||
* Howerver, cxd56's remote IRQ control logic also uses this handler
|
||||
* and a caller might not take a critical section to avoid a deadlock
|
||||
* during up_enable_irq() and up_disable_irq(). This is allowed
|
||||
* because IRQ control logic does not interact wtih the scheduler.
|
||||
* This means that if the request was not handled above, we need
|
||||
* to call up_cpu_puased() here again.
|
||||
*/
|
||||
|
||||
if (up_cpu_pausereq(cpu))
|
||||
{
|
||||
ret = up_cpu_paused(cpu);
|
||||
}
|
||||
|
||||
leave_critical_section(flags);
|
||||
}
|
||||
|
||||
return OK;
|
||||
return ret;
|
||||
}
|
||||
|
||||
/****************************************************************************
|
||||
|
|
|
@ -209,9 +209,22 @@ int lc823450_pause_handler(int irq, void *c, FAR void *arg)
|
|||
* interrupt by calling up_cpu_paused.
|
||||
*/
|
||||
|
||||
if (spin_islocked(&g_cpu_paused[cpu]))
|
||||
if (up_cpu_pausereq(cpu))
|
||||
{
|
||||
return up_cpu_paused(cpu);
|
||||
/* NOTE: The following enter_critical_section() will call
|
||||
* up_cpu_paused() to process a pause request to break a deadlock
|
||||
* because the caller held a critical section. Once up_cpu_paused()
|
||||
* finished, the caller will proceed and release the g_cpu_irqlock.
|
||||
* Then this CPU will acquire g_cpu_irqlock in the function.
|
||||
*/
|
||||
|
||||
irqstate_t flags = enter_critical_section();
|
||||
|
||||
/* NOTE: the pause request should not exist here */
|
||||
|
||||
DEBUGVERIFY(!up_cpu_pausereq(cpu));
|
||||
|
||||
leave_critical_section(flags);
|
||||
}
|
||||
|
||||
return OK;
|
||||
|
|
|
@ -215,9 +215,22 @@ int riscv_pause_handler(int irq, void *c, FAR void *arg)
|
|||
* interrupt by calling up_cpu_paused.
|
||||
*/
|
||||
|
||||
if (spin_islocked(&g_cpu_paused[cpu]))
|
||||
if (up_cpu_pausereq(cpu))
|
||||
{
|
||||
return up_cpu_paused(cpu);
|
||||
/* NOTE: The following enter_critical_section() will call
|
||||
* up_cpu_paused() to process a pause request to break a deadlock
|
||||
* because the caller held a critical section. Once up_cpu_paused()
|
||||
* finished, the caller will proceed and release the g_cpu_irqlock.
|
||||
* Then this CPU will acquire g_cpu_irqlock in the function.
|
||||
*/
|
||||
|
||||
irqstate_t flags = enter_critical_section();
|
||||
|
||||
/* NOTE: the pause request should not exist here */
|
||||
|
||||
DEBUGVERIFY(!up_cpu_pausereq(cpu));
|
||||
|
||||
leave_critical_section(flags);
|
||||
}
|
||||
|
||||
return OK;
|
||||
|
|
|
@ -353,17 +353,13 @@ void up_schedule_sigaction(struct tcb_s *tcb, sig_deliver_t sigdeliver)
|
|||
DEBUGASSERT(tcb->irqcount < INT16_MAX);
|
||||
tcb->irqcount++;
|
||||
|
||||
/* In an SMP configuration, the interrupt disable logic also
|
||||
* involves spinlocks that are configured per the TCB irqcount
|
||||
* field. This is logically equivalent to
|
||||
* enter_critical_section(). The matching call to
|
||||
* leave_critical_section() will be performed in
|
||||
* up_sigdeliver().
|
||||
/* NOTE: If the task runs on another CPU(cpu), adjusting
|
||||
* global IRQ controls will be done in the pause handler
|
||||
* on the CPU(cpu) by taking a critical section.
|
||||
* If the task is scheduled on this CPU(me), do nothing
|
||||
* because this CPU already took a critical section
|
||||
*/
|
||||
|
||||
spin_setbit(&g_cpu_irqset, cpu, &g_cpu_irqsetlock,
|
||||
&g_cpu_irqlock);
|
||||
|
||||
/* RESUME the other CPU if it was PAUSED */
|
||||
|
||||
if (cpu != me)
|
||||
|
|
|
@ -190,9 +190,22 @@ void xtensa_pause_handler(void)
|
|||
* interrupt by calling up_cpu_paused.
|
||||
*/
|
||||
|
||||
if (spin_islocked(&g_cpu_paused[cpu]))
|
||||
if (up_cpu_pausereq(cpu))
|
||||
{
|
||||
up_cpu_paused(cpu);
|
||||
/* NOTE: The following enter_critical_section() will call
|
||||
* up_cpu_paused() to process a pause request to break a deadlock
|
||||
* because the caller held a critical section. Once up_cpu_paused()
|
||||
* finished, the caller will proceed and release the g_cpu_irqlock.
|
||||
* Then this CPU will acquire g_cpu_irqlock in the function.
|
||||
*/
|
||||
|
||||
irqstate_t flags = enter_critical_section();
|
||||
|
||||
/* NOTE: the pause request should not exist here */
|
||||
|
||||
DEBUGVERIFY(!up_cpu_pausereq(cpu));
|
||||
|
||||
leave_critical_section(flags);
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
@ -339,17 +339,13 @@ void up_schedule_sigaction(struct tcb_s *tcb, sig_deliver_t sigdeliver)
|
|||
DEBUGASSERT(tcb->irqcount < INT16_MAX);
|
||||
tcb->irqcount++;
|
||||
|
||||
/* In an SMP configuration, the interrupt disable logic also
|
||||
* involves spinlocks that are configured per the TCB irqcount
|
||||
* field. This is logically equivalent to
|
||||
* enter_critical_section().
|
||||
* The matching call to leave_critical_section() will be
|
||||
* performed in up_sigdeliver().
|
||||
/* NOTE: If the task runs on another CPU(cpu), adjusting
|
||||
* global IRQ controls will be done in the pause handler
|
||||
* on the CPU(cpu) by taking a critical section.
|
||||
* If the task is scheduled on this CPU(me), do nothing
|
||||
* because this CPU already took a critical section
|
||||
*/
|
||||
|
||||
spin_setbit(&g_cpu_irqset, cpu, &g_cpu_irqsetlock,
|
||||
&g_cpu_irqlock);
|
||||
|
||||
/* RESUME the other CPU if it was PAUSED */
|
||||
|
||||
if (cpu != me)
|
||||
|
|
|
@ -318,47 +318,12 @@ bool nxsched_add_readytorun(FAR struct tcb_s *btcb)
|
|||
&g_cpu_schedlock);
|
||||
}
|
||||
|
||||
/* Adjust global IRQ controls. If irqcount is greater than zero,
|
||||
* then this task/this CPU holds the IRQ lock
|
||||
/* NOTE: If the task runs on another CPU(cpu), adjusting global IRQ
|
||||
* controls will be done in the pause handler on the new CPU(cpu).
|
||||
* If the task is scheduled on this CPU(me), do nothing because
|
||||
* this CPU already has a critical section
|
||||
*/
|
||||
|
||||
if (btcb->irqcount > 0)
|
||||
{
|
||||
/* Yes... make sure that scheduling logic on other CPUs knows
|
||||
* that we hold the IRQ lock.
|
||||
*/
|
||||
|
||||
spin_setbit(&g_cpu_irqset, cpu, &g_cpu_irqsetlock,
|
||||
&g_cpu_irqlock);
|
||||
}
|
||||
|
||||
/* No.. This CPU will be relinquishing the lock. But this works
|
||||
* differently if we are performing a context switch from an
|
||||
* interrupt handler and the interrupt handler has established
|
||||
* a critical section. We can detect this case when
|
||||
* g_cpu_nestcount[me] > 0.
|
||||
*/
|
||||
|
||||
else if (g_cpu_nestcount[me] <= 0)
|
||||
{
|
||||
/* Do nothing here
|
||||
* NOTE: spin_clrbit() will be done in sched_resumescheduler()
|
||||
*/
|
||||
}
|
||||
|
||||
/* Sanity check. g_cpu_netcount should be greater than zero
|
||||
* only while we are within the critical section and within
|
||||
* an interrupt handler. If we are not in an interrupt handler
|
||||
* then there is a problem; perhaps some logic previously
|
||||
* called enter_critical_section() with no matching call to
|
||||
* leave_critical_section(), leaving the non-zero count.
|
||||
*/
|
||||
|
||||
else
|
||||
{
|
||||
DEBUGASSERT(up_interrupt_context());
|
||||
}
|
||||
|
||||
/* If the following task is not locked to this CPU, then it must
|
||||
* be moved to the g_readytorun list. Since it cannot be at the
|
||||
* head of the list, we can do this without invoking any heavy
|
||||
|
|
|
@ -260,47 +260,12 @@ bool nxsched_remove_readytorun(FAR struct tcb_s *rtcb)
|
|||
&g_cpu_schedlock);
|
||||
}
|
||||
|
||||
/* Adjust global IRQ controls. If irqcount is greater than zero,
|
||||
* then this task/this CPU holds the IRQ lock
|
||||
/* NOTE: If the task runs on another CPU(cpu), adjusting global IRQ
|
||||
* controls will be done in the pause handler on the new CPU(cpu).
|
||||
* If the task is scheduled on this CPU(me), do nothing because
|
||||
* this CPU already has a critical section
|
||||
*/
|
||||
|
||||
if (nxttcb->irqcount > 0)
|
||||
{
|
||||
/* Yes... make sure that scheduling logic on other CPUs knows
|
||||
* that we hold the IRQ lock.
|
||||
*/
|
||||
|
||||
spin_setbit(&g_cpu_irqset, cpu, &g_cpu_irqsetlock,
|
||||
&g_cpu_irqlock);
|
||||
}
|
||||
|
||||
/* No.. This CPU will be relinquishing the lock. But this works
|
||||
* differently if we are performing a context switch from an
|
||||
* interrupt handler and the interrupt handler has established
|
||||
* a critical section. We can detect this case when
|
||||
* g_cpu_nestcount[me] > 0.
|
||||
*/
|
||||
|
||||
else if (g_cpu_nestcount[me] <= 0)
|
||||
{
|
||||
/* Do nothing here
|
||||
* NOTE: spin_clrbit() will be done in sched_resumescheduler()
|
||||
*/
|
||||
}
|
||||
|
||||
/* Sanity check. g_cpu_netcount should be greater than zero
|
||||
* only while we are within the critical section and within
|
||||
* an interrupt handler. If we are not in an interrupt handler
|
||||
* then there is a problem; perhaps some logic previously
|
||||
* called enter_critical_section() with no matching call to
|
||||
* leave_critical_section(), leaving the non-zero count.
|
||||
*/
|
||||
|
||||
else
|
||||
{
|
||||
DEBUGASSERT(up_interrupt_context());
|
||||
}
|
||||
|
||||
nxttcb->task_state = TSTATE_TASK_RUNNING;
|
||||
|
||||
/* All done, restart the other CPU (if it was paused). */
|
||||
|
|
|
@ -136,8 +136,11 @@ void nxsched_resume_scheduler(FAR struct tcb_s *tcb)
|
|||
{
|
||||
/* Release our hold on the IRQ lock. */
|
||||
|
||||
spin_clrbit(&g_cpu_irqset, me, &g_cpu_irqsetlock,
|
||||
&g_cpu_irqlock);
|
||||
if ((g_cpu_irqset & (1 << me)) != 0)
|
||||
{
|
||||
spin_clrbit(&g_cpu_irqset, me, &g_cpu_irqsetlock,
|
||||
&g_cpu_irqlock);
|
||||
}
|
||||
}
|
||||
#endif /* CONFIG_SMP */
|
||||
}
|
||||
|
|
|
@ -164,6 +164,13 @@ int nxtask_exit(void)
|
|||
|
||||
if (rtcb->irqcount == 0)
|
||||
{
|
||||
/* NOTE: Need to aquire g_cpu_irqlock here again before
|
||||
* calling spin_setbit() becauses it was released in
|
||||
* the above nxsched_resume_scheduler()
|
||||
*/
|
||||
|
||||
DEBUGVERIFY(irq_waitlock(this_cpu()));
|
||||
|
||||
spin_setbit(&g_cpu_irqset, this_cpu(), &g_cpu_irqsetlock,
|
||||
&g_cpu_irqlock);
|
||||
}
|
||||
|
|
Loading…
Reference in a new issue