When changing the size of the FreeRTOS timer queue somehow breaks the CAN bus

This writeup describes my debug process after finding a main feature of our device broken, diagnosing the problem, and finally hunting down the cause almost a week later. To no one’s surprise, the solution was a change to a single line of code.

My approach to debugging is to slowly chip away at small bits of uncertainty to reduce the problem to the smallest possible number of unknowns. This approach helped to debug an issue that came up at work recently that at first glance made absolutely no sense.

The issue

The systems that I work on are connected over ethernet and receive commands using the Art-Net protocol sent over UDP. When doing some testing one day, I realized the devices weren’t responding to any commands I was sending. After a couple of checks to make sure I wasn’t doing anything extra dumb (cough forgetting to plug in the ethernet cable), I realized there was an actual issue.

My first step was to find the last known good state. I flashed a random commit from a couple of weeks ago, and all commands registered perfectly. Using git-bisect I then narrowed the problem down to a single, specific commit from a few days prior.

The commit in question made a couple of small tweaks, but mainly centered around adjusting the ISRs and UART configuration for the debug CLI input and output byte streams. I started reverting changes file-by-file in that commit using git checkout HEAD~ <file>, and of course, the very last file to revert was the one with the actual problem. The reason it was the last for me to check is because it seemed so utterly unrelated. The file was the FreeRTOS configuration file, FreeRTOSConfig.h, and the single change was this:

#define configTIMER_QUEUE_LENGTH        5

to this:

#define configTIMER_QUEUE_LENGTH        10

Hm. Nowhere in any of the command processing code are FreeRTOS timers used. No idea how this is connected, but at least the source was now known.

Finding the cause of the issue

Now that the source of the problem was known, I needed to figure out what specifically in the command processing steps was failing. A description of “changing the timer queue length breaks commands” is about as vague as it gets and not helpful to start debugging the problem.

My very first check was to look at the stack allocations. Random changes causing other random errors smells a lot like a stack overflow from a task corrupting other memory. We allocate everything statically in this firmware, so any unaccounted-for allocations can blow through the stack initially allocated for the task. I’ve integrated a CLI over UART into our firmware to make controlling the firmware state much easier, and I had previously added a command to print the high watermark of every running task stack.

FreeRTOS tracks stack high watermarks. The closer the number is to zero, the closer it is to overflowing.

Well, that looks totally fine. Everything has a solid amount of space (some way over-allocated), but I gave the timer task a stack bump to see if that fixed anything (it didn’t). I also adjusted the CANBusRX task because, spoiler, that’s part of the command processing. That also resulted in no change in behavior.

Okay, so the easy answer wasn’t the solution. The next step is to take 3 steps back and figure out specifically what the problem is.

I started with a breakpoint in the function responsible for taking the command and calling out to different handlers based on its value. This is the most high-level part of the command-processing chain. That breakpoint was never hit.

I moved that breakpoint down the call stack level by level, first from high-level command processing to packet unpacking, then to the CAN bus transfer of received bytes to the packet unpacking, and finally into the vendor CAN code for checking the RX FIFO and signaling a waiting packet or not.

Not a single breakpoint was triggered when I sent a command to the device. To sanity check myself I put a scope on the incoming CAN lines to 100% check that valid CAN was getting to the processor, even though there should be no way firmware changes affect that (it was).

So I determined that something was going wrong between the hardware CAN peripheral in the chip and the SDK code accessing that memory. I also realized here that CAN is a bit annoying to debug in this regard because all of the data reception is purely in the hardware peripheral and not interrupt-driven like a UART peripheral. With UART you can get a bit closer to the hardware by verifying that the hardware was triggering an interrupt for a single char. For CAN, all I could do was look at the memory of the FIFO that the CAN peripheral was initialized with to see if data had appeared (and that wasn’t as simple as just looking at a block of memory in the debugger).

A bit more context on this specific device: the CAN bus is used for communication with only one other microcontroller. That MCU is on a PCB responsible for parsing the UDP command packets and forwarding them on the CAN bus. It’s not a multi-drop bus with many things listening and transmitting, so something was clearly going on in this board since the external CAN bus packets were deterministic and only sent when I sent a command.

I couldn’t examine a ‘byte RX’ interrupt like UART, but I could add breakpoints in every error CAN ISR flag handler to see if any were being set by the peripheral.

Every possible CAN peripheral interrupt flag for the MCU (don't judge the extra whitespace, it's vendor code)

Voila - when the commands weren’t being received, the breakpoint inside the CAN RX FIFO overflow flag handler was triggered.

I put a breakpoint inside the if-statement in the standard CAN_Recieve function that would trigger if a valid packet was found in the CAN RX FIFO. That breakpoint still never triggered. We don’t have a proper watchdog set up that monitors every task individually, so the CAN RX task was dying in the overflow handler and spinning forever, but the rest of the device functioned perfectly normally. That’s why it took me a few days to even realize something was broken in the firmware.

So bytes were arriving and being inserted correctly into the FIFO, but nothing was ever removing and processing them. I had been leaning towards some memory corruption or an out-of-bounds access before, and the fact that the FreeRTOS timer queue and failure to process CAN packets were so unrelated I was quite convinced now. I just had to find what was breaking the rules.

The hunt for the offending code

I went back and triple-checked all of the static memory allocation for the timer task, both for the task stack and even in the FreeRTOS timer code itself to ensure that the queue was correctly created.

I then tried bumping the size of the CAN FIFO itself, as it was sized to perfectly fit one of our packets. I wasn’t sure if something with the timer was causing some non-deterministic effects in other tasks to make them run slower or be run by the scheduler less. I thought maybe that was causing the CAN RX task to take just long enough to process the first packet that the second was overflowing, but that change didn’t fix anything.

I went back to my breakpoint where the vendor CAN code was checking for new messages in the FIFO and this time really inspected it while running. When looking at the actual memory bytes of the CAN handles and FIFO buffers when running the working firmware compared to the broken firmware, I found a small difference.

In the vendor CAN code, there are 3 CAN FIFOs - one for transmitting, and two for receiving. Pointers to the main FIFO data buffers are stored in an array, but there’s also pair of arrays that hold small structs with metadata for each TX and RX FIFO. One field (out of two total) of that metadata struct is the index into the FIFO data buffer array to find that specific CAN channel (TX, RX1, or RX2) buffer to check for data.

Diagram showing how the metadata array structs have an index field that indexes into the buffer pointer array which has pointers to the actual buffers (it's pointers all the way down)

In the working firmware, the RX metadata array contained 2 structs with the expected indexes of 1 and 2 for RX1 and RX2 respectively (index 0 in the FIFO buffer array is the TX FIFO). In the broken firmware, however, this small metadata array was totally zeroed out. That resulted in the index fields for those structs in the metadata array all holding 0 instead of the correct integers 1 and 2. That resulted in the code checking the FIFO in the FIFO buffer array at index 0 for new messages for all three FIFOs instead of the proper buffers at each buffer array index.

The channel field corrupted from the proper values of 1 and 2 (both RX FIFO indexes) to 0 (the TX FIFO index)

Memory bytes of the metadata struct array, normal (left) and corrupted (right). Red box is the first metadata struct holding a index field of value 1, blue box is the second with index field of value 2. The corrupted structs hold only zero

Since the FIFO at index zero was the TX buffer, that check always found no new messages. That caused an RX overflow when more than one CAN packet was received because the first was still in the FIFO and no one was checking to process it out. It was unfortunate that the overwritten values in the metadata struct were zeros (a valid index) instead of random byte, as the latter would have been more likely to immediately fail when trying to access out of bounds memory.

The same diagram, but portions in red are the corrupted bytes causing all metadata structs to index to the TX FIFO

Alright, it’s only taken one full workday to find out the cause of the problem…..nice. Next step: find out what’s corrupting the array and ask it politely to stop.

I made the metadata array variable non-static so the symbol would be global, and then checked the map file to see what was located near it in memory. I was hoping to see a nice little array right before it in memory that was overflowing its buffer, but it wasn’t that clear.

MPLAB and the Microchip compilers do some weird stuff where they break the .bss and .data sections up into tons of little small section labels, which results in .bss and .data memory being mingled closely. The following is what the memory around the corrupted array looked like.

At first, I thought I found a handle for a FreeRTOS timer directly before it in memory which seemed highly suspicious. I looked a bit closer and realized this was the annoying Microchip compiler behavior interleaving the .bss and .data sections, since those .bss addresses weren’t anywhere close to the CAN FIFO one.

The objects at the actual addresses before the RX FIFO were another timer, the compiled vendor ADC code, and a compiled external vendor Modbus library. The timer was again suspicious, but I was doubtful it would overwrite the pointer to two separate libraries without more serious impacts on the firmware.

This didn’t show me anything definitive, but now I at least had an idea that something related to the ADC or external Modbus library might be the problem.

My next and final stop on the debug train: data breakpoints. These are so helpful when dealing with memory corruption issues, but can be really finicky to get working well. Especially because Microchip’s MPLAB debugging experience is a nightmare even on the best of days, which is why I tried to leave this step until the very end.

A data breakpoint is just like a normal breakpoint in that it halts the execution of the processor at a specific instruction. Unlike the most common type of breakpoint which breaks on a specific line of code (a specific assembly instruction if we’re being pedantic), data breakpoints halt when a specific memory address is accessed to read or write. That means I could chuck a breakpoint on the address of the FIFO metadata and it would be break every time something changed that memory.

That was perfect for memory that’s initialized one time and never touched again, like this metadata array, because theoretically that breakpoint should only every trigger once. I first re-ran the firmware in the working case with this data breakpoint. It halted once early on in some assembly code, no source code. I stepped around and glanced at the disassembly to see if there were any clues, but it was all fairly foreign and had no familiar labels. I assumed this was the C stdlib initializing variables and copying the data from flash to RAM. I continued and the breakpoint wasn’t hit again, as expected since the array is never altered.

I re-flashed the non-working firmware and re-ran. The first break in assembly occurred again, but when I continued I got a second break. This came from inside a file in the external Modbus library code. Contrary to how obvious this might seem, I still wasn’t 100% convinced this was a problem. As I mentioned, the debugging in MPLAB is really shaky, and a lot of times the call stack can be corrupted or the assembly isn’t mapped accurately to the source code lines it displays. I stepped past a couple of lines in question but didn’t see any changes in the actual memory of the chip at the address of the buffer getting corrupted. After continuing and then examining the memory when the system was fully set up, it was once again corrupted.

This is the code surrounding where the external Modbus library halted on the data breakpoint, with a comment indicating the halted line:

if( !bIsInitalized )
{
    DEBUG_INIT;
    HDL_RESET( &xSerialHdls[USART_USART0_IDX] );
    HDL_RESET( &xSerialHdls[USART_USART1_IDX] );  // <------ data breakpoint halt here
    bIsInitalized = TRUE;
}

First I looked at what the HDL_RESET macro was. All it did was initialize the serial handle struct fields to some basic initial values. I then looked at the declaration of the xSerialHdls array and…

#define NUARTS                          ( 1 )
...
STATIC xMBPSerialIntHandle xSerialHdls[NUARTS];

Looks like this array is initialized with memory for a single element, but the setup code is accessing two hardcoded indexes for initialization.

Sure enough, examining the xSerialHdls variable in memory shows that it exists at the address 2 bytes directly before the CAN RX FIFO metadata buffer. When HDL_RESET was called with an imaginary second entry into the xSerialHdls array, it was overwriting the CAN metadata array with its zeroed-out initialization values.

The fix was as simple as removing that second HDL_RESET call, as the external Modbus library only used a single UART.

Moral of the story

I went back through the git history of this repo (that I inherited like this, by the way). I found that the external library was initially checked in with an NUARTS value of 3, so there initially was not only a second but also a third index in this array.

At some point, someone who worked at my studio previously had gone through and edited huge swaths of the vendor code to get the integration working and clearly had not caught this.

I was…less than pleased to find this vendor code with large pieces edited/commented out/newly added, and zero comments anywhere. If you are going to edit third-party code and libraries, here are some steps to follow:

Don’t
Seriously, it’s usually better in the long run to work around it in your own code for reasons exactly like this
If you must make edits, make extremely clear and descriptive comments around every section you change so people can see that it was intentionally changed from the original
Write better commit messages than “updated Modbus lib” when committing those changes

Okay but I still don’t understand what’s going on here

How did not only increasing but also decreasing the size of the FreeRTOS timer queue cause this CAN corruption? I’m honestly not too sure. This overflow has apparently been in the firmware for a long time but I’m guessing it was overwriting something inconsequential or something that was re-initialized to correct values by the firmware itself anyway.

If this behavior was only seen when the queue was a certain number of elements but working otherwise (aka the opposite of the actual situation), I would understand that the memory allocation kept bumping up until this buffer overflow lined up with the FIFO metadata array. The super weird part for me was that the exact same problem was seen if the timer queue was any other length than 4 or 5.

3? Still happened.

10, even 16? Still happened the exact same way.

I’m guessing that once the queue was bigger than 5 elements, it was moved to a different section of memory with more space, leaving the UART handle and the metadata array always next to each other. Similar to the 3-element queue, it was probably small enough to get shifted somewhere else, aligning the two variables adjacent once again.

Bugs like these make me miss the days of high-level software engineering where I was presented with beautiful stack traces and never worried about memory allocation at all. Once I finally solve the problem though, I admit that I really do love doing this type of debugging much more.