Aqara E1: Zigbee2MQTT, the NXP Zigbee stack, and OTA updates

This is the final part of a five-part post detailing the design and implementation of new firmware for the Aqara E1 Zigbee light switches:

Replacing the firmware in the Aqara E1 light switch
Reverse engineering the PCB
Planned new firmware features
Writing new firmware
Zigbee2MQTT, the NXP Zigbee stack, and OTA updates

Part of my project to write new firmware for the Aqara E1 Zigbee light switches involved implementing the Zigbee Cluster Library OTA Update function. It didn’t go particularly smoothly, with two main obstacles. Firstly:

How are OTA updates setup in Zigbee2MQTT?

I followed the Zigee2MQTT instructions to setup a firmware index override file and placed the OTA image in the same folder. After connecting the device, the network traffic showed:

Device asks for a new image (QueryNextImage)
Zigbee2MQTT responds with NO_IMAGE_AVAILABLE, despite logging that it had found a valid updated OTA image.

After checking the Zigbee2MQTT source, it became clear that this sort of auto-updating is not supported. In response to a QueryNextImage command from the device, Zigbee2MQTT:

Searches for available updates.*
Checks the firmware against the current version.
Outputs new firmware available.
Responds to the device with NO_IMAGE_AVAILABLE.

* If the last check was < update_check_interval minutes ago, it actually just ignores the request. This is not helpful as the device will retry a few times. I patched Zigbee2MQTT to fix this.

I assume the reason for this to ensure that OTA updates are only performed when explicitly requested by the user. In response to requests from the client device, Zigbee2MQTT updates its internal database of available firmware, but doesn’t start an update. The user can then navigate to the OTA tab in the UI and trigger the update manually, or send an update request via MQTT.

So I powered up the device, waited for it to fail after 3 QueryNextImage attempts, and then hit the “Update” button on the OTA page, which sent out an ImageNotify command to the device.

Firmware attempt 1, firmware fix 1

Despite the device sending out QueryNextImage commands correctly, it did not respond to the incoming ImageNotify command.

After much frustration, I discovered that although the client (output) OTA Upgrade cluster was implemented and included in the Zigbee configuration, the endpoint in the NXP Zigbee configuration editor needs the server (input) cluster included also. Without this, it won’t respond to any incoming commands from the external OTA server.

When an endpoint with an output cluster sends data, the receiving endpoint must have an input cluster in order to receive the data, otherwise the stack will reject it and will not notify the receiving endpoint. However, the Default cluster can be added to the endpoint in order to deal with received data that is destined for input clusters not supported by the endpoint (see the Note below this procedure).

…

Note: In the above procedure, you may want to add the Default cluster (with a Cluster ID of 0xFFFF) as an input cluster. The inclusion of the Default cluster means that received messages that were intended for input clusters not supported by the endpoint will still be passed to the application. The messages must, however, come from defined application profiles, otherwise they are discarded.

ZigBee 3.0 Stack User Guide
Rev. 4.1 - 2 March 2023

After adding the OTA Upgrade input cluster, the device began responding to the commands from the server.

Firmware attempt 2, firmware fix 2

After fixing the first issue, the device responded to the incoming ImageNotify command as expected:

<- ImageNotify     | QueryJitter : 100

-> QueryNextImgReq | FileVersion : 0x64

<- QueryNextImgRsp | FileVersion : 0x65
                   | ImageSize   : 268446

-> BlockReq        | FileVersion : 0x65
                   | MfrCode     : jennic
                   | ImageType   : 0x01
                   | FileOffset  : 0
                   | MaxDataSize : 48

<- BlockRsp        | FileVersion : 0x65
                   | MfrCode     : jennic
                   | ImageType   : 0x01
                   | FileOffset  : 0
                   | DataSize    : 48
                   | ImageData   : xxx (48-bytes)

-> BlockReq        | FileVersion : 0x65
                   | MfrCode     : jennic
                   | ImageType   : 0x01
                   | FileOffset  : 48
                   | MaxDataSize : 48

<- BlockRsp        | FileVersion : 0x65
                   | MfrCode     : jennic
                   | ImageType   : 0x01
                   | FileOffset  : 48
                   | DataSize    : 48
                   | ImageData   : xxxxxx (48-bytes)

Looking good 👍

To avoid saturating the network, the SDK sample code requests one block every RND_u32GetRand(10,20) seconds by default. This is far too slow. At one 48-byte block every 15 seconds on average, that would be 24 hours for our 270k image.

The timing is determined by the OTA_TIME_INTERVAL_BETWEEN_REQUESTS macro in the zcl_options.h file. I set it to 0, which results in one block every second (the SDK code adds 1 to this value). That brings the time down to around 1.5 hours. Tolerable.

Let it run

I let it run overnight and checked the logs in the morning. Wireshark showed that all block requests were made and satisfied 🥳, and then the device sent an UpgradeEndReq to the server (coordinator - i.e. Zigbee2MQTT).

-> UpgradeEndReq   | Status      : 0x00 (success)
                   | ImageType   : 0x01 (mfr specific)
                   | MfrCode     : jennic
                   | FileVersion : 0x65

This command was sent about 8 minutes after the last block was received, and then repeated every 6 seconds for a very long time. There was no response from Zigbee2MQTT in the Wireshark log.

The Zigbee2MQTT log showed why:

[02:53:02] info:     zhc:ota: [0x0123456789abcdef | e1_moult_a] Request offsets: fileOffset=274942 pageOffset=0 maximumDataSize=64
[02:53:02] info:     zhc:ota: [0x0123456789abcdef | e1_moult_a] Payload offsets: start=274942 end=274990 dataSize=50
[02:55:32] info:     z2m: Update of '0x0123456789abcdef' failed (Error: [0x0123456789abcdef | e1_moult_a] Timeout. Device did not start/finish firmware download after being notified. (Timeout - 12415 - 100 - undefined - 25 - 3 after 150000ms))
[02:55:32] info:     z2m:mqtt: MQTT publish: topic 'zigbee2mqtt/0x0123456789abcdef', payload '{"linkquality":214,"state_1":"OFF","state_2":"OFF","update":{"installed_version":100,"latest_version":101,"state":"available"}}'
[02:55:32] info:     z2m:mqtt: MQTT publish: topic 'zigbee2mqtt/bridge/response/device/ota_update/update', payload '{"data":{},"error":"Update of '0x0123456789abcdef' failed ([0x0123456789abcdef | e1_moult_a] Timeout. Device did not start/finish firmware download after being notified. (Timeout - 12415 - 100 - undefined - 25 - 3 after 150000ms))","status":"error","transaction":"3d36n-6"}'
[02:55:32] error:    z2m: Update of '0x0123456789abcdef' failed ([0x0123456789abcdef | e1_moult_a] Timeout. Device did not start/finish firmware download after being notified. (Timeout - 12415 - 100 - undefined - 25 - 3 after 150000ms))

So Zigbee2MQTT gives the device 150 seconds to do something after receiving the last block, but the device takes several minutes. Why? What is it doing? It’s “verifying” the image is what. Although the device UART log missed the transition from last block -> verify (only 80k lines of history), it did catch the end of the verification process, where it declared the image valid and sent the UpgradeEndReq.

The ZCL spec doesn’t appear to specify any maximum time limit between the final block request and upgrade end request, but 150 seconds seems plenty. So why does the device take 8 minutes?

Logging is not cost free

I’m not sure what happens during the verification process, but it involves reading the new image from the flash in 16-byte chunks, as can be seen in the debug output:

EP100_callback: [E_ZCL_CBET_CLUSTER_CUSTOM]
Custom evt: ep: 100, cluster: 25
OTA: OTA_CLIENT_EVENT: INTERNAL_COMMAND_FREE_FLASH_MUTEX
EP100_callback: [E_ZCL_CBET_CLUSTER_CUSTOM]
Custom evt: ep: 100, cluster: 25
OTA: OTA_CLIENT_EVENT: INTERNAL_COMMAND_LOCK_FLASH_MUTEX
vOtaFlashLockRead()
- Read 16 bytes at 0x08431a0

With 16-byte chunks, that’s ~17k chunks for the entire image, and for each chunk, the debug log outputs the above 334-byte message. That got me thinking - How long does it take to output 334-bytes over a 115200 baud UART 17k times? I’m guessing around 8 minutes. Let’s check:

334-byte msg * 17k chunks
 = 5,740,416 characters

UART baud rate
 = 115200 / no parity / 1 stop bit
 = ~12,800 characters per second

5,740,416 / 12,800
 = 448 seconds
 = 7.5 minutes

Sounds about right.

The SDK function DBG_vPrintf is undocumented (unless you include 2016 docs from a prior chip). If you trace through the source though, you find that it is an alias for PRINTF, which is in turn an alias for DbgConsole_Printf. I discussed this in the introduction to the NXP Zigbee SDK article. So it’s:

DBG_vPrintf -> PRINTF -> DbgConsole_Printf

The documentation for the DbgConsole_Printf function is in the MCUXpresso SDK Reference Manual, and makes no mention of whether the functions blocks, or whether it writes to a buffer and output is via interrupt/RTOS task/polled function etc.

Side note: It was already clear that it blocked, as there didn’t appear to be any output dropped, and the device does not have a fraction of the memory required to buffer that much data.

As with most things in this SDK, if you want to know what’s going on, you need to go to the source. There you’ll find that symbol DEBUG_CONSOLE_TRANSFER_NON_BLOCKING determines whether it blocks or not. Further searching reveals the fsl_debug_console_conf.h file, which includes the following:

/*! @brief If Non-blocking mode is needed, please define it at project setting,
 * otherwise blocking mode is the default transfer mode.
 * Warning: If you want to use non-blocking transfer,please make sure the corresponding
 * IO interrupt is enable, otherwise there is no output.
 * And non-blocking is combine with buffer, no matter bare-metal or rtos.
 * Below shows how to configure in your project if you want to use non-blocking mode.
 * For IAR, right click project and select "Options", define it in "C/C++ Compiler->Preprocessor->Defined symbols".
 * For KEIL, click "Options for Target…", define it in "C/C++->Preprocessor Symbols->Define".
 * For ARMGCC, open CmakeLists.txt and add the following lines,
 * "SET(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} -DDEBUG_CONSOLE_TRANSFER_NON_BLOCKING")" for debug target.
 * "SET(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} -DDEBUG_CONSOLE_TRANSFER_NON_BLOCKING")" for release target.
 * For MCUxpresso, right click project and select "Properties", define it in "C/C++ Build->Settings->MCU C
 * Complier->Preprocessor".
 */

And there is the answer: it blocks by default.

Firmware attempt 3 🥳

Logging will be disabled in production and so won’t cause this issue. It’s needed in testing though, and so I upped the UART baud rate 8x to 921,600 and halved the number of characters logged per chunk, reducing the output time required to about 30 seconds.

The chip reference manual suggests that a baud rate up to core clock / 16 is possible, which is 3,000,000 based on the 48 MHz clock, so 921,600 should be fine over the few centimeters from the device to the USB<->serial converter.

I reran the process, and after a 36 second verification, the UpgradeEndReq was received and processed:

[03:05:02] info:     zhc:ota: [0x0123456789abcdef | e1_moult_a] Payload offsets: start=273502 end=273550 dataSize=50
[03:05:38] info:     zhc:ota: [0x0123456789abcdef | e1_moult_a] Got upgrade end request: {"status":0,"manufacturerCode":4151,"imageType":1,"fileVersion":101}
[03:05:38] info:     zhc:ota: [0x0123456789abcdef | e1_moult_a] Update successful. Waiting for device announce...
[03:05:38] info:     z2m: Update of '0x0123456789abcdef' at 100.00%
[03:05:38] info:     z2m:mqtt: MQTT publish: topic 'zigbee2mqtt/0x0123456789abcdef', payload '{"linkquality":211,"state_1":"OFF","state_2":"OFF","update":{"installed_version":100,"latest_version":101,"progress":100,"remaining":29,"state":"updating"}}'
[03:07:38] info:     zhc:ota: [0x0123456789abcdef | e1_moult_a] Timed out waiting for device announce, update considered finished.
[03:07:38] info:     z2m: Finished update of '0x0123456789abcdef'
[03:07:38] info:     z2m:mqtt: MQTT publish: topic 'zigbee2mqtt/0x0123456789abcdef', payload '{"linkquality":211,"state_1":"OFF","state_2":"OFF","update":{"installed_version":101,"latest_version":101,"state":"idle"}}'
[03:07:38] info:     z2m: Device '0x0123456789abcdef' was updated from '{"dateCode":"20241201","softwareBuildID":"1000-0001"}' to '{"dateCode":"20241201","softwareBuildID":"1000-0001"}'
[03:07:38] info:     z2m: Configuring '0x0123456789abcdef'
[03:07:38] info:     z2m:mqtt: MQTT publish: topic 'zigbee2mqtt/bridge/response/device/ota_update/update', payload '{"data":{"from":{"date_code":"20241201","software_build_id":"1000-0001"},"id":"0x0123456789abcdef","to":{"date_code":"20241201","software_build_id":"1000-0001"}},"status":"ok","transaction":"zcbcy-2"}'
[03:07:39] info:     z2m:mqtt: MQTT publish: topic 'zigbee2mqtt/0x0123456789abcdef', payload '{"linkquality":211,"state_1":"OFF","state_2":"OFF","update":{"installed_version":101,"latest_version":101,"state":"idle"}}'
[03:07:39] info:     z2m:mqtt: MQTT publish: topic 'zigbee2mqtt/0x0123456789abcdef', payload '{"linkquality":211,"state_1":"OFF","state_2":"OFF","update":{"installed_version":101,"latest_version":101,"state":"idle"}}'
[03:07:39] info:     z2m: Successfully configured '0x0123456789abcdef'

To confirm the update was successful, I had added “New Version” to the startup log message in the updated firmware file. Checking the log:

...
OTA_CL_EVT: INT_SAVE_CONTEXT
vOTAPersist()
EP100_cb: [CSTC]
- cl 25
OTA_CL_EVT: INT_RESET_TO_UPG
E_CLD_OTA_INTERNAL_COMMAND_RESET_TO_UPGRADE

***********************************************
* ROUTER RESET New Version                    *
***********************************************
SYSCON->MEMORYREMAP = 0x0e400486
APP: APP_vInitialise()
APP: Initializing ZTimers
APP: Initializing ZQueues
APP: PDM initialized: 17/46 segments occupied
...

Faster iteration = faster development

90 minutes per update is fine in production, but it’s too slow for testing, so the next step was to speed it up. The code that times the block requests uses the u32ZCL_GetUTCTime function, which has seconds resolution. The documentation states:

This function obtains the current time (UTC) that is stored in the ZCL (‘ZCL time’)

**Returns**
The current time (UTC), in seconds, obtained from the ZCL

NXP ZCL User Guide (JN-UG-3115)

It returns a 32-bit unsigned integer, which is the number of seconds since…..? It’s the ZCL lib, so I took a look at the ZCL spec, and:

UTCTime is an unsigned 32-bit value representing the number of seconds since 0 hours, 0 minutes, 0 seconds, on the 1st of January, 2000 UTC (Universal Coordinated Time). The value that represents an invalid value of this type is 0xffffffff.

ZCL Revision 7

So along with PostreSQL, they use a 2000 based epoch. None of this matters for us though, as we only care about relative time, and in any case, it’s never even set, and so starts at 0 and is incremented by a 1s timer.

After each block, the next block is scheduled at u32ZCL_GetUTCTime() + OTA_TIME_INTERVAL_BETWEEN_REQUESTS + 1.

I decided to change the resolution from seconds to milliseconds

You can’t trust the NXP docs

I searched around the Zigbee stack/MCUXpresso SDK/Connectivity Framework docs and came across the TMR_GetTimestamp function the in Connectivity Framework, which:

Description:
Returns the absolute time at the moment of the call

Returns:
Timestamp (in ms).

Connectivity Framework Reference Manual Rev.14

Seems perfect. The functionality must first be initialized with TMR_TimeStampInit (this is not mentioned in the function documentation).

What is also not mentioned, is that the source code indicates that the returned value is not milliseconds, but microseconds:

TimersManager.c

/*! -------------------------------------------------------------------------
 * \brief  return an 64-bit absolute timestamp
 * \return absolute timestamp in us
 *---------------------------------------------------------------------------*/
uint64_t TMR_GetTimestamp(void)
{
#if mTMR_PIT_Timestamp_Enabled_d
    return TMR_PITGetTimestamp();
#elif  gTimestampUseWtimer_c
    return Timestamp_Get_uSec();
#else
#if defined (gMWS_UseCoexistence_d) && (gMWS_UseCoexistence_d) ||\
    defined(gWCI2_UseCoexistence_d) && (gWCI2_UseCoexistence_d)
    return TMR_CTGetTimestamp();
#else
    return TMR_RTCGetTimestamp();
#endif
#endif
}

And then further down in the same source file:

/*! -------------------------------------------------------------------------
 * \brief  Returns the absolute time at the moment of the call.
 * \return Absolute time at the moment of the call in microseconds.
 *---------------------------------------------------------------------------*/
uint64_t TMR_RTCGetTimestamp(void)
{
#ifdef CPU_JN518X
    uint64_t sec = (uint64_t)RTC_GetTimeSeconds(RTC);
    return sec * 1000000L;
#else
...

So the docs say miliseconds, the code is actually microseconds, and for the JN518x, although the resolution is microseconds, the accuracy is actually only seconds 🤦.

Keep digging

I found this undocumented function in the Connectivity Framework code:

middleware/wireless/framework/OSAbstraction/Interface/fsl_os_abstraction.h

/*!
 * @brief This function gets current time in milliseconds.
 *
 * @retval current time in milliseconds
 */
uint32_t OSA_TimeGetMsec(void);

It’s shown as part of the MCUXpresso SDK API ref here, but is not present in the same document that is included with the SDK download. The whole OSA Adapter section is missing from that document, and seems to have been moved to the Wireless Framework doc and renamed as OS Abstraction. There are many OSA_ functions (e.g. OSA_TimeDelay) present there, but not this one.

The OSA_TimeDelay function is documented, relies on OSA_TimeGetMsec, and is widely used so it seems safe.

For a Zigbee app on the JN5189, the implementation simply returns the value of a 32-bit counter that is incremented in the SysTick handler.

I also came across zbPlatGetTime, which is again undocumented, but is used in the ZTimer code. It seems to just call OSA_TimeGetMsec.

32-bit overflow

A 32-bit unsigned integer incrementing every millisecond will overflow after about 50 days. Unsigned integer overflow in C/C++ is well defined and wraps around on a modulo basis.

So what happens if an OTA upgrade is in progress during the overflow? Nothing good.

If the next block request time itself overflowed, but the timer hadn’t before the next check, then the next block request would go out early (next req time = 0, current time = 4 billion).
If the next block request time was set and then the timer overflowed before the next check, then the update would stall (next req time = 4 billion, current time = 0).

This second possibility is a problem, as it would likely require resetting the device. The current code won’t timeout and so won’t accept any new update request.

Just check the elapsed time?

The usual solution to this issue is to just check the elapsed time. Since unsigned arithmetic in C/C++ ignores overflow and truncates the result to the (correct) lowest bits, we can just do this:

static const uint32_t min_req_interval_ms = 200;
static uint32_t last_req_time_ms = 0;
uint32_t now_ms = getNow();

if (now_ms - last_req_time_ms > min_req_interval_ms) {
    last_req_time_ms = now_ms;
    // do it
}

The SDK code doesn’t check the elapsed time though, it sets sets the scheduled_req_block_time_ms in many difference places and then checks if now_ms >= scheduled_req_block_time_ms. So we need a separate overflow check:

if( (req_time_ms <= now_ms) && (req_time_ms != 0)) {
    requestNextBlock();
}
else if ( (req_time_ms - now_ms) > 120'000) {
    req_time_ms = now_ms;
}

Refactoring the code to avoid this check would be preferred, but is much more involved and not necessary here.

I revised the code, set the block delay to 50ms, and reran the process.

The OTA code is not designed for request delays < server response time

Side-rant

The OTA code in the NXP SDK is convoluted and difficult to parse. It is based on a multi-level state machine implemented with function pointers, with separate state machines on the application and SDK side. Communication is via various variables and flags, with the client side polling for changes.

Compounding this, it is dense with conditional compilation macros based on symbols defined in the makefile, and MCUXpresso is often not smart enough to know what is defined when displaying the code in the editor.

Problems became apparent immediately. Wireshark showed multiple block request commands going out before the response to the first came back. I had assumed that the OTA state machine would handle the request -> response -> request... states, but it does not. The multiple requests were for the same block, as the request code relies on the response handler incrementing the offset.

Side-note: The Aqara stock firmware appears to have this problem. Wireshark shows multiple requests for the same block going out prior to receiving the response.

I never intended to have parallel block requests. What was needed was a mutex to ensure that new block requests weren’t sent before prior request’s response was received. Looking at the code, it seemed that a signal was already implemented. It just wasn’t being used.

void vOtaHandleTimedBlockRequest(...)
   ...
   if(psOTA_Common->sOTACallBackMessage.sPersistedData.bWaitForNextBlockReq == FALSE)
   {
       vOtaRequestNextBlock(....

The bWaitForNextBlockReq flag is never set anywhere and is always false. So I set it immediately before calling vOtaRequestNextBlock above and cleared it in the response handling code.

Once that was done, everything worked as expected, with sequential request -> response -> request -> response.... The transfer time fell to 27 minutes. Much better, but still surprisingly high given that I had set the delay to 50ms.

Working, but why so slow?

Wireshark showed the limiting factor to be Zigbee2MQTT. The delay between the device receiving a response and sending the next block request was ~50ms. The delay between sending the request and receiving the response was 250ms. This set a cap of about 3-4 blocks/second.

In an effort to waste more time, I decided to delve into the Zigbee2MQTT source.

Always read the manual first

The OTA code is in the zigbee-herdsman-converters project. Thankfully it only took me about a minute to determine that I should have just read the manual:

/** Use to reduce network congestion by throttling response if necessary */
export const DEFAULT_IMAGE_BLOCK_RESPONSE_DELAY = 250;
...
let imageBlockResponseDelay = DEFAULT_IMAGE_BLOCK_RESPONSE_DELAY;
...
initialMaximumDataSize = settings.defaultMaximumDataSize || DEFAULT_MAXIMUM_DATA_SIZE;
...
const sendImageBlockResponse = async (
    ...
// Reduce network congestion by throttling response if necessary
    clearTimeout(lastBlockTimeout);

    const now = Date.now();
    const timeSinceLast = now - lastBlockResponseTime;
    const delay = imageBlockResponseDelay - timeSinceLast;
    ...

The configuration instructions for OTA updates are well written and make it quite clear why throughput maxes out at about 4 blocks per second:

You can increase or decrease the default speed (the minimum delay between two chunks) at which Zigbee2MQTT responds during the update process to send chunks of images. The default is 250ms….
ota:
   image_block_response_delay: 250
Zigbee2MQTT OTA Updates - Advanced configuration

I changed it to 50ms. I also changed the maximum block size to 64 (the NXP SDK requires a multiple of 16) and updated the firmware to use the same.

ota:
  zigbee_ota_override_index_location: firmware-index.json
  default_maximum_data_size: 64
  image_block_response_delay: 50

7 minutes!

The transfer time fell to 7 minutes. With no delay at all on the device, the limit is about 3.5 minutes for the 270k image, but the SDK software is designed to pause for each block, and if there’s no pause then it’s always running a busy loop waiting on the response. 7 minutes is fine for testing.

Updating the orginal Aqara firmware OTA

This project is part of my attempt to replace the firmware on the Aqara E1 series of Zigbee switches. I have several of these and once the firmware is done, would like to avoid having to disassemble each and program it manually over ISP.

Zigbee devices will typically refuse firmware not made for them

Zigbee OTA images have a defined header. For the existing firmware to accept a new OTA image, there are several things in this header that (typically) need to match the existing one:

Manufacturer code  - 16-bit number                | new image == old image
Image type         - 16-bit number, 0x0000-0xffbf | new image == old image
File version       - 32-bit number                | new image > existing image
OTA header string  - 32-byte string               | new image == old image

I write typically because in the end, it is up to the application code on the device to decide if the image is suitable. Some may accept lower version numbers and others (e.g. the NXP SDK) do additional checks such as the Link key.

The Zigbee2MQTT project hosts firmware images for many devices, and helpfully indexes these values for them:

Model	modelId	imgType	mfrCode	hdrString
QBKG40LM	lumi.switch.b1nc01	6280	4447	ROUTERX—–JN5180–ENCRYPTED000
QBKG41LM	lumi.switch.b2nc01	6408	4447	ROUTERX—–JN5180–ENCRYPTED000
ZNQBKG31LM	lumi.switch.acn040	????	????	?????

Metadata from the Zigbee2MQTT firmware repository for the E1 switches

I don’t like the look of those headers

It was immediately apparent upon seeing these values that OTA’ing the new firmware was unlikely.

The header strings for the two listed models (ROUTERX-----JN5180--ENCRYPTED000) indicated that they’re encrypted, and a quick hexdump of the OTA files confirmed it:

[me@here ota-firmware]$ hexdump -C -n 0x210 20240731114846_OTA_lumi.switch.b2nc01_0.0.0_0029_20240729_4EDD09.ota
00000000  1e f1 ee 0b 00 01 38 00  00 00 5f 11 08 19 1d 00  |......8..._.....|
00000010  00 00 02 00 52 4f 55 54  45 52 58 2d 2d 2d 2d 2d  |....ROUTERX-----|
00000020  4a 4e 35 31 38 30 2d 2d  45 4e 43 52 59 50 54 45  |JN5180--ENCRYPTE|
00000030  44 30 30 30 e0 72 04 00  00 00 c0 6b 04 00 e0 5f  |D000.r.....k..._|
00000040  01 04 01 02 00 00 23 02  00 00 6d c8 00 00 85 c8  |......#...m.....|
00000050  00 00 75 c8 00 00 7d c8  00 00 18 7a fb fb 02 79  |..u...}....z...y|
00000060  44 98 a0 6b 04 00 17 b4  1b 14 2d 02 00 00 2f 02  |D..k......-.../.|
00000070  00 00 00 00 00 00 31 02  00 00 c1 e9 00 00 9d 03  |......1.........|
00000080  00 00 25 d6 00 00 95 a6  00 00 65 a6 00 00 75 a6  |..%.......e...u.|
00000090  00 00 7d a6 00 00 85 a6  00 00 8d a6 00 00 21 02  |..}...........!.|
000000a0  00 00 9d a6 00 00 a5 a6  00 00 51 9a 03 00 b9 9a  |..........Q.....|
000000b0  03 00 ad a6 00 00 b5 a6  00 00 bd a6 00 00 c5 a6  |................|
000000c0  00 00 21 02 00 00 21 02  00 00 21 02 00 00 21 02  |..!...!...!...!.|
*
000000e0  00 00 21 02 00 00 21 02  00 00 21 02 00 00 cd a6  |..!...!...!.....|
000000f0  00 00 d5 a6 00 00 21 02  00 00 21 02 00 00 21 02  |......!...!...!.|
00000100  00 00 21 02 00 00 21 02  00 00 5d a6 00 00 6d a6  |..!...!...]...m.|
00000110  00 00 21 02 00 00 21 02  00 00 21 02 00 00 21 02  |..!...!...!...!.|
00000120  00 00 21 02 00 00 35 02  00 00 59 02 00 00 8d fd  |..!...5...Y.....|
00000130  03 00 21 02 00 00 21 02  00 00 21 02 00 00 21 02  |..!...!...!...!.|
00000140  00 00 c9 d3 00 00 21 02  00 00 21 02 00 00 21 02  |......!...!...!.|
00000150  00 00 21 02 00 00 21 02  00 00 21 02 00 00 14 57  |..!...!...!....W|
00000160  04 00 00 06 00 04 88 10  00 00 14 57 04 00 00 00  |...........W....|
00000170  02 04 00 00 00 00 90 1a  00 04 24 a9 00 00 00 00  |..........$.....|
00000180  02 04 40 37 00 00 ff ff  ff ff ff ff ff ff 00 00  |..@7............|
00000190  00 10 11 12 13 14 15 16  17 18 00 00 00 00 38 fb  |..............8.| <- Header should be visible here
000001a0  38 1f 87 e1 14 b8 b6 45  92 03 76 e0 fa 24 94 9c  |8......E..v..$..|
000001b0  c6 72 c9 55 90 4f e0 3b  46 aa 0d fe 2b 85 d9 82  |.r.U.O.;F...+...|
000001c0  bc 04 be 0e e2 2c 54 df  93 b9 4b 16 f7 21 44 c8  |.....,T...K..!D.|
000001d0  9c 95 fb 76 11 3e 10 57  5f 49 4a 84 1b 04 8b 11  |...v.>.W_IJ.....|
000001e0  a2 36 25 63 44 68 8f 88  43 78 d6 01 22 a9 c1 1c  |.6%cDh..Cx.."...|
000001f0  84 15 65 8e 32 44 1f 55  2f 97 b3 b5 d4 d4 ef 78  |..e.2D.U/......x|
00000200  68 c0 de da e2 10 d6 1d  6b 5c 92 8c d9 64 fb 1c  |h.......k\...d..|

During image prep, the OTA header it placed just after a bunch of static setup code at 0x0160. It is then also prepended to the start of the image, and if encryption is enabled, then everything after 0x019e (the start of the original OTA header) gets encrypted.

The header is easily seen in a hex dump due the ASCII-encoded header string. The above dump shows it at the beginning of the image, but the second copy is clearly encrypted.

Once ecnrypted, always encrypted

The on-device image is not encrypted. When an encrytped image is sent to it during an OTA update, the device decripts it on the fly and writes the unencrypted blocks to flash. After about 1k has been received, it checks if the link key - which is stored at the beginning of the image - matches its own link key. If it doesn’t, then it terminates the transfer as either the encryption key or link key are incorrect.

Becase of this, it’s not possible to send an unencrypted image to the device, and it’s not possible to send an image encrypted with a different key.

Duk.io

Aqara E1: Zigbee2MQTT, the NXP Zigbee stack, and OTA updates

Getting Zigbee over-the-air updates working with the NXP Zigbee SDK and Zigbee2MQTT

Apr 20, 2025
last updated: Jun 13, 2025

This is the final part of a five-part post detailing the design and implementation of new firmware for the Aqara E1 Zigbee light switches:

How are OTA updates setup in Zigbee2MQTT?

Firmware attempt 1, firmware fix 1

Firmware attempt 2, firmware fix 2

Let it run

Logging is not cost free

Firmware attempt 3 🥳

Faster iteration = faster development

I decided to change the resolution from seconds to milliseconds

You can’t trust the NXP docs

Keep digging

32-bit overflow

Just check the elapsed time?

The OTA code is not designed for request delays < server response time

Side-rant

Working, but why so slow?

Always read the manual first

7 minutes!

Updating the orginal Aqara firmware OTA

Zigbee devices will typically refuse firmware not made for them

I don’t like the look of those headers

Once ecnrypted, always encrypted

Aqara E1: Zigbee2MQTT, the NXP Zigbee stack, and OTA updates

Getting Zigbee over-the-air updates working with the NXP Zigbee SDK and Zigbee2MQTT

Apr 20, 2025 last updated: Jun 13, 2025

This is the final part of a five-part post detailing the design and implementation of new firmware for the Aqara E1 Zigbee light switches:

How are OTA updates setup in Zigbee2MQTT?

Firmware attempt 1, firmware fix 1

Firmware attempt 2, firmware fix 2

Let it run

Logging is not cost free

Firmware attempt 3 🥳

Faster iteration = faster development

I decided to change the resolution from seconds to milliseconds

You can’t trust the NXP docs

Keep digging

32-bit overflow

Just check the elapsed time?

The OTA code is not designed for request delays < server response time

Side-rant

Working, but why so slow?

Always read the manual first

7 minutes!

Updating the orginal Aqara firmware OTA

Zigbee devices will typically refuse firmware not made for them

I don’t like the look of those headers

Once ecnrypted, always encrypted

Apr 20, 2025
last updated: Jun 13, 2025