Hi, In our project we have Redis synchronization issues since several months. Mainly we have not al

Alberto Reyer · December 2019

Hi,

In our project we have Redis synchronization issues since several months. Mainly we have not all data in redis which is part of the related *_storage table.
To see if all data is synchronized to Redis I’ve build a tool to compare the count of a *_storage table with the respective key count in redis. With this tool it became pretty obvious that we have a synchronization problem.
I already took several experiments and tried to find anything in the logs, but there was just no reasons I could find, which could explain the missing entries.
What I’ve validated so far:

Redis: maxmemory = 0
Redis: maxmemory-policy: noeviction
- Writing 100000 entries with random strings (writing was completely successful)
- manual comparing counts with redis-cli vs. DB
Did you had similar experiences in any of you Spryker projects or did you have a hint what else we can have a look into?

One thing which took our attention in this regards was that after we have set up a test Kubernetes cluster with a managed Postgres database in Azure, but still a self hosted redis, this issue did not came up again. But I couldn’t find an issue with loading the data from the database either.

Alberto Reyer · December 2019

We’ve also checked that RabbitMQ is not the bottleneck or has issues. So we are pretty confident that the problem can be only in Redis or when data is written to redis.

Unknown · December 2019

Hi Alberto, did you have a look into the spy_queue_process database table?

Alberto Reyer · December 2019

@UK5EG6PBM In which regard? There are several entries in the spy_queue_process and we get events processed. Also there is a lot of data in our redis, but not all data which is in the related storage table.

Unknown · December 2019

That database table should be empty once the queue runs out of messages. If you have remaining entries in that table it could be a hint that something is wrong

Alberto Reyer · December 2019

Ok, but this isn’t our problem. Depending on when I do a look into the table it’s sometimes empty and we see in rabbitMQ that all queues are processed successful, but still data is missing in redis.

Unknown · December 2019

There are circumstances when Redis doesn't behave like you would expect. It might happen that Redis doesn't except writes while it's persisting data to disk. This, however, should be visible in the logs somehow

Unknown · December 2019

Until now, we were always able to identify causes of synchonization issues. Network, limited pod resources, various Kubernetes related problems, queue issues, ...

Unknown · December 2019

Maybe not related, but we had an issue with cms pages that had no store relation and where (silently) skipped.

Stanislav Matveyev · December 2019

One more possible reason (if you have custom publishers) is race condition between write & delete Redis messages.

When your publisher does DELETE from *_storage table it generates DELETE Redis records jobs in rabbitMq, and if your publisher after that does INSERT new records with same keys, which have been removed recently it generates WRITE Redis records jobs in rabbitMq.

Because RabbitMQ could proccess jobs asynchronously sometimes it can WRITE key and next worker can DELETE this key.

As result in DB you have records, but not all records in Redis.

Ehsan Zanjani · December 2019

Hi @UL6DGRULR,
if events are processed successfully, you might see some messages in sync.storage.* queues, these sync queues will be consumed to let the Redis gets updated.

Alberto Reyer · December 2019

Thx for all the answers. We already could exclude RabbitMQ as failure point, as our sync all tool writes directly into redis, without involving RabbitMQ.

@UK5EG6PBM The hint that redis sometimes does not accept writes is the most promising for me, but sadly there was nothing related in the logs, even when running with debug log level for redis. So far we already identified that in our production cluster the disk becomes pretty slow with a lot of writes. The througput of azure disks is a bad joke, it’s even slower than USB 1.0. So this might be the issue.

Unknown · December 2019

Good luck on you bug hunt 🍀

Hi, In our project we have Redis synchronization issues since several months. Mainly we have not al

Comments

Categories