What are the Slack Archives?

It’s a history of our time together in the Slack Community! There’s a ton of knowledge in here, so feel free to search through the archives for a possible answer to your question.

Because this space is not active, you won’t be able to create a new post or comment here. If you have a question or want to start a discussion about something, head over to our categories and pick one to post in! You can always refer back to a post from Slack Archives if needed; just copy the link to use it as a reference..

Hello everyone, We are encountering an issue when trying to write a large amount of data to RabbitM

UNTTWV4JK
UNTTWV4JK Posts: 63 🧑🏻‍🚀 - Cadet
edited December 2020 in Help

Hello everyone,

We are encountering an issue when trying to write a large amount of data to RabbitMQ using the sync:data command.
We call the command using an installer which has a timeout set to 10h. The sync:data process is killed by the timeout. We noticed a “stall” behaviour of the application, nothing happens on the database and RabbitMQ side and the sync process is still running.
After investigating we noticed 2 things happening:
• An exception is logged from php-amqplib (https://github.com/php-amqplib/php-amqplib/blob/9429243609cd40c2afab1d8bafc115188f8dc2f9/PhpAmqpLib/Wire/IO/StreamIO.php#L275), which means fwrite returned the result !== false but with a timeout in the metadata.
• After the exception is logged the application is not exiting until is killed by the main process timeout.
Our questions are:
• Is there any way of getting rid of the timeout or some sort of retry mechanism in Spryker in case writing to RabbitMQ fails?
• Do you have any idea why the process is not exiting when the exception occurs? This issue is happening only on production, we were not able to reproduce it yet on the dev machines. We hope someone else encountered this and managed to fix it.
Thank you!

Comments

  • Thomas Lehner
    Thomas Lehner Support Engineer @ Spryker Posts: 289 🏛 - Council (mod)

    @UL65CH0MC something you encountered as well, maybe?

  • giovanni.piemontese
    giovanni.piemontese Spryker Solution Partner Posts: 871 🧑🏻‍🚀 - Cadet

    Yes Thomas. It’s just the same behaviour that we had.

  • Thomas Lehner
    Thomas Lehner Support Engineer @ Spryker Posts: 289 🏛 - Council (mod)
    edited December 2020

    Correct me if I am wrong, but you managed to work aorund this by adjusting EXPORT_CHUNK_SIZE STORAGE_SYNC_CHUNK_SIZE to 20K, both, in your case

  • giovanni.piemontese
    giovanni.piemontese Spryker Solution Partner Posts: 871 🧑🏻‍🚀 - Cadet

    Yes. For our env and big data 20k is not large amount. But 50k+ then there is this “stall” and exceptions form rmq lib

  • UNTTWV4JK
    UNTTWV4JK Posts: 63 🧑🏻‍🚀 - Cadet

    Thank you for your answers. We will try to change the chunk size and see what happens.

  • UK7TM6CQJ
    UK7TM6CQJ Posts: 123 🧑🏻‍🚀 - Cadet

    hi giovanni, what kind of stall behavior you've seen?

  • giovanni.piemontese
    giovanni.piemontese Spryker Solution Partner Posts: 871 🧑🏻‍🚀 - Cadet

    Hi Lucian,
    i just seen that queue:worker took very much time (more than 20 min) and none process on server was running.. no data was written in db, no events was consumed... nothing.. The worker was in "stall" with sync.storage.product queue (it was the only queue on running, but nothing happened). After some minutes, ca. 15 min, than i got the first error from amq and after ca. 1-2 min a lot of exceptions about channel closed from amq (all events in unacked triggered an exception)
    I have also attached some screenshot from yesterday...

  • UNTTWV4JK
    UNTTWV4JK Posts: 63 🧑🏻‍🚀 - Cadet

    @tom.lehner Do you have any idea why the default value for AMQP_STREAM_CONNECTION_READ_WRITE_TIMEOUT is 130? (https://github.com/spryker/rabbit-mq/blob/05b18242a564897a153ffc0f13e9493e37ec764c/src/Spryker/Client/RabbitMq/RabbitMqConfig.php#L23) . The default socket read_write timeout is 60. I’m wondering why the value is 130 on Spryker side.

  • Thomas Lehner
    Thomas Lehner Support Engineer @ Spryker Posts: 289 🏛 - Council (mod)

    @UNTTWV4JK cant tell you off the top of my head. Do you suspect this to be a cause for problems?

  • UNTTWV4JK
    UNTTWV4JK Posts: 63 🧑🏻‍🚀 - Cadet

    if you check the method from amqplib you’ll see that line is happening only when write takes too long but it doesn’t throws exception. I suspect that RabbitMQ flow control comes into play and delays the write (https://www.rabbitmq.com/flow-control.html) which can reach the timeout limit on the client side, thus the timeout exception occurs. I don’t know yet why the application is stalling after the exception occurs.

  • giovanni.piemontese
    giovanni.piemontese Spryker Solution Partner Posts: 871 🧑🏻‍🚀 - Cadet

    I see self also yesterdy this read_write_timeout... also for me is strange 130... I suppose that it is related to worker timeout set to 1 min..

    But i would not change this value without know which side-effects can be happend..
    there is also a keep_alive conf that maybe can speed up a little bit the process, but also not sure what happend...

    Here is just missing a docu from Spryker about the configuration possiblities of some process/service..

  • Thomas Lehner
    Thomas Lehner Support Engineer @ Spryker Posts: 289 🏛 - Council (mod)

    We have a holiday event today, but i asked whether there is a subject matter expert available that we can pull into this thread

  • giovanni.piemontese
    giovanni.piemontese Spryker Solution Partner Posts: 871 🧑🏻‍🚀 - Cadet

    ops... Corona @tom.lehner 😉

  • Thomas Lehner
    Thomas Lehner Support Engineer @ Spryker Posts: 289 🏛 - Council (mod)
    edited December 2020

    dont worry, its a safe and virtual

  • giovanni.piemontese
    giovanni.piemontese Spryker Solution Partner Posts: 871 🧑🏻‍🚀 - Cadet

    🙂 good.. without u and spryker for 2 weeks it will be not cool 🙂

  • UNTTWV4JK
    UNTTWV4JK Posts: 63 🧑🏻‍🚀 - Cadet

    We try to understand what is happening and have an educated decision when it comes to production configuration.

  • giovanni.piemontese
    giovanni.piemontese Spryker Solution Partner Posts: 871 🧑🏻‍🚀 - Cadet

    @UNTTWV4JK did u tried already to downgrade the chunk_size? Do u have the same problem or not?
    Can i ask how is configured your chunk_size for storage?
    How many events?

    I can say that in my case i had ca. 2,7M Events in sync.storage.product and with Chunk of 50K i had the problem, with chunk of 20K not more... And what i see is that it works fine when u have the export chunk size configured as same as storage chunk size (in my case also for sync search chunk size)

  • UNTTWV4JK
    UNTTWV4JK Posts: 63 🧑🏻‍🚀 - Cadet

    Not yet, but it is on our list.

  • giovanni.piemontese
    giovanni.piemontese Spryker Solution Partner Posts: 871 🧑🏻‍🚀 - Cadet

    i would try this case asap... it will work everything fine and in our case also faster...
    because i think that a solution from spryker can take long time in this case... it is not simple to reproduce this case..