Custom State Stores for Apache Spark to keep data between micro-batches for stateful stream processing.
Out of the box, Apache Spark has only one implementation of state store providers. It’s
HDFSBackedStateStoreProvider which stores all of the data in memory, what is a very memory consuming approach. To avoid
OutOfMemory errors, this repository and custom state store providers were created.
To use the custom state store provider for your pipelines use the following additional configuration for the submit script:
Here is some more information about it: https://docs.databricks.com/spark/latest/structured-streaming/production.html
With semantics similar to those of
FlatMapGroupWithState, state timeout features have been built directly into the custom state store.
Important points to note when using State Timeouts,
GroupState, the timeout is not eventual as it is independent from query progress
To configure state timeout, additional configuration can be added,
--conf spark.sql.streaming.stateStore.stateExpirySecs=5 --conf spark.sql.streaming.stateStore.strictExpire=true
Other state timeout related points,
You’re welcome to submit pull requests with any changes for this repository at any time. I’ll be very glad to see any contributions.
The standard Apache 2.0 license is used for this project.