202201181349 - Sagemaker - Problems
Sagemaker problems
- 60 seconds hard timeout
- Sagemaker binds its configurations to the endpoint name, instead of an ID. So whenever we create an endpoint, it will search for endpoint configurations already created with that name, before deciding if it's going to create a new one. It does not have an 'edit configuration' feature, so it will take the one that already exists instead of what is on the scripts. We always have to create and delete endpoints alongside with the configurations to make it work as expected.
- Whenever a model stops responding or just outright freezes sagemaker gives no warning or even evidence in the logs that the problem has happened/is happening. The logs and the instance metrics do not provide any useful information for debugging. Most of the times it solves itself however sometimes we have to intervene and delete everything.
Links:
MLOps
202112011100 - FTG Models
2022-01-18
tags:
#work