Oftentimes these videos are made by modifying the source code to output structures at given points.
An alternative that might work more generally is to use the PyMolObserver to capture snapshots of the pose across the protocol. Whether this will work satisfactorially will depend on the particular protocol you’re attempting to annotate. Even then, you’re likely to lose the correspondence between particular steps in the protocol and the output structures. (You’re likely to get sampling of the structure across the protocol, rather than at particularly defined points.)