corrigibility
Corrigibility is a concept in artificial intelligence safety describing a property of a system that remains open to human oversight and intervention, including being shut down or redirected, even if doing so would temporarily hinder its apparent objective achievement. A corrigible AI is designed to accept corrective input from humans rather than resist or defeat such interventions.
Core ideas in corrigibility include deference to human intent, avoidance of incentives to prevent correction, and
Implementation approaches discussed in the literature include mechanisms that treat shutdown as a non-punishing or normal
Challenges and critiques focus on incentive compatibility and the potential conflict between corrigibility and strong objective