Prefer audit trails and peer reviews to access restrictions

Security is important, but overly restrictive access controls can hinder productivity. For frequently used systems, focus on creating robust audit trails and break-glass procedures to ensure accountability while maintaining flexibility.

Overview

Security and delivery velocity are often seen as being in opposition. While best practice dictates following the principle of least privilege, overly restrictive security controls can create bottlenecks and prevent teams from being able to access the information they need to do their jobs effectively. This becomes a hidden cost that can slow down work and reduce overall productivity.

Fixing this does not require giving everyone the keys to the kingdom, but rather thinking critically about the systems your team needs to do their jobs effectively, and making sure that access to those systems is not unduly restricted.

To balance security with agility, teams and organizations should focus on building a culture of transparency and accountability by implementing strong audit trails, peer reviews, and break-glass procedures, and loosening direct access restrictions where there is a genuine need.

How to do it

Checked item: Document critical decisions using Architectural Decision Records (ADRs) kept in VCS alongside the codebase.
Checked item: Use Infrastructure-as-Code (IaC) tools to put Infrastructure changes under version control, and limit direct access to cloud environments.
Checked item: Require code reviews and approvals, especially for changes to sensitive areas of the codebase, such as infrastructure, deployment workflows, and security configurations.
Checked item: Implement comprehensive logging for sensitive operations to create an audit trail.
Checked item: Establish break-glass procedures that allow temporary elevated access in emergencies, with strict logging and post-incident reviews.
Checked item: Regularly review access logs to detect and respond to unauthorized access attempts.
Checked item: Set up monitoring and alerting for unusual activities in critical systems.
Checked item: Ensure that access logs are enabled, logs are immutable, and log data is retained for an appropriate period.
Checked item: Conduct regular audits of access controls and permissions to ensure they are appropriate and up to date.

Details

What needs to to be logged?

To create an effective audit trail, you should be logging the following types of events:

Event	What system logs it?
Code changes (commits, merges, pull requests)	Version control system (e.g., GitHub, GitLab)
Code deployments	CI/CD system (e.g. GitHub Actions)
Configuration changes, including infrastructure and application settings changes.	CI/CD system, Cloud Provider (e.g., AWS CloudTrail)
Authentication events (logins, logouts, failed login attempts)	Identity Provider (e.g. Okta, AWS IAM)
Authorization events (access to sensitive resources, changes to permissions)	Identity Provider, Cloud Provider (e.g., AWS CloudTrail)
Break glass access events	Identity Provider, Cloud Provider
Data access events (access to sensitive data, data exports)	Application logs, Database logs
System events (system restarts, software installations, security patches)	Operating System logs
Network events (firewall changes, network traffic anomalies)	Network monitoring tools, Cloud Provider

Log Retention

Log retention policies should be defined based on regulatory requirements, organizational policies, and the sensitivity of the data being logged. Here are some general guidelines:

Retain logs for a minimum of 90 days to allow for incident investigation and forensic analysis.
For sensitive data or systems, consider retaining logs for up to one year or longer, depending on compliance requirements.
Ensure that logs are stored securely and are protected from unauthorized access or tampering. A log that can be altered or deleted undermines the integrity of the audit trail.

Break-glass procedures

A "Break-glass" procedure is a predefined process that allows authorized personnel to gain temporary elevated access to critical systems or data in emergency situations. This access is typically restricted and closely monitored to ensure accountability.

An easy to implement break-glass procedure would be allowing certain engineers access to elevated permissions via an identity provider (IdP) such as Okta or AWS IAM, but sending a team-wide notification whenever this role is assumed. This ensures that the role cannot be assumed surreptitiously, while still allowing for rapid access when necessary.

A more complex, but more secure, break-glass procedure could involve requiring multi-party approval before elevated access is granted. For example, an engineer could request elevated access, which would then require peer review before the access is granted. It is important that the approval process can happen within the team - waiting on outside approvals can negate the benefits of having a break-glass procedure in the first place.

All actions taken while using the elevated access should be logged, and a post-incident review should be conducted to assess the necessity and appropriateness of the access granted.

Audits

Regular audits of access controls and permissions should be conducted to ensure that they remain appropriate and up to date. Audits should include:

Reviewing user access levels to ensure they align with current job roles and responsibilities.
Identifying and revoking access for users who no longer require it.
Verifying that break-glass access events were justified and properly logged.
Assessing the effectiveness of logging and monitoring systems in detecting unauthorized access attempts.
Ensuring compliance with organizational policies and regulatory requirements.

Monitoring

Because this play proposes loosening some kinds of direct access restrictions, monitoring and alerting systems should be in place both to detect unauthorized access attempts and ensure accountability for actions taken. Monitoring should focus on setting up alerts that might include:

Multiple failed login attempts, access from unusual locations, or changes to critical configurations.
Anomaly detection using Security Information and Event Management (SIEM) tools.
Access to sensitive data or systems outside of normal business hours.
Changes to permissions or roles, especially for privileged accounts.
Use of break-glass access.

Alerts should always result in timely investigations to determine if the activity was authorized or indicative of a security incident.

Mighty Practices