AWS Resilience Hub Unveils Next Generation with Advanced Application Modeling, Dependency Discovery, and AI-Powered Failure Analysis

The next generation of AWS Resilience Hub has been announced, introducing a significantly expanded experience that integrates a new application model, dependency discovery assessment, generative AI-powered failure mode analysis, modular resilience policies, and organization-wide reporting capabilities. This evolution marks a pivotal step in how organizations manage and ensure the availability and robustness of their cloud-based applications.
For businesses operating a multitude of applications on Amazon Web Services (AWS), a persistent challenge has been the inconsistent approach to managing application availability. While availability remains a paramount concern, establishing uniform resilience goals, accurately measuring progress toward those goals, and demonstrating compliance across an entire application portfolio has proven difficult. This inconsistency often stems from development teams adopting disparate standards, utilizing different tooling, and facing hurdles in effectively communicating and verifying whether applications meet predefined expectations.

The newly enhanced AWS Resilience Hub aims to fundamentally alter this landscape. It provides Site Reliability Engineers (SREs) and development teams with a structured and comprehensive framework to align on resilience policy expectations. The platform is designed to empower application teams to achieve these standards and to rigorously demonstrate their compliance through integrated testing mechanisms. A key advancement is its integration with AWS Organizations, enabling the evaluation of resilience at an organizational scale. This integration facilitates the identification of potential failure modes, the discovery of previously unknown dependencies, and the generation of enterprise-wide progress reports, offering unprecedented visibility and control.
The new iteration of Resilience Hub guides users through their resilience journey by introducing several core concepts designed to streamline the process. These foundational elements aim to demystify resilience management and make it more actionable for technical teams.
The Next Generation of AWS Resilience Hub in Action
The implementation of the next generation of AWS Resilience Hub begins with a straightforward, yet powerful, workflow. Users are guided to first configure a resilience policy, followed by setting up their initial system and service. Subsequently, a failure mode assessment is initiated, the results of which are then reviewed, leading to the implementation of identified findings.

Prior to commencing the configuration process, it is essential to set up the invoker IAM role. This role grants Resilience Hub the necessary read-only access to an organization’s AWS resources. For environments not utilizing AWS Organizations, cross-account roles will be required. When AWS Organizations is in use, service-linked roles (SLRs) are leveraged. The platform’s deep integration with AWS Organizations is a cornerstone of its scalability, enabling organization-wide resilience management from a single, designated administrator account. This eliminates the cumbersome need to log into individual accounts to assess the resilience posture across an entire enterprise. Detailed prerequisite information can be found in the AWS Resilience Hub User Guide.
Defining Resilience Policies: A Granular Approach
To establish a resilience policy, users navigate to the "Policies" menu within the AWS Resilience Hub console and select "Create policy." This process involves defining a policy name, a descriptive explanation, and specifying the requisite resilience criteria. For instance, an organization might establish a reusable policy tailored for multi-Region disaster recovery, a critical requirement for financial applications. Such a policy could mandate a Service Level Objective (SLO) of 99.95% availability, a Recovery Time Objective (RTO) of 15 minutes, and a Recovery Point Objective (RPO) of 5 minutes for multi-Region disaster recovery scenarios. Furthermore, the policy can be designed to align with specific RTO and RPO requirements, ensuring a tailored approach to disaster preparedness.
The policy configuration also extends to data recovery requirements. Users can define specific data recovery time objectives for restoring data from backups for each service that is associated with the policy, ensuring comprehensive data protection strategies are integrated.

Systems and Services: Modeling Business Applications
The next step involves creating a "system," which represents a business application. This is accomplished by selecting "Create a system" within the "Systems" menu. For enhanced manageability, the option to enable AWS Organizations account access for this system can be selected during this phase.
Following the system creation, users can then define a "service," which represents a deployable unit, such as a microservice. This service is then associated with the parent system. During service creation, users specify the location of the resources that comprise the service. This can be achieved through various means, including resource tags, AWS CloudFormation stack references, Terraform state file locations, or Amazon EKS cluster and namespace details. The user also selects the relevant resilience policy and the designated invoker AWS IAM role name. Service regions can also be specified, ensuring the resilience strategy is aligned with deployment locations.
A significant new feature is the ability to enable dependency discovery for a service. When this feature is activated, AWS analyzes VPC flow logs for the VPCs associated with the service’s resources. This proactive analysis helps uncover interdependencies that might otherwise remain hidden. Users retain the flexibility to disable this feature at any time through the dependency discovery settings within the service’s details page.

Failure Mode Assessment: Proactive Identification and Mitigation
With the service created and a resilience policy applied, users can initiate their first assessment by selecting "Run failure mode assessment" on the service page. The assessment process is designed to be comprehensive and automated.
During the assessment, AWS Resilience Hub assumes the configured invoker role to access and read resources from the specified input sources. It then identifies parent-child relationships between resources and queries the application topology service to map connections. This mapping creates a detailed topology that visualizes data flow, containment relationships, and permission structures within the application architecture.
The "Service topology" view provides a graphical representation of service resources, intelligently grouped by their functional roles within the application. This information can be presented in graph, table, or JSON formats, catering to different user preferences and analytical needs.

The "Failure mode guidance" feature allows for the addition of assertions. These assertions act as guiding principles for the agents performing the failure mode assessment, helping to refine the accuracy and relevance of the analysis. Assertions can be automatically generated by the assessment agents or manually added and updated by users to enhance the precision of the findings.
Upon completion of the assessment, a detailed report of findings and recommendations is presented in the "Assessment" tab of the service page. Each identified finding clearly articulates the nature of the failure mode, its potential impact on the architecture, actionable remediation steps, and the specific policy requirement it relates to. Users can then choose to "Mark as resolved" once a recommendation has been implemented or "Mark as irrelevant" if a particular finding does not apply to their specific use case.
For existing AWS Resilience Hub customers, migration APIs have been introduced to simplify the transition to the new platform. These APIs automate the conversion of previous assessment policies into the new resilience policy format and map legacy application structures to the new model, which supports multiple related applications within a single system composed of multiple services.

Broader Implications and Availability
The introduction of the next generation of AWS Resilience Hub signifies a significant shift towards a more proactive and integrated approach to cloud application resilience. By centralizing resilience management, automating dependency discovery, and leveraging AI for failure mode analysis, AWS is empowering organizations to build and maintain more robust and reliable cloud environments.
The enhanced capabilities are particularly impactful for large enterprises with complex, distributed application architectures. The ability to gain organization-wide visibility, set consistent resilience standards, and automate the detection of vulnerabilities addresses long-standing challenges in cloud operations. This not only reduces operational risk but also facilitates better compliance with industry regulations and internal governance policies.
The next generation of AWS Resilience Hub is now generally available in all AWS commercial Regions where Resilience Hub is currently offered. Information regarding regional availability and the future roadmap can be accessed via the AWS Capabilities by Region portal.

AWS Resilience Hub has transitioned to a new service-based pricing model. This model includes two free failure mode assessments per month for services, with options for automated dependency assessment available. AWS is offering a trial period, allowing users to explore the platform’s capabilities without initial cost. Detailed pricing information is available on the AWS Resilience Hub pricing page.
Organizations are encouraged to explore the new AWS Resilience Hub through the Resilience Hub console. Feedback and inquiries can be directed to AWS re:Post for Resilience Hub or through standard AWS Support channels. This release underscores AWS’s commitment to providing advanced tools that enable customers to build and operate resilient cloud solutions at scale. The platform’s evolution reflects the growing importance of resilience in the face of increasingly complex digital landscapes and the constant demand for high availability.







