Carreira

AI Agent Operations Engineer

São Paulo, Brazil

We are a global digital agency comprised of strategists, creatives, media experts, data scientists, and engineers driven by one common purpose — accelerate business growth through marketing and digital transformation. Named a top 3% Google Premier Partner and recognized by Inc. 5000 and Adweek’s 75 Fastest Growing Companies, we’re constantly looking for “A” players to join our team.

The rapid growth is attributed to our strongest asset — our people. Our teams are highly collaborative and work closely with each client to set clear goals and objectives so that we can deliver exceptional results. Mindgruve is a place where every opinion is valued. Not only will you be empowered to contribute ideas, but you will also play a key role in the execution and driving success for brands across a variety of industries. Sound fun? Perfect — you’ll fit right in.



The AI Agent Operations Engineer plays a critical role in deploying, maintaining, and scaling our AI agent ecosystem. This role is responsible for the production environment around our agent applications, ensuring that the underlying cloud infrastructure, retrieval systems, memory services, and data pathways are reliable, observable, secure, and performant.
This is a hybrid role — part cloud engineer, part platform engineer, and part data infrastructure partner. You will work closely with AI Agent Engineers, Data Engineers, Data Scientists, and ML Operations Engineers to operationalize agent systems built in an AWS-heavy environment. The work includes managing infrastructure as code, tuning retrieval and query performance, debugging permissions and networking issues, and helping create the guardrails required for enterprise-grade AI systems.
The role offers broad exposure across cloud architecture, data infrastructure, AI operations, and system reliability engineering. It is designed for someone who can take a promising agent application and ensure it performs consistently and safely in production.


What You'll Do Here:

  • Deploy, maintain, and scale AI agent infrastructure across AWS services including Bedrock, Bedrock AgentCore, Bedrock Knowledge Bases, Athena, Glue, S3, Lambda, and Step Functions.
  • Operationalize Retrieval-Augmented Generation (RAG) systems, including support for vector stores, knowledge-base refresh processes, and retrieval performance monitoring.
  • Manage infrastructure as code for AI agent resources using Terraform or AWS CDK.
  • Support memory-enabled agent systems, including the operational setup of short-term and long-term memory services and supporting storage patterns.
  • Monitor and improve performance of Athena-backed workflows, including partition strategy, Parquet optimization, and Glue Data Catalog alignment.
  • Evaluate AI Agent’s performance and behavior. Includes monitoring prompt and context quality, data retrieval, and output quality.
  • Establish observability, logging, alerting, and failure-handling patterns for AI agent services and supporting workflows.
  • Debug and resolve AWS networking, IAM, VPC, NAT Gateway, timeout, and service-access issues affecting agent performance and reliability.
  • Collaborate with AI Agent Engineers and Data Scientists to move agent applications from prototype to production-ready systems.
  • Support deployment pipelines, environment configuration, and release processes for evolving AI applications.
  • Contribute to governance, reliability, and security practices that enable enterprise-grade AI operations.

 

We Need a Person With:

  • Bachelor’s degree required in Computer Science, Engineering, Information Systems, or a related field.
  • Strong experience with AWS cloud infrastructure and services required.
  • Hands-on experience with AWS Bedrock and Bedrock AgentCore strongly preferred.
  • Experience supporting or deploying RAG architectures and the cloud resources that underpin them.
  • Strong experience with infrastructure as code using Terraform or AWS CDK.
  • Strong working knowledge of Athena, Glue, S3, Parquet, and data infrastructure performance optimization.
  • Experience with observability, logging, alerting, and operational troubleshooting in cloud environments.
  • Strong understanding of IAM, networking, VPCs, NAT Gateways, and secure service-to-service communication.
  • Ability to work collaboratively across engineering, analytics, and product teams.
  • Excellent verbal and written communication skills.
  • Professional and personal integrity.
  • A detail-oriented mindset focused on reliability, security, and operational excellence.

 

What We Consider As a Plus:

  • Experience with LangGraph, LangChain, or Haystack in support of AI agent platforms.
  • Experience with OpenSearch Serverless, Pinecone, S3 Vectors or other vector database technologies.
  • Experience supporting streaming applications, asynchronous workflows, or event-driven architectures.
  • Experience partnering closely with Data Scientists, MLOps Engineers, and backend engineers to productionize AI systems.
  • Experience supporting analytics, marketing, or business intelligence environments.
  • Experience with evaluation pipelines, regression monitoring, or quality assurance frameworks for AI systems.


Inscreva-se agora

Preparade para levar sua carreira para o próximo nível?