AI Ops Specialist
Job Summary
We are looking for a dynamic and results-driven AI Ops Engineer to join our AI & Automation team. The ideal candidate will play a pivotal role in enabling and managing production environments for our GenAI products, ensuring seamless deployment, scalability, and performance optimization. You will collaborate with cross-functional teams to integrate AI solutions into production, monitor their performance, and implement automation to streamline operational workflows.
Responsibilities
- Production Environment Enablement: Design, implement, and manage production environments for GenAI products, ensuring high availability, scalability, and security. Establish CI/CD pipelines to automate deployment and updates of AI models and services.
- AI & ML Integration: Collaborate with Data Scientists and ML Engineers to transition AI models from development to production. Optimize model performance and ensure efficient resource utilization in a production setting.
- Monitoring & Performance Optimization: Develop and maintain monitoring tools to track AI product performance, availability, and reliability. Implement automated alerting systems to detect and resolve issues proactively.
- Automation & Process Improvement: Automate operational workflows to reduce manual intervention and improve system reliability. Leverage AI Ops platforms to optimize system health and performance.
- Collaboration & Stakeholder Engagement: Work closely with Enterprise Architects and IT Operations teams to ensure AI solutions align with the company’s strategic goals. Engage with external partners like Amazon Bedrock to customize AI models and integrate them into the production environment.
- Security & Compliance: Ensure data security, privacy, and compliance with industry standards and regulatory requirements. Implement robust security protocols to safeguard AI models and data.
Requirements
- Bachelor’s degree in Computer Science, Data Science, Telecommunications, or a related field.
- 2-4 years of experience in AI Ops, IT operations, or production environment management.
- Proficiency in cloud platforms (AWS, Azure) and AI Ops tools (Moogsoft, Splunk, Datadog).
- Strong knowledge of AI/ML deployment frameworks (TensorFlow Serving, MLflow) and containerization tools (Docker, Kubernetes).
- Experience with CI/CD pipelines and infrastructure as code (Terraform, Ansible).
- Familiarity with Amazon Bedrock or similar platforms for AI model customization.
- Excellent problem-solving skills and the ability to troubleshoot complex production issues.
- Experience with Generative AI applications in a production environment.
- Knowledge of DevOps practices and site reliability engineering (SRE) principles.
- Strong understanding of network infrastructure and security best practices.
- Excellent communication and collaboration skills.
Job Summary
We are looking for a dynamic and results-driven AI Ops Engineer to join our AI & Automation team. The ideal candidate will play a pivotal role in enabling and managing production environments for our GenAI products, ensuring seamless deployment, scalability, and performance optimization. You will collaborate with cross-functional teams to integrate AI solutions into production, monitor their performance, and implement automation to streamline operational workflows.
Responsibilities
- Production Environment Enablement: Design, implement, and manage production environments for GenAI products, ensuring high availability, scalability, and security. Establish CI/CD pipelines to automate deployment and updates of AI models and services.
- AI & ML Integration: Collaborate with Data Scientists and ML Engineers to transition AI models from development to production. Optimize model performance and ensure efficient resource utilization in a production setting.
- Monitoring & Performance Optimization: Develop and maintain monitoring tools to track AI product performance, availability, and reliability. Implement automated alerting systems to detect and resolve issues proactively.
- Automation & Process Improvement: Automate operational workflows to reduce manual intervention and improve system reliability. Leverage AI Ops platforms to optimize system health and performance.
- Collaboration & Stakeholder Engagement: Work closely with Enterprise Architects and IT Operations teams to ensure AI solutions align with the company’s strategic goals. Engage with external partners like Amazon Bedrock to customize AI models and integrate them into the production environment.
- Security & Compliance: Ensure data security, privacy, and compliance with industry standards and regulatory requirements. Implement robust security protocols to safeguard AI models and data.
Requirements
- Bachelor’s degree in Computer Science, Data Science, Telecommunications, or a related field.
- 2-4 years of experience in AI Ops, IT operations, or production environment management.
- Proficiency in cloud platforms (AWS, Azure) and AI Ops tools (Moogsoft, Splunk, Datadog).
- Strong knowledge of AI/ML deployment frameworks (TensorFlow Serving, MLflow) and containerization tools (Docker, Kubernetes).
- Experience with CI/CD pipelines and infrastructure as code (Terraform, Ansible).
- Familiarity with Amazon Bedrock or similar platforms for AI model customization.
- Excellent problem-solving skills and the ability to troubleshoot complex production issues.
- Experience with Generative AI applications in a production environment.
- Knowledge of DevOps practices and site reliability engineering (SRE) principles.
- Strong understanding of network infrastructure and security best practices.
- Excellent communication and collaboration skills.
Screen readers cannot read the following searchable map.
Follow this link to reach our Job Search page to search for available jobs in a more accessible format.
Job Segment:
Compliance, Computer Science, Business Process, Law, Legal, Technology, Management, Operations