DevOps and IT Admin Engineer

Neural Magic

Neural Magic

Software Engineering, IT
Somerville, MA, USA
Posted on Thursday, May 30, 2024

About Neural Magic

Based in Somerville, Massachusetts, Neural Magic is a series A startup backed by leading investors including Andreessen Horowitz, NEA, NEA, Pillar, VMware, Verizon Ventures, Comcast Ventures, and Amdocs. At Neural Magic we believe the future of AI is open and we are on a mission to bring the power of open-source LLMs and VLLM to every enterprise on the planet. Neural Magic accelerates AI for the enterprise and brings operational simplicity to GenAI deployments. As a leading developer and maintainer of the vLLM project and inventor of state-of-the-art techniques for model quantization and sparsification, Neural Magic provides a stable platform for enterprises to build, optimize and scale LLM deployments.

Our Mission

Neural Magic is on a mission to bring the power of open-source LLMs and vLLM to every enterprise on the planet.

Your Role

As a DevOps and IT Admin Engineer, you will manage and scale our Kubernetes infrastructure, cloud offerings, and network storage. This role involves hands-on data center tasks, including troubleshooting hardware failures, racking new servers, and managing internal networking and VPN. You will collaborate with ML Ops engineers, ML researchers, and the engineering team to support research training runs, performance benchmarking, and CI/CD. Additionally, you will contribute to the product roadmap by providing insights on scaling inference serving loads using vLLM, Kubernetes, Helm charts, and other technologies.

Join us in shaping the future of AI!

Responsibilities

  • Kubernetes Management: Oversee and improve our Kubernetes infrastructure, ensuring optimal performance and scalability.
  • Cloud Infrastructure: Manage cloud offerings across multiple regions, ensuring fast access and reliability.
  • Network Storage: Maintain and enhance our network storage solutions, ensuring data integrity and availability.
  • Data Center Operations: Troubleshoot hardware failures, rack new servers, and manage internal networking infrastructure and VPN.
  • Collaboration: Work closely with ML Ops engineers, ML researchers, and other engineering team members to support scalable research training runs, performance benchmarking, and CI/CD.
  • Product Roadmap Contribution: Provide insights and opinions on the product roadmap, focusing on scaling inference serving loads through vLLM, Kubernetes, and Helm charts.
  • Performance Monitoring: Implement monitoring solutions to ensure the health and performance of all infrastructure components.