The MAIA System Infrastructure team is pioneering the development of the next-generation developer ecosystem for AI Accelerators. We are at the core of creating the infrastructure that enables deep observability into our proprietary MAIA chips, empowering developers to harness the full potential of these advanced AI accelerators. Our mission is to build a transparent, efficient, and powerful ecosystem that goes beyond traditional GPU observability, providing unmatched insights into the operations and performance of our AI accelerators.
We operate at the intersection of cutting-edge AI hardware, system software, and developer tools, constantly pushing the boundaries of what is possible. We not only focus on the internal execution and performance metrics of the MAIA chips but also play a crucial role in optimizing the broader data flow infrastructure, particularly over PCIe, eBPF and various frontend networks, ensuring seamless and efficient data movement between the host and accelerators. By decomposing and optimizing data flow infrastructure into state-of-the-art designs, we aim to maximize the performance and efficiency of AI workloads, enhancing the overall ecosystem's capabilities. Our collaborative efforts span across hardware architects, system engineers, and AI researchers, all aimed at building a holistic observability stack that drives the next wave of AI innovation.
As a Software Engineer II on the MAIA System Infrastructure team, you will play a crucial role in building and optimizing the infrastructure that underpins our observability and data flow infrastructure for MAIA AI accelerators. Your primary focus will be on developing and enhancing the data flows that support our complex data flows across hosts and networks, ensuring they provide accurate and actionable insights into the complex operations of our AI hardware. This role involves working closely with senior engineers to design and implement data flow mechanisms that are efficient, scalable, and capable of handling the intricacies of our advanced accelerator architecture.
Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
Required Qualifications
Other Requirements
Preferred Qualifications
Software Engineering IC3 - The typical base pay range for this role across Canada is CAD $83,600 - CAD $159,600 per year.
Find additional pay information here:
https://careers.microsoft.com/v2/global/en/canada-pay-information.html
Microsoft will accept applications for the role until January 20, 2025.
In this position, you'll be hands-on in developing and optimizing the infrastructure that enables our observability and debugging tools to function seamlessly across multi-chip, multi-server environments. Your work will directly contribute to how developers interact with, analyze, and optimize AI workloads on our accelerators, ensuring that data transfer and processing are handled with maximum efficiency.
Foster an Inclusive and Collaborative Environment:
Actively contribute to a culture of inclusivity by valuing diverse perspectives, mentoring peers, and promoting open communication. Support and uplift teammates to ensure everyone can contribute their best in a high-performing, collaborative environment.
Develop and Optimize Tooling Infrastructure:
Work on the core infrastructure that supports our observability tools, focusing on the data flows and the efficient management of information between host systems and MAIA accelerators.
Implement and refine data transfer mechanisms, ensuring they are optimized for speed, reliability, and scalability across a distributed system of accelerators.
Contribute to Data Flow Efficiency:
Collaborate with senior engineers to decompose and optimize the data flow architecture over our entire hardware stack, focusing on minimizing latency and maximizing throughput.
Engage in profiling and debugging the data flow paths to identify and resolve bottlenecks, contributing to the overall performance of the AI infrastructure.
Participate in Building Robust Systems:
Assist in building and maintaining the infrastructure that allows seamless interaction between the tooling stack and the MAIA chips, ensuring reliable data collection and analysis.
Contribute to the development of internal APIs and libraries that facilitate data transfer, processing, and storage, supporting a high-performance observability ecosystem.
Engage with High-Performance Systems Design:
Work alongside a team of talented, inclusive and diverse engineers, gaining experience in the design and implementation of high-performance systems that are at the forefront of AI acceleration technology.
Develop a deep understanding of system-level interactions and learn to build infrastructure that supports real-time data analysis and feedback.
Other