Together AI's Posts (85)

LLM Training Dataset and Checkpoint Optimization Engineer

Together.ai is a leader in developing AI infrastructure that powers the training of state-of-the-art models. We focus on creating scalable, efficient systems for handling massive datasets and managing large-scale distributed checkpoints, ensuring seamless workflows for training and fine-tuning AI models. We are seeking aTraining Dataset and Checkpoint Acceleration Engineerto optimize data pipelines and checkpoint mechanisms for large-scale machine learning workloads. In this role, you will work at the intersection of data engineering and distributed systems, ensuring that training workflows are highly performant, reliable, and cost-efficient. Must-Have: Nice-to-Have: Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy athttps://www.together.ai/privacy - Dataset Acceleration:Design and optimize high-throughput data pipelines for streaming and processing massive training datasets.Implement caching, sharding, and prefetching techniques to maximize data-loading efficiency.Ensure efficient integration with distributed storage systems (e.g., S3, GCS, Lustre, Ceph). - Design and optimize high-throughput data pipelines for streaming and processing massive training datasets. - Implement caching, sharding, and prefetching techniques to maximize data-loading efficiency. - Ensure efficient integration with distributed storage systems (e.g., S3, GCS, Lustre, Ceph). - Checkpointing Systems:Build and optimize distributed checkpoint mechanisms for large-scale training workflows.Implement techniques to minimize checkpoint I/O overhead and ensure fault tolerance.Develop incremental and differential checkpointing solutions to reduce storage costs. - Build and optimize distributed checkpoint mechanisms for large-scale training workflows. - Implement techniques to minimize checkpoint I/O overhead and ensure fault tolerance. - Develop incremental and differential checkpointing solutions to reduce storage costs. - Performance Optimization:Profile and debug bottlenecks in data pipelines and checkpoint systems.Optimize for GPU/TPU utilization by ensuring efficient data feeding and checkpoint recovery times. - Profile and debug bottlenecks in data pipelines and checkpoint systems. - Optimize for GPU/TPU utilization by ensuring efficient data feeding and checkpoint recovery times. - Scalability and Reliability:Develop systems that scale efficiently across thousands of nodes and petabyte-scale datasets.Ensure fault-tolerant recovery and resume mechanisms for long-running training jobs. - Develop systems that scale efficiently across thousands of nodes and petabyte-scale datasets. - Ensure fault-tolerant recovery and resume mechanisms for long-running training jobs. - Collaboration and Support:Work closely with ML researchers, data engineers, and infrastructure teams to understand workload requirements.Build tools and frameworks to enable seamless integration of dataset and checkpointing systems with existing ML workflows. - Work closely with ML researchers, data engineers, and infrastructure teams to understand workload requirements. - Build tools and frameworks to enable seamless integration of dataset and checkpointing systems with existing ML workflows. - Design and optimize high-throughput data pipelines for streaming and processing massive training datasets. - Implement caching, sharding, and prefetching techniques to maximize data-loading efficiency. - Ensure efficient integration with distributed storage systems (e.g., S3, GCS, Lustre, Ceph). - Build and optimize distributed checkpoint mechanisms for large-scale training workflows. - Implement techniques to minimize checkpoint I/O overhead and ensure fault tolerance. - Develop incremental and differential checkpointing solutions to reduce storage costs. - Profile and debug bottlenecks in data pipelines and checkpoint systems. - Optimize for GPU/TPU utilization by ensuring efficient data feeding and checkpoint recovery times. - Develop systems that scale efficiently across thousands of nodes and petabyte-scale datasets. - Ensure fault-tolerant recovery and resume mechanisms for long-running training jobs. - Work closely with ML researchers, data engineers, and infrastructure teams to understand workload requirements. - Build tools and frameworks to enable seamless integration of dataset and checkpointing systems with existing ML workflows. - Experience:5+ years of experience in data engineering, distributed systems, or ML infrastructure. - 5+ years of experience in data engineering, distributed systems, or ML infrastructure. - Technical Skills:Expertise in high-performance data processing libraries (e.g., PyTorch DataLoader, TensorFlow Data, DALI).Proficiency in distributed storage systems and data formats (e.g., Parquet, HDF5).Strong understanding of checkpointing frameworks and file systems (e.g., POSIX, Lustre, GPFS). - Expertise in high-performance data processing libraries (e.g., PyTorch DataLoader, TensorFlow Data, DALI). - Proficiency in distributed storage systems and data formats (e.g., Parquet, HDF5). - Strong understanding of checkpointing frameworks and file systems (e.g., POSIX, Lustre, GPFS). - Programming:Proficient in Python, C++, or Go for performance-critical systems. - Proficient in Python, C++, or Go for performance-critical systems. - Optimization Techniques:Experience with I/O optimization techniques (e.g., asynchronous data loading, prefetching).Familiarity with compression and serialization for large datasets and checkpoints. - Experience with I/O optimization techniques (e.g., asynchronous data loading, prefetching). - Familiarity with compression and serialization for large datasets and checkpoints. - Soft Skills:Analytical and problem-solving mindset.Strong communication and collaboration skills across teams. - Analytical and problem-solving mindset. - Strong communication and collaboration skills across teams. - 5+ years of experience in data engineering, distributed systems, or ML infrastructure. - Expertise in high-performance data processing libraries (e.g., PyTorch DataLoader, TensorFlow Data, DALI). - Proficiency in distributed storage systems and data formats (e.g., Parquet, HDF5). - Strong understanding of checkpointing frameworks and file systems (e.g., POSIX, Lustre, GPFS). - Proficient in Python, C++, or Go for performance-critical systems. - Experience with I/O optimization techniques (e.g., asynchronous data loading, prefetching). - Familiarity with compression and serialization for large datasets and checkpoints. - Analytical and problem-solving mindset. - Strong communication and collaboration skills across teams. - Experience with ML frameworks (e.g., PyTorch, TensorFlow, JAX) and distributed training. - Familiarity with hardware accelerators (e.g., GPUs, TPUs) and storage optimizations. - Knowledge of open-source contributions or projects related to data pipelines or checkpointing. - Experience with incremental and real-time checkpointing solutions.

Location: San Francisco

Salary range: None - None

LLM Training Frameworks and Optimization Engineer

At Together.ai, we are building cutting-edge infrastructure to enable efficient and scalable training of large language models (LLMs). We focus on optimizing training frameworks, algorithms, and infrastructure to push the boundaries of AI performance, scalability, and cost-efficiency. We are seeking aLLM Training Frameworks and Optimization Engineer to drive innovations in the development and optimization of distributed training frameworks. In this role, you will ensure that our LLM training pipelines are robust, efficient, and capable of handling the complexities of large-scale distributed systems. Must-Have: Nice-to-Have: Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy athttps://www.together.ai/privacy - Framework Development and Optimization:Design, implement, and optimize distributed training frameworks tailored for large language models.Develop custom modules, plugins, and features to enhance framework scalability and performance. - Design, implement, and optimize distributed training frameworks tailored for large language models. - Develop custom modules, plugins, and features to enhance framework scalability and performance. - Algorithmic and Systems Optimization:Optimize communication patterns (e.g., gradient synchronization, all-reduce) in distributed training.Implement techniques like mixed precision, tensor parallelism, pipeline parallelism, and sharded training. - Optimize communication patterns (e.g., gradient synchronization, all-reduce) in distributed training. - Implement techniques like mixed precision, tensor parallelism, pipeline parallelism, and sharded training. - Performance Tuning:Conduct in-depth profiling and debugging of training jobs to identify and resolve bottlenecks.Collaborate with hardware teams to optimize performance for GPUs, TPUs, and other accelerators. - Conduct in-depth profiling and debugging of training jobs to identify and resolve bottlenecks. - Collaborate with hardware teams to optimize performance for GPUs, TPUs, and other accelerators. - Scalability and Resilience:Ensure training systems scale efficiently to thousands of nodes and petabytes of data.Develop resilience mechanisms for fault-tolerant and checkpointed training pipelines. - Ensure training systems scale efficiently to thousands of nodes and petabytes of data. - Develop resilience mechanisms for fault-tolerant and checkpointed training pipelines. - Collaboration and Support:Work closely with researchers, data engineers, and platform teams to ensure training frameworks meet model and workload requirements.Provide guidance and tools to improve the overall efficiency of the LLM development lifecycle. - Work closely with researchers, data engineers, and platform teams to ensure training frameworks meet model and workload requirements. - Provide guidance and tools to improve the overall efficiency of the LLM development lifecycle. - Design, implement, and optimize distributed training frameworks tailored for large language models. - Develop custom modules, plugins, and features to enhance framework scalability and performance. - Optimize communication patterns (e.g., gradient synchronization, all-reduce) in distributed training. - Implement techniques like mixed precision, tensor parallelism, pipeline parallelism, and sharded training. - Conduct in-depth profiling and debugging of training jobs to identify and resolve bottlenecks. - Collaborate with hardware teams to optimize performance for GPUs, TPUs, and other accelerators. - Ensure training systems scale efficiently to thousands of nodes and petabytes of data. - Develop resilience mechanisms for fault-tolerant and checkpointed training pipelines. - Work closely with researchers, data engineers, and platform teams to ensure training frameworks meet model and workload requirements. - Provide guidance and tools to improve the overall efficiency of the LLM development lifecycle. - Experience:5+ years of experience in deep learning frameworks, distributed systems, or machine learning infrastructure. - 5+ years of experience in deep learning frameworks, distributed systems, or machine learning infrastructure. - Technical Skills:Expertise in distributed training frameworks (e.g., PyTorch DDP, DeepSpeed, Megatron-LM, TensorFlow XLA).Strong understanding of parallelism techniques (e.g., data, tensor, pipeline, and ZeRO-based parallelism).Familiarity with GPU/TPU hardware and deep learning performance optimizations. - Expertise in distributed training frameworks (e.g., PyTorch DDP, DeepSpeed, Megatron-LM, TensorFlow XLA). - Strong understanding of parallelism techniques (e.g., data, tensor, pipeline, and ZeRO-based parallelism). - Familiarity with GPU/TPU hardware and deep learning performance optimizations. - Programming:Proficient in Python and C++ or CUDA for high-performance computing. - Proficient in Python and C++ or CUDA for high-performance computing. - Optimization Techniques:Experience with memory optimization techniques (e.g., activation checkpointing, gradient sharding).Knowledge of training dynamics for large-scale LLMs, including hyperparameter tuning and optimization. - Experience with memory optimization techniques (e.g., activation checkpointing, gradient sharding). - Knowledge of training dynamics for large-scale LLMs, including hyperparameter tuning and optimization. - Soft Skills:Analytical problem-solving skills and a focus on performance improvement.Strong collaboration and communication skills across teams. - Analytical problem-solving skills and a focus on performance improvement. - Strong collaboration and communication skills across teams. - 5+ years of experience in deep learning frameworks, distributed systems, or machine learning infrastructure. - Expertise in distributed training frameworks (e.g., PyTorch DDP, DeepSpeed, Megatron-LM, TensorFlow XLA). - Strong understanding of parallelism techniques (e.g., data, tensor, pipeline, and ZeRO-based parallelism). - Familiarity with GPU/TPU hardware and deep learning performance optimizations. - Proficient in Python and C++ or CUDA for high-performance computing. - Experience with memory optimization techniques (e.g., activation checkpointing, gradient sharding). - Knowledge of training dynamics for large-scale LLMs, including hyperparameter tuning and optimization. - Analytical problem-solving skills and a focus on performance improvement. - Strong collaboration and communication skills across teams. - Familiarity with graph optimization and compiler-level performance tuning. - Contributions to open-source deep learning or distributed training projects. - Experience with low-level hardware optimizations (e.g., kernel fusion, custom CUDA kernels).

Location: San Francisco

Salary range: None - None

LLM Training Resilience Engineer

Together.ai is at the forefront of AI infrastructure development, creating robust platforms and frameworks to support state-of-the-art large-scale machine learning training. We specialize in delivering resilient, high-performance systems that power breakthroughs in AI research and deployment. We are seeking aLarge-scale Training Resilience Engineerto ensure the reliability, fault tolerance, and scalability of our large-scale training infrastructure. If you are passionate about solving complex distributed systems problems and building highly available AI training pipelines, this role is for you. Must-Have: Nice-to-Have: Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy athttps://www.together.ai/privacy - Resilience and Fault Tolerance Design:Develop systems to identify, isolate, and recover from failures in large-scale distributed training workloads.Implement proactive error-detection mechanisms, including straggler detection and fault prediction algorithms. - Develop systems to identify, isolate, and recover from failures in large-scale distributed training workloads. - Implement proactive error-detection mechanisms, including straggler detection and fault prediction algorithms. - Distributed System Optimization:Ensure stability and consistency across distributed training clusters (e.g., GPU/TPU clusters).Optimize recovery time and throughput in the face of hardware or software failures. - Ensure stability and consistency across distributed training clusters (e.g., GPU/TPU clusters). - Optimize recovery time and throughput in the face of hardware or software failures. - Monitoring and Observability:Design and maintain observability systems for monitoring cluster health, training performance, and failure patterns.Leverage telemetry data to improve incident response and automate mitigation strategies. - Design and maintain observability systems for monitoring cluster health, training performance, and failure patterns. - Leverage telemetry data to improve incident response and automate mitigation strategies. - Automation and Tooling:Build resilience-focused tooling, such as job health monitors, distributed checkpoint systems, and automated recovery workflows.Enhance debugging and diagnosis frameworks for distributed training jobs. - Build resilience-focused tooling, such as job health monitors, distributed checkpoint systems, and automated recovery workflows. - Enhance debugging and diagnosis frameworks for distributed training jobs. - Collaboration and Documentation:Collaborate with platform engineers, researchers, and ML practitioners to identify pain points and resilience requirements.Document and communicate best practices for fault-tolerant AI training. - Collaborate with platform engineers, researchers, and ML practitioners to identify pain points and resilience requirements. - Document and communicate best practices for fault-tolerant AI training. - Develop systems to identify, isolate, and recover from failures in large-scale distributed training workloads. - Implement proactive error-detection mechanisms, including straggler detection and fault prediction algorithms. - Ensure stability and consistency across distributed training clusters (e.g., GPU/TPU clusters). - Optimize recovery time and throughput in the face of hardware or software failures. - Design and maintain observability systems for monitoring cluster health, training performance, and failure patterns. - Leverage telemetry data to improve incident response and automate mitigation strategies. - Build resilience-focused tooling, such as job health monitors, distributed checkpoint systems, and automated recovery workflows. - Enhance debugging and diagnosis frameworks for distributed training jobs. - Collaborate with platform engineers, researchers, and ML practitioners to identify pain points and resilience requirements. - Document and communicate best practices for fault-tolerant AI training. - Experience:5+ years of experience in distributed systems, cloud infrastructure, or large-scale machine learning training. - 5+ years of experience in distributed systems, cloud infrastructure, or large-scale machine learning training. - Technical Skills:Proficiency in distributed computing frameworks (e.g., PyTorch DDP, TensorFlow, Horovod).Strong knowledge of resilience strategies in distributed systems (e.g., leader election, consensus, retry mechanisms).Hands-on experience with observability tools (e.g., Prometheus, Grafana, ELK stack). - Proficiency in distributed computing frameworks (e.g., PyTorch DDP, TensorFlow, Horovod). - Strong knowledge of resilience strategies in distributed systems (e.g., leader election, consensus, retry mechanisms). - Hands-on experience with observability tools (e.g., Prometheus, Grafana, ELK stack). - Programming:Proficient in Python, Go, or a similar programming language. - Proficient in Python, Go, or a similar programming language. - Infrastructure:Experience working with cloud platforms (e.g., AWS, GCP, Azure) and Kubernetes for workload orchestration. - Experience working with cloud platforms (e.g., AWS, GCP, Azure) and Kubernetes for workload orchestration. - Soft Skills:Strong analytical, problem-solving, and debugging skills.Excellent collaboration and communication skills. - Strong analytical, problem-solving, and debugging skills. - Excellent collaboration and communication skills. - 5+ years of experience in distributed systems, cloud infrastructure, or large-scale machine learning training. - Proficiency in distributed computing frameworks (e.g., PyTorch DDP, TensorFlow, Horovod). - Strong knowledge of resilience strategies in distributed systems (e.g., leader election, consensus, retry mechanisms). - Hands-on experience with observability tools (e.g., Prometheus, Grafana, ELK stack). - Proficient in Python, Go, or a similar programming language. - Experience working with cloud platforms (e.g., AWS, GCP, Azure) and Kubernetes for workload orchestration. - Strong analytical, problem-solving, and debugging skills. - Excellent collaboration and communication skills. - Familiarity with GPU/TPU cluster management and scheduling. - Experience with high-availability database systems or message queues. - Experience with open-source contributions or community engagement.

Location: San Francisco

Salary range: None - None

Rust Systems Engineer - Inference

Together AI is seeking a Rust Systems Engineer to join ourInference Engine team, focusing on optimizing and enhancing the performance of our AI inference systems.  If you are passionate about developing high-performance systems, we want to hear from you. This position offers the chance to collaborate closely with AI researchers and engineers to create cutting-edge AI solutions. Join us in shaping the future at Together AI! Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society. Together, we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI. Our team has been behind technological advancements such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey to build the next-generation AI infrastructure. We offer competitive compensation, startup equity, health insurance, and other competitive benefits. The US base salary range for this full-time position is $160,000 - $240,000 + equity + benefits. Our salary ranges are determined by location, level, and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunities to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy athttps://www.together.ai/privacy - Demonstrated proficiency in Rust programming language and ecosystem - Strong experience with Rust frameworks namelyaxum,tokio,jemalloc,serde,rayon, and/orcrossbeam - Deep understanding of concurrent programming and multiprocessing patterns - Experience developing and optimizing orchestration and scheduling algorithms - Knowledge of distributed systems principles and design - Experience interfacing Rust services with RPC components - Ability to optimize systems for performance, reliability, and resource efficiency - Solid understanding of memory management and performance profiling. - Experience with one or more ML inference frameworks (HuggingFace TGI, NVIDIA Dynamo, Candle, Pixi, uv, llguidance) - Background in designing and implementing high-throughput, low-latency systems - Knowledge of disaggregated serving architectures and continuous batching algorithms for ML workloads - Familiarity with ML/AI inference systems and performance optimization techniques - Experience with system benchmarking and performance analysis

Location: San Francisco

Salary range: None - None

Systems Research Engineer, GPU Programming

As a Systems Research Engineer specialized in GPU Programming, you will play a crucial role in developing and optimizing GPU-accelerated kernels and algorithms for ML/AI applications. Working closely with the modeling and algorithm team, you will co-design GPU kernels and model architecture to enhance the performance and efficiency of our AI systems. Collaborating with the hardware and software teams, you will contribute to the co-design of efficient GPU architectures and programming models, leveraging your expertise in GPU programming and parallel computing. Your research skills will be vital in staying up-to-date with the latest advancements in GPU programming techniques, ensuring that our AI infrastructure remains at the forefront of innovation. Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy athttps://www.together.ai/privacy - Strong background in GPU programming and parallel computing, such as CUDA and/or Triton. - Knowledge of ML/AI applications and models - Knowledge of performance profiling and optimization tools for GPU programming - Excellent problem-solving and analytical skills - Bachelor's, Master's, or Ph.D. degree in Computer Science, Electrical Engineering, or equivalent practical experiences - Optimize and fine-tune GPU code to achieve better performance and scalability - Collaborate with cross-functional teams to integrate GPU-accelerated solutions into existing software systems - Stay up-to-date with the latest advancements in GPU programming techniques and technologies

Location: San Francisco

Salary range: None - None

1 ... 11 12 13 14 ... 17