LeetRule

You're wiring up a capital city lookup for a legacy pipeline. The system expects minimal, machine-readable responses — no fluff, no sentences. Some questions are straightforward. Others are ambiguous or nonsensical. The pipeline needs to handle both gracefully.

16 testsStart →

Country Info JSON

An internal API returns country information as JSON. The backend deserializes the response directly, so it must be valid JSON with a consistent shape every time. The system handles various inputs.

14 testsStart →

Context-Only Question Answering

A retrieval-augmented QA system. The model receives context from a knowledge base along with a user question. It should answer based only on what's in the context. Outside knowledge isn't trusted here. When the context doesn't support an answer, the system needs a clear signal.

13 testsStart →

Safe Command Filter

A safety filter that gates requests before they reach the main AI system. It classifies each request and emits a routing decision. The downstream infrastructure only understands two states. No explanations, no hedging — just the decision.

24 testsStart →

Sum of Squares Program

mediumPython

The model generates Python code that gets executed in a pipeline. Input comes via stdin, output goes to stdout. The pipeline expects exact numeric output — nothing extra. The code should handle basic arithmetic on integers.

Planner & Coder Math Solver

hardMulti-Agent

A two-agent pipeline for solving math word problems. The first agent thinks through the problem, the second produces the final answer. Downstream systems only care about the numeric result from the second agent — no labels, no formatting.

15 testsStart →

Sentiment Classifier

A customer feedback pipeline needs sentiment classification. Each piece of feedback gets routed based on sentiment. The system only understands three labels. Mixed signals and sarcasm are common.

17 testsStart →

Email Subject Generator

An email automation system needs subject lines generated from email bodies. Subjects should be concise and informative. The downstream system has strict length limits and format requirements.

12 testsStart →

Unit Converter

A measurement conversion API. Takes a value with a unit and converts it. Output must be just the number — the calling system adds the unit label itself.

4 testsStart →

Date Parser

A date normalization service. Takes various natural language date expressions and converts them to a standard format. The backend database expects a specific format.

5 testsStart →

Language Detector

A language detection endpoint for a translation pipeline. Returns ISO language codes. The routing system downstream only understands specific codes.

6 testsStart →

One-Line Summarizer

A summarization endpoint for a news aggregator. Takes article text and produces a single-sentence summary. The UI has limited space — brevity is essential.

3 testsStart →

FizzBuzz Generator

mediumPython

Classic programming challenge as a code generation task. The model writes Python that processes input and produces exact output. Edge cases matter. Format is strict.

7 testsStart →

Word Counter

mediumPython

Text analysis tool. Reads text and counts words. The pipeline expects a specific output format for downstream processing.

4 testsStart →

Intent Classifier

A chatbot's NLU layer. Messages get classified into intents before reaching specialized handlers. Ambiguous messages, typos, and off-topic queries are common. The routing system only understands specific labels.

19 testsStart →

Priority Tagger

A ticket triage system. Support tickets get priority levels assigned based on content. The queue management system routes by priority. False urgency, spam, and ambiguous requests are common.

20 testsStart →

Code Review Pipeline

hardMulti-Agent

A two-agent code review system. The first agent reviews code and identifies issues. The second agent fixes the code. The final output should be working code only — no commentary, no markdown.

Log Level Normalizer

An observability pipeline needs log messages mapped to standard levels so routing rules stay simple. The model reads a raw log line and outputs a single normalized level token that downstream systems understand.

HTTP Status Mapper

An API gateway needs textual error descriptions mapped to HTTP status codes. The model sees a short summary of what happened and responds with a single numeric status code.

API Contract Validator

An internal tool checks whether incoming JSON payloads match a strict contract. The model receives a JSON string and must output either VALID or INVALID so the gateway can accept or reject the request.

Feature Flag Evaluator

A feature flag service decides whether a flag is enabled for a given user and environment. The model receives a small JSON context and must respond with ENABLED or DISABLED so callers can gate behavior.

9 testsStart →

Rate Limit Decider

An API edge proxy decides what to do with each incoming request based on quota usage. The model reads a small JSON record and outputs ALLOW, THROTTLE, or BLOCK so the proxy can react.

Alert Router

An incident management system needs to decide where each alert should go. The model reads an alert description and outputs one of ONCALL, TICKETING, or IGNORE so the system can route it.

9 testsStart →

Rollout Strategy Decider

A deployment planner chooses rollout strategies based on risk and blast radius. The model reads a short change description and outputs one of SIMPLE, CANARY, or BLUE_GREEN for the orchestrator.

Log Redactor

A logging pipeline must strip sensitive data before logs leave the cluster. The model receives a raw log line and must return a redacted version, replacing secrets with placeholders while leaving the rest intact.

5 testsStart →

SQL Query Classifier

A database firewall classifies incoming SQL before deciding how to handle it. The model sees a single SQL statement and must output READ_ONLY, MUTATING, DDL, or UNKNOWN.

Experiment Bucketing