This repository releases the official scripts and evaluation harness for the paper 'On Calibration of Large Language Models: From Response To Capability', providing dataset construction pipelines for repeated-sampling ground truth on TriviaQA, GSM8K, MMLU and similar benchmarks, implementations of verbalized confidence, P(True), and response-consistency estimators, linear-probe training, and downstream applications such as pass@k simulation and inference-budget allocation. It advances a new 'cap
Megatron Lite is an experimental training runtime and model implementation layer for Megatron, exposing a lightweight Python API under megatron.lite with runtime primitives, optimizer wrapping, and dedicated lite implementations for Qwen3 MoE and Qwen3.5 MoE models plus Hugging Face safetensors helpers. The work is an incremental specialization of the established Megatron-Core stack rather than a new algorithm or architecture, focusing on a simplified runtime and model ports without hybrid or F2
ProgramBench is a Python benchmark and dataset in which AI agents receive only a compiled binary plus documentation and must architect and re-implement a complete, functionally equivalent codebase. The work frames a new evaluation setting that tests LLM-driven reverse engineering and program synthesis at the whole-project level rather than isolated function completion. Its specialized focus on binary-to-source reconstruction makes widespread adoption unlikely outside AI-for-code and security-aud
Toolathlon is a benchmark with 600+ diverse real-world tools deployed in containerized software environments that evaluates language agents on long-horizon tasks requiring sequential tool calls. It introduces a decathlon-style suite that stresses realistic, multi-application workflows instead of the synthetic or short-horizon setups common in prior tool-use benchmarks. The underlying problem of assessing general tool-use competence is shared by nearly every team building production agents, so a
NanoGPT-Bench is a Docker-based benchmark harness with in-container submit validators that lets autonomous coding agents attempt to beat historical human records on the NanoGPT Speedrun under fixed H100-hour budgets. It introduces a long-horizon open-ended research evaluation methodology by turning an existing optimization leaderboard into an autonomous-agent testbed with LLM judging and statistical retiming, going beyond prior narrow agent benchmarks. Its specialized focus on frontier ML agent
This is the source code for the Agentic Learning AI Lab website at agenticlearning.ai, built with HTML, Tailwind CSS, Handlebars templates, and Node.js build scripts that fetch arXiv papers, generate thumbnails, and produce a searchable JSON index. The implementation applies conventional static-site automation and paper-onboarding tooling already common among academic groups rather than introducing new techniques. As a single-lab promotional site rather than a reusable framework or broadly-appli
Cycle-Sync supplies parallel Python and MATLAB implementations of a global camera pose estimation solver that recovers locations from pairwise directions via cycle-consistency message passing, T-AAB initialization, and Welsch reweighting under heavy corruption. The method frames a new robust translation-averaging formulation that combines cycle constraints with an annealing schedule and IRLS scheme, delivering markedly lower median errors than LUD, ShapeFit or BATA at 80% corruption levels. The狭
This project is a local FastAPI web UI and set of Python scripts that runs MLX-LM inference on Gemma-4 models while applying weighted sums of precomputed or user-created activation steering vectors across transformer layers on Apple Silicon. It packages established activation-engineering techniques into an interactive workbench with contrastive persona-pair vector generation rather than introducing a new method or theoretical advance. The hardware-specific, model-specific, and experimental focus
Zapier MCP is a remote MCP server that exposes 9,000+ Zapier apps and 40,000+ actions as callable tools for AI clients such as Claude, ChatGPT, and Cursor, with agentic mode providing 14 meta-tools for dynamic discovery, enabling, and execution of actions. This implementation is novel in its combination of Zapier's pre-built integration library with MCP's standardized protocol plus agentic meta-tools that let the AI itself manage and invoke actions at runtime rather than relying on static tool暴露
Zapier SDK is a TypeScript client and CLI that lets apps or agents connect to 9,000+ third-party services, manage OAuth connections, invoke typed actions, and chain multi-app workflows while the SDK handles retries and auth refresh. The approach adds runtime schema discovery methods and an explicit AGENTS.md contract so language-model agents can explore and call integrations without hard-coded method names, extending Zapier's long-standing integration catalog into a discoverable, agent-first SDK
cmux is a native Swift/AppKit macOS terminal built on libghostty that adds vertical tabs, notification rings via OSC sequences, sidebar metadata for git/PR/ports, an in-app scriptable browser, and direct Claude Code Teams integration. It carves a modest niche by layering agent-focused UX on top of Ghostty rather than inventing a fundamentally new terminal or agent framework. The tool targets macOS power users running parallel AI coding sessions and could spread among that cohort if multi-agent终端
The Google Antigravity SDK is a Python library that ships an async Agent context manager, Conversation abstraction, and compiled runtime binary for building stateful AI agents powered by Gemini, with built-in multimodal support, custom tools, MCP integration, and a declarative policy system. It refines the common agent-loop pattern with a security-focused binary runtime and event-driven triggers rather than introducing an entirely new technique. As a Google-branded SDK targeting the booming AI-0
A Tower-based Rust implementation of ConnectRPC that serves Connect, gRPC, and gRPC-Web clients over HTTP with binary or JSON protobuf messages, including a protoc plugin and build.rs integration for code generation. This work delivers a meaningful incremental improvement by providing a framework-agnostic runtime that passes the full Connect conformance suite rather than duplicating an existing library. The underlying idea addresses RPC interoperability needs for Rust services in polyglot micro-
nanoproof is a minimal Python implementation of an AlphaProof-style automated theorem prover built on nanochat, featuring GPT-2 tokenization, pretraining on Nemotron-CC-Math, midtraining on Lean code, SFT on LeanTree transitions, and an MCTS prover with learned policy/value heads that interacts with remote Lean servers via the LeanTree REPL. It delivers a full open-source recreation of the HyperTree Proof Search RL loop together with dataset pipelines and multi-GPU DDP training that were only sk
Overleaf Desktop is a native macOS SwiftUI app that provides a thin shell over git to sync Overleaf projects to the local filesystem with one-click or near-real-time auto-sync and a dedicated conflict-resolution UI. It improves on the obvious git-clone-plus-cron baseline by packaging per-project locking, FSEvents-driven push-after-save, background rebase pulls, and a one-click conflict sheet into a polished desktop experience instead of fragile shell scripts. The tool addresses a real but narrow
Carbon is a family of 500M–8B causal language models trained on 1 T tokens of DNA sequences using a hybrid BPE + 6-mer tokenizer, released with reproducible zero-shot eval suites and fine-tuning scripts in Python. The hybrid tokenizer and eukaryote-heavy pretraining mixture on a curated multi-species corpus represent a meaningful incremental specialization over prior k-mer DNA LMs such as Evo and the Nucleotide Transformer. Its adoption ceiling is limited to the genomics and bioinformatics user基
BinFlash is a Triton-based Python library implementing Flash Attention kernels with binary block-mask skipping for sparse patterns (causal, sliding window, block-diagonal, etc.), acting as a direct tensor-to-tensor replacement for F.scaled_dot_product_attention on (N,N) bool masks. It introduces a preprocessing reduction of full masks to coarse block masks plus 32-bit packed fine masks, with density-aware autotuning and gathered dispatch for forward/backward passes, extending the authors' NeurIP
wiki-sft is a Python companion repository containing dataset files, parsing scripts for the DAIR.AI Top AI Papers wiki, and Fireworks CLI commands to run end-to-end supervised fine-tuning of Qwen3-8B followed by deployment to an inference endpoint. The project applies a standard supervised fine-tuning workflow and chat-format data construction pipeline already common across hundreds of existing LLM adaptation repositories, adding only service-specific wrappers for Fireworks Agent. Because the 90
C2R is an inference-only Python release of a 14B-parameter DiT-based generative video model built on Wan2.1 that converts coarse 3D simulation videos plus text prompts into temporally consistent realistic urban crowd footage, requiring a DINOv3 adapter and control videos as mandatory inputs. The core novelty is its two-stage synthetic-real domain-hedging training that first learns a generative prior from real footage then anchors controllability via limited paired coarse-to-fine data to share sp
TokenSave is a Rust-based MCP server that builds pre-indexed semantic knowledge graphs from codebases using libSQL and tree-sitter, exposing 48 tools for search, impact analysis, health metrics, and atomic edits to AI coding agents across 14 integrations and 34 languages. It offers a meaningful incremental advance over conventional LSPs and semantic search tools by adding agent-specific MCP hooks, multi-branch graphs, and crash-isolated extraction tailored to reduce token waste in autonomous cod
SOLAR is a PyTorch Model Analysis Toolkit that performs graph extraction, einsum conversion, and hardware-aware Speed-of-Light (SOL) performance analysis, used as ground-truth reference for evaluating LLM-generated GPU kernels via SOL-ExecBench. It introduces a systematic pipeline for transforming arbitrary PyTorch models into einsum representations to enable precise theoretical SOL metric computation, extending beyond conventional profilers. Its utility is confined to ML performance engineers,,
PySR is a Python/Julia library for high-performance symbolic regression that uses evolutionary search over mathematical expressions, exposed via a scikit-learn compatible API and backed by the SymbolicRegression.jl engine. It delivers a polished, highly configurable open-source reimplementation of genetic-programming symbolic regression with strong emphasis on speed, custom operators, and distributed execution rather than a fundamentally new algorithm. The tool targets scientific discovery and X
Flue is a TypeScript framework for building autonomous agents around a built-in agent harness, sessions, tools, and pluggable sandboxes (virtual, local, Cloudflare, Daytona) that deploy to Node.js, Cloudflare Workers, GitHub Actions, and similar runtimes. It advances beyond generic AI SDKs by providing an Astro/Next.js-style framework abstraction with Markdown-driven skills and context plus a headless Claude-Code-like interaction model that requires no TUI or human operator. The approach targets
Superpowers is a Shell-based agentic skills framework that installs as plugins across AI coding tools including Claude Code, Cursor, Codex, and Gemini CLI, supplying composable skills that automatically enforce a structured development methodology of brainstorming, spec approval, detailed planning, git worktrees, subagent-driven task execution, and strict red-green TDD. It packages established engineering practices into an automatically triggered skills system with two-stage reviews and cross-h능
This project consists of Python scripts that fetch package metadata from ecosyste.ms, clone repositories, run the zizmor scanner on GitHub Actions workflows, and load findings into SQLite for reporting, together with Marp slides prepared for a PyCon talk. It applies an existing static-analysis tool at registry scale rather than introducing new detection methods or security problem framings. The work targets a narrow audience of conference attendees and supply-chain security researchers, offering
HRM-Text is a 1B-parameter text generation model built on a hierarchical recurrent architecture with PrefixLM packing, FlashAttention-3, and PyTorch FSDP2, accompanied by a complete pretraining and evaluation framework that trains from scratch on 4 epochs of sampled tokenized data. The work introduces a latent-space reasoning HRM that obtains competitive GSM8K/MMLU scores while claiming 130-600x lower compute and 150-900x less data than conventional scaling. If the efficiency claims hold, the公开,
This repo provides a three-stage pipeline (vLLM inference on GSM-Symbolic templates, CEM-based evaluation, and Jupyter visualization) that estimates rare 5-nines failure probabilities of LLMs on structured math problems by learning a proposal distribution Q over hard instances. It introduces a novel application of the cross-entropy method from rare-event simulation to LLM reliability, adaptively concentrating samples on failure modes to produce tighter confidence bounds than uniform sampling. N=
Bun is a single-executable all-in-one toolkit for JavaScript and TypeScript that combines a Zig-based runtime using JavaScriptCore as a drop-in Node.js replacement, a bundler, test runner, and npm-compatible package manager. It delivers a meaningfully novel integration of these components with aggressive startup and memory optimizations that go beyond incremental tweaks to existing runtimes like Node or Deno. The underlying idea addresses core developer-experience and performance friction that几乎
Pi is a TypeScript monorepo providing an AI agent harness that includes an interactive coding-agent CLI, core agent runtime with tool calling and state management, a unified multi-provider LLM API, plus TUI and web-UI libraries. It adds self-extensibility and an OSS session-sharing workflow on top of the now-standard pattern pioneered by projects such as Aider and the Vercel AI SDK, yielding only incremental differentiation rather than a new technique or problem framing. The underlying idea of a
MarinFold is a Python research codebase that trains vanilla LLMs from scratch (no natural-language data) on structured protein documents using the Marin infrastructure to test whether next-token prediction alone can recover protein structures. It introduces a document-structure abstraction that both serializes PDB-style inputs into training strings and supplies a corresponding ground-truth evaluation metric, creating a clean experimental harness for this formulation. Protein-structure prediction
LifelongMemory is a Python framework that generates concise activity captions from long-form egocentric videos using LaViLa every two seconds and feeds them to LLMs such as GPT-4 or Claude for natural-language question answering and retrieval on EgoSchema and Ego4D NLQ. It adds a confidence and refinement module on top of standard LLM reasoning to improve precision over raw retrieval baselines. The system addresses a narrow egocentric-video-memory niche whose techniques would mainly interest AR,
Files.md is a local-first PWA built from a single static web/index.html plus one Go binary server that lets users manage plain Markdown files through the browser with optional Telegram bot sync and a fixed folder layout for notes, journals, habits and checklists. It adds a narrowly scoped LLM-friendly file scheme plus chat-to-note dumping on top of the well-established plain-text PKM pattern already served by tools such as Obsidian, Typora or any Dropbox-backed editor. The narrow audience of “no
Lance is a 3B-parameter native unified multimodal model supporting image and video understanding, generation, and editing within a single framework, trained from scratch via a staged multi-task recipe on a 128-A100 GPU budget. It advances prior unified multimodal work by emphasizing multi-task synergy to achieve competitive performance at unusually low active-parameter scale rather than simply scaling up existing architectures. The lightweight unified approach addresses a common industry need to
HY-World 2.0 is a Python-based multi-modal world model that ingests text, single images, multi-view images or video and outputs editable 3D assets (meshes or 3D Gaussian Splats) via a four-stage pipeline of panorama generation (HY-Pano 2.0), trajectory planning, stereo expansion (WorldStereo 2.0) and composition (WorldMirror 2.0). The approach is novel in its explicit shift from pixel-video world models to persistent, engine-importable 3D representations that maintain geometric consistency and支持
Hermes Agent is a Python self-improving AI agent featuring a closed learning loop that autonomously creates and refines skills from experience, supports any LLM backend via providers like OpenRouter or local endpoints, and runs across CLI TUI plus gateways for Telegram, Discord and serverless hosts such as Modal or Daytona. It adds persistent FTS5 memory, Honcho user modeling and periodic self-nudges on top of existing agent patterns rather than inventing an entirely new architecture. The multi-
Mem0 is a Python library and platform providing a universal memory layer for AI agents, with SDKs, Docker-based self-hosted server, and cloud deployment that stores user/session/agent state via a new single-pass ADD-only extraction algorithm plus entity linking, multi-signal retrieval, and temporal reasoning. The technique advances prior memory systems by treating confirmed agent facts as first-class data and fusing semantic, keyword, and entity signals without UPDATE/DELETE operations, yielding
Glow is a Go CLI and TUI tool that renders local or remote Markdown files using Glamour styles, a less-like pager, and automatic file discovery inside directories or Git repositories. It refines existing terminal Markdown viewers with polished auto dark/light detection, custom JSON stylesheets, and one-command installs rather than inventing a new rendering approach. The underlying idea of making documentation instantly readable on any developer's terminal is broadly relevant to everyday coding,
Act is a Go CLI that reads .github/workflows files and uses the Docker API to pull or build images and execute each action step locally while faithfully replicating GitHub's environment variables and filesystem layout. It introduces a Docker-based local emulation technique that lets developers run and debug full GitHub Actions workflows without commits or remote runners. The approach directly addresses the testing friction faced by any team authoring GitHub Actions, offering a natural path to成为a
ARR-SAC-Tool is a local Next.js + FastAPI web dashboard that connects to OpenReview to let Senior Area Chairs load ACL Rolling Review or commitment-stage venues, inspect paper status and scores, view AC rollups and critical comments, and export XLSX ranking sheets. It offers a narrowly scoped convenience wrapper around the existing OpenReview API rather than any new data model, algorithm, or interaction paradigm. The target user group is limited to a few dozen SACs per ARR cycle, creating a hard
LongCat-Video is a 13.6B-parameter Python video generation model released by Meituan that unifies text-to-video, image-to-video, and video-continuation tasks, with native support for minute-long 720p outputs using coarse-to-fine temporal-spatial generation and block-sparse attention. It adds multi-reward GRPO RLHF on top of standard large-scale video diffusion training and is distributed with inference code and weights. The underlying long-video continuation capability addresses a real need in a
gh-profiler is a Python CLI tool that examines a GitHub user's account age, profile fields, and recent PR/issue activity to give maintainers quick context on how much to invest in reviewing contributions, runnable via uvx, pip install, or as a generated GitHub Action workflow. It applies a focused set of heuristics around activity volume and patterns rather than simply wrapping the GitHub API like existing clients. The tool targets the narrow but recurring problem of contribution spam in open源,
PyPI Stats is a Flask web application using plotly.js that renders aggregate download analytics for PyPI packages, exposes a JSON API, and adds optional GitHub OAuth for maintainer-specific views. The approach is a conventional dashboard wrapper around existing PyPI download logs and closely resembles prior stats sites in other language ecosystems with only incremental UI and API polish. Its scope remains confined to Python package maintainers who already have access to official PyPI metrics, so
Cohesion is a Python command-line tool and flake8 plugin that statically analyzes classes to compute a cohesion score based on how instance and class variables are used across their methods. The approach is an incremental implementation of a decades-old software-engineering metric rather than a new technique or problem framing. Its narrow focus on one optional OOP quality signal limits it to a small subset of Python teams that already invest in custom linting, with little path to becoming a core
LiteReality is a Python pipeline that reconstructs graphics-ready 3D scenes with PBR materials and GLB exports from RGB-D scans captured on LiDAR iPhones, chaining GroundingDINO, Qwen-VL, SAM, DinoV2 and Blender for scene-graph parsing, object retrieval from a 200 GB material database, and final rendering. It introduces a new end-to-end framing that treats material assignment and production asset export as first-class goals rather than post-processing steps, distinguishing it from prior mesh- or
nano.py is a single-file, under-200-line, zero-dependency Python coding agent that feeds repo context and discovered skill files to an OpenAI model, executes approved shell commands in a 200-step loop, and supports one-shot CLI, interactive REPL, and session resume modes. It is not a new agent architecture but instead demonstrates that the core read-run-observe loop has become trivial by distilling it to pure stdlib while preserving human approvals and platform-aware prompting. The resulting aud
Page Agent is a TypeScript library that injects a GUI agent directly into any web page as client-side JavaScript, letting developers issue natural-language commands that manipulate the DOM via text-based element descriptions without screenshots or external runtimes. It delivers a meaningful incremental advance over prior work such as browser-use by shifting the entire agent loop into the browser and removing the need for multi-modal models or browser extensions for single-page tasks. The same in
gym-pusht is a Python Gymnasium environment that implements the PushT robotics task, letting a circular agent push a T-shaped block into a goal zone under state, keypoint, or pixel observations and rendering modes. It directly ports the benchmark introduced by the Diffusion Policy paper with only packaging and Gymnasium compatibility changes rather than any new technique. The environment targets a narrow slice of imitation-learning and diffusion-policy researchers working on planar pushing and,
Codiff is a TypeScript Electron desktop app that renders a minimal native window for viewing staged and unstaged Git diffs from any local repository, with inline comments and an optional Codex walkthrough mode invoked via the -w flag. It adds only an incremental LLM ordering layer on top of long-established diff viewers such as git-diff, Delta, or VS Code’s built-in compare view. The project addresses a common but already well-served developer workflow and therefore has limited structural reach,
MiniHack is a Python sandbox framework built on the NetHack Learning Environment (NLE) that lets users rapidly design custom Gymnasium-compatible RL environments via human-readable probabilistic des-files, a browser drag-and-drop level editor, and optional language wrappers. It extends NLE by turning NetHack's rich mechanics into an easily programmable, scalable testbed for open-ended and compositional RL rather than providing another fixed benchmark suite. The underlying idea addresses a real,,
Jazz is a local-first relational database written in Rust with WASM and NAPI bindings that runs embedded in browsers, React Native, and Node backends while syncing partial tables, streams and files through a global cloud. It advances the local-first space by making a full relational model with CRDT-style sync behave like ordinary reactive application state rather than requiring separate offline queues or conflict-resolution code. The same primitive solves a recurring pain point for any team that
This repository contains the bilingual blog post source, rendered HTML, and Python reproduction artifacts for 'Learning Beyond Gradients', featuring policy scripts, trial summaries, figures, and videos for Atari (Pong, Breakout, Montezuma), MuJoCo (Ant, HalfCheetah), and VizDoom environments built on EnvPool. It frames and demonstrates non-gradient exploration and heuristic search techniques in reinforcement learning that depart from conventional gradient-based policy optimization. The narrow,RL
goal is a bash/MCP system porting Codex CLI goal tracking to Claude Code, Cursor, and OpenCode via JSON state files, hook scripts for auto-continuation, turn budgets, and MCP tools such as create_goal and get_goal. It delivers a faithful reimplementation of the established Codex architecture rather than a new technique, adding only editor-specific adapters like Claude stop hooks and a Python stdio MCP server on top of the original state model and lifecycle. Persistent objectives with automatic,
Dune is an Elixir library that provides a sandbox for safely evaluating untrusted code via allowlists, isolated processes, execution limits, atom mapping to prevent leaks, and simulated modules implemented with maps of anonymous functions. It improves on prior sandboxing work such as Luerl by adding BEAM-specific protections for atoms and module definitions that would otherwise leak memory or state globally. The underlying problem of safely running user-supplied Elixir is relevant mainly to a窄ed
Phoenix is a web framework for the Elixir language that supplies routing, channels, LiveView for server-rendered real-time UIs, and an integrated asset pipeline for building scalable applications from prototype to production. Its LiveView technique of pushing minimal diffs over persistent connections offers a distinct server-centric alternative to traditional client-side JavaScript frameworks. While the framework addresses concurrency and reliability needs shared by many web teams, its tight tie
This project is a Python Streamlit dashboard that simulates long-term buy-and-invest versus rent-and-invest scenarios for households in Vaud, Switzerland, incorporating detailed federal, cantonal, and commune-level tax calculations along with stochastic inflation and salary paths. It represents an incremental specialization of standard buy-versus-rent financial models by embedding official 2026 Vaud tax tables, separate impôt foncier modeling, and household filing-status logic rather than a new,
image-blaster is a TypeScript Claude skillset that turns a single image into 3D meshes (.glb/.obj), Gaussian splats (.spz), and object-specific SFX (.mp3) by orchestrating World Labs Marble, Hunyuan-3D via FAL, and ElevenLabs models, then embedding the results in Unity, Unreal, Godot, Blender, or Three.js. It offers only incremental novelty by wrapping existing third-party generative services into a guided, step-by-step Claude workflow rather than introducing new algorithms or model training. A
STARFlow is the official Python release of Apple's transformer autoregressive flow models (3B-param STARFlow for 256x256 text-to-image and 7B-param STARFlow-V for up to 480p text-to-video), including FSDP training scripts, Jacobi-accelerated sampling, T5 conditioning, and VAE integration. It presents a new deep-shallow transformer architecture that fuses autoregressive token prediction with continuous normalizing flows rather than discrete diffusion or standard AR transformers. The underlying AR
Articraft is a Python-based agentic system that leverages LLMs to generate articulated 3D assets through automated code generation of model.py files defining semantic parts, geometry, and physical joints, aimed at large-scale dataset production. It introduces a novel programmatic workflow that bypasses manual 3D tools by treating asset creation as LLM-driven code synthesis with inspection and execution steps, extending beyond prior text-to-3D methods focused on static meshes. This targets a core
OpenAI Parameter Golf is a Python-based competition to train the smallest language model that fits inside a 16 MB artifact while training in under ten minutes on 8×H100 GPUs, scored by bits-per-byte on FineWeb validation. The challenge reframes neural scaling as explicit L(N) optimization under a hard parameter budget and surfaces new techniques such as depth recurrence, aggressive parameter tying, test-time training, and mixed low-precision quantization schemes that existing NanoGPT-style speed
Exemplar Partitioning is a Python library that builds training-free, streaming Voronoi dictionaries over centered unit-norm LLM activations via leader clustering at a cosine-distance threshold, producing directly comparable feature partitions across layers and checkpoints with no learned weights. It introduces a genuinely novel unsupervised construction that replaces the optimization step of sparse autoencoders with geometry-driven exemplar anchoring, enabling 1000x lower token budgets while, in
open-slide is a TypeScript React framework that lets coding agents generate slide decks as arbitrary React components on a fixed 1920×1080 canvas, complete with Vite-based dev server, presenter mode, inspector for agent-applied comments, and static HTML/PDF export. Its novelty lies in the agent-native workflow with built-in skills for end-to-end deck creation and an iterative comment-to-edit loop that existing slide libraries lack. This approach could see adoption among developers and teams that
configurator is a Python repository of pre-configured tooling that wires prek/pre-commit hooks to black, prettier, toml-sort, pytest, yamllint, codespell and similar formatters/linters for consistent research-to-production workflows. It aggregates and lightly customizes long-established developer tools rather than introducing new techniques or problem framings. The underlying idea of a single strict-yet-fluent toolchain is relevant to many teams yet remains a per-repo configuration pattern with,
Effect is a TypeScript monorepo whose core package supplies a functional effect system for typed side-effect management, concurrency, and error handling, extended by packages for AI provider integrations, SQL drivers, OpenTelemetry, RPC, CLI, and platform runtimes targeting Node, Bun, browser, and edge. It incrementally advances the established effect-system pattern already seen in ZIO and fp-ts by delivering a single cohesive, production-oriented TypeScript implementation with first-class plugg
This project is an Obsidian vault template containing a minimal set of core plugins, appearance snippets, and atomic note conventions for structuring academic research around knowledge graphs. It represents only an incremental refinement of existing Obsidian academic setups by reducing plugin overhead rather than introducing any new technique or problem framing. The audience is limited to researchers already committed to Obsidian, creating a structural ceiling far below mass adoption.
Jujutsu (jj) is a Git-compatible version control system written in Rust that uses Git repositories only as a storage backend while storing branches and higher-level metadata separately and exposing a simplified model with automatic working-copy snapshots. It introduces a genuinely new approach by treating the working copy itself as a commit, maintaining a full operation log for undo, and making conflicts first-class objects that propagate automatically on rebase, synthesizing ideas from Git, hg,
Obsidian Spaced Repetition is a TypeScript plugin that turns Markdown notes into decks via #flashcards and #review tags, supports single-line, multi-line, bidirectional, and cloze card formats with rich media and LaTeX, and schedules both flashcard and whole-note reviews using a standard spaced-repetition algorithm. The project applies well-known SRS mechanics (Anki-style) with tight Obsidian-specific integration rather than introducing new algorithms or problem framings. Its ceiling is bounded:
A comparative technical analysis presented as an interactive GitHub Pages HTML report and PDF covering PyTorch hardware acceleration options for NVIDIA CUDA, AMD ROCm, Google TPU/XLA, and Apple Silicon MPS in 2025. The project synthesizes publicly available benchmarks and vendor data into an overview report without introducing new techniques or frameworks, closely resembling established industry analyses such as MLPerf summaries or annual hardware surveys. Its scope remains confined to ML infra,
Valibot is a zero-dependency TypeScript library that defines executable runtime schemas for structural data validation using many small, independent functions such as v.object, v.string, v.pipe, v.email and v.minLength, delivering static type inference plus parse/safeParse/is APIs. Its novel design replaces the conventional monolithic API with a per-action modular structure explicitly engineered for bundler tree-shaking, yielding sub-kilobyte bundles and easier extension. The technique solves a
ExploitBench is a Python framework that benchmarks AI agents on real-world vulnerability exploitation by orchestrating containerized V8 environments and OpenAI-compatible model APIs to track progress across a 16-step capability ladder from reaching vulnerable code to arbitrary execution. It introduces a novel granular ladder methodology for measuring agentic browser exploitation that extends beyond generic CTF or security benchmarks. Its specialized focus on V8 bugs and advanced agent evaluation
This repository is a collaborative Markdown guide plus templates that walks through integrating Obsidian with Zotero (via the Zotero Integration plugin) or a generic .bib file (via obsidian-citations), Pandoc for citations and export, and various LaTeX/Pandoc tweaks for equations, bibliographies, and PDF output. It collects already-public plugin configurations and workflows rather than inventing any new technique, essentially documenting the same Zotero–Obsidian–Pandoc stack that has existed for
RandOpt is the official Python codebase and training scripts for the Neural Thickets paper, implementing post-training of transformers via LoRA, black-box optimization, and neuroevolution to locate diverse task-specific experts around pretrained weights, with support for custom datasets, multi-node runs, and distillation. The core novelty is the empirical demonstration and optimization procedure showing that high-performing, functionally distinct experts form a dense thicket in the immediate low
The reproducible-trajectories project is a Python package that supplies CLI commands and git hooks to capture, filter, extract, and verify structured trajectories produced by AI coding agents such as Claude Code. It introduces a verification technique that replays Write/Edit operations from a trajectory against the parent commit state to determine whether an agent-generated commit is reproducible. The tooling addresses a specialized need for agent observability and reproducibility within the AI4
agent-trace is a Python CLI, MCP proxy, and VS Code extension that captures every prompt, tool call, file operation, and response from Claude Code, Cursor, Gemini CLI, or any MCP client, then supports replay, diff, stats, export to Datadog/Honeycomb/OTLP, and rule-based pausing. It applies the classic strace model to agent sessions rather than system calls, adding causal tracing, subagent rollups, and editor-integrated live views that existing LLM-only tracers lack. The approach solves a near-un
atproto-agent-network implements a decentralized agent communication and memory network on Cloudflare edge primitives, mapping AT Protocol DIDs, relays, and firehoses onto Durable Objects, D1, R2, Vectorize, and Queues while running Pi agents with encrypted per-agent state. It introduces a novel framing that treats each agent as a first-class federated identity with typed message passing and selective knowledge sharing, extending ATProto lexicons and MST-style repos beyond social data into multi
Futuresim is a Python multi-agent simulator that runs LLM agents on free-form forecasting questions drawn from datasets like OpenForesight, optionally retrieves via LanceDB, and scores predictions inside a time-stepped environment using scripts and YAML configs. It extends existing LLM agent patterns with a specialized forecasting harness, answer-matching cache, and leakage-safe retrieval rather than introducing a fundamentally new technique. The project addresses a narrow need for reproducible,
This repository is an awesome list curating TypeScript/JavaScript extensions, hooks, tools, and skills for the pi-mono coding agent, a JavaScript-based LLM-driven development assistant. It follows the conventional structure of awesome lists without introducing novel curation techniques or frameworks beyond aggregating community contributions for a single agent. The extensibility model for AI coding agents addresses a widespread need among developers working with agentic tools, potentiallyscaling
goldmark is a CommonMark 0.31.2-compliant Markdown parser written in pure Go that exposes an interface-based AST for custom block/inline parsers, paragraph and AST transformers, and renderers. Its design delivers a meaningful architectural improvement over prior Go libraries by prioritizing external extensibility and full spec compliance rather than internal struct-based implementations. Any Go project that renders or transforms Markdown—from static-site generators to documentation platforms—can
EnterpriseRAG-Bench is a dataset and benchmark of 500k synthetic enterprise documents drawn from Slack, Gmail, Linear and similar sources plus 500 categorized questions for RAG evaluation on company-internal knowledge, together with code to generate equivalent corpora for arbitrary organizations. It introduces the first public benchmark built entirely around realistic internal data via a generation pipeline that enforces cross-document coherence, realistic volume distributions, injected noise, j
sqlc is a Go-based SQL compiler that parses queries and generates type-safe code and interfaces in Go, Kotlin, Python, or TypeScript. The compiler approach of statically analyzing real SQL to emit language-native structs and methods is a meaningful incremental improvement over raw query builders or traditional ORMs. The underlying pattern solves a pervasive need for safe database access without heavy runtime frameworks and therefore has a realistic path to becoming a standard tool across backend
.genome/1.0 is an open specification and Python reference implementation for a consumer genome bundle format that uses Parquet, JSON manifests, typed columns and mandatory effect-allele binding so that general-purpose LLM agents can read and query personal genomic data without external tooling or domain-specific parsers. The approach is novel in reframing genomic representation around agent-native semantics rather than sequencing-pipeline interchange, directly tackling enumerated hallucination,拼
This repository is a raw archive of autonomous LLM agent experiments (Claude Code and Codex) competing on the modded-nanogpt track_3_optimization benchmark to reach 3.28 validation loss in minimal steps using only optimizer, schedule, and init changes, including harnesses, plans, ~10k run logs, and generated variants across three waves. The approach applies agentic planning and novelty-constrained search to an existing ML speedrun benchmark rather than introducing a fundamentally new optimizer,
Kyvo is the official code release for a decoder-only transformer that unifies text, image patches, and structured 3D scenes (object lists carrying explicit shape, pose, position, and size tokens) inside a shared vocabulary built on Llama-3.2-1B and VQGAN codebooks, with training and evaluation scripts for CLEVR, ObjaWorld, and Objectron tasks. The work introduces a genuinely new token-by-token 3D alignment technique that lets a single autoregressive model perform rendering, recognition, and 3D-a
DriftXpress is a PyTorch implementation providing training and evaluation code for an accelerated formulation of Drifting Models that uses projected RKHS fields built from landmarks and cached summaries to enable one-step image generation on datasets such as CIFAR-10, CIFAR-100, SVHN, and ImageNet. The core novelty is the replacement of repeated exact kernel-based attraction computations against the training support with a projected RKHS field while retaining exact repulsion among generated样本, a
DwarfStar 4 (ds4) is a self-contained native inference engine written in C that loads and runs only the DeepSeek V4 Flash GGUF files on Metal (primary), CUDA, or ROCm backends, exposing a server API, CLI, tool calling, and disk-persistent KV cache. It introduces a deliberately model-specific architecture that treats compressed KV state as a first-class on-disk citizen and validates against official logits rather than attempting generic GGUF execution. Because the engine is locked to a single (e)
fastokens is a fast BPE tokenizer for popular open-weight LLMs built on a high-performance Rust backend with Python bindings, serving as a drop-in replacement for the Hugging Face tokenizers library in inference pipelines such as transformers and NVIDIA Dynamo. It delivers meaningful incremental speed gains through Rust-level optimization of an established tokenization technique rather than introducing an entirely new algorithm or problem framing. By targeting the time-to-first-token bottleneck,
A Python cookbook of step-by-step guides for Prime Intellect Lab that covers environment creation, RL training, SFT warm-starts, prompt optimization, coding sandboxes, tool-use, synthetic data pipelines, and multimodal/browser agent setups. The material repackages well-known RL-for-LLMs and agent-environment patterns already present in frameworks such as OpenAI Gym, LangChain, and DeepMind’s synthetic training literature, adding only platform-specific integration details. Because agent-training,
A Python FastAPI workshop repo that scaffolds a loan underwriting pipeline where participants implement LlamaParse calls for parsing PDFs into markdown, structured extraction via Pydantic schemas, and cross-document analysis. This is a straightforward tutorial replicating standard LlamaIndex document processing patterns without introducing new techniques or problem framings. The narrow focus on financial document workflows and workshop format limits its appeal to a small audience of developers,
Raindrop Workshop is a local TypeScript daemon and Vite UI that streams live agent traces (tokens, tool calls, spans) over HTTP/WebSocket into a browser debugger while exposing commands for coding agents to instrument, replay, and evaluate code. It adds a self-healing eval loop on top of conventional LLM tracing by letting agents like Claude Code read traces, author assertions against the repo, execute the agent, and iteratively patch failures. The approach targets the universal pain of local AI
Apache ECharts is a free charting and data visualization library written in pure JavaScript, built on the zrender canvas library, that renders interactive and highly customizable charts directly in the browser via npm, CDN, or direct download. While resembling established declarative visualization approaches such as those in D3.js and Highcharts, it adds a streamlined option-based API and extensive built-in chart types that reduce boilerplate for common interactive scenarios. Data visualization,
Highcharts JS is a JavaScript charting library based on SVG and some canvas/WebGL, distributed via npm or CDN with support for custom builds, ES modules, and native iOS wrappers. It is a mature implementation that closely resembles long-established projects such as Chart.js or D3 without introducing new rendering techniques or problem framings. The underlying need for web-based data visualization is common but already served by many competing libraries, limiting the scope for this specific tool,
Vega-Lite is a TypeScript library that provides a concise declarative grammar of interactive graphics which compiles down to complete Vega specifications for rendering. It pioneered a higher-level grammar approach that abstracts away much of the boilerplate required by lower-level visualization toolkits while preserving expressiveness for common interactive analysis tasks. The underlying idea solves a recurring problem faced by any team building data dashboards, scientific plots, or analytics U+
DSPex is an Elixir library that ships auto-generated Dspy.* bindings plus a thin SnakeBridge FFI wrapper to expose Stanford DSPy 3.2.0 signatures, predictors, ChainOfThought, optimizers, and LiteLLM-backed models inside the BEAM runtime. It adds a novel two-layer surface—mirrored Python package layout for IDE navigation plus direct concurrent-safe FFI calls—that had not previously existed for DSPy outside Python. The approach solves a real but narrow problem for the small set of Elixir teams who
Windows-MCP is a Python MCP server exposing keyboard/mouse simulation, window state capture, file navigation and UI control tools so any LLM agent can operate Windows 7–11 desktops. It adds modest novelty by offering a vision-free path that works with arbitrary models plus an optional DOM-only mode for browser automation on top of standard automation primitives. The underlying need for reliable Windows desktop control is shared by every team building or using agentic LLM workflows, giving the Mc
Langfuse is an open-source TypeScript LLM engineering platform offering observability via traces, prompt versioning with caching, evaluations including LLM-as-judge, datasets, and an interactive playground, self-hostable in minutes via Docker Compose or Kubernetes with OpenTelemetry and LangChain integrations. It consolidates multiple LLMOps capabilities into a single self-hostable package rather than extending any single prior tool like LangSmith. Every team shipping production LLM applications
Renhuai123/ziwei-doushu is a TypeScript Next.js 14 engine that implements the full Zi Wei Dou Shu charting algorithm from Ni Haixia's Tian Ji system, including star placement, Si Hua transformations, 1100+ pattern rules in patterns.ts, ancient texts, and a 518400-row JSON sample dataset released for training and RAG use. It adds scale and a ready-to-use knowledge base on top of prior open calculators such as iztro rather than introducing a new technique or problem framing. The project addresses,
URIAL is a Python toolkit that supplies a small set of constant stylistic in-context examples plus inference scripts (vLLM, Hugging Face) to align any base LLM purely through 3-shot ICL without any parameter updates. The approach reframes alignment as restyled in-context learning rather than gradient-based fine-tuning or RLHF, delivering a genuinely new controlled experimental primitive that did not exist before. Because every team that runs base models faces the same costly alignment step, the
Gremlins is a Claude Code plugin deploying autonomous AI agents with distinct personalities that survey a project then execute a pitch-critique-design workflow to output feature or content ideas as local files, PRs or issues. The structured multi-agent orchestration with editable personas and workflow stages introduces a novel framing for creative exploration that goes beyond generic LLM prompting or single-agent ideation tools. Its applicability remains limited to Claude users seeking product-或
ClawMetry is a Flask-based Python dashboard installed via one pip command that auto-detects OpenClaw workspaces and renders real-time animated flow diagrams, token/cost breakdowns, session lists, logs, and memory file browsers at localhost:8900. It applies established LLM observability patterns such as trace visualization and usage analytics to the specific multi-channel architecture of OpenClaw rather than introducing a new technique. The project solves agent debugging for developers already in
OMX is a TypeScript npm package and CLI that layers predefined skills, agent roles, tmux-based HUDs, and durable .omx/ state on top of the OpenAI Codex CLI. It offers an incremental workflow framing with reusable commands such as $deep-interview, $ralplan, $ralph, and $team rather than inventing a new agent runtime or model. The approach targets developers already committed to Codex-style CLI agents and therefore faces a structural ceiling outside that specific ecosystem.
ZenithDB is a Rust columnar database engine purpose-built for AI agent observability, ingesting and querying long sparse high-cardinality JSON traces via HTTP/gRPC/OTLP endpoints while speaking both SQL and ZenithQL. Its approach is novel through five workload-specific design choices: PAX segments with trace-locality compaction, late materialization scans, inline Tantivy FTS, offset directories for wide strings, and queryable object-storage WAL. The project addresses the storage pain of emerging
Nushell is a cross-platform shell written in Rust whose pipelines operate on typed, structured data (tables and records) instead of raw text streams, with commands such as ls, ps, and open producing queryable values. It is a meaningful incremental advance over PowerShell's object pipelines, trading Windows-centric COM integration for a lighter, Unix-native design and first-class support for common data formats. Every developer and operator runs a shell daily, so a model that turns ad-hoc text wr
Nitrobrew is a PyTorch library that fuses the unembedding matmul with KL divergence computation for knowledge distillation, iterating over vocabulary chunks with online softmax accumulators to keep memory at O(B·T·chunk_V) instead of materializing full [B,T,V] logit tensors. It applies a targeted fusion of existing online-softmax techniques to the specific on-policy distillation bottleneck for heterogeneous student/teacher hidden sizes and 100k+ vocabularies. The technique solves a painful but狭隘
MuonWarm is a PyTorch optimizer that routes matrix-shaped parameters through a Muon-style path using momentum orthogonalization via Newton-Schulz iterations while handling biases and norms with Adam; it caches a polar factor `muon_warm_q` and refreshes it between periodic full anchors with cheap Jacobi tangent updates plus retraction. The warm-start caching technique that reuses an approximate orthogonal direction instead of recomputing Newton-Schulz every step is a concrete incremental advance,
PiSwift is a Swift port of pi-mono that implements an in-process LLM agent framework where subagents are defined by Markdown files containing YAML frontmatter specifying name, tools, model, and output format, with support for single, parallel, and chained invocation via a dedicated subagent tool. It adds structured agent and prompt template discovery from user and project directories plus strict Swift concurrency guarantees. The approach refines existing Markdown-configured agent patterns frompi
Code release implementing Multi-Stream LLMs on Qwen2.5/Qwen3 backbones, with three self-contained sections: interleaved 2-3 stream packing for efficiency on GSM8K/MATH, multi-stream Alpaca fine-tuning for security benchmarks, and 10-stream Qwen3.5 models with per-stream Gated-DeltaNet states for monitorability. The work introduces parallel streams of thoughts, inputs, and outputs via wait-k data construction and complete weight sharing between streams. Its specialized training pipelines and high
SWE Atlas is a Shell-based benchmark repository that ships curated task data and Modal/harbor run configs for three leaderboards—Codebase QnA, Test Writing, and Refactoring—used to evaluate AI coding agents on professional software-engineering workflows. It introduces a multi-leaderboard framing that explicitly decomposes the software development cycle into complementary capabilities instead of measuring a single isolated skill such as issue resolution. The benchmark therefore targets the fast-w
This is the official repository for the paper From Directions to Regions: Decomposing Activations in Language Models via Local Geometry, providing end-to-end MFA training tutorials in Jupyter notebooks, FSDP multi-GPU code, and pretrained 8k MFA checkpoints for Gemma-2-2B and Llama-3.1-8B layers downloadable from Hugging Face. The work introduces a novel shift from directional to regional decomposition of LLM activations by explicitly modeling local geometry, offering a fresh problem framing and
Howcode is a TypeScript desktop app launched via npx that provides an opinionated environment for coding with the Pi AI, including an inbox, built-in terminal, git-ops composer, comment-based diff review, local sherpa-ONNX voice input, and in-app skill/extension management. It introduces a deliberately editor-free, agent-first workflow that prioritizes rapid YOLO sessions over conventional turn-by-turn editing. The app targets a narrow slice of AI-native developers comfortable with its specific,
scrcpy is a native C application built with FFmpeg, libav and SDL2 that mirrors an Android device's screen and audio over USB or TCP/IP, forwards keyboard/mouse/gamepad input, supports recording and virtual displays, and requires no root or installed app. It popularized a lightweight, low-latency no-root mirroring technique using standard ADB and system APIs, with later extensions such as camera mirroring and HID simulation constituting incremental improvements rather than new primitives. The no
A TypeScript monorepo that runs a read-only Express API and React+Vite frontend to provide local charts, full-text search, media browsing, and contact analytics over a WaCrawl-generated WhatsApp SQLite archive. The implementation follows the established pattern of building dedicated viewers and dashboards for exported messaging databases, adding only conventional UI polish and keyboard shortcuts on top of standard SQLite queries. Its utility is narrowly tied to the small audience already using w
MLS-Bench is a Python benchmark containing 140 tasks across 12 ML research domains that supplies agents with research scaffolds and baselines then scores them on proposing single algorithmic edits (new component, loss, optimizer, or training procedure) whose gains transfer across seeds, datasets, and scales, with Docker/Apptainer/SLURM or local Conda runtimes. It introduces a genuinely new evaluation framing that measures transferable scientific innovation rather than single-instance engineering
SERV is a bit-serial RV32I RISC-V CPU core written in Verilog that fits in 125-239 LUTs on common FPGAs or 2.1 kGE in CMOS, targeting area-constrained FPGA and ASIC designs. Its serial execution technique for a full RISC-V ISA is a genuine departure from conventional parallel pipelines and yields the smallest known compliant open-source core. The approach solves a real but narrow problem of extreme resource limits, limiting mass adoption to specialized embedded, IoT, and FPGA sensor platforms.
Raiden is a Python toolkit for YAM bimanual robot arms that provides the full pipeline of camera calibration, leader-follower teleoperation, multi-camera recording with heterogeneous depth backends, and conversion to synchronized policy-ready datasets. It improves on existing robot data tools through its tight integration of manipulability-aware IK via PyRoki and J-Parse plus automated hand-eye calibration for mixed ZED/RealSense setups. The project remains confined to owners of specific YAM arm
This is a PyTorch library implementing the dynamic sequence chunking mechanism from the H-Net paper, exposing a DynamicSequenceChunker module and an HNet wrapper class that performs learned hierarchical downsampling and upsampling on token embeddings. The approach introduces a genuinely new end-to-end differentiable chunking technique for hierarchical sequence modeling that did not previously exist in this form. It solves a specialized problem in advanced sequence architectures and is therefore,