Richard Kasendu | Modernizing KYN: Performance Improvements for Binary Code Similarity Detection

KYN (Know Your Neighborhood) is a cutting-edge binary code similarity detection system that leverages graph neural networks to identify similar functions across different binaries. Recently, I had the opportunity to contribute several significant improvements to this project, focusing on performance optimization, enhanced evaluation capabilities, and better developer experience.

What is KYN?

KYN addresses a critical challenge in security research and software analysis: determining whether two binary functions are semantically similar, regardless of how they were compiled. This capability is essential for:

Malware analysis: Identifying malware variants and families
Vulnerability detection: Finding instances of known vulnerabilities across different binaries
License violation detection: Discovering unauthorized code reuse
Security patching: Tracking patch propagation across systems

The system uses a novel graph representation called "call graphlets" - specialized graphs that encode the neighborhood around each function, capturing both local context (instruction counts, variable usage) and global relationships (function calls, control flow).

The Challenge

When I started working on KYN, the project had solid foundations but faced several performance and usability challenges:

Performance bottlenecks in graph processing using NetworkX
Limited evaluation flexibility for real-world use cases
Dataset generation constraints for custom binary collections
Missing development infrastructure for reproducible environments

Key Improvements

1. Migration from NetworkX to RustworkX

The most impactful change was migrating the graph processing backend from NetworkX to RustworkX. NetworkX, while feature-rich, is written in pure Python and can be slow for large-scale graph operations.

# Before: NetworkX implementation
import networkx as nx
G = nx.DiGraph()
betweenness = nx.edge_betweenness_centrality(G)  # Slow for large graphs

# After: RustworkX implementation
import rustworkx as rx
G = rx.PyDiGraph()
betweenness = rx.edge_betweenness_centrality(G)  # 10-100x faster

This migration required:

Creating a GraphWithMetadata wrapper class to maintain node/edge metadata alongside RustworkX graphs
Developing conversion utilities to transform RustworkX graphs to PyTorch Geometric format
Updating the entire dataset pipeline to work with the new graph representation

Result: 3-10x faster dataset processing, enabling work with much larger binary collections.

2. Enhanced Dataset Generation

I extended the dataset generation pipeline to support custom binary collections, making KYN more accessible for researchers working with proprietary or specialized datasets:

# New custom dataset support
python -m kyn.cli generate \
    --dataset-type custom \
    --input-dir /path/to/binaries \
    --output-dir /path/to/dataset

The improvements included:

Flexible function ID extraction for various naming conventions
Better handling of malformed or incomplete binary data
Weighted sampling based on function frequency across binaries

3. Zero-Shot Evaluation Capabilities

One of the most exciting additions was the CosineSimilarityEvaluator, enabling zero-shot and one-shot learning evaluation:

# Direct similarity computation between function embeddings
evaluator = CosineSimilarityEvaluator(model)
similarities = evaluator.compute_similarities(query_funcs, target_funcs)

This feature is crucial for real-world applications where you need to:

Search for a specific vulnerability across a large codebase
Identify similar functions without retraining the model
Perform real-time similarity queries

4. Training and Evaluation Improvements

Several enhancements made the training process more robust:

Validation set shuffling: Prevents overfitting to validation order
Enhanced W&B integration: Comprehensive metric tracking and hyperparameter sweeps
Better dataset filtering: Removes functions with insufficient training examples
JSON processing optimization: Switched to orjson for 3-5x faster parsing

5. Development Infrastructure

To improve the developer experience, I added:

Dockerfile: Containerization for consistent environments
IDE configurations: VSCode debugging setups
Hyperparameter sweeps: Automated tuning with Weights & Biases
Improved CLI: Better error messages and progress reporting

Technical Deep Dive: RustworkX Integration

The RustworkX migration was particularly interesting from a technical perspective. RustworkX provides Rust's performance with Python bindings, but it required careful handling of the impedance mismatch between the two ecosystems.

Key challenges included:

Metadata management: RustworkX graphs don't natively support arbitrary node/edge attributes
API differences: Different method names and signatures required careful translation
Memory management: Ensuring proper memory handling across the Rust-Python boundary

The solution involved creating an abstraction layer that maintains compatibility while leveraging RustworkX's performance benefits.

Impact and Results

These improvements collectively result in:

Performance: 3-10x faster processing for large datasets
Flexibility: Support for custom datasets and evaluation modes
Robustness: Better handling of edge cases and malformed data
Usability: Improved developer experience with better tooling

The changes maintain backward compatibility while significantly enhancing the project's capabilities for production use.

Looking Forward

KYN now provides a more efficient and flexible platform for binary code similarity detection. The performance improvements enable scaling to larger codebases, while the enhanced evaluation capabilities support more diverse use cases.

For security researchers and malware analysts, these improvements mean faster analysis workflows and the ability to work with custom binary collections. The zero-shot evaluation capabilities are particularly valuable for rapid vulnerability searching across large software ecosystems.

Try It Yourself

If you're interested in binary analysis or security research, I encourage you to check out KYN. The improved performance and flexibility make it more accessible than ever for both research and production use cases.

The project demonstrates how targeted optimizations - particularly in core components like graph processing - can dramatically improve the usability of machine learning systems for security applications.

Table of content