Modernizing KYN: Performance Improvements for Binary Code Similarity Detection
Table of content
KYN (Know Your Neighborhood) is a cutting-edge binary code similarity detection system that leverages graph neural networks to identify similar functions across different binaries. Recently, I had the opportunity to contribute several significant improvements to this project, focusing on performance optimization, enhanced evaluation capabilities, and better developer experience.
What is KYN?
KYN addresses a critical challenge in security research and software analysis: determining whether two binary functions are semantically similar, regardless of how they were compiled. This capability is essential for:
- Malware analysis: Identifying malware variants and families
- Vulnerability detection: Finding instances of known vulnerabilities across different binaries
- License violation detection: Discovering unauthorized code reuse
- Security patching: Tracking patch propagation across systems
The system uses a novel graph representation called "call graphlets" - specialized graphs that encode the neighborhood around each function, capturing both local context (instruction counts, variable usage) and global relationships (function calls, control flow).
The Challenge
When I started working on KYN, the project had solid foundations but faced several performance and usability challenges:
- Performance bottlenecks in graph processing using NetworkX
- Limited evaluation flexibility for real-world use cases
- Dataset generation constraints for custom binary collections
- Missing development infrastructure for reproducible environments
Key Improvements
1. Migration from NetworkX to RustworkX
The most impactful change was migrating the graph processing backend from NetworkX to RustworkX. NetworkX, while feature-rich, is written in pure Python and can be slow for large-scale graph operations.
# Before: NetworkX implementation
import networkx as nx
G = nx.DiGraph()
betweenness = nx.edge_betweenness_centrality(G) # Slow for large graphs
# After: RustworkX implementation
import rustworkx as rx
G = rx.PyDiGraph()
betweenness = rx.edge_betweenness_centrality(G) # 10-100x faster
This migration required:
- Creating a
GraphWithMetadata
wrapper class to maintain node/edge metadata alongside RustworkX graphs - Developing conversion utilities to transform RustworkX graphs to PyTorch Geometric format
- Updating the entire dataset pipeline to work with the new graph representation
Result: 3-10x faster dataset processing, enabling work with much larger binary collections.
2. Enhanced Dataset Generation
I extended the dataset generation pipeline to support custom binary collections, making KYN more accessible for researchers working with proprietary or specialized datasets:
# New custom dataset support
python -m kyn.cli generate \
--dataset-type custom \
--input-dir /path/to/binaries \
--output-dir /path/to/dataset
The improvements included:
- Flexible function ID extraction for various naming conventions
- Better handling of malformed or incomplete binary data
- Weighted sampling based on function frequency across binaries
3. Zero-Shot Evaluation Capabilities
One of the most exciting additions was the CosineSimilarityEvaluator
, enabling zero-shot and one-shot learning evaluation:
# Direct similarity computation between function embeddings
evaluator = CosineSimilarityEvaluator(model)
similarities = evaluator.compute_similarities(query_funcs, target_funcs)
This feature is crucial for real-world applications where you need to:
- Search for a specific vulnerability across a large codebase
- Identify similar functions without retraining the model
- Perform real-time similarity queries
4. Training and Evaluation Improvements
Several enhancements made the training process more robust:
- Validation set shuffling: Prevents overfitting to validation order
- Enhanced W&B integration: Comprehensive metric tracking and hyperparameter sweeps
- Better dataset filtering: Removes functions with insufficient training examples
- JSON processing optimization: Switched to
orjson
for 3-5x faster parsing
5. Development Infrastructure
To improve the developer experience, I added:
- Dockerfile: Containerization for consistent environments
- IDE configurations: VSCode debugging setups
- Hyperparameter sweeps: Automated tuning with Weights & Biases
- Improved CLI: Better error messages and progress reporting
Technical Deep Dive: RustworkX Integration
The RustworkX migration was particularly interesting from a technical perspective. RustworkX provides Rust's performance with Python bindings, but it required careful handling of the impedance mismatch between the two ecosystems.
Key challenges included:
- Metadata management: RustworkX graphs don't natively support arbitrary node/edge attributes
- API differences: Different method names and signatures required careful translation
- Memory management: Ensuring proper memory handling across the Rust-Python boundary
The solution involved creating an abstraction layer that maintains compatibility while leveraging RustworkX's performance benefits.
Impact and Results
These improvements collectively result in:
- Performance: 3-10x faster processing for large datasets
- Flexibility: Support for custom datasets and evaluation modes
- Robustness: Better handling of edge cases and malformed data
- Usability: Improved developer experience with better tooling
The changes maintain backward compatibility while significantly enhancing the project's capabilities for production use.
Looking Forward
KYN now provides a more efficient and flexible platform for binary code similarity detection. The performance improvements enable scaling to larger codebases, while the enhanced evaluation capabilities support more diverse use cases.
For security researchers and malware analysts, these improvements mean faster analysis workflows and the ability to work with custom binary collections. The zero-shot evaluation capabilities are particularly valuable for rapid vulnerability searching across large software ecosystems.
Try It Yourself
If you're interested in binary analysis or security research, I encourage you to check out KYN. The improved performance and flexibility make it more accessible than ever for both research and production use cases.
The project demonstrates how targeted optimizations - particularly in core components like graph processing - can dramatically improve the usability of machine learning systems for security applications.