File Scanner 1.0.0
A high-performance C++ malicious file scanner.
Loading...
Searching...
No Matches
File Scanner

C++ CI Coverage Docs License: MIT

A high-performance, multithreaded command-line utility for scanning files against a database of malicious MD5 hashes.

This project is a demonstration of modern C++ software engineering principles, designed to be robust, scalable, and maintainable. It is built with a clean, decoupled architecture to ensure high cohesion, low coupling, and excellent testability.

‍Made for the Kaspersky SafeBoard internship.

Features

  • High-Performance & Concurrent: Utilizes all available CPU cores to hash files in parallel, capable of scanning hundreds of thousands of files per minute.
  • Scalable: Employs a streaming approach to calculate MD5 hashes, allowing it to process files of any size (even those larger than available RAM) without performance degradation.
  • Clean Architecture: Strictly separates concerns into Domain, Application, and Infrastructure layers. This makes the core logic independent of external details like filesystems and databases.
  • Modern C++: Written in C++17, leveraging modern features like smart pointers, std::filesystem, std::thread, atomics, and move semantics.
  • Fully Tested: Includes a comprehensive suite of unit and integration tests using the Google Test framework to ensure correctness and reliability.
  • CI/CD: Features a complete Continuous Integration pipeline using GitHub Actions for automated cross-platform builds (Windows, macOS, Linux), testing, code coverage analysis, and documentation deployment.
  • Well-Documented: Provides a full API reference generated by Doxygen and hosted on GitHub Pages.

Performance

The scanner is designed for high throughput on modern hardware. Benchmarks were conducted on a machine equipped with an Apple M1 Pro (8 performance cores).

The benchmark consists of scanning a directory containing 100,000 files (1KB to 128KB each) with a total size of approximately 6.2 GB.

Metric Result
Total Execution Time **~19.5 seconds**
Throughput **~5,100 files/second**

During the scan, the utility successfully utilizes all available CPU cores, demonstrating the efficiency of the multithreaded architecture.

To run the benchmark on your own machine, use the benchmark target:

cmake --build build --target benchmark

Requirements

  • CMake (version 3.14 or higher)
  • A C++17 compliant compiler (e.g., GCC 9+, Clang 10+, MSVC 2019+)
  • Python 3 (for running the performance benchmark)
  • Doxygen & Graphviz (optional, for generating documentation locally)

Building and Testing

The project uses a standard CMake workflow. All C++ dependencies (Google Test, md5-lib) are fetched automatically.

# 1. Clone the repository
git clone https://github.com/GregoryKogan/file-scanner.git
cd file-scanner
# 2. Configure the project
cmake -S . -B build
# 3. Build all targets (library, executable, tests)
cmake --build build
# 4. Run all unit and integration tests
ctest --test-dir build --output-on-failure

Usage

The utility is run from the command line with three required arguments.

Launch Command:

# From the 'build' directory after compiling
./bin/scanner --path /path/to/scan --base /path/to/database.csv --log /path/to/report.log

Arguments:

  • --path <directory>: The absolute or relative path to the root directory to be scanned.
  • --base <file.csv>: The path to the CSV file containing malicious signatures.
  • --log <file.log>: The path to the file where detection reports will be written.

Example <tt>base.csv</tt> Format

The signature database is a simple text file with one entry per line. Each line contains an MD5 hash and a verdict, separated by a semicolon.

a9963513d093ffb2bc7ceb9807771ad4;Exploit
ac6204ffeb36d2320e52f1d551cfa370;Dropper

Example <tt>report.log</tt> Output

Detections are logged in the JSON Lines (JSONL) format, which is structured and machine-readable.

{"path": "/path/to/scan/bad_file1.exe", "hash": "a9963513d093ffb2bc7ceb9807771ad4", "verdict": "Exploit"}
{"path": "/path/to/scan/nested/bad_file2.dll", "hash": "ac6204ffeb36d2320e52f1d551cfa370", "verdict": "Dropper"}

Example Console Report

After the scan is complete, a summary is printed to the console.

--- Scan Report ---
Processed files: 15032
Malicious detections: 2
Errors: 1
Execution time: 3451 ms
-------------------

Architecture Overview

The project follows the principles of Clean Architecture to ensure a separation of concerns.

  • **src/scanner_lib (Core Library - DLL):**
    • Domain (domain.h):** Contains the core data structures (ScanResult). It has no dependencies.
    • **Application (scanner.h, thread_pool.h):** Contains the application-specific logic and orchestration (IScanner, ThreadPool). Depends only on the Domain.
    • **Infrastructure (csv_hash_database.h, etc.):** Contains the concrete implementations of external-facing components. It implements interfaces defined in the Application layer.
  • **src/scanner_cli (Presentation Layer - EXE):**
    • The command-line interface. It is the "Composition Root" that uses the library's public **Builder API to assemble the application and present the results. It depends on the library but not on its internal details.