Parse AST for specific function checking #21

import-pandas-as-numpy · 2023-06-25T18:52:57Z

Specification

Implement a feature which allows us to utilize Python's abstract syntax tree to match our current YARA rules against.

This feature should selectively run on files ending in ".py".
This feature should parse the nodes in a ".py" file for all nodes.
This feature should treat AST nodes as separate files ending in ".py" for the purposes of our current ruleset.
This feature should implement a manner to target specific rules to the type of node.
This feature should raise which AST node a specific rule was found in, the type of the node, and the the underlying code that that node was found in.
This feature should compile this information to indicate in the original code which nodes matched which rules.
This feature might contain functionality to selectively ignore or regard constants and arguments as a default behavior specified in a rule.
- Ignore Arguments: Bool -- Causes args/kwargs in AST nodes to be disregarded to derive function calls regardless of arguments
- Ignore Constants: Bool -- Causes constants in AST nodes to be disregarded to derive function calls regardless of constants.
This feature might spawn a separate folder of YARA rules that are only compiled and invoked against AST nodes to prevent contamination of current rulesets.

Motivation

The Abstract Syntax tree offers much more context for what Python understands a function to be doing. The use of a string in one context doesn't necessitate that string being an indicator in the entire program. Passing something like rm -rf / in a subprocess or system command is far riskier than finding that string in a docstring, but current YARA conventions have created an issue where we must either check to see that it isn't in a docstring (currently impossible, no lookaheads/lookbehinds) or we must specify the specific contexts that this command must flag in regex itself. (As in, in this case, we would have to look for subprocess calls with those arguments a list passed.)

Additionally, this would be a significant quality of life enhancement to PyPI staff, who would now be pointed to a specific line of malicious behavior.

Precedent for this exists in two forms, Semgrep and YARA itself. Semgrep is able to comprehend far more semantics of the language, to derive the context in which something is used. YARA has pefile section features to allow you to reference specific sections of a PE file to derive behavior in the context that it might appear. (For instance, .rsrc containing a malware.dll is something that YARA currently supports detection for.)

Open Questions

Are we reinventing Semgrep? Semgrep does not currently exist in Rust, but contains many of the same functionalities that we're aiming to replicate here.
Will we need to spin up a new ruleset for this? We stand to pollute our current rules with additional needless metadata fields to specify behaviors of these AST parsers if so.
Should we carve out functionality for the deobfuscators that Stickie and IlluminatiFish are working on while we're writing this feature? They make heavy use of the AST, and we'll likely want to bake this into the scanner at some point.
Will this be something we can easily extend to other languages? If we ever elect to scan another ecosystem such as NPM, using an AST might be useful there too. If we can avoid footgunning ourselves by abstracting this in a way that makes drop in functionality useful.

Requirements

Answer Open Questions
Help me develop requirements
Do those requirements

The text was updated successfully, but these errors were encountered:

import-pandas-as-numpy · 2023-06-30T01:49:51Z

@Robin5605 @AbooMinister25 @jonathan-d-zhang @Recursive-Error
Review/eyes requested.

AbooMinister25 · 2023-06-30T02:00:49Z

for this question ~

Will this be something we can easily extend to other languages? If we ever elect to scan another ecosystem such as NPM, using an AST might be useful there too. If we can avoid footgunning ourselves by abstracting this in a way that makes drop in functionality useful.

Considering that the semantics of the languages differ, I imagine that identifying what specific nodes to apply specific rules to would change as well - depending on how we structure the API, I suppose maybe something like providing mappings of nodes to a set of rules or whatever, differing per language, might be feasible.

Robin5605 · 2023-07-01T03:15:44Z

Will this be something we can easily extend to other languages? If we ever elect to scan another ecosystem such as NPM, using an AST might be useful there too. If we can avoid footgunning ourselves by abstracting this in a way that makes drop in functionality useful.

Superficially, technically yes. If we go with something like treesitter, for instance, it supports parsing a whole bunch of languages

import-pandas-as-numpy added the enhancement New feature or request label Jun 25, 2023

import-pandas-as-numpy added this to Dragonfly Roadmap Jun 25, 2023

import-pandas-as-numpy moved this to 🔎 Discovery in Dragonfly Roadmap Jun 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse AST for specific function checking #21

Parse AST for specific function checking #21

import-pandas-as-numpy commented Jun 25, 2023 •

edited

Loading

import-pandas-as-numpy commented Jun 30, 2023

AbooMinister25 commented Jun 30, 2023

Robin5605 commented Jul 1, 2023

Parse AST for specific function checking #21

Parse AST for specific function checking #21

Comments

import-pandas-as-numpy commented Jun 25, 2023 • edited Loading

Specification

Motivation

Open Questions

Requirements

import-pandas-as-numpy commented Jun 30, 2023

AbooMinister25 commented Jun 30, 2023

Robin5605 commented Jul 1, 2023

import-pandas-as-numpy commented Jun 25, 2023 •

edited

Loading