Edge cases and Gotchas #2

jdesrosiers · 2024-02-27T18:16:52Z

Here are a few things to look out for when implementing something like this.

The set of properties that are considered keywords depends on the dialect

In the following example, additionalItems should not be highlighted as a keyword because it was removed in 2020-12.

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "prefixItems": [true],
  "additionalItems": false, // <- not a keyword
  "items": false,
  "definitions": {} // <- not a keyword
  "aaa": 42 // <- not a keyword
}

When we change the dialect, the properties that are considered keywords changes.

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "prefixItems": [true], // <- not a keyword
  "additionalItems": false,
  "items": false
  "definitions": {}
  "aaa": 42 // <- not a keyword
}

Properties are only keywords inside schemas

Not every object in a JSON Schema document is a schema, so you need to know when you're in a schema and when you're not. Here are a couple examples.

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "properties": {
    "$id": "foo" // <- not a keyword
  }
}

In the next example, $id isn't considered a keyword because definitions isn't a keyword in 2020-12. Therefore, their values aren't schemas and the properties of those values shouldn't be considered keywords.

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "definitions": {
    "foo": {
      "type": "string" // <- not a keyword
    }
  }
}

Embedded schemas can have a different dialect

It's possible for embedded schemas to have a different dialect than their parent schema. In the following example, the same keywords are highlighted differently depending on which schema resource the keyword appears in.

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "prefixItems": [true],
  "additionalItems": false, // <- not a keyword
  "items": false,
  "$defs": {
    "foo": {
      "$schema": "http://json-schema.org/draft-07/schema#",
      "$id": "https://example.com/schema/embedded",
      "prefixItems": [true], // <- not a keyword
      "additionalItems": false,
      "items": false,
      "definitions": {}
    }
  }
}

The text was updated successfully, but these errors were encountered:

sudo-jarvis · 2024-02-27T18:23:32Z

@jdesrosiers , the current implementation as in #1 , is supporting only the latest dialect, not multiple dialects or previous dialects. Any idea how to dynamically fetch the keywords for each dialect ?

jdesrosiers · 2024-02-27T19:46:27Z

Any idea how to dynamically fetch the keywords for each dialect ?

There isn't a convenient list anywhere you can just fetch. You'll need to build the lists yourself from the spec or meta-schemas or whatever other source you can find.

Julian · 2024-02-28T15:59:29Z

The simple list of keywords is something that my plan is probably to eventually live in the jsonschema-specifications project, which essentially represents "give me the JSON Schema specifications in Python at runtime".

But that plan includes also writing type annotations for them, so it's a bit medium term.

For now simply copying / writing them down is the right thing.

Julian · 2024-02-28T15:59:59Z

(Oh and definitely awesome! Thanks again Jason for sharing your learnings!)

sudo-jarvis · 2024-02-29T13:18:15Z

@Julian To add support for multiple schemas what we could do is that once the lexer gives us a list of tokens we can iterate from left to right and maintain a stack using which we will find for each keyword which is its nearest $schema to the left.

We'll fill the stack with each token and once we encounter a } we will pop all tokens till the first {, this will ensure that even due to nesting the earliest $schema present on the left would actually represent the $schema which we need to refer for that token.

Then once we know it we'll check the dict of that particular schema if the token is to be treated as a keyword or not.

Julian · 2024-02-29T13:26:25Z

Does pygments's JSON lexer not already handle the recursion? It presumably must, since it's noticing when an object literal is being encountered, so the stack you're talking about must already be there. "All" we should have to do is intercept that object literal parsing once it's done, look at the $schema keyword if present, and then decide how to handle the other keywords, I'd think. But I haven't looked closely clearly.

sudo-jarvis · 2024-02-29T13:34:55Z

@Julian, Yes you are right we can do that when the whole document has the same schema. However, I was talking about the case when say in the outer object we have draft-2020-12 schema and in some inner object we have draft-07 schema

As mentioned by @jdesrosiers here:

It's possible for embedded schemas to have a different dialect than their parent schema. In the following example, the same keywords are highlighted differently depending on which schema resource the keyword appears in.

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "prefixItems": [true],
  "additionalItems": false, // <- not a keyword
  "items": false,
  "$defs": {
    "foo": {
      "$schema": "http://json-schema.org/draft-07/schema#",
      "$id": "https://example.com/schema/embedded",
      "prefixItems": [true], // <- not a keyword
      "additionalItems": false,
      "items": false,
      "definitions": {}
    }
  }
}

Julian · 2024-02-29T14:16:42Z

Yes I know that bit of course, but I forgot Pygments doesn't do any AST parsing, just a flat list of tokens, so it doesn't tell us where objects start and end... OK, that's unfortunate, but what you say sounds fine then. And you can get the list of keywords for each dialect by adding a dependency on jsonschema -- the keywords you need are then jsonschema.Draft202012Validator.VALIDATORS.keys().

jdesrosiers · 2024-02-29T18:28:28Z

find for each keyword which is its nearest $schema to the left.

It's a little more complicated than that. $schema only has an effect when it's at the root of a schema resource. The presence of an identifier ($id or id depending on dialect) determines that the subschema is schema resource.

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "prefixItems": [true],
  "additionalItems": false, // <- not a keyword
  "items": false,
  "$defs": {
    "foo": {
      "$schema": "http://json-schema.org/draft-07/schema#", // <- no $id, so this keyword has no effect
      "prefixItems": [true],
      "additionalItems": false, // <- not a keyword
      "items": false,
      "definitions": {} // <- not a keyword
    }
  }
}

Keep in mind that you can't just look for $id or id, you have to look for the one appropriate to the dialect.

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "prefixItems": [true],
  "additionalItems": false, // <- not a keyword
  "items": false,
  "$defs": {
    "foo": {
      "$schema": "http://json-schema.org/draft-04/schema#", // <- no id, so this keyword has no effect
      "$id": "https://example.com/schema/embedded", // <- $id doesn't apply for draft-04
      "prefixItems": [true],
      "additionalItems": false, // <- not a keyword
      "items": false,
      "definitions": {} // <- not a keyword
    }
  }
}

Unfortunately, there's an ambiguous situation that you're going to have to figure out how to deal with. Imagine that $schema declares a dialect that you don't recognize.

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "prefixItems": [true],
  "additionalItems": false, // <- not a keyword
  "items": false,
  "$defs": {
    "foo": {
      "$schema": "https://example.com/unknown-dialect",
      "$id": "https://example.com/schema/embedded", // <- is this an identifier or not?
      "prefixItems": [true], // <- is this a keyword or not?
      "additionalItems": false, // <- is this a keyword or not?
      "items": false, // <- is this a keyword or not?
      "definitions": {} // <- is this a keyword or not?
    }
  }
}

If you don't understand the dialect, you don't know what keyword is used for identifying a schema resource. Therefore, it's ambiguous whether $schema should be respected or not. My solution for the purpose of highlighting is to treat an unknown dialect as an embedded schema even though I don't know if it declares an identifier and treat all properties as non-keywords. It's not perfect, but it's the best we can do with limited information.

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "prefixItems": [true],
  "additionalItems": false, // <- not a keyword
  "items": false,
  "$defs": {
    "foo": {
      "$schema": "https://example.com/unknown-dialect",
      "$id": "https://example.com/schema/embedded", // <- not a keyword
      "prefixItems": [true], // <- not a keyword
      "additionalItems": false, // <- not a keyword
      "items": false, // <- not a keyword
      "definitions": {} // <- not a keyword
    }
  }
}

sudo-jarvis · 2024-03-01T05:15:00Z

@jdesrosiers , So basically first we need to look at the dialect, and then that dialect would specify if id is a keyword or $id and then accordingly the presence of id or $id would tell whether the keywords in that subschema are to be treated according to that dialect or according to the dialect of the enclosing schema?

jdesrosiers · 2024-03-01T22:04:23Z

Correct, but don't forget to also handle the case where you don't know the dialect that's specified (the ambiguous situation described in my last comment).

Julian mentioned this issue Feb 28, 2024

Create the Json Schema lexer using pygments #1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Edge cases and Gotchas #2

Edge cases and Gotchas #2

jdesrosiers commented Feb 27, 2024

sudo-jarvis commented Feb 27, 2024 •

edited

Loading

jdesrosiers commented Feb 27, 2024

Julian commented Feb 28, 2024

Julian commented Feb 28, 2024

sudo-jarvis commented Feb 29, 2024

Julian commented Feb 29, 2024

sudo-jarvis commented Feb 29, 2024

Julian commented Feb 29, 2024

jdesrosiers commented Feb 29, 2024

sudo-jarvis commented Mar 1, 2024

jdesrosiers commented Mar 1, 2024

Edge cases and Gotchas #2

Edge cases and Gotchas #2

Comments

jdesrosiers commented Feb 27, 2024

The set of properties that are considered keywords depends on the dialect

Properties are only keywords inside schemas

Embedded schemas can have a different dialect

sudo-jarvis commented Feb 27, 2024 • edited Loading

jdesrosiers commented Feb 27, 2024

Julian commented Feb 28, 2024

Julian commented Feb 28, 2024

sudo-jarvis commented Feb 29, 2024

Julian commented Feb 29, 2024

sudo-jarvis commented Feb 29, 2024

Julian commented Feb 29, 2024

jdesrosiers commented Feb 29, 2024

sudo-jarvis commented Mar 1, 2024

jdesrosiers commented Mar 1, 2024

sudo-jarvis commented Feb 27, 2024 •

edited

Loading