Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Opta] Ordering of events #267

Open
probberechts opened this issue Dec 27, 2023 · 14 comments
Open

[Opta] Ordering of events #267

probberechts opened this issue Dec 27, 2023 · 14 comments

Comments

@probberechts
Copy link
Contributor

I noticed that Opta events can sometimes be slightly out of order. The F24 docs specify that the following attributes (in the given order) should be used to order each team's match events chronologically:

image

Only sorting by timestamp does not always give the same result. For example:

<Event id="1889768843" event_id="358" type_id="1" period_id="1" min="32" sec="3" player_id="59062" team_id="174" outcome="0" x="21.6" y="39.2" timestamp="2018-08-20T21:32:27.98" last_modified="2018-08-20T21:32:28" version="1534797148460"></Event>         
<Event id="1592827425" event_id="228" type_id="1" period_id="1" min="32" sec="4" player_id="80908" team_id="957" outcome="0" x="60.4" y="52.0" timestamp="2018-08-20T21:32:27.635" last_modified="2018-08-21T16:43:18" version="1534866198424"></Event>

Since the Opta deserializer currently only parses the "timestamp" field, it does not seem possible to order events chronologically.

@koenvo
Copy link
Contributor

koenvo commented Dec 27, 2023

Are there any details on how to properly sort on correctly and maintain millisecond precision?

A solution could be to extract timestamp from “min” and “sec” attributes but than we lose the precision.

@probberechts
Copy link
Contributor Author

My documentation doesn't mention the precision of the "timestamp" field. However, my version of the documentation is extremely outdated. Maybe @JanVanHaaren has something more up-to-date.

I find it strange that the "timestamp" field does not align with the "min" and "sec" fields. If the precision of the "timestamp" field would be inferior to the "min" and "sec" fields, I don't see why we would infer an (incorrect) millisecond precision from it.

@probberechts
Copy link
Contributor Author

probberechts commented Dec 27, 2023

Looking at a few more timestamps, I now realize that Opta does not add leading zeros to the milliseconds. So, "2018-08-20T21:32:27.98" is actually "2018-08-20T21:32:27.098000".

Python's %f pads zeros to the right, while we should pad zeros to the left to parse the Opta timestamp. We should simply adapt the timestamp parser and then it should work.

%f is an extension to the set of format characters in the C standard (but implemented separately in datetime objects, and therefore always available). When used with the strptime() method, the %f directive accepts from one to six digits and zero pads on the right.

@JanVanHaaren
Copy link
Collaborator

The min and sec fields on one hand and the timestamp field on the other hand provide different pieces of information about an event. The min and sec fields provide the game time in minutes and seconds when the event occurred, whereas the timestamp field provides the date and time when the event was logged in UK time. Hence, the timestamp field can be used as a tie-breaker to order events but not to derive the time when the event occurred in the match.

Documentation Opta F24

  • timestamp - "The UK time/date at which this event was initially entered into Opta’s database"
  • min - "Minute of the event"
  • sec - "Second of the event"

Documentation Stats Perform MA3

  • timestamp - "The UK time/date at which this event was initially entered into Opta's database"
  • timeMin - "Game time in minutes"
  • timeSec - "Game time in seconds"

@probberechts
Copy link
Contributor Author

So, to conclude, would it be okay to fill the "timestamp" field in Kloppy with min + sec and order events based on min + sec + timestamp?

@JanVanHaaren
Copy link
Collaborator

That suggestion sounds good to me. The Wyscout V3 deserializer fills the timestamp field based on the minute and second fields too although it would probably be better to use the provided matchTimestamp field. The StatsBomb deserializer uses the provided timestamp.

Should we explicitly store a sequence number for each event as well? StatsBomb and Wyscout explicitly provide a sequence number in the index and eventIndex fields, respectively.

@probberechts
Copy link
Contributor Author

Should we explicitly store a sequence number for each event as well? StatsBomb and Wyscout explicitly provide a sequence number in the index and eventIndex fields, respectively.

I would just make sure that the records in a dataset are chronologically ordered. Storing a sequence number then does not provide any added value since you would be able to infer it from the position in the list of records.

@koenvo
Copy link
Contributor

koenvo commented Dec 27, 2023

Small question about the timestamp vs min/sec: when the record is not altered afterwards, does the timestamp match the min/sec?
so only when the record is altered the timestamp loses value, correct?

@JanVanHaaren
Copy link
Collaborator

Small question about the timestamp vs min/sec: when the record is not altered afterwards, does the timestamp match the min/sec? so only when the record is altered the timestamp loses value, correct?

My understanding is that the timestamp field is never updated. The timestamp field reflects the time when the event was initially entered in the database and the last_modified field reflects the time when the event was last updated in the database.

I suspect that the timestamp field is reasonably accurate for events that are recorded live. However, not all event data is recorded live and events can occasionally be inserted at a later time during the match or even after the match.

@probberechts
Copy link
Contributor Author

Although, according to my old documentation the timestamp field reflects the time that the event occured within the match. 😕

image

@JanVanHaaren
Copy link
Collaborator

I will contact the Stats Perform support desk. The official documentation is confusing.

Documentation website

  • timestamp - "The UK time/date at which this event was initially entered into Opta's database"
  • timestamp_utc - "The UTC timestamp of when the event occurred, or when the data was entered in Opta DB"
  • last_modified - "The UK time/date at which this event was last modified by Opta"

@JanVanHaaren
Copy link
Collaborator

I haven't heard back yet from Stats Perform, but I think I finally understand how the timestamps work. I suspect the meaning of the timestamp field depends on the coverage level. The event timestamps are detailed to the millisecond for some but not all coverage levels.

For example, the event data for this friendly match between Salzburg and Ried has coverage level 14. The game took place on 12 October 2023, but the timestamp for the kick-off event is 2023-10-15T08:49:39.373Z.

{
	"id": "9130ocq9mdrosrd4mv7a666tw",
	"coverageLevel": "14",
	"date": "2023-10-12Z",
	"time": "12:00:00Z",
	"localDate": "2023-10-12",
	"localTime": "14:00:00",
	"numberOfPeriods": 2,
	"periodLength": 45,
	"overtimeLength": 15,
	"lastUpdated": "2023-11-25T12:46:38Z",
	"description": "Salzburg vs Ried",
	...
},
{
	"id": 2604454267,
	"eventId": 3,
	"typeId": 1,
	"periodId": 1,
	"timeMin": 0,
	"timeSec": 0,
	"contestantId": "do3l4dhs0ooog6se728jxc06z",
	"playerId": "3rmiekqhf431q783nhdc2m12h",
	"playerName": "W. Eza",
	"outcome": 1,
	"x": 49.8,
	"y": 50.0,
	"timeStamp": "2023-10-15T08:49:39.373Z",
	"lastModified": "2023-10-16T00:39:15Z",
	"qualifier": [
		...
	]
},

@probberechts
Copy link
Contributor Author

The question is rather whether they can be used as a reliable way to measure the relative time that has passed since the "period start" event.

@JanVanHaaren
Copy link
Collaborator

I don't know yet, but my feeling is that it should be possible for the highest coverage levels. I'll investigate a few more matches. Unfortunately, I don't have access to much event data that was collected at lower coverage levels.

@probberechts probberechts changed the title Ordering of Opta events [Opta] Ordering of events Dec 30, 2023
probberechts added a commit to probberechts/kloppy that referenced this issue Jan 20, 2024
Opta does not zero-pad milliseconds. Therefore, they were incorrectly parsed
by Python's default "%f" format code.

See also PySport#267
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants