Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fido identifying some XLSX, PPTX, and DOCX as fido-fmt/{x} #152

Open
ross-spencer opened this issue Feb 24, 2019 · 5 comments
Open

Fido identifying some XLSX, PPTX, and DOCX as fido-fmt/{x} #152

ross-spencer opened this issue Feb 24, 2019 · 5 comments
Assignees
Labels
bug A product defect that needs fixing P1 High priority issues to be scheduled in the upcoming release
Milestone

Comments

@ross-spencer
Copy link

ross-spencer commented Feb 24, 2019

Dev Effort

1D

Description

Via @sromkey the MS-Office Open XML files in this Archivematica test data zip are being identified as fido-fmt/{x} in Fido:

ross-spencer@artefactual:~/git/artefactual-labs/am/src/archivematica-sampledata/SampleTransfers/OfficeDocsExtracted/objects$ fido *
FIDO v1.3.12 (formats-v94.xml, container-signature-20180920.xml, format_extensions.xml)"
OK,14,fido-fmt/189.ppt,"Microsoft Office Open XML - Powerpoint","Microsoft Office Open XML - Powerpoint",47215,"MS-OfficeOpenXML-samples/samplepptx.pptx","None","signature"
OK,10,fido-fmt/189.word,"Microsoft Office Open XML - Word","Microsoft Office Open XML - Word",14860,"MS-OfficeOpenXML-samples/sampledocx.docx","None","signature"
OK,11,fido-fmt/189.xl,"Microsoft Office Open XML - Excel","Microsoft Office Open XML - Excel",12050,"MS-OfficeOpenXML-samples/samplexlsx.xlsx","None","signature"
FIDO: Processed      9 files in 343.28 msec, 26 files/sec

If the fido-fmt{x} entries are removed as per here: #36 (comment) then the closest match seems to be generic OOXML:

ross-spencer@artefactual:~/Desktop/temp/ndsa/office-samples-and-skeletons/samples$ fido *
FIDO v1.3.12 (formats-v94.xml, container-signature-20180920.xml, format_extensions.xml)
OK,150,fmt/189,"Microsoft Office Open XML","Microsoft Office Open XML",14860,"sampledocx.docx","None","signature"
OK,8,fmt/189,"Microsoft Office Open XML","Microsoft Office Open XML",47215,"samplepptx.pptx","None","signature"
OK,9,fmt/189,"Microsoft Office Open XML","Microsoft Office Open XML",12050,"samplexlsx.xlsx","None","signature"
FIDO: Processed      3 files in 206.92 msec, 14 files/sec

Unfortunately the Skeleton Suite looks like it won't help debug here as the extracted samples (three per puid) all identify correctly.

I have extracted the samples and the skeleton files here for easy access.

NB. Also noted by Sarah is that Siegfried will identify the formats correctly:

ross-spencer@artefactual:~/git/artefactual-labs/am/src/archivematica-sampledata/SampleTransfers/OfficeDocsExtracted/objects$ sf *
---
siegfried   : 1.7.11
scandate    : 2019-02-24T12:22:11+01:00
signature   : default.sig
created     : 2019-02-16T11:10:03+01:00
identifiers : 
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V94.xml; container-signature-20180917.xml'
---
filename : 'MS-OfficeOpenXML-samples/sampledocx.docx'
filesize : 14860
modified : 2007-08-14T23:29:00+02:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/412'
    format  : 'Microsoft Word for Windows'
    version : '2007 onwards'
    mime    : 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
    basis   : 'extension match docx; container name [Content_Types].xml with byte match at 460, 94 (signature 1/3)'
    warning : 
---
filename : 'MS-OfficeOpenXML-samples/samplepptx.pptx'
filesize : 47215
modified : 2007-08-14T23:51:16+02:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/215'
    format  : 'Microsoft Powerpoint for Windows'
    version : '2007 onwards'
    mime    : 'application/vnd.openxmlformats-officedocument.presentationml.presentation'
    basis   : 'extension match pptx; container name [Content_Types].xml with byte match at 2326, 96 (signature 1/3)'
    warning : 
---
filename : 'MS-OfficeOpenXML-samples/samplexlsx.xlsx'
filesize : 12050
modified : 2007-08-14T23:50:24+02:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/214'
    format  : 'Microsoft Excel for Windows'
    version : '2007 onwards'
    mime    : 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
    basis   : 'extension match xlsx; container name [Content_Types].xml with byte match at 676, 88 (signature 1/3)'
    warning : 
---
@ghost ghost added the bug A product defect that needs fixing label Mar 13, 2019
@carlwilson carlwilson self-assigned this Mar 13, 2019
@ghost ghost assigned ablwr and unassigned carlwilson Mar 13, 2019
@ghost ghost added the P1 High priority issues to be scheduled in the upcoming release label Mar 13, 2019
@ghost ghost added this to the v1.4.0-m4 milestone Mar 13, 2019
@ablwr
Copy link
Contributor

ablwr commented Oct 29, 2019

If this is an error for these handful of files but not other files of its type, is the outcome better that FIDO should return the generic Microsoft OOXML with a standard PRONOM fmt/189 ID, rather than the custom fido-fmt ID?

Asking because I get the same results in master but don't necessarily have the bandwidth to fully investigate and change and test a larger solution for these Microsoft files, but I can remove the custom fido-fmts which will produce fmt/189 results (better for preservation..?)

@carlwilson carlwilson removed this from the v1.4.0-m4 milestone May 5, 2020
@carlwilson carlwilson assigned replaceafill and unassigned ablwr May 5, 2020
@carlwilson carlwilson added this to the v1.6 milestone May 5, 2020
@replaceafill
Copy link
Contributor

@carlwilson I investigated this using commit 6211d66 of the rc/1.6 branch using signature versions FIDO v1.4.1 (formats-v97.xml, container-signature-20200121.xml, format_extensions.xml) and the office-samples-and-skeletons.zip file shared by Ross.

From what I can see for the files in the office-samples-and-skeletons/samples directory fido finds three signatures. For example for the samplexlsx.xlsx file in it the match_formats method initially gets a list similar to:

 [('x-fmt/263', 'ZIP format'), ('fmt/189', 'Microsoft Office Open XML'), ('fido-fmt/189.xl', 'Microsoft Office Open XML - Excel')]

Then the priority logic determines that ('fido-fmt/189.xl', 'Microsoft Office Open XML - Excel') from the format_extensions.xml file is the best match.

The difference with the files in the office-samples-and-skeletons/skeleton directory is that only one signature is found and that makes fido to detect the formats using the container signature file container-signature-20200121.xml instead. For example for the fmt-214-container-signature-id-2030.xlsx file in it the match_formats method gets a list similar to:

[('x-fmt/263', 'ZIP format')]

From it a container type ZIP is determined getting the format from the [Content_Types].xml file contained in the xlsx file.

Do you have any advice on how to proceed with this?

@carlwilson carlwilson self-assigned this Jun 15, 2022
@carlwilson carlwilson modified the milestones: v1.6, v1.8 Jun 15, 2022
@carlwilson carlwilson modified the milestones: v1.8, OPF Hackathon 2023 Tasks Jun 19, 2023
@carlwilson
Copy link
Member

Hackathon 2023 Review: Selected for initial tasks. @replaceafill, sorry to do this again, but you're already here. I suggest prioritising this over #94, as it's likely a quicker win.

@replaceafill
Copy link
Contributor

@carlwilson if we remove these custom fido-fmt/... entries from format_extesions.xml to get fmt/189 for all the mentioned sample files as explained by Ross and Ashley above, what would be an appropriate way to write a test for that?

@carlwilson
Copy link
Member

That's a good question @replaceafill and one I'm a little too busy to think about right now. Feel free to have a think and suggest something, if not I'll give this some serious thought week starting 31/7.

@carlwilson carlwilson modified the milestones: OPF Hackathon 2023 Tasks, v1.8 Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A product defect that needs fixing P1 High priority issues to be scheduled in the upcoming release
Projects
None yet
Development

No branches or pull requests

4 participants