-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
specifying the persona #236
Conversation
…cause the information is redundant with the flag.
Hi, do you have an example when a persona would be useful? Personnally I think it would complicate the gedcomX model. Isn't the "confidence level" of conclusions already doing something like that? I thought, and I prefer, to have only one person and all of the "informations" about it, right or wrong (using confidence level to make the difference). |
@stoicflame - would personas be N-tier? My usecase in #232 might benefit from a tiered persona construct. A through D would then be personas, they would be gathered as AB + CD personas, and finally a top-level Person gathering the evidence with associated confidence-levels. |
Now, I'm only an amateur (close to novice) genealogist, but as I do my best to understand the needs of professional researchers, they express the need for more fidelity with the genealogical research process. Part of the formality of this process is a separation between the act of gathering information and compiling evidence. The process of gathering information includes digging through all the potentially relevant sources and recording what they say. After you've gone through the process of gathering information, you compile evidence that supports your conclusions about a person. The process of gathering evidence includes a bunch of analysis of the information you've gathered. You do your best to determine whether a given information item is applicable as evidence, based on (among other things) where it came from and what it said. So that's a pretty quick-and-dirty overview, but the concept of a
Yes, but you're describing an activity that is pretty strictly in the "evidence" side of the world. I believe the model supports what you describe, but the concept of a
Yes, they are one of the basic "units" of the N-tier architecture.
Indeed, that's the idea.
Yes, although I would say that the act of identifying A and B as the same person is part of the conclusion-making process and therefore the thing that binds AB would be considered a "person" and not a "persona". |
Ok, that makes sense. I like this idea so far, I'll think on the consequences some more. |
Since it hasn't been mentioned yet, this is related to #149, #72, and #138, among several others that were discussed at excruciating length last year. As for the present proposal, it's seriously incomplete: If you flag a Person as a Persona, how then to you connect to a conclusional Person? Via a SoureReference? |
I'm not sure what you mean by "connect to". Do you mean "cite as a source"? If so, then yeah, via source reference. Or do you mean "make a conclusion that two personas are the same person"? If so, then via an |
I meant "make a conclusion...", but I'll wait for the changes to N-tier before commenting further. |
Thank you for the answer. I understand the need of evidence it's just that I really don't see what is the best way to modelise it. So with this change it could be possible either to make directly conclusions from sources or define informations, that's it? I think the boolean choice is a good idea so we can pass from information to conclusion, and conversely, quickly without duplicating informations if it is no needed. Did I understand well? Another question: why was there "extractedConclusions" in SourceDecription? can't we find them with SourceReference in each conclusions? |
I would like to attempt to restate the use case, requirements, and proposal, then perhaps re-solicit feedback. But first, I want to attempt to be strict in my use of a couple of words. I like to think of a source as a container of information (e.g. a death record source might include information about birth, death, burial, parents, etc.). When we identify information in a source as helpful to answering a question, the selected information becomes evidence in an answer to that question. The Use CaseWe often create digital representations of our sources -- things like image copies, extracts, abstracts, transcriptions, indexes, etc. Combined with appropriate software, these digital representations become useful in the research process (e.g., finding aides, mechanisms for sharing sources to that our work can be peer reviewed, etc.). One desired representation of the information in a source is a lineage-linked representation of the persons and relationships found there -- a "micro-tree" of sorts. Ideally, this lineage-linked data would be constructed using GEDCOM X entities -- As a researcher, I find sources and create digital representations of the information in those sources -- including lineage-linked representations. Along side the representations of this information is the data about what I have concluded -- the conclusions that represent the result of correlating the evidence I've selected. My conclusions are also represented with GEDCOM X entities -- Requirement(s)In exchanging data, it is required that the data representing conclusions to be distinct from the data representing information in sources. It is also desired that the same model entities be used to described both conclusions and their informational equivalents. The Current ModelWe designate the objects intended to represent information in a source as such by adding references to them to the NOTE: Given a list of The Proposal@stoicflame has proposed that we remove the NOTE: Given a list of CommentsI understand the value of representing both information and conclusions using the same types of entities. I also believe that it will be important to distinguish the conclusions from the information. I do not think that either mechanism -- the current or the proposed -- makes distinguishing conclusions and information particularly easy. Some feel that pushing the marker to the We could mark the top-level entities (e.g., What are some other ways we might look at these issues?
The reason for this constraint is that the personas are intended to represent the information in a single source. |
I hope you will indulge me as I have managed to be silent on this topic, which is of great interest to me, for many months! So here are a few sentences on my personal views about personas and related concepts. A persona is a record in a database. Its fields contains information extracted from evidence found in a source. A persona has only one source because it holds information extracted from a single item of evidence. When a researcher decides that two personas represent the same person a new person record is created that links to the two persona records. The two persona records are permanent records. They are never destroyed. They are not merged into the body of the new person record. The new person record does not need a source because the persona records already hold complete source information. The new person record does not need any fields at all, really, since it inherits things like name, gender, birth date from the personas. If there are conflicts in the data in the two personas then the preferred or chosen or even modified values can be added to the person records, which then take precedence over the values in the personas. The person record obviously represents a conclusion. In a sense that conclusion is the "source" of the person. Whereas personas need source references, a person should have a conclusion. When you set an attribute of the person record, say the person's name, which may be different in the different personas, you are making the conclusion that this is the better name for the person. What is described here is a two tier, binary system. There is no need to be binary. A person record may contain links to many persona records. This is obviously necessary as new evidence is found, and that evidence is codified into more persona records. And there is no need for the the system to be limited to two tiers. Say you decide that two of your person records (each referring to multiple persona records) represent the same real person. Two obvious approaches exist. First all the personas from the two persons could be grouped together into one person record, replacing the two persons. Or a new person record record could be created that refers to the two person records, adding a tier; the two person records are not modified in any way. There are advantages to both approaches. In the former things remain two tier. The persona level is always a codification of evidence, and the person level is alway the codification of conclusions and decision making. And two tiers are simple and make good sense. In the latter case, the history of decision making is maintained. Each interior node in an n-tier tree keeps its own conclusion, so you end up with a "conclusion tree" that clearly shows how you made your decisions about who was who. Another advantage of the n-tier approach is its reversibility. You can undo decisions easily. All this depends on the idea that we decide we want to codify our evidence into persistant data base records. I don't put a value judgement on that. I want to be able to it, because it is how I do my own research and models how I view the research process. But others do just as well by only keeping the conclusion persons records around, adding information from new sources directly to those conclusions records, which grow larger and larger as new evidence is found. |
In an n-tier system, which is the system I prefer, there is no need, in my opinion, for a tag to specify whether a person record is a persona record or a conclusion record. If a record has tiers below it, it must be a conclusion. If a record is a leaf in a tree (or a stand alone record) we WANT it to be a persona, but there is no way to require it to be. A persona record could be defined operationally as any person record with a source reference. But if you want a tag there's not too much to complain about from my point of view. I always prefer simplicity in a model, knowing that things always complexify enough! |
Thanks, @thomast73. I'm so glad you're there to help fill in my many gaps. And, thanks @ttwetmore for taking the time to expound on the n-tier model and particularly to compare it to the two-tier model. I'd like to say again that we intend GEDCOM X to be able to support an n-tier model to accommodate applications that implement such a model. I hope to be able to initiate a proposal at #149 within the next few days.
So what about person records that are "stand alone"? How would you be able to tell whether applications should treat such a record as an "information item" (i.e. the record shouldn't have more than one source and shouldn't be modified in such a way so as to conflict with what that source says) and a "conclusion record"? |
First off, I love the amount of detail that just showed up here tonight. To quote Hemingway, this is fascinating as obscenity. My comments:
Thanks, that helped a lot on forming my thoughts, diffuse as they may appear.
I very much agree that a "person" composed of "personas" should show all data from all contained "personas". Say I have five different personas that I consider the same person. It is then up to me and my software to sort eg. their birth year data based on confidence and conflicts/lack thereof (say all except one say born in 1743, the last one says born in 1753). Method being something like sort events of type T by confidence > average numeric values/display values colorcoded by strength of evidencer Obviously that is too complex to encode in a standard, that's just me talking idealism, but the idea being as @ttwetmore says: All data is always there, and it is up to the user & software to decide what is right and what is wrong - to show to the end user at first glance. Generally data will somewhat agree, and if they do not match up at all, the user is probably at fault, putting two obviously conflicting "personas" into the same "person" without any confidence check.
I suppose those could be classified by being 1 ref from an "actual" source - ie, any person deriving itself from conclusions on other persons is not a persona. Any person derived from a piece of paper or a picture is a persona. |
If a person record has a source link it is a persona. If a stand-alone person record does not have a source reference it is either a "lazy" persona (user didn't bother to add source info) or it is an "old-fashioned" conclusion record (as in today's systems, in which case one hopes that at least some of the individual attributes/fields/properties within the record will have source references). If we do end up with applications that support the research process by using evidence based records (e.g., personas) and conclusion based records (e.g., "today's" person records), then at certain times the user interface will concentrate on personas (just the facts, m'am), and sometimes on conclusions (showing the user the "roots of the person trees", with options to "dig deeper" into the facts). Where does a stand-alone person record fit into this UI scheme? I think the user might want to see these in both contexts. Certainly if the record has a source reference it is a fact and should be shown with them. Certainly if it doesn't have a source reference it should be shown with the conclusions. If I were writing such software I would also have a mode where I could see all and only the stand alone records. The issue is probably whether to allow a stand-alone record to not have a source reference. One could imagine a very strict, research based application, that simply insists that all stand-alone records must be personas, and be done with it -- that's just the way it is. I would want a more flexible system that would allow stand-alones that could be "imported conclusion records." But wanting such flexibility might be another example of my desire to resist restrictions, even in places where they are the best way. But this is probably all moot. I have no real objection to the tag at all, other than my natural contrariness toward any kind of rule or restriction before it is fully considered. |
So that doesn't make sense to me. Why shouldn't we accommodate the notion of a "conclusion" person that cites sources? Most applications today (which are decidedly not n-tier) don't even provide much of a UX that allows a user to gather "information" in the form of a persona, instead just allowing users to put all their conclusions together and cite the sources they used. So maybe what you're saying is that a media type that enforced the n-tier architecture wouldn't need to have the Just to be clear, though: when I say "we intend GEDCOM X to be able to support an n-tier model," I'm not saying that we intend GEDCOM X to enforce and n-tier model. GEDCOM X needs to (also) accommodate implementations that allow "conclusion" persons to cite sources. Hence the need for a |
Ryan, Yes, I think the whole thing comes down to the "model" that an application supports. Today a person record (in a typical desktop or on-line system) is a conclusion record and it contains possibly many PFACTs (properties, facts, attributes, characteristics, traits), with the pfacts extracted from multiple sources, and each pfact "should" have a source reference to indicate where it was found. I believe this is the base model we are all comfortable with. Therefore it is a model that GEDCOM-X should support. When I said a conclusion record doesn't need source references, I was referring to conclusion records as they might exist in a two-tier or n-tier system, in which the conclusion records, instead of containing pfacts, contain references to persona records that have the pfacts and the source references. In such a research based system the conclusion records get their source references indirectly through their personas. This is a model that I hope GEDCOM-X will be able to support, and I am happy to see that you are supporting the idea. I agree with you that GEDCOM-X, if it is to be a generic model, must not restrict things to any particular genealogical software model. So a person record should be able to have a source reference that applies to all the pfacts in the record inclusively, and/or each pfact should be able to have its own source reference. This generic approach is the one I use in the DeadEnds model. I guess one of the reasons that I don't like the tag idea (which really is a fine idea, I just don't like it), is that it kind of admits to the world exactly what kind of a model the software is choosing to use. I think I'm just be dumb about this as that's not such a big deal. Maybe it is important, given that GEDCOM-X will be able to support different genealogical models, that there be tags to indicate the fact. |
And then what happens if you decide that the original two persons were wrong altogether? Consider Persona 1 and 2 referenced by Person A, and Persona 3 and 4 referenced by Person B. These are merged into Person C, referencing Person A and B. An astute researcher realizes the proper grouping is actually Person D referencing Persona 1 and 3, and Person E referencing Persona 2 and 4. (You better believe this will happen, especially in shared trees with novice researchers.) It seems like the only logical conclusion is to delete Persons A, B and C and instead provide Persons D and E. How do you retain any of the advantages of the N-tier model when it reduces back to 2 tiers the minute a genealogist make a mistake and the tree of Persons must be re-arranged by a more careful researcher? ------ EDIT ------- I see from another thread that perhaps the right thing to do here is to modify Person A and Person B pointing to ALL the personas, including the wrong ones, and write a proof statement for each person that includes the conflicting evidence. If they end up being the same person, you can merge them into person C, and if that is the wrong thing to do you break that merge and retain the original two persons, with revised proof statements. |
@zappala, I agree with your analysis (it would be hard to disagree!). But I don't believe it applies to the 2-tier vs n-tier issue. N-tier only makes sense (to me) as a way to record the history of sequential, decision making. As your example points out, when you have to undo earlier decisions and make fundamental changes to the structure of the records, you loose that history, in the sense that it is no longer present in the structure of the records. If it is important to you to record every decision, the ones that break down older decisions, as well as the ones that build up new decisions, you will have to find another way to do it. Notes in the person records come to mind. I don't wish to sound preachy about the n-tier approach. If it were available I think I would use it. I have used the same structure in a real-world, non-genealogical application that had to automatically join billions of persona records into 100s of thousands of person records (the algorithms used by the Zoominfo Company). Because of the sheer mass of data, and the need to solve the O(n*n) problems of comparing billions of records to billions of records, a multi-phase approach based on n-tiers (one tier per phase) was the only practical solution I could come up with. This is certainly not an argument that a similar structure is needed in a genealogical application that deals with orders of magnitude fewer records. There are enough analogies however to make it intriguing to think about. A two-tier approach will solve the problems of codifying research excellently. The real issue boils down to whether or not we decide that our genealogical databases should hold our evidence in some explicit record-based format, or whether we only wish to copy items of evidence directly from the source material to our conclusion records with no intermediary (e.g., persona) stage. |
Believe it or not, I can relate to that. I had the same concern (but didn't articulate it the same way) and pushed us to where we are today as @thomast73 articulated. But it turned out to be confusing, hard to explain, and add potential for data integrity violations. Hence the proposal here to just use a flag because it's clearer. |
Note that the changes in #149 include the changes here, so it's easier to review there. |
I continue to worry that the only marker being considered is in In some cases, we use information from a source to make a case, and the information is not directly associated with any To repeat myself...
I would add to this list For But what about an |
So if we're going to add a flag to more than just
Here's an idea: instead of distinguishing what resources are "information items", we could distinguish which resources are "working conclusions". Then the flag would be the inverse and might be more easily named...
Or what about:
What do you think? |
Let's step back a bit and think about what we're trying to model. What Thad is pointing out is that a proof argument may need to take into account a wide variety of evidence, some of which may not directly mention the historical person under discussion and therefore doesn't generate a Persona instance but nevertheless bears on determining the fact or perhaps just in writing a good biographical sketch. The BCG crowd advocates a tree of prose analyses culminating in a proof argument that is attached to one or more events and facts associated with a person. For the most part they also advocate doing that work outside of the genealogy database program, because none of those programs provide any support for recording the analysis.Is that where you want to go? |
How about a flag called evidence? |
How about using the extractedConclusion list on the SourceDescription like we already decided in #202? We've been around this block before. |
That is, indeed, an option. But I'd (personally) vote against it because I've had to explain it to too many people who get confused about it. After going through the rounds of explanation until they finally get it, their question is usually: why not just provide a flag? |
OK, but on Conclusion and perhaps called "abstracted"? Or are you still stuck on it applying only to Persons? Can we lose the extractedConclusion list on SourceDescription in exchange? |
Huh. Yeah. I guess I kind of like that. Anybody else? So @jralls I like your suggestion, but I still can't tell if you would prefer to just not have a flag. What's your preference? |
KISS. I like a flag that says "this conclusion is a verbatim abstract of the cited source" a lot better than a list on the source of "conclusions which are verbatim abstracts of this object" because the former is better data encapsulation: One shouldn't have to look at another object to get a complete description of the object at hand. But what is the motivation for such a tag? Does anyone besides Tom contemplate writing a program that makes use of the difference? Does it really add anything to an n-tier program? Tom doesn't think so:
|
The words "extract" and "abstract" already have meaning in the genealogical community. We have already received push-back on associating these names with the concept of a "verbatim" representation of information in a source. We tried to get around that concern by combining two words to form "extractedConclusions". Perhaps the flag could be "extractedConclusion"? Here are a few more attempts at a name...none of which I am truly happy with...just hoping to spur ideas:
|
"extractedConclusion" is less ugly than the others. My mac's Thesaurus produces the following synonyms for "abstract" I like 'digest'. |
I like the word "extracted", but to put a property named "extractedConclusion" on a data type called "Conclusion" seems redundant. I'd prefer just "extracted" so that the accessor would look something like "conclusion.extracted". |
Doesn't 'extracted' elicit the same objections as 'extract' and 'abstract'? |
I like some of the dictionary definitions I saw for this word, and thought they were applicable. But without reading those definitions, I did not see an immediate connection. So I worry about adopting this name.
So...I went back and looked for the original objection...and found it here? If this was it, it was not raised exactly like I remember it, so I apologize. And despite the objection, the "extractedConclusions" name was eventually adopted. @stoicflame argues that the "conclusion" part of the name was more meaningful when its context was the At this point, I guess I would also lean toward using the "extracted" name. |
With "here" being in #202, so we are indeed revisiting last year's work. Rather different from the usual meaning of "pushback" in this forum, where you usually mean the anonymous group of "outsiders" who occasionally veto consensus arrived at after (sometimes weeks) of discussion in an issue.
Well, Ryan exercised his executive privilege and committed a change with that in it; it wasn't because of anything resembling consensus. In any case the objection was something of an aside in a long and rather circuitous discussion. With that cleared up, "Extracted" is fine with me. But to repeat an earlier question, is this replacing the extractedConclusion list on SourceDescriptions? |
Yes. That is the current plan. |
…tion*__ to conclusions.
I have tried to update the specification reflect the current state of the proposal. |
Looks good, but there are still a couple of references to "persona" that should be redone. |
See d28209d, which formally defines the "persona" concept using the "extracted information" concept. (I think there is still value in formalizing the notion of "persona", even if it's just for convenience.) |
OK. I don't see why we need a special term for it, but it's harmless. "Encloses" sounds odd. What was wrong with "contains"? While I'm nit-picking wording, how about "Extracted Conclusion Constraints" instead of "Extracted Information Constraints"? The constraints are on Conclusions, and "information" isn't a defined term in the spec. |
+1 I'd like to have @thomast73 comment on that, though. |
One of the purposes for adding this flag to the GEDCOM X model is to add a provision in the model for what is usually termed "information" in the Genealogical Proof Standard (GPS) literature—see Elizabeth Shown Mills, Evidence Explained: Citing History Sources from Artifacts to Cyberspace, 2d Ed. (Baltimore, Maryland: Genealogical Publishing Company, 2009), 24. Given this, and after trying to get the constraints written, etc., I would actually like to modify the proposal so that the flag is called |
The GPS itself is a bit loose about information, data, and evidence, using all three in the same bullet-item without any distinction about which means what. She goes on to define evidence as " our interpretation of information we consider relevant to the research question or problem" [again, italics hers], and to explain direct, indirect, and negative evidence. Two pages later, she presents the "five essential parts" of a proof argument, where the third is "presentation of evidence, supported by thorough source citations and analyses" and the fourth "explicit discussion of any conflicting evidence". ISTM you want to use 'evidence' here (as in |
Well said, John. I think what Thad's trying to express is that he'd like to use the term "information" so that it's easier for those who use the vocabulary as you've explained can more easily identify how those concepts are supported in GEDCOM X. I think Thad would like to identify the "information" and then refer to that information as "evidence" from the (working) conclusion. So, the piece that's not on the table yet is a new (forthcoming) proposal to introduce a new concept called something like "evidence reference" which is used to refer, for example, to personas from persons instead of using the For my comments, I could get behind the name |
It's a better name, but it's not really a new concept, is it?
That's the point, actually, and is why using "information" isn't appropriate. Information is abstract; as soon as you make concrete bits of it it becomes evidence. There's another viewpoint buried in here, though, and that's that we're writing a spec for programmers, not genealogists, and the name of the object being modified is "conclusion", not "evidence" or "information". In OO speak, Evidence subclasses Conclusion, and to my mind that's conceptually clearer and therefore easier to explain than applying constraints to Conclusion regardless of how you label them. |
I am going to rescind my most recent addendum to the proposal. The distinctions between information and evidence in Evidence Explained are not as tight as the pundits (e.g., Tom Jones) are currently preaching, but I do not know where to find the current preaching in writing to be able to discuss and cite it in a public forum. I think the |
…ing another concept in the conceptual model that must be defined.
Conflicts: specifications/conceptual-model-specification.md
Your comments are invited on the attached changes to the specification which define the notion of a
persona
and provide a way to indicate data that is identified as a persona.