Two Genealogy Models - GitHub

Viewer
Transcript

Two Genealogy Models Overview This document attempts to briefly describe the Record and Conclusion models within GEDOMX. It also explains the need for the two models, how they’re different, and why they’re separate.

Background For decades, GEDCOM has remained the de facto standard for communicating genealogical data between people and systems. Although significant limitations have existed in this format, it’s been adequate enough to persist despite various attempts to replace it. GEDCOM was the first standard. Prior to the Internet, people would perform painstaking research, and (even more painfully ) attempt to put their evidence and conclusions together into GEDCOM files. There wasn’t much choice for them to do anything else besides:

1) Manually searching and finding historical evidence at libraries, archives, and other institutions. 2) Entering Conclusions based on the evidence they found 3) Attempting to document their Evidence in a GEDCOM file by: a. Referencing the evidence with a bibliographic citation, OR b. Photo-scanning the evidence and embedding the content (there was nowhere else to put these), OR

c. Both 4) Copying their file to another floppy disk 5) Giving the copy to anyone who cared With the emergence of the web, email replaced steps 4 and 5 above, but there were still two problems with:

1) Conclusions - Since there wasn’t a common repository of conclusions, people weren’t aware of research completed by others, so work was being duplicated over and over by people related to the same ancestors.

2) Evidence - If you only had a bibliographic citation, you still couldn’t see a qualitative-enough representation of the evidence without either:

a. Driving to the facility that kept the evidence to see it for yourself, OR b. Having a photo-copy manually sent to you somehow, either through email or postal service. So, websites were built to attempt to solve these problems. Specifically, they provided:

1) Conclusions - A common database for everyone to store and share their conclusions, AND

2) Evidence - A massive warehouse containing representations of all the world’s historical documents! This is where we are today, but unfortunately, we still have two problems with:



1) Conclusions - The idea of a common conclusion database is great, unless there are several of them. Today, if one wants to switch systems or use multiple systems, they must export (guess what?) a GEDCOM file from System A and import/upload it to System B. In today’s weboriented world, this should be better facilitated with modern formats (e.g. XML, JSON, etc.) and modern technologies (e.g. web services).

2) Evidence - It will take decades for any single organization to complete the digitization of all the world’s historical records, if ever. What do we do in the meantime? What if no single organization ever digitizes all the records themselves? Also, some archives want to be the exclusive online repository for their own records so they can charge fees to cover their costs. How can data consumers easily integrate with these systems while still serving the interests of the producer? These problems, as with other genealogical problems in the past, must be solved somehow with new innovation.

The Solution For Conclusions, the GEDCOMX Conclusion Model addresses the need to share conclusion data by modernizing the integration format and technologies used between various conclusion-based systems. GEDCOM’s replacement is essentially the GEDCOMX Conclusion Model. For Evidence, there are significant integration barriers across disparate evidence repositories. Archives and genealogy companies need to share and reference each other’s evidence data in a common format to overcome these barriers. The GEDCOMX Record Model is intended to be this format. It allows for the common exchange of record data (including digital transcriptions and URLs of evidence images). At the same time, an archive can keep its digital images behind a “pay wall” and let the record data it has published to other systems drive traffic to its images. The GEDCOMX Record Model has no predecessor within the legacy GEDCOM format. It is a new model for sharing and referencing genealogical evidence, not conclusions, in a world of individual online archives. These two models, while targeting very different domains, are intended to be compatible. Conclusions will naturally refer to evidence; so too, conclusion data should be able to reference, and even consume, evidence data. In this way, the Conclusion Model is aware of, references, and consumes the Record Model.

Why Not One? Since all we’re talking about is genealogy, why not have one “genealogy model”? Both models need to the same type of information; why can’t they be the same objects? Looking at the surface, these models have a lot of common terms. There are Relationships, Names, Facts, Dates, Places, Genders, etc. in both models. Additionally, the structures of these two models appear to be almost identical. For example, in both models, a Person(a) has Names, Facts, and Gender; a Relationship has Facts and references two Person(a)s; and a Fact has a Date and Place. These are the same domain, no? No. Common terms do not equate to common domain. Name, Account, Dollar Amount, Tax, and Transaction Date are also common terms in accounting systems, but that doesn’t mean Accounts Payable, Account Receivable, and Payroll systems should share the same schema.

The Record and Conclusion Models are, semantically, trying to accomplish very different things. The Conclusion Model is trying to justify the conclusion that a person existed with associated information and specific relationships to other persons. The Record Model is documenting the existence of a record concerning some persona. If we define the domain as “genealogy” instead of “genealogical conclusions” and “genealogical evidence”, then we could try to put all these terms into one all-purpose, generic “genealogy model” to handle the wide breadth of use cases and processes within the industry. Combining these domains would make for a smaller more-generic model which would confuse the “person existed” assertion with the “record existence” documentation. This would also increase implementation complexity. Additional code must be written across multiple systems to differentiate between objects used for transcriptions and objects used for conclusions. These objects have very different requirements. In the context of software design principles, merging these objects would violate the “Single responsibility principle” and “Interface segregation principle” within SOLID object-oriented design. See http://en.wikipedia.org/wiki/SOLID_(object-oriented_design) Here’s a list of some of the dichotomies between the two models: Record Model

Conclusion Model

Evidence-based

Conclusion-based

Generally, a snapshot in time

Across time

Static, with few re-edits

Dynamic, with many re-edits

Field class

Normalizable Interface - is a composite or conclusion of multiple transcriptions

- is transcription Transcription-specific members of Field class - original - interpreted

Conclusion-specific members of Normalizable Interface - value

- label Record class

(nothing equivalent)

Can effectively model some non-person information

Not very conducive for non-person information

Records are self-contained, they don’t reference each other

Conclusions reference records and other conclusions

Do not have negative assertions, they only positively assert what’s on the record

Can have negative assertions

Age class

(nothing equivalent)

Facts exist on Records that are not at the Persona or Relationship level. For example, film/image number, page number, etc.

(nothing equivalent)

(nothing equivalent)

NameForm class

(nothing equivalent)

Name has one Primary Form and multiple

Alternate Forms Primary Fact and Principal Persona within a Record

(nothing equivalent)

Has DatePart, PlacePart, AgePart for identifying when Fields on an image are separated and when they’re combined

(nothing equivalent)

Can be used by Archives and Geneology Companies

Only used by Geneology Companies

Processes are more tailored to fast data-entry use cases. It is more valuable for users to key breadth over depth. That is, we’re more interested in data-entry users keying more records with fewer, high-quality fields such as name, date, and place; instead of keying fewer records that go deeper into lower-value fields. Entering how confident a user is in their transcription of each field, what they concluded, why they concluded it, etc. will give us deeper records, but significantly fewer of them. We’re more interested in helping people find the evidence image, not rationalizing the existence of the evidence data. Yes, this may be heresy to some genealogists, but it’s the most pragmatic approach for most organizations.

Processes are targeted to accurately capturing research and conclusions with high attribution, citation, and confidence support. Highly genealogically-sound processes are intended to be supported.

The above table is based on version 0.10 of both models.

The challenge that these two models present is that, while very different, one must consume the other. To this end, we’ve attempted to keep the Record Model very consumable by the Conclusion Model; which partially explains why they have common terms and similar structures: not because they inherently are the same, but because they’re very different and we’re trying to make one map-able to the other. While maintaining map-ability is crucial; providing map-ability by keeping the process models (classes) in lock step between the two models fosters increased harmful complexity.

Industry Need and Best Practice Most archives don’t care what a researcher concludes about their records. They present the evidence in the most consumable format (e.g. first-hand viewing, photo-copies, digital representations, etc.) and let the viewer judge for themselves. Archives only need systems and schemas that present the evidence in its most objective form. We must provide a simple model for them to participate with the genealogical industry and publish the world’s records to all in a common format. Archives have no need to host systems or schemas that allow people to build conclusion towers on top of their evidence. For conclusion towers, we rely on conclusions systems hosted by genealogical companies. Fortuitously, many of these same companies also host their own evidence-based systems to support their conclusion systems, and we can look to these industry-leading organizations for insight as to what the generally-accepted best practice is for combining or separating these systems and models. Today, all industry-leading genealogical websites separate their conclusion systems from their evidence systems, of course with the allowance for conclusions linking to and referencing evidence. Without enumerating them here, an entirely different set of processes, requirements, and use cases exist for transcribing a record than for documenting one’s genealogical research. These different use

cases have independently brought the same evolutionary outcomes across multiple genealogy companies; thus, objectively proving the need for two systems and two domains. In other industries, such as within detective or investigative domains, there exists a similar division between the evidence system and the hypothesis/theory system. It’s simply not effective to mix the objective evidence (including recordings and measurements) with everything we’re hypothesizing based on that evidence. Yes, in a purely academic sense, one can say even a transcription is a conclusion, particularly in cases where the original record creator used illegible writing; but this is more of an intellectually-interesting stretch than an accurate portrayal of reality, since most hand-written records are quite legible and form-based, so names, dates, and places are clearly identifiable. Moreover, saying “transcriptions are conclusions” for records created with typewriters is downright silly, and it’s beyond ridiculous when referring to digitally-born evidence. The exceptional “bad handwriting” example shouldn’t redefine a domain at the expense of the majority case.

Summary This document has attempted to explain the rationale behind the existence and separation of the Conclusion and Record Models within GEDCOMX. While our industry benefits from compatibility between the two models, they must still be developed within their own semantic domains. As with other industries, segregating domains is ultimately more effective in solving the unique needs of each domain.