Data Mining in Genealogy – John Ellingsworth

In this paper, I will suggest that the application of data mining to large data sets and repositories of genealogical information, along with the potential benefits of data mining to both researchers and organizations that support genealogical research efforts will enhance the ability of historians and genealogical researchers to conduct more efficient and effective research.

Background

Genealogical research usually begins with a small data set of a particular family relationships, history and origins. In time, the amount of research material discovered and collected regarding the members of a single family can grow substantially in size. For some researchers, the interests in genealogy become such that the development of single surname repositories become common and the researcher has turned it into a profession (Guild, 2008). With this increasing amount of data and information, new methods of discovery need to be considered, including the application of data mining techniques for seeking out previously undiscovered patterns and clusters of information.

Genealogical data repositories for a single family will quickly grow into the gigabytes when storage factors such as copies of original records, digital photographs, video collections, references, and various electronic documents are considered. The need for increased storage increases exponentially when a researcher has taken on multiple surnames throughout a family, or when information on all the individuals of a single surname is collected. Additional data resources for the genealogist include existing electronic resources, online databases and repositories, OCR compatible documents, and other digital archives. Going through all of this raw data and finding relevant information becomes increasingly difficult and time consuming for individual researchers.

Data mining bears many similarities to statistical analysis and information extraction, and as a result data mining can be very useful in the analysis of large data repositories. Data mining in genealogical repositories can be used to extract information about previously unknown relationships, to determine implicit relationships from large collections of people, and extract potentially useful information through an automated system (Elmasri, 2007).

Data Mining in Genealogy

In order for data mining to be effective, we must have specific goals or application of the data discovered. In the case of genealogical data, we hope to identify historic patterns, familial relationships between different persons for which explicit relationships have not been identified, geographic patterns of distribution within family groups, and specific patterns in time periods. These various classes of attributes within the data set will enable the researcher to spend less time manually searching through data sets, and more time verifying the relationships discovered through data mining.

Since genealogical data repositories often exist with some inherent structure to the data, discovering and creating models for discovery is less cumbersome than less structured data sets and makes the creation of clusters within data sets. Data modeling is “the act of building a model in one situation where you know the answer and then applying it to another situation that you don’t” (Thearling, 2008). For instance, large genealogical data sets are stored using the GEDCOM (GEnealogical Data COMmunication) standard. The GEDCOM standard was created “to provide a flexible, uniform format for exchanging computerized genealogical data” (LDS, 1996). This structured data model allows for the creation of class labels from the attributes within it, such as birth place and date classes, surnme, place, date classes, etc. However, due to the often disparate collections of text documents, database sources, binary files, etc., using clustering techniques in data mining should prove to be a more reliable source of relevant information than hierarchical and decision trees.
Using hierarchies and clustering present ideal frameworks for data mining genealogical collections. However, hierarchies and decision trees require the creation of specific models in advance of data mining, which is not always applicable to unstructured data often found in genealogy repositories. Clustering is the processing of data into partitions without having a predefined training class for doing analysis; it places records into groups of similar data and also into groups of dissimilar data (Elmasri, 2007).

Data mining can also be used in the development of a genealogical data repository or warehouse to find meaningful patterns within existing data sets and information collections. By using data mining on small collections of data in the early stages, we can better define those elements that will structure a future data warehouse, where applicable. Data mining can also be used AFTER the creation of a data warehouse to find different rules and patterns since the data has been cleansed and transformed into the necessary structure for analysis (Betz, 2006).

In genealogical repositories it is common for much of the data to exist in an unstructured format, such as a text paragraph in a PDF file. This free form text is not part of the typical data mining environment, thus it requires data analysts to spend more time imposing some type of structure to the data before and after processing. Domain expertise will facilitate this process, as will the interpretative power of the researcher. Ultimately, discovering relational patterns unknown a priori may both improve extraction accuracy and uncover informative trends in the data and help. (Betz, 2006)

Future Implications

The introduction of data mining into the realm of genealogical research opens new doors for businesses to develop new customer models that assist with research and also affords individual researchers to leverage advances in storage capacity and computing power. Large organizations that have built up massive data collections – such as Ancestry.com, Rootsweb.com or the The Church of Jesus Christ of Latter-day Saints (Familysearch.org) – will benefit the most from the economic benefits of creating a data mining infrastructure. In this scenario – where user contributed genealogical repositories reside – each of the above organizations could create web services built upon data mining the vast collections of material they have accumulated and charging a fee for access to and use of the data mining results.

Another area of further research is the contribution of data from social networking sites to the genealogical data repository. By leveraging the vast amount of information being published online in such application as Facebook, MySpace or LinkedIn, genealogy researchers can mine relationships and social contexts of living members to develop enhanced understanding of family dynamics.

Genealogical data mining creates a number of privacy concerns for the researcher or organization that wishes to leverage it as part of a business model. While many organizations and researchers use tools available to them to hide important information about living persons, it is always possible that some aspect of the data has not been sanitized and vetted of personal information. This is an important consideration for organizations that intend to provide public accessing to genealogical data in general and to data mining services in particular.

A final concern for many genealogists who have labored to collect their data is the issue of ownership and copyright. Many researchers have spent incalculable hours compiling their data repositories and are very reluctant to part with their work. This creates a dilemma for large repositories that are user populated and make ideal candidates for data mining; in these cases, it is best that the researchers doing the data mining and the genealogical researcher conduct due diligence on keeping data clean and the ownership clearly delineated.

References

Betz, Jonathan, Culotta, Aron, and McCallum, Andrew. Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text. 2006. http://www.cs.umass.edu/~culotta/pubs/culotta06integrating.pdf

Elmasri,R. and S.B.Navathe (2007). Fundamentals of Database Systems, 5th ed. Addison Wesley.

Guild of One-Name Studies. 2008. http://www.one-name.org/

LDS. The GEDCOM Standard Release 5.5. 1996. Family History Department, The Church of Jesus Christ of Latter-day Saints. http://homepages.rootsweb.ancestry.com/~pmcbride/gedcom/55gctoc.htm

Thearling, Kurt. An Introduction to Data Mining. http://www.thearling.com/text/dmwhite/dmwhite.htm