Archving FLEx Data

Any discussion about archiving needs to set out three things:

What is it that is being archived.
What is does it mean to "archive" as a process of activity.
What is an archive.

The academic discussion around these three things is all over the map. Mostly because finding seekers are seeking to meet the requirements of institutions offering funding. The following discussion provides a road map and the dicsussion of all three elements listed above.

What is being archived?

FLEx is many things to many people. That is some use the software to maintain a list of words in the language being studied. Some people only use the parsig feature to parse texts. Others use FLEx to generate dictionaries, yet others use it to store texts as if were the final location those texts belonged. So when we talk about archiving FLEx we need to talk about it broadly as if the FLEx user was using it to store words, texts and relationships about words, and relationships about words in texts, and audio files and video files, and photos and the relationships of those media to the words. or texts.hile I suggest that FLEx databases are more than just dictionaries in database formats, a lot of the questions I have encountered from SIL users of FLEx who are looking to archive FLEx databases, have related to FLEx as a lexical resource. Therefore I want to look for a moment at the dichotomy of lexical resourceses.

1. All Lexical Resources

Linguists, archivists, and data wranglers all think in different terms (worldviews). It is an active act of translation to communicate the semantics, concepts and implications from one community to another community. For instance communicating to linguists what it means to archivists to archive a lexical resource, and communicating to data wranglers what resources are in the archive and how they are constructed. It is important to have this conversation so that as archivists and as linguists (submitters) we know where in the spectrum of lexical resources FLEx and Toolbox data sets fit.

Why do we need a typology of resource types?

Marketing of products (on the behalf of linguists and economic partners, including the matching of resources to SIL's services).
Communication between submitters and archivists.
Effective matching of metadata to other metadata systems.
The longevity (Data maintenance strategy) and upgrading of data formats. (Because these strategies are dependent on the data formats and the file types and the data resource types.)

The archive's (and LSDev too) audience (linguists) often think in terms of "the dictionary" or "the wordlist". This is an end product orientation. It has been well argued (by Steve Echerd, Gary Simons and others) that SIL would rather see their staff oriented towards "the lexical database" because that is the source. From such a source multiple end products can be produced. While it would be ideal to flip a switch and change these "linguist's" orientation, such a switch does not readily present itself. Therefore, the SIL services which deal with Lexical data must be clear, persuasive and educational in their communications.

With respect to the distinction between a lexical data set and an out-put end product like a "dictionary", the distinction must also be clear in the archiving records. That is, on the back end of the service to archive resources, the archive record's architecture and organization needs to reflect the derivative product relationships. Since there can be multiple derivative products per lexical data set, in a DSpace architecture, it seems that both objects should be items with a relationship "is_derivative_of" So, a dictionary item is_derivative_of the lexical data set item. This is discussed in section 1.3 below.

Linguists (especially American linguists) are trained deconstructionists. This means that one of their first questions is, _what do you mean by "lexical database" or "lexical data set"? _Clarifying for them what we mean is important as we strive to provide clear services to this class of consumers.

1.1 Resource Types

A consistent typology of lexical resources is challenging for several reasons. One of those reasons is that lexical resources are usually at the apices of several intersecting continuums. Some of these continuums are presented below.

Wordlists	Encyclopedic entries
Monolingual	Poly-lingual
Print	Non-Print (Oral)
Single mode (i.e. textual only)	Multi-mode (i.e. text + audio, images, video)
Physical	Digital
Edited	Non-Edited
Single Author	Collaborative Production
Single IP	Multiple IP
Corpus Based	Non-Corpus Based

Beyond these continuums there is also purpose both of data collection and of the out-put product. It is in this purpose that interactive ideal is established (linguists, like many other classes of individuals often leave this idea un stated). What do I mean by purpose? If we take the dictionary as an example, then there is the "Learner's dictionary", the "Bi-lingual dictionary", the "Picture dictionary", the "Domain specialist dictionary", etc.

1.2 Databases Types

Beyond the description of the thing-ness of lexical databases using the continuums above, there is the technical description of the database. We can talk about character sets (UTF-8, UTF-16, etc.), and we can also talk about the description of "the thing" by the application which we used to create "the thing". So it might be a ToolBox database or a FLEx database, etc. But even within these descriptions there issues like database schemas, or customizations which need to be documented if we are going to think about passing our data on to other users.

1.2.1 "The Things"

So, what is this "thing" we (linguists) need to actually submit to the archive? or the "thing" we (archivists) need to expect from linguists?

In a complete toolbox project file one should expect to find the following.

Some-zipped-toolbox-project.zip
├── .typ - File defining the database structure
├── .lng - File defining theLanguage encoding
├── .prj - Project file
└── Datafile - with one of the following file types
├── .db
├── .dic
├── null - meaning no file ending
├── .txt
└── .xml

In a complete FLEx 6 and previous project file one should expect to find the following.

-- Some tree of files and what those files represent or include and why

In a complete FLEx 7 and Newer project file one should expect to find the following.

-- Some tree of files and what those files represent or include and why

In a complete FLEx 8 and Newer project file one should expect to find the following.

-- Some tree of files and what those files represent or include and why

1.2.2 Are the same "Things" Equivalent?

Inter-version non-equivalence.

It follows then that as we look at various databases (For instance a FLEx database) as produced by various version of software (for instance FLEx 6 vs. FLEx 8) that the thing-ness of the digital object changes. This means from a reusability standpoint that the things are different. Notice that I am not talking about user changing the data in their databases over time, but rather I am talking about the technical composition of the object. This variation would suggest that the archive should have some method of grouping like "things" together. So, one should be able to get a report on all the "FLEx 6" databases or all the "FLEx 8" Databases.

Same version non-equivalence.

A second level of non-equivalence exists and may not be obvious to non-application users (especially archivists). To this point in the discussion we have been talking about FLEx and Toolbox databases and datasets as if they are only databases, or grids of words and their relationship to grammar and meaning. However, both applications can be used in multiple ways (and in deed are by various linguists). Let me take FLEx for instance, because it is more familiar to me (but in our communications with linguists we should provide examples from ToolBox and FLEx). A FLEx 8 database used by anthropologists may include texts, but rather than word level annotations about meaning and grammar, there are a plethora of annotations for notes on culture and anthropology (with very little marked in the database for grammar). This kind of FLEx database stands in contrast to the dictionary resource which is mostly focused on grammar and meaning.

However, the example of the anthropologist using FLEx with texts points to a larger challenge when considering and categorizing the output of tools like FLEx and ToolBox. That is, these tools are not just grids of words and meanings they also have texts in them. I refer to these texts as bit-text because they are in the written mode rather than in the oral or video mode. This pluralistic function of these resources is an important element to highlight and make available to discovery for linguists. In archiving terms it is as if the FLEx item contains other items which may not be archived independently. A FLEx database may have over 100 bit-texts which are parsed and glossed embedded inside of the "FLEx database". Therefore the kind of database which is based off of rapid word collection strategy is very different in terms of content from the database based off of bit-texts. When communicating the nature of the archived database with linguists this is an important element to communicate about. This is also an important element to realize for data transfer and an Archive's Data Preservation Strategy. In the transition from FLEx 7.2.7 to FLEx 8 I have seen no less than two discussions on the FLEx users group where data migration was botched because the texts were lost. The ability of FLEx to handle texts is also a point of critique by well established Toolbox users. That is, some ToolBox users either don't understand the current power of FLEx to process (bit-)texts, or they don't understand how to move (bit-)texts processed in ToolBox to FLEx, or ToolBox really is more flexible in processing (bit-)texts than FLEx. But both applications have bit-text elements, as well as grid-like elements.

1.3 The Archive Record

As previously discussed above, the archive record needs to consider the dictionary as an item but also the data used to create that dictionary. As we see in 1.2.2 bit-texts may be a part of that foundation. I think the crucial question to ask is: Is a dictionary a lexical database? are a lexical database and a dictionary the same thing? - If they are not then should they be put in the same record (Item) or should they be independent items with a relationship connecting them?

Archive Institution
└── DSpace
├── Community 1
│ ├── Collection 1
│ │ ├── Item 1
│ │ │ ├── Bitstream 1
│ │ │ └── Bitstream 2
│ │ └── Item 2

│ │ │ ├── Bitstream 1

│ │ │ └── Bitstream 2
│ └── Collection 2
└── Community 2

Once we have an answer to the Is a dictionary a lexical database? are a lexical database and a dictionary the same thing? question then we can move on to asking what does each record need to contain. In many respects this is like existing package development going on in ILPT for training resources and with respect to type-setters and the products and outputs they have.

1.3.1 What does an archive's catalogue entry for a dictionary need to look like?

Best practices for file archiving of Dictionaries in SIL's Archive. ( or What should the dictionary package include?)

All dictionaries should have a lexical database associated with them.
All dictionaries should have a PDF with them.
All dictionaries should have the cover or jacket PDF (if one was created, if not then a comment to that effect should be in the description).
All fonts and scripts used to format the lexical data into the PDF should be included.
All dictionaries should have a write up of which materials in the Lexical database were included in the dictionary and how this was decided.
All dictionaries with more than lexical content should include source files for those pages (portions) of the dictionary.
All dictionaries with images should include the original source images in this archive package.

1.3.2 What does an archive's catalogue entry for a Lexical Data Set need to look like?

Best practices for file archiving of lexical databases in SIL's Archive. ( or What should the lexical database package include?)

All lexical data sets should have a write up explaining which custom fields are used and for what they are used. ****

All lexical data sets should have in their description the texts which are included in their texts portion. (These texts should also get their own item description.)

Not all lexical data sets have a dictionary output. All lexical datasets should have a .lift output. (even thought .lift is not everything in a FLEx dataset. - ie. LIFT it does not include bit-texts)

All ShoeBox files should have_____ file ending
- A remark about SFM v.s MDF (the Schema used)
All ToolBox Files should have_____ file ending
- In all ToolBox files should be ______ components.
- A remark about SFM v.s MDF (the Schema used)
All FLEx databases should have_____ file ending
- All FLEx databases should have a remark about the FLEx version.
- What is included in a FLEx archived package?
- What is included in a FLEx back-up package?
- What is transferred to Language Depot? Is this the same as what is included in a FLEx Backup file?
  - How long is data on Language Depot kept?
  - Who owns the data on Language Depot?
  - What is the license of the Data on Language Depot?
  - Who has access to the files on Language Depot?
- Is Language Depot Use considered Archiving?

**** The guidance currently provided by the archive is really confusing because, as a surveyor, I could choose to put all my words collected in the FLEx database and because of my task goal it would be "complete" however, an encyclopedic lexicographer would not consider this complete. There are really two factors which I feel are trying to be answered by the single piece of guidance curently provided by the archive. 1st) Is the answer of coverage. It should be a statical feature of the application to be able to determine how many headwords are in the lexical database. Then the application should be able to look at those head words and determine how many fields are used for each lexical item. If the database has 1500 items, and on average each item has 5 other fields with data in it but across the database a total of 30 fields are uses with many of the 25 odd fields being used under 10 times, then the total database report should be able to quantify which files are used what percent of the time, and the complete list of named fields used. For instance 1500 head words, 1495 definitions, 1374 pronunciation fields, 1500 english glosses, 300 French glosses, 500 example sentences, etc. This is an example of coverage. However, Coverage is only one metric of "completeness" review and accuracy is also a metric. If we have only 300 items of those 1500 which have been reviewed by a second speaker, or a lexicography consultant then that is a separate part of this report, and it needs to be treated separately in instructions to those archiving lexical databases. By adding a stage meter to the entry level in architecture of FLEx we could then easily quarry the "average" stage of completeness or graph the state of the dataset by completeness: 300 entries consultant reviewed, 400 entires verified by more than one speaker, 800 entries in initial draft stage.Resource Types

A second thing to think about is data licensing --- talk here about the onion model

So, what is this "thing" we need to actually submit to the archive?

In a complete toolbox project file one should expect to find the following.

Some-zipped-toolbox-project.zip
├── .typ - File defining the database structure
├── .lng - File defining the Language encoding
├── .prj - Project file
└── Datafile - with one of the following file types
├── .db
├── .dic
├── null - meaning no file ending
├── .txt
└── .xml

In a complete FLEx 6 and previous project file one should expect to find the following.

In a complete FLEx 7 and Newer project file one should expect to find the following.

Is a dictionary a lexical database? are they the same thing? - If they are not then should they be put in the same record (Item) or should they be independent items with a relationship connecting them?

Archive Institution
└── DSpace
├── Community 1
│ ├── Collection 1
│ │ ├── Item 1
│ │ │ ├── Bitstream 1
│ │ │ └── Bitstream 2
│ │ └── Item 2
│ └── Collection 2
└── Community 2

What does a dictionary entry look like for archiving?

Best practices for file archiving of lexical databases and Dictionaries in SIL's Archive.

All dictionaries should have a lexical database associated with them.
All dictionaries should have a PDF with them.
All dictionaries should have the cover or jacket PDF.
All fonts and scripts used to format the lexical data into the PDF should be included.
All dictionaries should have a write up of which materials in the Lexical database were included in the dictionary and how this was decided.
All dictionaries with more than lexical content should include source files for those pages of the dictionary.

All ShoeBox files should have_____ file ending
All ToolBox Files should have_____ file ending
In all ToolBox files should be ______ components.
A remark about SFM v.s MDF

All FLEx databases should have_____ file ending
All FLEx databases should have a remark about the FLEx version.

Not all lexical data sets have a dictionary output. All lexical datasets should have a .lift output. (even thought .lift is not everything in a FLEx dataset.)

===Data maintenance strategy===
All Shoebox, ToolBox and FLEx databases should be archived one a year, at project's end and prior to conversion to another format (or version of)- like a FLEx database.

All Data conversion should be first attempted by the active project. All data from inactive projects should be updated annually with the release cycles of newer versions of FLEx. - This might could be scripted and conducted in the collaboration between the SIL Archive and the SIL Lexicography Data Conversion Service.

Lexical content Browser

Anatomy of archived lexical data sets

In a complete toolbox project file one should expect to find the following.
.zip
├── Database structure - .typ
├── Datafile - with one of the following file types
│ ├── .db
│ ├── .dic
│ ├── null - meaning no file ending
│ ├── .txt
│ └── .xml
├── Language encoding file - .lng
└── Project file - .prj

Last thing

We are happy to help (or be helped as it were).

Below is the text we have been sending out.

Subject line

Lexical Database Archiving Questionnaire

Email Text Body

Last year the SIL Archive did an analysis of what kinds of materials are being submitted to SIL's Language and Culture Archive. During the 2012 year only 4 FLEx data sets were submitted to the archive (and about 6 Toolbox datasets). The archive is looking to see if this is a broader trend among linguists (and more generally lexical database users) or if it is unique to SIL contexts.

We are particularly expecting responses from people who work with minority languages but all lexicographic database users are open to respond to the Questionnaire. We are asking 4 questions and we estimate the whole thing takes about three minutes to complete:

If you are a creator or user of lexical databases (like Toolbox, FLEx, or Lexus, etc.):
Please take a quick moment to fill out the following online questionnaire: http://bit.ly/19QSPMb

Though

For those in a bandwidth restricted situation feel free to reply to the questions below with answers to: Hugh_Paterson [ at ] sil.org

Four questions:

What is your Lexical Database Management solution? Options generally include: FLEx, ToolBox, Lexus, other:____, etc :
1. What is the ISO code of the language you are using it with?:
2. Have you ever archived a version of your current Lexical Database at an official archive? (An archive like SIL's L & CA -REAP-, or SOAS's ELAR, or MPI's TLA, or PARADISEC. - Though it doesn't have to be one of these four. ) - Yes / No:
  4: Have you ever produced a Print or Digital Publication from your Lexical data (like a Glossary or a Dictionary)? if so we would like to hear about it, got a link or a citation?:
One entry per language. If you work with more than one language, feel free to submit one answer per language or add a comment to that effect with a list of the ISO 639-3 codes or language names.
Personal details (email address and name) will be kept confidential, other data and generalizations of trends may be published.

Thank you for the work you do and the effort you make to serve speakers of minority languages.

https://wiki.insitehome.org/display/~HUGH_PATERSON/2014/05/20/Most+fascinating+use+case+of+Lexical+Database+Archiving

Skip to end of metadata

Created by Paterson, Hugh

Go to start of metadata

Unlike the title of this post, the following use case is not written in satire. It is a conversation I have had with PARADISEC, an archive in Australia. I have copied over the raw email transcripts. Note, the names have not been changed to protect the guilty. I hope that by sharing this dialog internally in SIL circles that critical players in SIL publicity, SIL communications strategy, and SIL product development, will come to understand that SIL needs to take a more active role in helping language archives understand the kinds of language artifacts that are being created by SIL's tools and methods, and how to effectively archive those artifacts - including any data migration strategies needed to update materials to make them useable with current tools (or current versions of the same tool).

An email to the collections manager of PARADISEC and to the depositor. The item in question has a creation date somewhere around 2010-2011.

Greetings,

I was browsing, the Papuan Languages Collection in which the following item is a part:
http://catalog.paradisec.org.au/collections/DD1/items/028
. I noticed that some of the FLEx databases were turned over to the archive as .zip files. However, the files listed in the file list for the parts of the collection are .xml files. I know that FLEx can export .xml LiFT files, but I also know that FLEx can export "project backups". In the "project backups" there is sometimes more data including texts and parsing rules created by FLEx. Therefore these "project backups" can contain more information than just the LiFT file contains. Which was submitted to the archive? LiFT .xml files or the "export/ Project Backup"? and if it was "export/ Project Backup" does this mean that the archive (PARADISEC), "unziped" the backup file before putting the content in its archive?
thanks for the clarification,

Hugh Paterson III

Reply from the PARADISEC collections manager.

Dear Hugh,

This is really a question for the depositor so I hope Don will chime in. We prefer to have the most open and accessible version of the material we can. But, as we are not resourced to examine all content as it comes in, we rely on the depositor to create the best archival version of their materials. Zip and other compressed formats are not suitable for archiving, but text (including XML) is fine.

All the best,
Nick Thieberger

Reply from Don Daniels the depositor,

Hi Hugh and Nick,

The files I archived were old FLEx backups (made with version 6 or 7, probably), back when the program created a .zip file. When PARADISEC received them, they unzipped them and archived the .xml file that was inside. If there was any other information in the original .zip files, I don't know what happened to it. I'm actually preparing new databases for archiving, though, so this will all be moot soon enough. The new files will be created with FLEx 8, which creates its own file type (.fwbackup, I believe), and I don't expect those will be changed.

Best,
Don

Finally my final reply,

Thanks Don,

This is what I suspected had happened. I was actually looking for a use case for an upcoming discussion I am having with the lexicography and software teams at SIL. It is interesting how generally safe assumptions about .zip files being unsuitable for archiving actually can cause archivists to weaken the integrity of the objects entrusted to their archives. Not that the FLEx website, or SIL in general makes this point about .zip files being the most suitable format for archiving FLEx databases at all clear. My understanding is that there is indeed more in the .zip file of those older FLEx backup files. I think texts, parsing rules, and keyboard files could also have been in there.

As a FLEx 8 user, I trust you already realize that there is more contained in a ".fwbackup" than there is in the LiFT .xml exports. Again my understanding is that texts, parsing rules, media associated with the lexical entry, and keyboard files could also be in the .fwbackup but will not be present in the LiFT.xml export file (and the LiFT export file might even contain a limited set of the total entries based on any filters active when the export is made).

Anyways, thank you for clarifying this unfortunate, but hopefully, useful case. I plan on taking it back to SIL to start a much needed discussion about how archives (including SIL's own archive) can respect the "thing-ness" of the variations in FLEx databases when the application changes versions.

Hugh Paterson III

One interesting element about this discussion is that I have heard the SIL Language and Culture Director say that if an SIL team is using LanguageDepot that they don't need to archive their lexical database in REAP because the content is already on "SIL data stores". However, my understanding of Language Depot limitations is that media, over 1 MB are not uploaded to LanguageDepot, texts which are stored in FLEx are not uploaded to LanguageDepot, Parsing rules and Keyboard are also not uploaded to LanguageDepot. In my mind this makes the content on language Depot, and the content of a .fwbackup file significantly different. Enough different to not qualify as an "archived" version of the lexical database. Additionally, Data permissions which may include permissions from speakers are not included in content on LanguageDepot, where there are mechanisms for the licensing and permissions of data via REAP.

1. All Lexical Resources

Why do we need a typology of resource types?

Marketing of products (on the behalf of linguists and economic partners, including the matching of resources to SIL's services).
Communication between submitters and archivists.
Effective matching of metadata to other metadata systems.
The longevity (Data maintenance strategy) and upgrading of data formats. (Because these strategies are dependent on the data formats and the file types and the data resource types.)

1.1 Resource Types

Wordlists	Encyclopedic entries
Monolingual	Poly-lingual
Print	Non-Print (Oral)
Single mode (i.e. textual only)	Multi-mode (i.e. text + audio, images, video)
Physical	Digital
Edited	Non-Edited
Single Author	Collaborative Production
Single IP	Multiple IP
Corpus Based	Non-Corpus Based