Mashups with lexical data
One of the commonly cited issues in publishing data expecially lexical (and even more especially in voices that often represent the projects that SIL International becomes involved with) is the copyright and licensing of the volumes created from the work done to collect the lexical data. Most people become vocal about the volumes created from lexical databases because they are not familure with concepts about licensing the data itself.
A point to the issues of licensing. There have been several cases in the Americas Area when I was there where there is a legitimate question about the ethics of licensing.
- As the story was related to me by Pat Kelley. There is a man who has a Waroni parent. He keeps asking Pat for the dictionary. Pat's inclination is that he wants to print it and sell it to the Waroni. So the question is:
Is it ethical to give this man the PDF of the dictionary so that he can sell it to the tribe? if SIL is unwilling to "profit" off of the "product" is it even ethical to give the "data" to someone who will profit off of the data?
- There are several general points to note here: as a non-profit organization should the organization take the position where it creates economical ventures to prevent certain economic activites related to the language and products produced for the benefit or in the process of benefiting the language community? Should the organization sit on the data if it can not make viable products of high utility to the language community?
- If we take the above example and contrast it with the dictionary project that Doris Payne has been doing on a language in Africa and has gotten U.S. federal rant money (NSF I think) to conduct and is conducting it partially under SIL and partially under the University of Oregon. There is several conflicts here - what is SIL's policy for handling data which is acquired with public funding? but also what is SIL's policy for datasets collected when the person leading the project has dual allegiances or agencies which could claim that that individual is "working for hire"?
But both of these questions are really side issues, because Doris would say that copyright should belong to the consultants who helped build it, even though they were paid to build it. - This view stands in opposition to the view that the dataset should belong to the the language community1. All of these points of view have valid merit. Often each with their own set of ethical assumptions. None of these assumptions are strictly speaking vindictive towards ethnolingusitic minorities, nor are they in someway detrimental. The best way to summarize these various kinds of issues is that the people involved in language development have different economic models which they would like to see take root in the communities and are willing to bind their work (and the data created through that work) to a particular economic model.
The open source software movement, and the free culture movement both have something to say which is worth hearing. Economic models which surround open source and free culture activities still exist. However these economic models clearly delineate that certain parts of the product chain can or cannot be used for certain activities. By agreeing on the economic model it allows various organizations the freedom to contribute to a product and its path of development and then to also mutually benefit from the use of that same product2. The ideas of the open source movment can be applied to data. This happens when the product to be produced is not software but rather a data source or a database of content. Lexical data can be a source of data. Many of the foundations which have been set up to stewart open source software require individual contributors to sign a release of their contribution to the foundation3. This sort of activity makes a lot of sense if we look at the activities around how some data collection processes for lexical data work, for instance the rapid word collection method is one where lots of people provide input to the creation of a lexical data set. We can also use the FLEx send and receive model as a distributed model for dictionary dataset creation (i.e. via the web or syncing) then we need to look at who is contributing what to make the "dataset". One way to look at these activities is segmentally, like an onion. The "headwords" are contributed by someone then someone else contributes the definitions and then a third party adds images, etc. It is possible to layer copyright per layer and then require Creative Commons licensing for all contributions to the project - in the end the sum of the parts is greater than the whole. In some ways this is the same model that software developers use when they license a library which gets included in an application. Each part can be copyright to the contributor and the whole dataset lives on because of licensing. The challenge is that Creative Commons does not cover datasets - if would cover the output products like a PDF or a book. This is why the OpenData license is needed: http://opendatacommons.org/ (same principals as Creative Commons but for Data).
In fact, I suggested using the onion model of licensing for AMA's involvement with the Rapashana project in South America. In the case of the Rapashana there was a lexical database which was created by an SIL team (unpublished -
and unarchived, if I remember correctly) and someone in the community wanted to "enrich" the dataset (aka. "work on the dictionary"). So I asked the question: How is the enrichment going to be archived and fed back into the original dataset?
I also asked is this person working for SIL as they "enrich the data set"? and the answer was "No, they are not working for SIL". (So, technically the Language and culture archive cannot receive data from a non-SIL project/person - but an entity's archive can chose to archive if they want to.) AMA did not find a licensing solution and ended up just giving the data to the individual. So my question is, if this individual does publish a dictionary and it contains SIL's data then the credit goes to that individual - (I am not even concerned about the credit). But lets say after they publish (and they copyright the work) someone else in the community wants to work on a non-copyrighted dictionary "for the community" and they send off to SIL and ask for a copy of the SIL work, how does SIL prove that it still has first authorship rights and rights for distribution for the content in this "original" dataset?
In the Rapashana case there was a lot of resistance from the AMA Linguistics Services Coordinator on using Creative Commons Licensing. (I am not sure exactly why. Is this more of an "SIL does not endorse CC or OpenData licensing",
thing or was this an individual opinion and judgment call kind of thing?).
It seems to me that SIL International will continue to have their hands tied and be unable to efficiently and clearly communicate licensing issues, practices and service agreements to the Lexicography Service users as long as various entity licensing practices exist - entity directors are the ones accountable for intellectual property created and licensed under their administrative units. At the core of this licensing question is also the question: is the customer/consumer of the service a customer of SIL International or are they the customer of the "local" entity. If SIL can clear up some of these challenges in operational practices then it would really help resolve some of the issues that Mike Cahill brings up that are challenging GPS - they look to guidance from the local projects partially because there is significant variation in operations.
Comment #4 this comment is in reply to the section of Ownership/Copyright
Who owns the dictionaries? (Copyright is the publishers, but what rights do individuals and even communities have?) Does SIL have exclusive or non-exclusive rights to dictionaries compiled by SIL members? Is there a concern that language communities might perceive SIL as benefiting financially from their data? Is there a way to compensate language communities for sharing their data?
Before we can talk about ownership, I think we need to come to a common understand of what we mean when we say "dictionary". At this point we are still talking in theoretical terms (both "lexicography service" and "dictionary" are open ended ideas).
So, let me ask: What is the dictionary?
- Is it the technology which presents the data to the user of the data (FLEx, Toolbox, LexiquePro, Webonary, the website we build on SIL.org)?
- or is it the data being presented (an entry and/or a series of entries and their supporting data) ?
- or is it the structures which are created to relate the data together in a database (the schema)?
- or Is a dictionary the paper binding (a physical product)?
- or is it a digital product (PDF, Ebook, .epub, an app for a phone)
- or Is a dictionary the rules which bring together text strings to do spell checking in various kinds of third party applications?
- or is a dictionary a wordlist?
- or is a dictionary a lexical database?
- or is a dictionary a shoebox of index cards?
I think we also have to ask the question here: What does SIL want to see happen to the data? How is running this lexicography service and relating to these people furthering the ends and accomplishing the goals and feeding into the other operations of SIL. The funny thing about data is that it needs to breath and live if it doesn't flow then it becomes stagnate and old. Data is also like money, just as one needs money to make money, one needs data to breed data.
Part of answering the question about what does SIL want to happen to the data, needs to be answered by the financial model that the service employs. We haven't yet heard if SIL is looking to broker a deal with Google for Google to license the data from SIL to use in their Google translate app (like is rumored that http://www.dict.cc does). Or if SIL is looking to be the exclusive location on the web where this data can be obtained. SIL looking to give away lexical databases for all those who ask for a copy? Is it part of the model of the service that a lexicography consultant will be flown anywhere in the world to help any people group set up a dictionary process and to use the website features? - Some return on investment analysis must be completed before we can say anything definitive about licensing. Return on investment is not just about dollars. Because data breeds data, and lots of data may breed good PR or opportunities (even to share the gospel), and good data can feed into other service lines and other products. So, there are some returns on investment that cannot be bought but must be obtained relationally - even if that relationship is through the internet via a website.
These datasets come to us through multiple sources, right? We are talking about:
- some coming from survey trips
- some coming from field work
- some coming from lexicographer curated data sets
- some coming from an interactive website
- some coming from other non-SIL researchers or institutions.
So, the ethics and legalities of what is right will change from scenario to scenario. But SIL should still try to group these scenarios into typological sets and have a framework in place for dealing with said typological sets.
So, for instance if SIL is receiving data from an individual via the web on an entry by entry basis, can that individual download their contributions to various dictionaries. - This sounds reasonable to me. Does it mean that that persons participation in using "the Website" also does allows SIL to use their work in other products? - Maybe. These are terms which need to be worked out in the EULA. But even the EULA is set up to support the answer to the question: What does SIL want to see happen to the data? - What is the desired social outcome?
If we take a different Use Case, say the SIL language team who has been working for 30 years and has amassed 15,000 words in a toolbox file (assuming that it was actually the couple and not a helper who input the 15,000 entries)... Who owns the toolbox file, well SIL should. But did SIL make that clear to the team when they set out? Does not informing the team of:
A) corporate policy regarding generated artistic or intellectual works
B) changes in corporate policy with regards to artistic or intellectual works
generated by the team during their time working for SIL give cause for them to claim that the "work" (toolbox file) does not belong to SIL. - It might.
Obviously these are only two of a plethora of use cases of which some might include MTT (compensated and uncompensated) participation in the creation of a single lexical dataset.
What is not covered above is the joint creation of a single data set - a mash up - where two datasets are merged to create a single product. This is what I have been referring to as the onion model. I first wrote about it here (in #2): Permalink and then also put in the diagram which is in the reply to Albert, B's comment: Permalink
... if we look at the rapid word collection method and look a distributed model for dictionary dataset creation (i.e. via the web or syncing) then we need to look at who is contributing what to make the "dataset". One way to look at this is segmentally, like an onion. The "headwords" are contributed by someone then someone else contributes the definitions and then a third party adds images, etc. It is possible to layer copyright per layer and then require Creative Commons licensing for all contributions to the project - in the end the sum of the parts is greater than the whole. In some ways this is the same model that software developers use when they license a library which gets included in an application. Each part can be copyright to the contributor and the whole dataset lives on because of licensing. The challenge is that Creative Commons does not cover datasets - if would cover the output products like a PDF or a book. This is why the OpenData license is needed: http://opendatacommons.org/ (same principles as Creative Commons but for Data).
In all of this there is a dynamic which is at play and I think the Lexicography Service will find challenging. It will find it challenging because SIL has not adequately addressed it in their planning. - Here is the issue: the long view...
The long view by some means that a translation took 40 years to complete. This is not entirely what I mean. With the occurrence of vision 2025, the emphasis has been on measurable success and on completing projects and programs. In relational terms this means representatives of the SIL corporation developing relationships with indigenous speakers for a period of time to complete a specified task. Upon the completion of the task the relationship between the corporation and the indigenous speaker ends, even if said representatives maintain personal connections with the community. Not a lot of thought has gone into thinking about what SIL wants the relationship between that people group and the corporation after those specified goals are met and after the people involved in making those goals come about leave that community. I have often heard it said that our job there is finished so "SIL leaves" (Coming from AMA the lingo has been "passing the torch" and "finishing the work"). With this intense missional focus on translation or scripture use, it blinds us, or brings out of focus, the longer term relationship which is required for true language development. Some of the things which might fit in this longer term relationship are access to archived materials and access to Lexicography services. Both of these service do not require in community representatives and can be part of an internet facilitated relationship.
Does SIL have exclusive or non-exclusive rights to dictionaries compiled by SIL members?
I think this question should be asked with a "should" as in: Should SIL have exclusive or non-exclusive rights... I say this because even if the policies are written clearly, membership 1) does not agree with them, 2) membership does not know about them, 3) Entities do not enforce the policies, or do not value intellectual property. Because SIL administrators often view Intellectual property as a liability rather than an asset there are 100 different stories for every 100 different dictionaries and lexical databases SIL members have created. This is another reason that SIL should embrace openData. So, in some sense the question to ask is: What is SIL willing to enforce? Take a licensing posture no greater than SIL is willing to enforce. (Or take a licensing posture which contributes to the business model.) This way the greatest amount of legal activity is encouraged. - I personally like the idea of non-exclusive rights which can be obtained through OpenData Licensing. But non-exclusive means that SIL needs to claim it's copy of the data too.
Is there a concern that language communities might perceive SIL as benefiting financially from their data?
If this is a reality I think it is an indication of a broken relationship between SIL and the language community (supposing that the language community is a small one) . If the LexicographyService sells anything it should sell the graphical experience to the data and not the access to the dataset. Wikipedia's data is freely downloadable. Linguists do it all the time so that thy have a corpus to work with. This does not make Wikipedia any less valuable or any less used. It makes it more valuable as a service. What keeps wikipedia valuable is the experience that one gets as they access the data. - Otherwise someone else would have downloaded all the data and created a better experience and drawn customers from Wikipedia. With appropriate and consistent presentation of our work we can sell access to our data and give the data away for free at the same time.
Is there a way to compensate language communities for sharing their data?
If we try and answer this question, we start to tell language communities that there is money to be made off of their language. How much is SIL willing to pay for each word that SIL has used in Bible translations in the past? What is the Lexicography Service unit's vision? is it to help language communities monetize their language? How will you make sure that everyone who has contributed to the dictionary gets recompensed? If we become advocates of opendata (which we should) then we are taking the stance that language and language diversity is part of world heritage.
1. If we could ever define or agree on what the language community actually is and who represents it. ↩
2. In general I rarely see any two language development organizations agreeing on the same economic model for language development. Rather, I see a lot of various ideas about which economic ideas are best and then a lot of championing for those methods to be exclusively used. ↩
3. For example look at this release from the Apache Foundation: https://www.apache.org/licenses/icla.txt ↩