Information Discovery and Access in Digital Language Archives

Exploring Methods and Techniques for Facilitating Access to Digital Language Archives 

From 2019-2020, PI Oksana Zavalina led this IMLS-funded study with co-PIs Shobhana Chelliah and Mark E. Phillips, and Research Assistant Mary Burke. Project goals included the following: 

  • Identifying
    • information organization tools and practices currently employed by language data archives across the United States
    • the needs of depositors and end-users (linguistics researchers, instructors, and students) for information organization functionalities in these archives
  • Providing empirical data in support of planning the future large-scale collaborative project focused on development of more efficient and user-friendly tools for access to digital language archives.

The team interviewed language archive users, managers, and depositors to address the followed research questions:

  • How is information currently organized in the language archives?
  • What are the needs of actual and potential depositors of language data with regards to information organization in language data archives and how they correlate with available information organization functionality?
  • What are the needs of end users of language data (researchers, educators, students) with regards to information organization in language data archives and how they correlate with available information organization functionality?

Explore our findings in publications and presentations related to this project: 

Burke, M., & Zavalina, O. (2019). Exploration of information organization in language archives. Proceedings of the Association for Information Science and Technology, 56(1), 364-367.

Burke, Mary & Zavalina, O. L. (2020). Identifying challenges for information organization in language archives: Preliminary findings. In A. Sundqvist, G. Berget, J. Nolin, & K. I. Skjerdingstad (Eds.), Sustainable digital communities: 15th international conference, iConference 2020, Böras, Sweden, March 23–26, 2020, proceedings (pp. 622–629). Springer International Publishing.

Burke, Mary, Zavalina, Oksana L., Phillips, Mark & Chelliah, Shobhana. (2021). Organization of knowledge and information in digital archives of language materials. Journal of Library Metadata, 20(4) 185-217, doi: 10.1080/19386389.2020.1908651.

Burke, Mary, Zavalina, Oksana L., Chelliah, Shobhana L., & Phillips, Mark. (2022). User needs in language archives: Findings from interviews with language archive managers, depositors, and end-users. Language Documentation & Conservation, 16, 1-24.

The Lang Arc Workshop  

Lang Arc is an interactive workshop aimed at exploring a broad scope of issues related to digital language archives. The workshop aims at bringing together researchers, practitioners, educators, and students from around the world who are currently working or are interested in working in different areas related to collecting, archiving, curating, organizing, and providing access to born-digital or digitized language data, and evaluation of digital language archives. The 1st International Workshop on Digital Language Archives (Lang Arc) 2021 was organized by the University of North Texas and held online on September 30, 2021. This workshop was one of five workshops held as part of the IEEE/ACM Joint Conference on Digital Libraries co-organized by the University of Illinois. 

Explore workshop papers and presentations here:

Pedagogy research 

Our team continues to explore how to bring information on language archives into existing Information Science and Linguistics curricula.  We experimented with some combined courses and modules:

  • Spring 2020 combined course: LING 5030 Languages of South Asia / INFO 5224 Advanced Topics in Metadata with curriculum developed by Shobhana Chelliah, Okasana Zavalina, and Mary Burke
  • Spring 2021 experimental course: INFO 5224 Advanced Topics in Metadata, with special emphasis on language archives with curriculum developed by Oksana Zavalina and Mary Burke

We report on those efforts in the following publications:

Buchanan, S.A., Babalola, N.A., Chelliah, S.L., Kriesberg, A., Pratt, S., Wisser, K.M., & Zavalina, O.L. (2020). Transforming the archival classroom for a connected reality. In K. Dali, S. Hawamdeh, & H.G. Gunderman (Eds.), Proceedings of the Association for Library and Information Science Education Annual Conference: ALISE 2020, pp.396. ALISE. 

Roeschley, Ana, Buchanan, Sarah A., Graf, Ann, Burke, Mary, & Zavalina, Oksana L. (2020). Considering individual and community contexts within information pedagogy, scholarship, and practice. Panel presented at the 83rd Annual Meeting of the Association for Information Science and Technology, 23-29 October 2020. 


Encouraging re-use of archival material 

Using archival deposits to analyze and improve orthography

This project includes the analysis and comparison of orthographic representations and spelling variations with spoken forms of languages spoken in the Northeastern Indian region, including Lamkang, Hakha Lai, Boro, and Manipuri. These languages have complex morphological structures which trigger numerous morpho-phonological processes and can prove challenging to represent in the orthography. These communities are at different stages of orthography development, which is influenced by linguistic and non-linguistic factors, such as identity and usability of the orthography in digital spaces.

The current project includes analysis of primary tests and writings found in the CoRSAL archive as well as collaboration with native speakers to assess potentially ideal orthographic standards based on literacy processing and comprehension. The long-term purpose of the project is to gain better insight into potential linguistic processes which may result in spelling variation and provide information on ideal forms of representation.

Orthography development is extremely important for language revitalization and documentation for a variety of purposes. Having a written language is critical for endangered languages because it allows communities to create formal teaching materials such as textbooks, magazines and newspapers for news, signs and menus, and even instruction manuals. Written language is also critical for transmission of the language to future generations and can help with adoption of the language by younger members of the community and the use of the language on social media. Orthographic standards also allow speakers to be active participants in the documentation process, which is critical for documentation efforts.

In order to reach an orthographic standard, linguists and language communities must make decisions about orthographic representation but guidelines and research to inform these decisions remain lacking. This research could lead to preliminary guidelines to aid in the development of orthographies for indigenous languages. In addition, this research could increase our understanding of literacy processing in general. The majority of literacy research has been on English, though that trend has changed somewhat in recent years. Indigenous languages are a rich part of the world’s larger linguistic landscape but literacy research with indigenous language speakers is almost non-existent.

Explore our findings in publications and presentations related to this project: 

Chelliah, Shobhana & Garton, Rachel. (2023, in production). Orthography Development for Tibeto-Burman Languages of the South Central Branch: Lessons from Lamkang. Himalayan Linguistics, 20(1). 

Garton, Rachel, Dale, Merrion, Roy, L. Somi, & Basumatary, Prafulla. Endangered Languages in the Digital Public Sphere: A case study of the writing systems of Boro and Manipuri. Paper presented at Grapholinguistics in the 21st Century Conference, June 10, 2022.

Depositors engagement with collections 

Social media for engagement with archives

We consider how social media may be used to increase community engagement with collections.  The following Facebook pages are moderated by depositors and are used by them to increase engagement:

The following presentation and publication describe our efforts. 

Dale, Merrion. (2021). Examining the influence of social media on community engagement with language archives. Paper presentation at the 3rd Annual Multidisciplinary Information Research Symposium, 10 April 2021. University of North Texas. Denton, Texas.

Dale, Merrion. (2022). Case study of using Facebook groups to connect community users to archived CoRSAL content. Language Documentation & Conservation, 16, 399-416.

Enriching metadata through community engagement

We consider how depositor and non-depositor community members can contribute to metadata enrichment. Related activities are described here:

Burke, Mary. (2021). Collaborating with language community members to enrich ethnographic descriptions in a language archive. In O.L. Zavalina & S.L. Chelliah (Eds.). Proceedings of the 1st International Workshop on Digital Language Archives, pp. 18-21, University of North Texas. doi: 10.12794/langarc1851172.

Language Archive Metadata Quality 

A series of articles review metadata quality in language archives. These include CoRSAL as a case study in challenges in metadata quality.

Burke Mary and Zavalina, Oksana L. (2020). Descriptive richness of free-text metadata: A comparative analysis of three language archives. Proceedings of the 83rd annual meeting of the Association for Information Science & Technology 2020, 57:e429.

Burke, Mary. (2020). Obstacles overcome in the creation of the Lamkang Language Resource: Implications for archiving Tibeto-Burman languages. Paper presented at the North East Indian Linguistics Society 11, 7-9 February 2020. Kokrajhar, India.

Burke, Mary. (2021). Evaluating language data: Lessons from curating the Computational Resource for South Asian Languages (CoRSAL). Paper presented at the 3rd Annual Multidisciplinary Information Research Symposium, 10 April 2021. University of North Texas. Denton, Texas.

Interoperability of Archival Data

Burke, Mary & Chelliah, Shobhana L. (2021). Cross-Language Comparison of Mismatched Annotation in Interlinear-Glossed Texts. Poster presented at the 95th Annual Meeting of the Linguistics Society of America. 7-11 January 2021. San Francisco, California.

Burke, Mary & Chelliah, Shobhana L. (2021). Challenges to representing personal names and language names in language archives: Examples from Northeast India. Proceedings of the 1st International Workshop on Digital Language Archives 2021. University of North Texas.

Encouraging re-use of archival materials for linguistic analysis

Intellectual access to connected texts such as traditional stories is often provided through clause, word, and morpheme analysis presented in bundles called interlinear glosses. The glossing follows the Leipzig Glossing Rule principles. However, there is clearly a need for language specific glossing rules. These are developed in a series of articles and an open access textbook.  

Chelliah, Shobhana, Mary Burke, and Marty Heaton. (2021). Using interlinear gloss texts to facilitate cross-language comparison and improve language description. Indian Linguistics, 82(1-2): 1-24. 

Chelliah, Shobhana and Mary Burke. (2019, December 10). IGT as a Computational Resource for South Central Tibeto-Burman Languages. Presented at the Recent Advances in Kuki Chin, part of the seminar series Asia Beyond Boundaries, sponsored by an ERC Synergy Grant. School of African and Oriental Studies, London.

Chelliah, Shobhana and Samson Lotven. (2021). (release date March 2022) From Source to Analysis: A fieldworker’s guide to Annotation. UNT Open Books.

As part of this effort, CoRSAL has developed a partnership with the UNT Libraries' imprint Aqualine Books to publish online open access CoRSAL Occasional Publications series. These series will feature text collections to accompany the language collections in CoRSAL. The first publications in these series are the following:

Haokip, Pauthang. (2021). Annotated Texts of the Languages of the Barak Valley: Thadou, Saihriem, Hrangkhol, Ranglong. (S. Chelliah, M. Burke & M. Heaton, Eds.). CoRSAL Occasional Publications.

Matisoff, James A. (2022). Window onto a Vanished World: Lahu texts from Thailand in the 1960’s (J. B. Lowe & C. Zhang, Eds.). CoRSAL Occasional Publications.

Interdisciplinary Infrastructure Development 

CoRSAL uses the UNT Digital Library's infrastructure, as described in these presentations and publications:

Phillips, Mark, Burke, Mary, Tarver, Hannah & Zavalina, Oksana. (2021). Leveraging digital library infrastructure to build a language archive. In O.L. Zavalina & S.L. Chelliah (Eds.). Proceedings of the 1st International Workshop on Digital Language Archives, pp. 15-17, University of North Texas. doi: 10.12794/langarc1851182.

Phillips, Mark, Burke, Mary, Tarver, Hannah & Zavalina, Oksana. (2022). Utilizing existing metadata standards and tools for a digital language archive: A balancing act. The Electronic Library 40(5): 579-593.