This readme file was generated on 2023-12-14 by Nicholas Wolf

GENERAL INFORMATION

Title of Dataset:

An Gaodhal Newspaper (1881-1898) Full-Text OCR Output Files

Author/Principal Investigator Information

Name: Nicholas Wolf

ORCID: 0000-0001-5512-6151

Institution: New York University

Email: nicholas.wolf@nyu.edu

Author/Associate/Co-investigator/Collaborator

Information

Name: Deirdre Ní Chonghaile

ORCID: 0000-0001-8147-9874

Institution: New York University

Email: aransongs@gmail.com

Date of data collection:

2023-01-03 / 2023-12-14

Funding sources:

Robert David Lion Gardiner Foundation, Irish Institute of New York, University of Galway, New York University Glucksman Ireland House

***

SHARING/ACCESS INFORMATION

Licenses/restrictions placed on the data:

Creative Commons Attribution Share Alike 4.0 International

Links/relationships to supporting or related

data sets:

https://digital.library.universityofgalway.ie/p/ms/categories/an-gaodhal

Was data derived from another source?

If yes, list source(s):

Digitized raster images of original print volume newspaper pages, made available by the James Hardiman Library at the University of Galway on its Digital Collections platform.

Recommended citation for this dataset:

Ní Chonghaile, D., Dereza, O., & Wolf, N. (2023). An Gaodhal Newspaper Full-Text OCR Output Files [Data set]. New York University. https//doi.org/10.58153/5ya5n-mc504

***

DATA & FILE OVERVIEW

File List:

AnGaodhal_alto.zip

AnGaodhal_transkribusPage.zip

AnGaodhal_pageMetadata.csv

Relationship between files, if important:

The files present the OCR output in two XML formats: 1) Alto, a standard maintained by the Library of Congress 2) A similar “page” XML output produced by OCR software Transkribus. In both cases, outputs are presented at page level, one XML file per page. A third file, pageMetadata, offers page-level information about the contents of each page such as language used on the page (English, Irish, or mixed), presence of tables or advertisements, etc.

Filenames are prefixed with a page image order, ranging from 0001 to

  1. The remainder of the filenames refer to the naming convention used by the University of Galway Digital Collections for its An Gaodhal collection, enabling a quick link back to the digital image. That filename consists of “gaodhal” plus a sequential year, volume, and page number (example: 2298_gaodhal_0013_0003-0016_alto.xml refers to the 2298 XML output sequence in Alto form, referencing volume 13, issue 3, digital collection page sequence number 16).

Are there multiple versions of the dataset?

Yes. Subsequent versions represent improved XML files as OCR corrections are completed by researcher review.

If yes, name of file(s) that was updated:

AnGaodhal_alto.zip

AnGaodhal_transkribusPage.zip

AnGaodhal_pageMetadata.csv

Why was the file updated?

Files were updated as additional corrections were made to improve OCR accuracy.

***

METHODOLOGICAL INFORMATION

Description of methods used for

collection/generation of data:

Among the project goals was the extraction of full text to the highest possible degree of accuracy via Optical Character Recognition (OCR) software. The source for the text was digitized images of the An Gaodhal newspaper spanning 1881 to 1898, which were provided especially to the NYU team by the James Hardiman Library at the University of Galway. This digitization was done using a uniquely complete holding of the newspaper that was compiled, bound, and sent to Galway in 1924 by Rev. Daniel J. Murphy (Domhnall Ó Morchadha) of Philadelphia. The marginalia throughout this series are his and relate to his scholarship on sean-nós song. The corresponding Rev. Murphy manuscript collection followed An Gaodhal to Galway in 1936.

Challenges involved in performing this OCR work include: as a bilingual newspaper, pages feature Irish (Gaeilge) only, English only, or both languages together; the use of two different orthographies throughout — the Irish cló Gaelach typeface and Roman typeface — with some infrequent changes of font and sometimes using cló Gaelach for English content and Roman lettering for Irish content; the pre-standardized spelling of the Irish language in the late nineteenth century; variations in spelling and vocabulary reflecting the three major dialects of Irish; variations in spelling reflecting the language aptitude of each contributor, many of whom were learners of the language or were gaining literacy in Irish for the first time; and layout conventions reflecting the artisanal nature of the letterpress printing operation, which was small and domestic in scale and style, produced by the founder and editor Michael J. Logan entirely on a pro-bono basis, and funded chiefly by subscriptions and advertisements.

Only a handful of OCR training models attuned to cló Gaelach, pre-standardized spelling, are in existence (see, for example, https://github.com/kscanne/tesseract-gle-uncial), and none available had been trained on bilingual texts. Thus, the team had to create and train such a bilingual Irish-English OCR model from scratch. The software selected for this process was READ-COOP’s Transkribus software (https://readcoop.eu/transkribus). The team proceeded as follows:

  1. Automate identification of predominantly Irish-language lines on pages. This was done using Amazon’s Textract software (https://aws.amazon.com/textract), which could quickly and accurately produce token-detection and line segmentation regardless of language; the resulting OCR outputs were then categorized into Irish and non-Irish texts on a line-by-line basis by evaluating the dictionary-word accuracy of each line output; those scoring high as containing properly spelled English words were deemed “non-Irish,” leaving a clear corpus of Irish-language pages to train an initial model. The team “masked” the resulting English-language lines on pages using overlaid opaque rectangles, enabling the creation of monolingual Irish-only page images.

  2. Train an OCR recognition model for Irish-only pages. From the masked Irish-only pages, the team selected 60 pages at random; after excluding pages dominated by images or advertisements, a total of 57 proved viable for training. The team transcribed these texts manually and created a model in Transkribus for Irish-language detection named “An Gaodhal Gaeilge Model 1” (Transkribus Model ID#50036). That model incorporated 18,533 tokens and achieved a CER (character error rate) of 0.1%.

  3. Train an OCR recognition model on bilingual Irish/English pages. The team selected 100 pages randomly from the entire collection, removing all masks to present fully bilingual texts. Pages predominantly in Irish were run through the Irish-only model; pages predominantly in English were run for text recognition using Transkribus’s Print M1 model (ID#39995), which has been trained on over 5 million tokens and which also reflects the historical typographical conventions of the corpus. The resulting pages were then corrected manually, which provided the necessary content to train a bilingual OCR model. This bilingual model titled “An Gaodhal Bilingual Gaeilge/English Model 1” (ID#51080) achieved a 0.01% CER on 54,406 input training tokens.

  4. Correction of outputs. Three different OCR models were run on the full 2,298 pages of the corpus as appropriate to the language profile of each page: Transkribus’s Print M1 (ID#39995) on English-only or mostly English pages; An Gaodhal Gaeilge Model 1 (Transkribus Model ID#50036) on the Irish-only or mostly Irish pages; and An Gaodhal Bilingual Gaeilge/English Model 1 (ID#51080) on bilingual pages. To date, half of these pages (prioritizing the Irish-only pages) have been corrected by human review and text entry.

  5. Collection of supplementary page-level information. The team reviewed and recorded key attributes of each page. The results of this review are presented in the CSV file. These include: the presence of a table, advertisement, or image on each page; the language profile of the page — Irish, English, or bilingual; and the occurrence of verse (song or poem) or letters. This detail provides scope for further analysis of the content of the corpus.

OCR Quality Details:

As a monthly newspaper, An Gaodhal contained 12 numbers per volume. The corpus totals 147 issues from volume 1, number 1, to volume 13, number 3, and is complete and intact i.e. there are no missing pages. Most issues contain 16 pages and some contain 14, 12 or 8 pages. Page tears, ink spots, and blemishes are rare. Where such characteristics impair the legibility of text, human review relied on consulting the printed artifact or other extant samples of the relevant text e.g. song lyrics.

The cló Gaelach typeface used in An Gaodhal compares with the early nineteenth-century typeface attributed to Richard Watts of London. It is understood to have derived from matrices and punches that had been obtained by the Boston Philo-Celtic Society by 1879; the resulting type — then the only variety available in America — was sold at a cost of forty-two cents per pound, which was then equal to the price of Roman type. All known samples of printing with this particular typeface emerged solely from America.

The set of cló Gaelach type used by Logan appears complete. A contemporary New York newspaper edited by Irish-born printer James Haltigan, Celtic Monthly (1879-1884), used a set of type that appears identical; however, the characters Ḃ, Ċ, Ḋ, Ḟ, Ġ, Ṁ, Ṗ, Ṡ, and Ṫ are applied variably therein. In lieu of dotted capital consonants, Haltigan and his colleagues sometimes rendered Ḃ, Ċ, and Ḋ as Bh, Ch, Dh, etc., a common substitution at this time and later where access to cló Gaelach type was not guaranteed. To ensure that such nuances of contemporary typesetting and spelling conventions in a given printed artifact are preserved in the text extraction, the two new OCR models were trained to match a single unicode character to each printed glyph; manually substituting Ḃ, Ċ, and Ḋ with Bh, Ch, and Dh, etc. was eschewed. Logan rarely adopted such substitutions and, in Irish-language texts, chose to adhere to the relevant orthography, spelling some English words phonetically e.g. “Nuaḋ Ġorc” for New York. In the present text extraction, the selected unicode characters do not replicate exactly the design of the cló Gaelach type (such as Gaelchló provides https://www.gaelchlo.com/); rather, in deference to long-standing practice, Roman typeface characters — including those with diacritics (dots or accents) above the x-height or cap height e.g. ú or Ṁ — were chosen, thus ensuring interoperability between this dataset and others (see http://corpas.ria.ie/).

Printing errors are uncommon. Sometimes individual pieces of moveable type were placed in the printer’s composing stick in the wrong order or upside-down, or supplies of particular letters e.g. a, á, e, é, ran short and were substituted with alternatives from either of the two orthographies. On occasion, insufficient ink or loose type rendered gaps. Corrections arising were tagged as “supplied” or “unclear” or “gap” as appropriate to the word or line in question. Smaller pica sizes, which occurred only in the English-language fonts and most often in advertisements, proved challenging to the OCR software and thus prompted manual text entry.

Eight pages of the dataset — 121, 234-235, 1242, 1291-1292, and 1835-1836 — represent items created and/or added by Rev. Daniel J. Murphy to the printed newspaper issues before they were bound together in hardback volumes. These pages include: a drawing; a hand-drawn diagram; two printed extracts from another publication featuring song extracts; and song verses and their source typed or handwritten on small slips of paper. The latter two varieties of text-based items were included in the OCR run.

The newspaper’s decorative lithographic frontispiece, which was reproduced on most of its cover pages, was excluded from the OCR run.

Handwritten marginalia corresponding to Rev. Murphy’s handwriting occur on 495 of the remaining 2290 pages and were included in the OCR run. Appearing in black, blue, and red ink and in pencil, Rev. Murphy’s annotations supply additional data including references to published books, journals, and newspapers; identify alternate song titles and associated song airs or melodies; and suggest corrections to the printed text content.

Tables present a particular challenge because of their idiosyncratic styling and their variable placement throughout. The text extraction sought to render tables as logically as possible in text regions where formal table structures could not be applied. Formal table structures appear on 193 pages and were included in the OCR run. Differences in the cell separation from table to table respond to the nature of the text on each page with the aim of extracting the most accurate text possible and preserving the internal logic of each table.

Individual verses and choruses of songs and poems are rendered in separate text regions to facilitate the potential for search functionality focusing on first lines. One or two acrostic compositions occur. Some text is printed vertically or rotated left or right but most text is printed horizontally.

Footnotes appear at the end of some articles, often referring to an accompanying glossary to aid language learners. Among the characters used in a given text to refer to such footnotes are numbers, lower case Roman letters, and standard markings including asterisks and daggers. The use of superscript characters is rare. The sequencing of text regions ensures data output is comprehensible and readers can navigate to the appropriate footnote.

Abbreviations reflecting conventions of the period occur throughout, many of them serving to conserve space and type in printed matter: in English, “Jas.” meaning “James” and “Patk.” meaning “Patrick.” The names of individual states throughout America are frequently abbreviated, with and without period marks and/or spacing e.g. “RI” and “R. I.” for Rhode Island. In Irish, Logan frequently abbreviated ‘agus’ (meaning ‘and’) to the digit ‘7’ in lieu of the Tironian symbol for the Latin ‘et’ (⁊). In correcting text extraction, human review reverted to the ampersand symbol (&) instead to avoid confusion with the digit ‘7.’

Long numbers, such as population figures, are sometimes rendered in text rather than numerical digits e.g. six million instead of 6,000,000. This reflects the compositor’s effort to conserve type — in this case, zeros — as well as space within the confines of a newspaper column.

The approach to correcting OCR output was curatorial, not editorial. As the newspaper was edited by the same individual from start to finish and printed under his guidance (potentially with the assistance of his young son Edward J. Logan who later became a professional printer), there is a notable consistency of style throughout. Corrections were applied rarely and only then in the interests of ensuring discoverability. Non-standard forms of Irish-language spellings throughout prompted a strict adherence to the printed artifact as did printer’s abbreviations — both conventional and idiosyncratic — that represent efforts to maximize space or optimize readability.

Punctuation and typographical conventions are generally preserved. However, some commas were inserted where printing appeared to render a period mark in the middle of a sentence; tilde marks (≈) used in hyphenated compound words were replaced with a standard n-dash (-) to avoid confusion with the mathematical sign ‘equal to’ (=); and spaces were inserted on either side of m-dashes (—) to ensure that words on either side were recognized as separate entities. Some lines of text were justified from time to time but many more end with a word that is split between the end of that line and the start of the next, reflecting the physical restrictions of manual type-setting. In the printed artifact, the split is bridged by a n-dash (-). Excluding hyphenated compound words, human review of the text output replaced such examples with the character ¬ (called a ‘soft hyphen’ or ‘optional hyphen’). Such amendments aim to facilitate comprehension and deliver consistency of meaning for machine-reading tasks.

As is common in OCR workflows, layout detection was important to overall accuracy, especially given that columns and paragraph structures were used by the printers throughout. Print block detection and layout analysis models offered by Transkribus were applied consecutively to each page to yield accurate structure and baseline recognition. The occurrence of two columns on most pages, tables, advertisements, images, marginalia, and fine print demanded careful review of the page layout and sometimes required manual treatment including adjusting baselines and box boundaries and hand-drawing baselines for vertical text and marginalia.

The language profile of each page determined which of the three selected OCR models ought to be applied. Pages featuring a majority of English content required text entry for any Irish content therein where the English-only OCR model failed to render the Irish orthography. Likewise, pages featuring a majority of Irish content required text entry for any English content therein where the Irish-only OCR model failed to render the English orthography. The bilingual Irish-English model performed best when the content featured almost equal quantities of both languages and when the languages were confined to separate sections. Where the languages were intermixed in individual lines — in lists of translated Irish vocabulary or language instructional texts, for instance — the OCR output required more correction where the model failed to adjust to the rhythm of the orthographic exchanges on the page. Where OCR failed to render complete lines or word boxes, these were entered manually. Lines were sometimes joined or split to maximize comprehensibility of the extracted text. Corrections were provided at word-level, not simply at line-level, to enable future application of language-based technologies.

Methods for processing the data:

Submitted data represents the direct exports from Transkribus of the resulting full text. The files are presented in two forms: 1) Alto-format XML files that provide bounding box regions for text locations (at the individual token level) of separately tokenized pages, 2) Page-format XML files, representing similar information as the Alto files, but using a specific output format for Transkribus software.

Instrument- or software-specific information needed to interpret the data:

All files are in stable, easily discernible XML formats that can be parsed by a variety of software options once uncompressed.

***

DATA-SPECIFIC INFORMATION FOR FILES:

XML Files:

XML files are internally self-describing, with tags providing names of fields.

“Page” Transkribus output format files are organized on a per-page basis into regions (\<TextRegion>) or tables (\<TableRegion>), lines (\<TextLines>) or table cells (\<TableCell>) respectively, and words (\<Word>). Regions are also labeled according to a structure type: paragraph (\<TextRegion type="paragraph">), page-number (\<TextRegion type="page-number">), or marginalia (\<TextRegion type="marginalia">). These distinguish between the standard printed newspaper text, a page number printed on the page, and handwritten marginalia added to the original printed artifact.

Each structural element maps to the image uploaded to the software, reading each of the newspaper’s two columns left to right from top to bottom. Exceptions arise where the usual layout deviates according to the printer’s prerogative; for instance, when a reader’s eye moves at intervals over and back between the two columns. In such rare cases, human review prompted the re-ordering of the sequence to ensure the extracted text output was as logical and comprehensible as the experience of reading the printed artifact.

Each region, line, and word has a unique identifier derived from its logical sequence on the page. Thus, for example, word id “r5l1w2” refers to region 5, line 1, word 2. Tables, table cells, and words conform to the same style of sequencing e.g. region id “tbl_4_4” refers to a table appearing between Regions 3 and 4 of standard text areas, and the relevant word id entries appear per line and word (left to right) as “r_4_1_1” and “r_4_1_2” etc.

Additional identifiers indicating separators (\<Separator>) are retained in the data. The separator ID numbers do not conform to the sequence of identifiers mapping all other page elements; rather, they retain the identifiers generated automatically by the initial layout analysis. Hence, they appear somewhat random — two consecutive separators might appear as “r_25” and “r_39” — and are typically grouped together at the end of the page metadata. Each separator corresponds to a decorative hairline rule or border demarcating different elements of the printed page, separating articles or advertisements from each other. Such decorative elements aid the reader’s navigation of a printed page. In a digital environment, an equivalent distinction is provided by the structural tags applied during text extraction. As such, separators were deemed surplus to the requirements of text extraction. In addition, such was the quantity of separators throughout, time did not allow for the re-sequencing of each individual separator between different text regions as they appear on each page.

Bounding coordinates for polygons and locations of points making up text baselines are oriented to an origin point (0,0) at the top left of the page, mapping each element to the image in question. X,Y coordinates are given as pairs in the form x1,y1; x2,y2; etc.

“Alto” output format files follow the XML stylesheet maintained by the Library of Congress at https://www.loc.gov/standards/alto/v4/alto-4-4.xsd (for version 4.4, the most current). These files follow a similar region, line, string format, with the token provided at \<string CONTENT>.

See also the Alto documentation at https://www.loc.gov/standards/alto/techcenter/elementSet/index.html#element_OCRProcessing.

Accompanying CSV file:

The accompanying CSV provides additional metadata on a per-page basis that were recorded in the course of page layout review. Page metadata appear in rows with columns distinguishing between the following elements: page filename; the language profile of the page — Gaeilge (Irish), English, or Bilingual; presence of skew or tight gutters; and whether or not a page contains marginalia, images, advertisements, verse, or letters.

Variables include:

pageFilename: XML OCR output to which row data refer

skew_gutter_fallaway: Yes/No on presence of a skew, gutter, or fallaway on digitized page that might affect OCR quality

hasTable: Yes/No on presence of a table or table-like arrangement of tokens on page (includes list and list-like structures)

language: Gaeilge/English/Mix, predominant language on page

isCover: Yes/No on whether this page is the issue start (i.e. cover) page

hasMarginalia: Yes/No on whether handwritten margin notes are present

hasSong_Poem: Yes/No on whether a song or poem, or part thereof, is present

hasAdvert: Yes/No on whether an advertisement is present

hasLetter: Yes/No on whether a letter, or part thereof, is present

hasImage: Yes/No on whether an image is present