Continued searching for poetry treasure hidden in Latin prose

Description

Roman prose authors show a great fondness for Roman and Greek poetry, which they reference throughout their texts. Cicero and Seneca for example include 327 and 156 references to poetry in their respective corpora, ranging from the texts of Aeschylus and Accius to Euripides and Ennius, from Homer and Hesiod to Pacuvius and Plautus, and from Solon and Sophocles to Valgius and Vergil. In addition to the philosophers Cicero and Seneca, historians, biographers and grammarians like Ammianus Marcellinus, Suetonius and Varro also include numerous citations and paraphrases from poetry.

However, an author did not always mention the poet they referenced, often because this was deemed unnecessary: the audience would know the source, as poetry was popular. But for a contemporary reader, any unattributed quotation is hidden in plain sight, the only clue being their metrical form, which is hard to detect within a piece of prose.

This project wants to apply a neural network developed by the University of Leiden to prose texts to detect any hidden poetry passages. The program to write should be able to read a piece of text, create a set of poetry candidates, apply the neural network on these candidates, evaluate whether the resulting scansions are a specific poetic meter (including fault tolerance, as the model is only 90% accurate) and return all true candidates with their poetic meter and level of certainty.

Expected MVP

The following steps should be implemented for an MVP: 1. Ingest prose text 2. Clean prose text 3. Split prose text into words 4. Syllabify the words in the text (Latin poetry is labeling syllables) 5. Create poetry candidates from the text (e.g. a hexameter is everything between 12 and 17 syllables) 6. Feed candidates to the neural network (LSTM) 7. Check each labeling of the LSTM for a metrical pattern (including fault tolerance) 8. Add checks for 'good' poetry (12 long syllables is a valid hexameter, but not a good one) 9. Return all remaining candidates with their meter. E.g. "ărūndı̆nı̆s ūmor" -> hexameter (100% match)

This can all be done in a Python script. At the minimum, we would like the find hexameters, half hexameters, iambic trimeters and trochaic septenarii in the prose text.

Moving towards a fully completed version of the project, the text should be provided using our Angular frontend, either using a .txt file or a text field, and sent to our Flask API, where the text should be processed. Having done this, the candidates should be returned to Angular and neatly displayed on the frontend with a save to file function. Everything should work within the Docker environment we currently have. (Improving the LSTM is also very welcome)