Summer School
Digging for Gold - Knowledge Extraction from Text
9-11th May 2023, Madrid, Spain
- Session 1Digging for Gold I: Introduction to Python
- Session 2Digging for Gold II: Extracting useful information from a corpus pt1 - Cleaning
- Session 3Digging for Gold III: Extracting useful information from a corpus pt2
- Session 4Digging for Gold IV: Word Embeddings
- Session 5Digging for Gold V: Vector Semantics and Embeddings
- Session 6Finding Gold I: Stylometry - Distances and differences
- Session 7Finding Gold II: Stylometry with R
- Session 8Finding Gold III: Keywords and associations
- Session 9Finding Gold IV: Case-study - Stylometry applied to Old Spanish Poetry
- Session 10Showing Gold I: Visualisation
- Session 11Storing Gold II: Lindat/Teitok
Digging for Gold I: Introduction to Python
This lesson provides an introduction to variables, operators, loops, lists, and dictionaries. Students will learn how to use lists and dictionaries to manipulate data in their programs. By the end of the lesson, students will have a foundational understanding of Python’s core concepts.
Speakerfor this session
Salvador Ros
Salvador Ros is an Associate Professor at UNED (National Distance Education University) at the School of Computer Science. Currently, He is the Technical Director of POSTDATA ERC Starting Grant and LyrAIcs proof of concept project, and Director of the Master of Big Data's architectures and technologies and Data Science. Salvador Ros has been Director of Learning Technologies at UNED for six years and Vice Dean of Technologies at Computer Science School for six years. He has received the Extraordinary Doctoral Award in the UNED for his PhD dissertation and two special best paper awards. He is a strategical and innovation Manager in the Public sector. He graduated from the Leadership Program for Public Sector Management by IESE Business School, Universidad de Navarra, in a Strategic Senior Management for Universities by Universidad de Nebrija y Politécnica de Barcelona and the Leadership Program for Innovation and entrepreneurship in Public Sector by Deusto Business School at Universidad de Deusto. He has been a senior member of the IEEE Education society since 2007. His research and professional activity, in general, is focused on enhanced learning technologies for distance learning scenarios and learning analytics, big data, and an IA applied to Science and Humanities and strategic consultant for the public sector.
Digging for Gold II: Extracting useful information from a corpus pt1 - Cleaning
In this lesson, students will learn how to use the Natural Language Toolkit (nltk) to download a corpus and then use Python and regular expressions to clean the text data. Students will learn how to remove unwanted characters and symbols, tokenise the text into individual words, and how to remove stop words. This lesson will equip students with the skills to clean and preprocess text data for further analysis and natural language processing tasks.
Speakersfor this session
Salvador Ros
Salvador Ros is an Associate Professor at UNED (National Distance Education University) at the School of Computer Science. Currently, He is the Technical Director of POSTDATA ERC Starting Grant and LyrAIcs proof of concept project, and Director of the Master of Big Data's architectures and technologies and Data Science. Salvador Ros has been Director of Learning Technologies at UNED for six years and Vice Dean of Technologies at Computer Science School for six years. He has received the Extraordinary Doctoral Award in the UNED for his PhD dissertation and two special best paper awards. He is a strategical and innovation Manager in the Public sector. He graduated from the Leadership Program for Public Sector Management by IESE Business School, Universidad de Navarra, in a Strategic Senior Management for Universities by Universidad de Nebrija y Politécnica de Barcelona and the Leadership Program for Innovation and entrepreneurship in Public Sector by Deusto Business School at Universidad de Deusto. He has been a senior member of the IEEE Education society since 2007. His research and professional activity, in general, is focused on enhanced learning technologies for distance learning scenarios and learning analytics, big data, and an IA applied to Science and Humanities and strategic consultant for the public sector.Alvaro Pérez
Álvaro Pérez Pozo is a computational linguist at UNED and has published work on topics such as automatic stanza classification in Spanish poetry and artificial intelligence applications for the humanities. He also has experience obtaining, cleaning, and compiling very large text collections.
Digging for Gold III: Extracting useful information from a corpus pt2
This is a continuation of lesson 2 where you will find how to use more advanced Python data structures and we will introduce how to use the Guttenberg project Python library to download corpora and how to clean text. Finally, students will learn what is and how to use the panda’s library for working and processing very large corpora.
Speakersfor this session
Alvaro Pérez
Álvaro Pérez Pozo is a computational linguist at UNED and has published work on topics such as automatic stanza classification in Spanish poetry and artificial intelligence applications for the humanities. He also has experience obtaining, cleaning, and compiling very large text collections.Salvador Ros
Salvador Ros is an Associate Professor at UNED (National Distance Education University) at the School of Computer Science. Currently, He is the Technical Director of POSTDATA ERC Starting Grant and LyrAIcs proof of concept project, and Director of the Master of Big Data's architectures and technologies and Data Science. Salvador Ros has been Director of Learning Technologies at UNED for six years and Vice Dean of Technologies at Computer Science School for six years. He has received the Extraordinary Doctoral Award in the UNED for his PhD dissertation and two special best paper awards. He is a strategical and innovation Manager in the Public sector. He graduated from the Leadership Program for Public Sector Management by IESE Business School, Universidad de Navarra, in a Strategic Senior Management for Universities by Universidad de Nebrija y Politécnica de Barcelona and the Leadership Program for Innovation and entrepreneurship in Public Sector by Deusto Business School at Universidad de Deusto. He has been a senior member of the IEEE Education society since 2007. His research and professional activity, in general, is focused on enhanced learning technologies for distance learning scenarios and learning analytics, big data, and an IA applied to Science and Humanities and strategic consultant for the public sector.
Digging for Gold IV: Word Embeddings
In this lesson, students will use the natural language processing library, spaCy, to extract all the “scary” verbs from a corpus of horror books. They will then use the similarity function in spaCy to determine which verbs are most closely related to the concept of fear. Additionally, they will determine which book is the “scariest” by calculating the ratio of scary verbs to total words in each book. This exercise will provide students with an understanding of how natural language processing can be used to analyse and compare different works of literature based on a specific set of criteria.
Speakerfor this session
Salvador Ros
Salvador Ros is an Associate Professor at UNED (National Distance Education University) at the School of Computer Science. Currently, He is the Technical Director of POSTDATA ERC Starting Grant and LyrAIcs proof of concept project, and Director of the Master of Big Data's architectures and technologies and Data Science. Salvador Ros has been Director of Learning Technologies at UNED for six years and Vice Dean of Technologies at Computer Science School for six years. He has received the Extraordinary Doctoral Award in the UNED for his PhD dissertation and two special best paper awards. He is a strategical and innovation Manager in the Public sector. He graduated from the Leadership Program for Public Sector Management by IESE Business School, Universidad de Navarra, in a Strategic Senior Management for Universities by Universidad de Nebrija y Politécnica de Barcelona and the Leadership Program for Innovation and entrepreneurship in Public Sector by Deusto Business School at Universidad de Deusto. He has been a senior member of the IEEE Education society since 2007. His research and professional activity, in general, is focused on enhanced learning technologies for distance learning scenarios and learning analytics, big data, and an IA applied to Science and Humanities and strategic consultant for the public sector.
Digging for Gold V: Vector Semantics and Embeddings
Embedding spaces and computational literary studies have emerged as a fruitful convergence of computer science and literary analysis. This presentation introduces the concept of embedding spaces, which represent textual data as high-dimensional vectors, capturing semantic relationships and contextual information. Specifically, word embeddings enable tasks such as sentiment analysis and authorship attribution, while sentence and document embeddings facilitate analysis at larger text units. By leveraging computational methods, researchers can delve into the complexities of literature, offering quantitative insights and novel research avenues. This interdisciplinary approach holds great promise for revolutionising the study of literature and deepening our understanding of its intricacies.
Speakerfor this session
Javier de la Rosa
Javier de la Rosa is Senior Research Scientist at the Artificial Intelligence Laboratory of the National Library of Norway, and former postdoc in Natural Language Processing (NLP) at UNED. He holds a PhD specialising in Digital Humanities and an MSc in Artificial Intelligence. His interests are in natural language processing applied to historical and literary texts, with a focus on large language models. He has previously worked at Stanford and the University of Western Ontario.
Finding Gold I: Stylometry - Distances and differences
This is an introductory overview of the field of stylometry and multivariate text analysis. We discuss classic approaches to text representation, such as bag of words, and show the ability of word frequencies to reflect meaningful cultural and social conditions of texts: genre, chronology and authorship.
R and RStudio
The most important thing to do would be installing R and RStudio on your machine. We will use the stylo library for most of the day; if you want, you can look at step-by-step introduction to stylo for beginners, or at the more extensive HOW TO, but we will cover all the basics. NB. In case you don’t / can’t install R and stylo locally, there will be an option to run analysis from Colab, just be aware it will require more coding, not less coding, because of the stylo’s interface that doesn’t work in Colab. You are free to bring your own collection of texts to the workshop, but you also can find plain text fiction collections in various languages on the Computational Stylistics website.
Materials
We will use GDrive folder that holds all necessary materials. Your options are:
- Download it and work locally
- Download -> re-upload folder to work in Colabs’ virtual machine with R (or just open the .ipynb notebook, then File -> Save a copy in Drive. It will create a copy of the notebook on your own Drive.)
To copy folder to your own GDrive, do this:
- Download CLS_Madrid_Folder
- Extract the downloaded .zip
- Upload the extracted folder back to your GDrive
Speakersfor this session
Artjoms Šeļa
Dr. Artjoms Šeļa is a postdoctoral researcher at the Methodology department of the Institute of Polish Language (PAN, Kraków) and is a research fellow at the University of Tartu (Estonia). He holds a PhD in Russian Literature and uses computational methods to understand historical change in literature and culture. His main research interests include stylometry, verse studies and cultural evolution. Sometimes he forays into digital preservation and history of quantitative methods in humanities.Joanna Byszuk
Joanna Byszuk is a research associate and a member of Computational Stylistics Group at the Institute of Polish Language, Polish Academy of Sciences, Kraków. She has worked on 'Foundations of Computational Stylistics' (2018-2022) and 'CLS Infra' (2022-2025) projects, focusing on cross-lingual computational stylistics and advancing stylometric methodology and its understanding, especially locating method limitations and developing evaluation procedures. She was also engaged in the COST Action Distant Reading, where she was leading Working Group 2 'Methods and Tools' (2020-2022), and in 'Deep Learning in the Computational Stylistics' collaboration with the University of Antwerp. She is interested in discourse analysis and sociolinguistics, especially in connection to 'big data' and multimodal perspective, establishing in her dissertation a methodology of multimodal stylometry for the study of audiovisual works.
Finding Gold II: Stylometry with R
This session introduces the ‘stylo’ library for R that allows to perform different stylometric analysis on a collection of documents. We quickly introduce R language and go over a graphical user interface of ‘stylo’ to show practicalities of feature selection, distance metrics, cluster analysis and sampling.
Note - Download instructions for software and materials can be found in Session 5
Speakersfor this session
Maciej Eder
Prof. Maciej Eder is the director of the Institute of Polish Language at the Polish Academy of Sciences, chair of the Committee of Linguistics at the Polish Academy of Sciences, vice-chair of the COST Action 'Distant Reading', co-founder of the Computational Stylistics Group, and the main developer of the R package 'Stylo' for performing stylometric analyses. He is interested in European literature of the Renaissance and the Baroque, classical heritage in early modern literature, and quantitative approaches to style variation. These include measuring style using statistical methods, authorship attribution based on quantitative measures, as well as 'distant reading' methods to analyse dozens (or hundreds) of literary works at a time.Artjoms Šeļa
Dr. Artjoms Šeļa is a postdoctoral researcher at the Methodology department of the Institute of Polish Language (PAN, Kraków) and is a research fellow at the University of Tartu (Estonia). He holds a PhD in Russian Literature and uses computational methods to understand historical change in literature and culture. His main research interests include stylometry, verse studies and cultural evolution. Sometimes he forays into digital preservation and history of quantitative methods in humanities.Joanna Byszuk
Joanna Byszuk is a research associate and a member of Computational Stylistics Group at the Institute of Polish Language, Polish Academy of Sciences, Kraków. She has worked on 'Foundations of Computational Stylistics' (2018-2022) and 'CLS Infra' (2022-2025) projects, focusing on cross-lingual computational stylistics and advancing stylometric methodology and its understanding, especially locating method limitations and developing evaluation procedures. She was also engaged in the COST Action Distant Reading, where she was leading Working Group 2 'Methods and Tools' (2020-2022), and in 'Deep Learning in the Computational Stylistics' collaboration with the University of Antwerp. She is interested in discourse analysis and sociolinguistics, especially in connection to 'big data' and multimodal perspective, establishing in her dissertation a methodology of multimodal stylometry for the study of audiovisual works.
Finding Gold III: Keywords and associations
In this session, we look at the ideas of the word ‘keyness’ and the ways to understand which features differ between texts and corpora. We also show how to trace features that might underlie text groupings and clusters and detect a potential bias, or systematic error in a corpus.
Note - Download instructions for software and materials can be found in Session 5
Speakersfor this session
Joanna Byszuk
Joanna Byszuk is a research associate and a member of Computational Stylistics Group at the Institute of Polish Language, Polish Academy of Sciences, Kraków. She has worked on 'Foundations of Computational Stylistics' (2018-2022) and 'CLS Infra' (2022-2025) projects, focusing on cross-lingual computational stylistics and advancing stylometric methodology and its understanding, especially locating method limitations and developing evaluation procedures. She was also engaged in the COST Action Distant Reading, where she was leading Working Group 2 'Methods and Tools' (2020-2022), and in 'Deep Learning in the Computational Stylistics' collaboration with the University of Antwerp. She is interested in discourse analysis and sociolinguistics, especially in connection to 'big data' and multimodal perspective, establishing in her dissertation a methodology of multimodal stylometry for the study of audiovisual works.Artjoms Šeļa
Dr. Artjoms Šeļa is a postdoctoral researcher at the Methodology department of the Institute of Polish Language (PAN, Kraków) and is a research fellow at the University of Tartu (Estonia). He holds a PhD in Russian Literature and uses computational methods to understand historical change in literature and culture. His main research interests include stylometry, verse studies and cultural evolution. Sometimes he forays into digital preservation and history of quantitative methods in humanities.
Finding Gold IV: Case-study - Stylometry applied to Old Spanish Poetry
In this session, we will try to find out whether the Old Spanish version of the Book of Alexander was written by an author known as Gonzalo de Berceo, as it is stated in one manuscript, or not. We will gather the data from the Old Spanish Textual Archive (OSTA), a corpus of Old Spanish Texts lemmatised and PoS tagged. Then we will use the stylo package to find it out. Afterwards, we will try to see if the rhyming words are a good element to establish the authorship of the Book of Alexander.
Speakerfor this session
José Manuel Fradejas Rueda
José Manuel Fradejas Rueda is a Professor of Spanish Language at Universidad of Vallodolid. Prof. Fradejas Rueda has been engaged with computational text analysis (including text editing, mostly Medieval sources) since late 1980s, however he has recently become more interested in these digital approaches when he discovered R and how useful can be for a literary / linguistic scholar.
Showing Gold I: Visualisation
This session aims to introduce visualisation grammars and their applications to accelerate exploratory data analysis in digital humanities projects by allowing users to create interactive visualisations with minimal effort in a Jupyter notebook. To this end, a real application will be built by replicating the design process of a visualisation system based on machine-annotated textual data, in a similar setup to many digital humanities research projects.
You can download the course materials for this session here: CLICK HERE
Speakerfor this session
Alejandro Benito-Santos
Alejandro Benito-Santos is a postdoctoral researcher in the School of Computer Science, UNED, working at the intersection of textual analysis, digital humanities, information visualisation and HCI. His background includes working with unstructured and semistructured text and he designed an interactive visual analytics system that allowed users to navigate a historical dictionary given in TEI format as part of his postgraduate degree.
Storing Gold II: Lindat/Teitok
Corpus data should be kept FAIR: findable, accessible, interoperable and reusable. This lecture will show how to do that using an example set-up in which the LINDAT repository is used for findability (as well as long-term preservation), TEITOK for accessibility (as well as maintenance and annotation), and TEI for interoperability and reusability. And we will show how to interact with these tools and standard both via a GUI (online) and using an API (to interact programmatically).
Speakerfor this session
Maarten Janssen
With a background in computational linguistics, Maarten has been involved in many corpus projects. Over the course of time he has developed the TEITOK environment, which is intended to allow linguists to build, maintain, and improve their own corpus without the need for extensive computational skills. Maarten is currently employed at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, at Charles University in Prague.