We expected to take a dive into these seasons and find ways that the show had changed. We didn’t have too many expectations other than seeing obvious results such as the main family members being prominent and what characters appeared the most. We went into it pretty much blind to get a better understanding of looking at large scale corpi as data rather than the actual contents.
We made many interesting findings, many of which were surprisingly high numbers when we looked at our selected four seasons of a show. We learned a lot about how to better look at large heaps of data and understand how to better filter it.
Our biggest issues were that we initially wanted to bite off more than we could chew. Some of the files were messy and needed a lot of cleaning. There may be web standards, but that does not mean everyone will follow them. Another is that we don’t know how accurate the transcripts were unless we watched each episode alongside the documents and made sure they lined up. We had to find interesting ways to tag our files with XML and also think up what exactly we wanted to find.
At the start of it, we had greater plans to look over nearly every episode starting from Season 1 back in 1989 up to present day 2023 and find the same numbers that we found here so we could have denser information but we were quickly humbled by the lack of usable online transcripts and the sheer amount of work that a full 30 plus seasons would have been. We found many interesting things. We learned a lot about what it would take to do something like this on an even larger scale and understand the headaches that it would be to fix corpi on such a massive scale. We overall enjoyed the process and walked away satisfied and happy to have learned and developed new skills.