Emily18 Com Full Sets – 2021
An Exploratory Analysis of the Complete 2021 Collection
Limitations –
Future Work – We propose extending the analysis with:
| Cluster ID | Dominant Modality | Size (items) | Representative Themes (LDA keywords) | |------------|-------------------|--------------|---------------------------------------| | C1 | Text‑heavy (70 % transcripts) | 322 | “memory”, “family”, “childhood”, “storytelling”, “nostalgia” | | C2 | Image‑centric | 254 | “landscape”, “architecture”, “light”, “color”, “composition” | | C3 | Audio‑rich (58 % MP3) | 210 | “interview”, “soundscape”, “ambient”, “dialogue”, “field‑recording” | | C4 (Noise) | Mixed | 12 | — | Emily18 Com Full Sets -2021-
Visual inspection of the UMAP plots shows clear separation between C1–C3, confirming that multimodal embeddings preserve thematic distinctions.
The Emily18 Com series (2021) comprises a comprehensive set of 1 018 digital artifacts ranging from high‑resolution images and audio clips to annotated textual transcripts. Although the series was released publicly in late 2021, no systematic scholarly assessment of its content, structure, or potential research applications has yet been published. This paper presents a descriptive and exploratory analysis of the full 2021 collection. We detail the provenance of the data, the taxonomy of its constituent items, and the preprocessing pipeline required for reproducible research. Using a combination of statistical profiling, network‑based clustering, and topic modelling, we uncover three dominant thematic clusters—Personal Narrative, Technical Demonstration, and Community Interaction—and illustrate how these clusters map onto temporal release patterns. The findings highlight the collection’s value for studies in digital humanities, media archaeology, and human‑computer interaction, and we provide a publicly available Python‑based toolkit to facilitate further investigation.
Release dates (extracted from metadata) were plotted against cluster membership to examine whether certain themes corresponded to particular periods of 2021. Emily18 Com Full Sets – 2021 An Exploratory
| Step | Tool | Description |
|------|------|-------------|
| File catalogue | os, pathlib | Generated a master CSV (catalogue.csv) with file path, type, size, SHA‑256. |
| Image feature extraction | torchvision (ResNet‑50) | Produced a 2048‑dimensional embedding for each PNG. |
| Audio feature extraction | librosa | Computed MFCCs (20 coefficients, mean & variance) → 40‑dim vector. |
| Text preprocessing | spaCy (en_core_web_md) | Tokenisation, lemmatisation, stop‑word removal; generated TF‑IDF vectors (max‑features = 2 000). |
| Metadata aggregation | pandas | Merged JSON fields (author, tags, release_date) into a unified table. |
All intermediate artefacts are stored under /processed/ and version‑controlled with Git LFS.
The Emily18 Com series, curated by the independent digital‑arts collective Emily18, was launched in 2021 as a “full‑set” archive of the group’s output for that calendar year. The set includes: Limitations –
| Media type | Quantity | Typical size | Example | |------------|----------|--------------|---------| | PNG images | 312 | 2–8 MB each | “village‑scene‑01.png” | | MP3 audio | 184 | 5–30 MB each | “interview‑s01‑track‑04.mp3” | | TXT transcripts | 276 | < 500 KB each | “chatlog‑2021‑03‑15.txt” | | JSON metadata | 246 | 1–2 KB each | “metadata‑001.json” | | Total | 1 018 items | — | — |
Despite the collection’s breadth, it has received limited scholarly attention. Researchers interested in longitudinal digital culture, multimodal narrative analysis, or archival practices would benefit from a systematic description of the data and an initial analytical baseline. This paper aims to fill that gap.
Research questions