8. Introducing the Archie Comics Multiclass dataset
Categories:
Tags:
Created dataset
Hello again! Today marks the start of an exciting new series, where we look at image classification and other computer vision (CV)-related tasks using the Archie Comics Multi-class (ACMC) dataset that I created.
Now, right at the start, let me address one issue that some of you may have - why comics? Why not something with more real world applicability? Well, firstly, because comics are inherently a sketched representation of the real world, almost any CV-related task that can be done on photo datasets can be done on comics datasets - image classification, object detection, image segmentation, image captioning, image reconstruction, you name it. Indeed, we shall look at several of these applications in later posts. The only real difference is that instead of say, training our model on photos of faces, cars, animals, etc., we will train on drawings of these instead. The second reason is that we can often obtain a large number of comics sketches much more easily than we could obtain photographs of a particular subject. In the ACMC dataset, for instance, we shall see that we have hundreds of sketches of certain characters, a number that may not be easy to obtain for any real life subjects. Thirdly, while comics may be representative of reality, they are often an exaggerated version of reality, providing us with poses and expressions that are not common in real life. Depending on your point of view, this may or may not be a plus, but I see it as increasing the diversity of the dataset. Fourth, and probably the main reason I chose to go down this path - it’s fun! Look, doing machine learning (ML) stuff is often hard and frustrating, and so I preferred to choose a subject that is at least interesting to me, rather than some dataset I couldn’t care less about!
OK, so then comes the next question - why Archie comics in particular? Once again, we have a list of reasons…starting with the fact that I have been reading them for almost three decades now, and so both have an extensive amount of material, and an intimate knowledge of the characters. The second reason would be that they have been published for over 80 years now. Why does that matter? Well, aside from the abundance of available material, this leads to a point that may be better appreciated by those familiar with the comics than those who are not. You see, dozens of artists have drawn these characters over the decades, bringing their own drawing style to the table. At the same time, dressing styles and other aspects of daily life have also changed tremendously over this period. Despite this, the artists are required to maintain a continuity with past representations of the characters, so that readers can read a 2020 Archie story as easily as they can a 1950 story (and stories from different decades often appear together in the same issue). In other words, the characters have to look similar enough for human readers to be able to readily recognise them. This, then, is a great CV challenge - can the model learn the character feature representations well enough to tell that, though they look somewhat different, this sketch from 1962 and this from 2008 both show Reggie?
Thirdly, they have a large cast of characters - over two dozen recurring characters, easily. This makes it especially interesting for classification-type tasks. Fourthly, the frequency of appearance of these characters is wildly different, with those of the lead characters often orders of magnitude higher than some of the niche characters. This makes the dataset very imbalanced…and even more intriguing!
So let’s look at the dataset, which can be found here, in a little more detail. Before that, an obvious disclaimer - all rights to the images belong to Archie Comic Publications Inc., with the dataset only being used by me for educational purposes. I have created the dataset from clippings from various Archie comic books and newspaper strips, with some minor editing occasionally done to remove the lettering in dialogue boxes, etc.
Multi-class versus multi-label
Before anything else, I should mention that I have actually created two datasets - one multi-class, and one multi-label. We will deal with the multi-label dataset later, but since people sometimes get confused about the difference between multi-class and multi-label classification, let me explain this here briefly. Multi-class is when every image has a single label, with the label being one of a number of possible classes. Multi-label is when every image has or can have multiple labels.
An example of multi-class image classification is this:
Cat
Dog
© 2022 Agneev Mukherjee
Simple, right? On the other hand, the images below, with the labels given below them, may be used for multi-label classification:
[Grass; Sand; Sea; Boats; Sunny]
© 2022 Agneev Mukherjee
[Cars; Bicycle; Buildings; Trees; Lamp post; Street; Grass; Shrubs; Cloudy]
© 2022 Agneev Mukherjee
Of course, the exact labels for the figures will depend on the application, but you get the drift. So the dataset we are dealing with now is multi-class, that is, each picture only has a single label - the name of a particular character - attached to it. Later we will deal with the multi-label dataset, with several characters in each image.
Brief look at the characters
As I said earlier, Archie comics has a venerable history stretching back over 80 years. While nowhere near as popular (or ubiquitous) as in its heyday, it continues to have legions of fans, with new ‘properties’ like the animated series Archie’s Weird Mysteries, the zombie comics title Afterlife with Archie, and the TV series Riverdale coming out every now and then. There are also different versions of the characters like Little Archie, The New Archies, the ‘new look’ series, etc. The ACMC, however, deals with the classic version of the Archie comics characters - as we shall see, there is more than enough varied content in this to satiate us.
I was originally planning to provide a fairly detailed description of each character in the dataset, but then realised that this would be superfluous and irrelevant to the task at hand. I therefore redirect you to Wikipedia if you would like to know more about them. Here, let’s just see a couple of pix of each character, one ‘old’ and one ‘new’.
Archie Andrews (An archetypal average American teenager):
Jughead Jones (Archie’s best friend, usually wears a beanie, smart but lazy, girl hater, has an insatiable appetite):
Betty Cooper (Sweet girl, good at studies and sports, has crush on Archie):
Veronica Lodge (Rich, spoilt girl, Archie’s crush):
Reggie Mantle (Archie’s main rival):
The above are, in my opinion, the 5 most important characters in the ‘Archies Universe’ - and, in fact, the members of the band ‘The Archies’! The ‘Riverdale gang’, on the other hand, has several other members, among whom the most notable are:
Dilton Doiley (Stereotypical teenage geeky genius):
Moose Mason (Stereotypical jock, with near-superhuman strength and a meagre intellect):
Midge Klump (Moose’s girlfriend):
Ethel Muggs (Chases Jughead):
Chuck Clayton (Talented athlete and cartoonist):
Nancy Woods (Chuck’s girlfriend):
The Riverdale gang studies at Riverdale high, whose most important staff members are:
Waldo Weatherbee (The school principal):
Geraldine Grundy (Usually an English teacher, although often also shown teaching Maths, History and other subjects):
Mr. Flutesnoot (Usually a science, especially chemistry, teacher):
Coach Kleats (Head physical education teacher):
Coach Clayton (Chuck’s father, physical education and history teacher):
Mr. Svenson (School janitor):
Ms. Beazley (School cafeteria cook):
Of the parents of the Riverdale gang, two characters make appearances far more frequently than others:
Hiram Lodge (Veronica’s father):
Fred Andrews (Archie’s father):
Being as rich as they are, it is no surprise that the Lodges have a butler, Smithers, who makes semi-regular appearances:
And what is the gang’s favourite hangout? Pop Tate’s Chok’lit Shoppe, of course. Here’s Pop:
So that was a round-up of all the major characters in the ACMC dataset. However, the dataset actually has one more class - ‘Others’. This, as you can guess, is a medley of images of random characters, and the aim is for a model to put any images that it cannot classify as a member of any of the other classes into this category. Let’s have a look at some of the images under this heading:
Archie comics fans will recognise several familiar faces in that gallery: Jughead’s father, Gaston, Archie’s mother, Cheryl Blossom, Betty’s father, Jellybean, and Ms. Haggly. The other images are of non-recurring characters.
Conclusion
Looking at the pictures above serves to highlight some of the challenges of working with this dataset. I used a range of materials to make the dataset, collected over a long period, and hence the images vary widely in size and image quality. How did I decide whether an image belongs in the dataset or not? The primary criterion was that a ‘human expert’, in this case an Archie comics fan, should be able to look at the image in question and identify it without any further cues. In some cases, this is difficult - without prior knowledge, for example, it is hard to categorise the Midge images above as belonging to the same person. Still, this is better than if I had put Jughead’s mother as a category - believe it or not, both the images below are of her. These images actually appear in the ‘Others’ category, meaning that their belonging to the same character is moot.
Another reason for the dataset being challenging - well, a look at some Jughead images below should suffice…
The images above can all easily be identified by any Jughead fan, but for a ML model, it might not be as straightforward.
And finally, as I said at the start, the dataset is also very imbalanced, which introduces its own challenges - and provides the opportunity to test some novel techniques. We shall look at this and other aspects in detail next time, so goodbye for now!