⇠ 7. Modelling water bodies - 4

9. Diving into the Archie Comics Multiclass dataset ⇢


Categories:

Machine Learning   Computer Vision   Python


Tags:

Created dataset


Hello again! Today marks the start of an exciting new series, where we look at image classification and other computer vision (CV)-related tasks using the Archie Comics Multi-class (ACMC) dataset that I created.

Now, right at the start, let me address one issue that some of you may have - why comics? Why not something with more real world applicability? Well, firstly, because comics are inherently a sketched representation of the real world, almost any CV-related task that can be done on photo datasets can be done on comics datasets - image classification, object detection, image segmentation, image captioning, image reconstruction, you name it. Indeed, we shall look at several of these applications in later posts. The only real difference is that instead of say, training our model on photos of faces, cars, animals, etc., we will train on drawings of these instead. The second reason is that we can often obtain a large number of comics sketches much more easily than we could obtain photographs of a particular subject. In the ACMC dataset, for instance, we shall see that we have hundreds of sketches of certain characters, a number that may not be easy to obtain for any real life subjects. Thirdly, while comics may be representative of reality, they are often an exaggerated version of reality, providing us with poses and expressions that are not common in real life. Depending on your point of view, this may or may not be a plus, but I see it as increasing the diversity of the dataset. Fourth, and probably the main reason I chose to go down this path - it’s fun! Look, doing machine learning (ML) stuff is often hard and frustrating, and so I preferred to choose a subject that is at least interesting to me, rather than some dataset I couldn’t care less about!

OK, so then comes the next question - why Archie comics in particular? Once again, we have a list of reasons…starting with the fact that I have been reading them for almost three decades now, and so both have an extensive amount of material, and an intimate knowledge of the characters. The second reason would be that they have been published for over 80 years now. Why does that matter? Well, aside from the abundance of available material, this leads to a point that may be better appreciated by those familiar with the comics than those who are not. You see, dozens of artists have drawn these characters over the decades, bringing their own drawing style to the table. At the same time, dressing styles and other aspects of daily life have also changed tremendously over this period. Despite this, the artists are required to maintain a continuity with past representations of the characters, so that readers can read a 2020 Archie story as easily as they can a 1950 story (and stories from different decades often appear together in the same issue). In other words, the characters have to look similar enough for human readers to be able to readily recognise them. This, then, is a great CV challenge - can the model learn the character feature representations well enough to tell that, though they look somewhat different, this sketch from 1962 and this from 2008 both show Reggie?

Thirdly, they have a large cast of characters - over two dozen recurring characters, easily. This makes it especially interesting for classification-type tasks. Fourthly, the frequency of appearance of these characters is wildly different, with those of the lead characters often orders of magnitude higher than some of the niche characters. This makes the dataset very imbalanced…and even more intriguing!

So let’s look at the dataset, which can be found here, in a little more detail. Before that, an obvious disclaimer - all rights to the images belong to Archie Comic Publications Inc., with the dataset only being used by me for educational purposes. I have created the dataset from clippings from various Archie comic books and newspaper strips, with some minor editing occasionally done to remove the lettering in dialogue boxes, etc.

Multi-class versus multi-label

Before anything else, I should mention that I have actually created two datasets - one multi-class, and one multi-label. We will deal with the multi-label dataset later, but since people sometimes get confused about the difference between multi-class and multi-label classification, let me explain this here briefly. Multi-class is when every image has a single label, with the label being one of a number of possible classes. Multi-label is when every image has or can have multiple labels.

An example of multi-class image classification is this:

Image_1

        Cat

Image_2

        Dog

© 2022 Agneev Mukherjee


Simple, right? On the other hand, the images below, with the labels given below them, may be used for multi-label classification:

Image_3
[Grass; Sand; Sea; Boats; Sunny]

© 2022 Agneev Mukherjee



Image_4
[Cars; Bicycle; Buildings; Trees; Lamp post; Street; Grass; Shrubs; Cloudy]

© 2022 Agneev Mukherjee


Of course, the exact labels for the figures will depend on the application, but you get the drift. So the dataset we are dealing with now is multi-class, that is, each picture only has a single label - the name of a particular character - attached to it. Later we will deal with the multi-label dataset, with several characters in each image.

Brief look at the characters

As I said earlier, Archie comics has a venerable history stretching back over 80 years. While nowhere near as popular (or ubiquitous) as in its heyday, it continues to have legions of fans, with new ‘properties’ like the animated series Archie’s Weird Mysteries, the zombie comics title Afterlife with Archie, and the TV series Riverdale coming out every now and then. There are also different versions of the characters like Little Archie, The New Archies, the ‘new look’ series, etc. The ACMC, however, deals with the classic version of the Archie comics characters - as we shall see, there is more than enough varied content in this to satiate us.

I was originally planning to provide a fairly detailed description of each character in the dataset, but then realised that this would be superfluous and irrelevant to the task at hand. I therefore redirect you to Wikipedia if you would like to know more about them. Here, let’s just see a couple of pix of each character, one ‘old’ and one ‘new’.

Archie Andrews (An archetypal average American teenager):

Image_5         Image_6


Jughead Jones (Archie’s best friend, usually wears a beanie, smart but lazy, girl hater, has an insatiable appetite):

Image_7         Image_8


Betty Cooper (Sweet girl, good at studies and sports, has crush on Archie):

Image_9         Image_10


Veronica Lodge (Rich, spoilt girl, Archie’s crush):

Image_11         Image_12


Reggie Mantle (Archie’s main rival):

Image_13         Image_14


The above are, in my opinion, the 5 most important characters in the ‘Archies Universe’ - and, in fact, the members of the band ‘The Archies’! The ‘Riverdale gang’, on the other hand, has several other members, among whom the most notable are:

Dilton Doiley (Stereotypical teenage geeky genius):

Image_15         Image_16


Moose Mason (Stereotypical jock, with near-superhuman strength and a meagre intellect):

Image_17         Image_18


Midge Klump (Moose’s girlfriend):

Image_19         Image_20


Ethel Muggs (Chases Jughead):

Image_21         Image_22


Chuck Clayton (Talented athlete and cartoonist):

Image_23         Image_24


Nancy Woods (Chuck’s girlfriend):

Image_25         Image_26


The Riverdale gang studies at Riverdale high, whose most important staff members are:

Waldo Weatherbee (The school principal):

Image_27         Image_28


Geraldine Grundy (Usually an English teacher, although often also shown teaching Maths, History and other subjects):

Image_29         Image_30


Mr. Flutesnoot (Usually a science, especially chemistry, teacher):

Image_31         Image_32


Coach Kleats (Head physical education teacher):

Image_33         Image_34


Coach Clayton (Chuck’s father, physical education and history teacher):

Image_35         Image_36


Mr. Svenson (School janitor):

Image_37         Image_38


Ms. Beazley (School cafeteria cook):

Image_39         Image_40


Of the parents of the Riverdale gang, two characters make appearances far more frequently than others:

Hiram Lodge (Veronica’s father):

Image_41         Image_42


Fred Andrews (Archie’s father):

Image_43         Image_44


Being as rich as they are, it is no surprise that the Lodges have a butler, Smithers, who makes semi-regular appearances:

Image_45         Image_46


And what is the gang’s favourite hangout? Pop Tate’s Chok’lit Shoppe, of course. Here’s Pop:

Image_47         Image_48


So that was a round-up of all the major characters in the ACMC dataset. However, the dataset actually has one more class - ‘Others’. This, as you can guess, is a medley of images of random characters, and the aim is for a model to put any images that it cannot classify as a member of any of the other classes into this category. Let’s have a look at some of the images under this heading:

Image_49 Image_50 Image_51 Image_52 Image_53


Image_54 Image_55 Image_56 Image_57 Image_58


Image_59 Image_60 Image_61 Image_62 Image_63


Archie comics fans will recognise several familiar faces in that gallery: Jughead’s father, Gaston, Archie’s mother, Cheryl Blossom, Betty’s father, Jellybean, and Ms. Haggly. The other images are of non-recurring characters.

Conclusion

Looking at the pictures above serves to highlight some of the challenges of working with this dataset. I used a range of materials to make the dataset, collected over a long period, and hence the images vary widely in size and image quality. How did I decide whether an image belongs in the dataset or not? The primary criterion was that a ‘human expert’, in this case an Archie comics fan, should be able to look at the image in question and identify it without any further cues. In some cases, this is difficult - without prior knowledge, for example, it is hard to categorise the Midge images above as belonging to the same person. Still, this is better than if I had put Jughead’s mother as a category - believe it or not, both the images below are of her. These images actually appear in the ‘Others’ category, meaning that their belonging to the same character is moot.

Image_64         Image_65


Another reason for the dataset being challenging - well, a look at some Jughead images below should suffice…


Image_66   Image_67   Image_68   Image_69   Image_70   Image_71


The images above can all easily be identified by any Jughead fan, but for a ML model, it might not be as straightforward.

And finally, as I said at the start, the dataset is also very imbalanced, which introduces its own challenges - and provides the opportunity to test some novel techniques. We shall look at this and other aspects in detail next time, so goodbye for now!

⇠ 7. Modelling water bodies - 4

9. Diving into the Archie Comics Multiclass dataset ⇢