-- Leo's gemini proxy

-- Connecting to gemini.splashgel.net:1965...

-- Connected

-- Sending request

-- Meta line: 20 text/gemini


↩ go back home


Comparing TV Cast Members Using IMDB Data


Introduction


Two great programmes are Seinfeld and Malcolm in the Middle. Fans of both may notice that they have a few actors in common: Apart from Bryan Cranston playing Hal and Tim Whatley, and Daniel von Bargen as Commandant Edwin Spangler and Kruger there are a few other familiar faces in Malcolm in the Middle. Stevie’s mother Kitty Kenarban is played by the same actress playing the cashier who wouldn’t accept the book George tried to return at the bookshop after taking it into the toilet. The actress who played Sue Ellen Mischke pops up, and of course Jason Alexander makes an appearance in a more prominent role. Perhaps IMDB can provide more detailed information on exactly who was in both programmes?


The data


In another project (see links at the bottom of the page) the downloadable IMDB database held the information needed to solve the problem in hand. Unfortunately the cast list is not part of this, and so accessing it via the HTML of the website is the only option. The Mojolicious project provides the tools needed to do all this, starting with Mojo::UserAgent module which can be used to download the web pages.


my $ua;
unless (-f 'seinfeld_cast.html') {
	say "Downloading Seinfeld cast...";
	$ua = Mojo::UserAgent->new;
	$ua->get('https://www.imdb.com/title/tt0098904/fullcredits/')
	->result->save_to('seinfeld_cast.html')
}

Extracting the cast


Looking through the HTML shows a table with CSS class cast_list containing some bits that look something like this (cleaned up)


<tr class="odd">
<a href="/name/nm0001431/">Wayne Knight</a></td>
<td class="character"><a href="/title/tt0098904/characters/nm0001431">Newman</a>
<a href="#" class="toggle-episodes" data-n="43">43 episodes, 1991-1998</a>
</td>
</tr>

So using the CSS selector 'table.cast_list tr td.character' gives the relevant TD element for each character in the cast, so that the parent TR contains A elements with the actor, character name and number of appearances. Using Mojo::DOM they can be extracted based on the value of the href attribute.


sub extract_cast {
	my $filename = shift;

	open my $fh, '<', $filename or die "Can't open file $!";
	read $fh, my $html, -s $fh;
	close $fh;

	my @castlist;
	my $dom = Mojo::DOM->new($html);
	for ($dom->find('table.cast_list tr td.character')->each) {
		my $actor = "";
		my $character = "";
		my $num_eps = 0;
		for ($_->parent->find('a')->each) {
			my $href = $_->attr->{href};
			my $class = $_->attr->{class} // '';
			if ($href =~ m|^/name/nm\d+/|) {
				$actor = $_->text =~ s/(^\s+|\s+$)//gr;
			} elsif ($href =~ m|/title/.*/characters/|) {
				$character = $_->text;
			}
			if ($class eq 'toggle-episodes') {
				$num_eps = $_->attr->{"data-n"};
			}
		}
		next unless $actor && $character;
		push @castlist, { actor => $actor, character => $character, n => $num_eps };
	}
	return \@castlist;
}

Putting it together


The rest is just running over the two lists, putting the overlaps into an array, then the using the Term::Table module from CPAN to print it out. Download the full script below!


the full script


Links


a similar project using the downloadable IMDB database


Mojo - web development toolkit

Term::Table on CPAN

-- Response ended

-- Page fetched on Sun May 19 01:55:55 2024