-- Leo's gemini proxy

-- Connecting to gemini.splashgel.net:1965...

-- Connected

-- Sending request

-- Meta line: 20 text/gemini

Auto-naming video files using IMDB data

WARNING

Often the order of titles is different between broadcast and DVD. IMDB usually uses the broadcast order, so using this script to rename your files without checking may cause strange results.

Abstract

Discussion of getting television episode data from the Internet Movie Database (IMDB) and using an SQLite database instrumented by a Perl script to rename video files and set their metadata based on this.

Introduction

Often when making legitimate backup copies of videos they end up being named things like title_1. It’s relatively easy to rename these by season and episode, when this is a television series, but adding the episode titles can be annoying and time consuming. The Internet Movie Database (IMDB) has a store of all this online, so why not use it to automate the process?

The data

IMDB does not have an open API to access the title and episode data, and ‘scraping’ the HTML is fragile and error-prone. The underlying database is made available, however, as a download for personal or non-commercial use.

the IMDB data files

description of the data file structure

The important dataset in this case is “title.basics” which contains the following interesting fields:

tconst: the primary key that identifies this episode or series

titleType: for our purposes “tvSeries” and “tvEpisode” are the values we're interested in

primaryTitle: the series name, such as “Arrested Development” or an episode name such as “Let ’Em Eat Cake”.

The episodes are linked to their parent series via the table “title.episode” which contains:

tconst: the key of the episode

parentTconst: the key of the series

seasonNumber

episodeNumber

The details

The IMDB data comes as TSV (tab separated values) files, which when uncompressed weigh in at 664MB for the title.basics file and 147MB for the title.episode file. Working directly with these files can prove to be a little unwieldy, for example Perl has a module called Text::CSV (that also supports TSV); unfortunately when loading the larger of the two files it gets to allocating 10GB of memory and then the process is killed, for me at least, using Text::CSV_XS version 1.46.

Importing into SQLite

As an alternative, the data can be imported into an SQLite database (note that the files must be gunzipped first). So that the season and episode numbers are numeric, a temporary table is created during the import and the data moved across to the final one; this is because SQLite does not support changing column types.

.mode ascii
.separator "\t" "\n"
.import title.basics.tsv title_basics
.import title.episode.tsv title_episode_temp
CREATE TABLE title_episode(
	tconst TEXT PRIMARY KEY,
	parentTconst TEXT,
	seasonNumber INTEGER,
	episodeNumber INTEGER
);
INSERT INTO title_episode
	(tconst, parentTconst, seasonNumber, episodeNumber)
	SELECT tconst, parentTconst, seasonNumber, episodeNumber
	FROM title_episode_temp;
DROP TABLE title_episode_temp;

Querying the SQL database

Once this SQLite database is set-up, the following query gets you a list of season and episode numbers, along with their titles, for a given series identifier:

SELECT seasonNumber, episodeNumber, primaryTitle
FROM title_episode
JOIN title_basics
ON title_episode.tconst = title_basics.tconst
WHERE parentTconst = ?
ORDER BY seasonNumber, episodeNumber

The series identifier can be found either by searching the IMDB website and looking for the code starting with “tt” and then a number in the URL for a series, or we can make an SQL query to search for series identifiers:

SELECT COUNT(DISTINCT seasonNumber) seasons,
	b.tconst AS tconst,
	b.primaryTitle AS primaryTitle,
	b.startYear AS startYear
FROM title_basics b
JOIN title_episode e
	ON b.tconst = e.parentTconst
WHERE titleType = 'tvSeries'
	AND primaryTitle LIKE ?
GROUP BY b.tconst
ORDER BY primaryTitle

This query gives you back a list of title identifiers, names, years, and the number of seasons. Remember in SQL the wildcard is the percent symbol %.

Optimizing the database

If you try this with your favourite show, for example Arrested Development, title “tt0367279”, you may notice it takes quite a long time. Here are the actual numbers.

> time ./mkepisodenames.pl -dry-run -t tt0367279 S01E01.txt
'S01E01.txt' -> './S01E01 Pilot.txt'
________________________________________________________
Executed in    7.12 secs    fish           external
   usr time    6.57 secs    0.28 millis    6.57 secs
   sys time    0.50 secs    1.07 millis    0.50 secs

So it took 7 seconds, which is pretty slow by modern standards. If we go into SQLite we can get it to explain what happens with the query, if by putting EXPLAIN QUERY PLAN in front of the SELECT statement.

SQLite Explain Query Plan documentation

sqlite> EXPLAIN QUERY PLAN SELECT seasonNumber, episodeNumber, primaryTitle, titleType
FROM title_episode JOIN title_basics ON title_episode.tconst = title_basics.tconst
WHERE parentTconst = 'tt264235' ORDER BY seasonNumber, episodeNumber;

QUERY PLAN
|--SCAN title_episode
|--SEARCH title_basics USING AUTOMATIC COVERING INDEX (tconst=?)
`--USE TEMP B-TREE FOR ORDER BY

This means it scans the entire episode table row-by-row, which is obviously slow. Also the tconst columns are textual, so the database has to do a text compare which is presumably slower than an integer compare. We can tweak the database so that the tconst fields are INTEGER types, and create an index for them.

CREATE TABLE title_episode(
	tconst INTEGER PRIMARY KEY,
	parentTconst INTEGER,
	seasonNumber INTEGER,
	episodeNumber INTEGER
);
INSERT INTO title_episode
	(tconst, parentTconst, seasonNumber, episodeNumber)
	SELECT  CAST(SUBSTR(tconst, 3) AS INTEGER),
		CAST(SUBSTR(parentTconst, 3) AS INTEGER),
		seasonNumber,
		episodeNumber
	FROM
	title_episode_temp;
CREATE INDEX title_episode_parentTconst ON title_episode(parentTconst);

We do a similar thing for title_basics, although creating an index for it seems not to make any difference. You can try running “EXPLAIN QUERY PLAN” again to see for yourself that it uses the index, but let’s try the script now with the tuned database.

> time ./mkepisodenames.pl -dry-run -t 367279 S01E01.txt
'S01E01.txt' -> './S01E01 Pilot.txt'
________________________________________________________
Executed in   81.81 millis    fish           external
   usr time   66.77 millis  313.00 micros   66.46 millis
   sys time   14.79 millis  919.00 micros   13.87 millis

So from 7.12 seconds to 0.082 seconds, or 87.9× faster! The database is also a lot smaller now, 498MB instead of 917MB probably because of going from TEXT columns to INTEGER ones.

Turning it into a script

Note: in the snippets below, a lot of the error checking is left out, but it can be found in the full script linked for download at the bottom of the page.

Accessing the database

Perl has good modules for accessing databases in the form of DBI, which has an SQLite driver.

Database Independent interface for Perl

DBD::SQLite - SQLite driver for DBI

The code for accessing the database is pretty self-explanatory, we open it read-only just because that is our intention. The return value is an array-of-hashes where each array index represents a row, and the hash is of the columns.

sub get_episode_titles ($title_id) {
	my $dbfile = dirname($0) . "/imdb.db";
	my $dbh = DBI->connect("dbi:SQLite:dbname=$dbfile", undef, undef, {
		sqlite_open_flags => SQLITE_OPEN_READONLY,
	});
	my $sth = $dbh->prepare("
	SELECT seasonNumber, episodeNumber, primaryTitle
	FROM title_episode
	JOIN title_basics
	ON title_episode.tconst = title_basics.tconst
	WHERE parentTconst = ?
	ORDER BY seasonNumber, episodeNumber");
	$sth->bind_param(1, $title_id);
	$sth->execute();
	return $sth->fetchall_arrayref({});
}

Parsing the command-line

Before we go ahead and start renaming files, we need to find out which ones to rename and what TV series we're interested in. The Getopt::Long package is very useful here: we set up our option variables, call it and it fills them in based on the command line.

Getopt::Long documentation

We need a title ID to be able to search the database, and we need some files to actually work on. There's an option for help, and we can add more comprehensive help messages and a manual later. After GetOptions, the filenames are left in @ARGV so if it's empty none were supplied.

my $dry_run      = '';
my $title_id     = '';
my $help         = '';
GetOptions(
	'help|h'       => \$help,
	'title-id|t=s' => \$title_id,
	'dry-run|n'    => \$dry_run)
	or die "Bad options";
die "Usage: $0 [-dry-run] <-t title-id> <files>" if $help;
die "Title ID required" unless  $title_id;
die "No filenames given" unless @ARGV;

Renaming the files

So now we have a list of files and the title ID on the command line. We'll have to assume the files we have are already named in a sensible way; a common convention is to use SxxExx to denote the series (or season) and episode number respectively. These can easily be captured using a regular expression, and we can use the grep function to search the rows returned from the database to find the relevant episode. It's unlikely, but possible, that this pattern could appear in the path of the file somewhere other than the filename, so we split the path into its components before processing. Named captures are stored in the hash %+. The regex at the end of fileparse is for matching the file extension, it just matches the last lot of characters after a dot.

my($filename, $dirs, $suffix) = fileparse($_, qr/\.[^.]*/);
$filename =~ /S(?<season>\d+)E(?<episode>\d+)/;
my @row = grep {
		$_->{seasonNumber}  == $+{season} &&
		$_->{episodeNumber} == $+{episode}
} @$data;

We're making a few assumptions here about how the input files are named, and how the user wants them to be renamed. It's a balance between offering millions of options and doing something reasonable.

Now we have the matching row, a single element array containing a hash reference, so we can build the new filename. The $& variable contains the whole match e.g. S01E01.

my $outfile = "$dirs$& - " . $row[0]{primaryTitle} . $suffix;
die "Output file '$outfile' exists" if -e $outfile;
rename($_, $outfile) or die "unable to rename '$_': $!";

Adding a man page

Perl has its own documentation format called POD which stands for “plain old documentation”. It's a simple markup language, and can be translated into multiple formats such as man pages. There is also a useful module called Pod::Usage which can turn supplied POD into a usage message or a man page.

POD markup is ignored by the interpreter in the same way that comments are, so it can be placed inline or at the end of the script.

Automating the database creation

The database creation can be automated too. The first thing is to programatically get the path of the database based on the path of the script, which can be done like this. $0 is the path of the script.

my $dbfile = abs_path(dirname($0)) . '/imdb.db';

The SQL itself can be part of the script by placing it after an __END__ directive. A global file handle called DATA exists that allows access to this data; the line

my $sql = do { local $/; <DATA> };

reads all this in one go. This is a Perl-ish way of reading a whole file, $/ defaults to "\n" so that a file is usually read line-by-line; by declaring it as local, it is blanked in scope. Here is the whole setup subroutine:

sub setup_database {
	my $base    = 'https://datasets.imdbws.com/';
	my @files   = qw(title.basics title.episode);
	my $tempdir = tempdir( CLEANUP => 1 );
	my $sql		= do { local $/; <DATA> };
	my $ua      = LWP::UserAgent->new( timeout => 10 );
	$ua->env_proxy;

	for (@files) {
		say "Downloading '$_'...";
		my $req = HTTP::Request->new( GET => "$base$_.tsv.gz" );
		my $res = $ua->request( $req, "$tempdir/$_.tsv.gz" );
		say 'Decompressing...';
		gunzip "$tempdir/$_.tsv.gz", "$tempdir/$_.tsv"
			or die "gunzip failed: $GunzipError";
		unlink("$tempdir/$_.tsv.gz");
	}
	# now call the setup sql script etc...
	say "Creating script in '$tempdir'...";
	open(my $fh, '>', "$tempdir/import_imdb.sql")
		or die 'Unable to create SQL import script';
	print $fh $sql;
	close $fh;
	say 'Calling SQL script...';
	my @args = ('sh', '-c', "cd $tempdir && sqlite3 $dbfile <'$tempdir/import_imdb.sql'");
	system(@args) == 0
		or die "sqlite3 failed: $?";
}

The finished program

Put it all together in a single file with the appropriate directives and here it is!

Download the full script

External links

-- Response ended

-- Page fetched on Sun May 19 02:03:05 2024