-- Leo's gemini proxy
-- Connecting to gemini.splashgel.net:1965...
-- Connected
-- Sending request
-- Meta line: 20 text/gemini
Often the order of titles is different between broadcast and DVD. IMDB usually uses the broadcast order, so using this script to rename your files without checking may cause strange results.
Discussion of getting television episode data from the Internet Movie Database (IMDB) and using an SQLite database instrumented by a Perl script to rename video files and set their metadata based on this.
Often when making legitimate backup copies of videos they end up being named things like title_1. It’s relatively easy to rename these by season and episode, when this is a television series, but adding the episode titles can be annoying and time consuming. The Internet Movie Database (IMDB) has a store of all this online, so why not use it to automate the process?
IMDB does not have an open API to access the title and episode data, and ‘scraping’ the HTML is fragile and error-prone. The underlying database is made available, however, as a download for personal or non-commercial use.
The important dataset in this case is “title.basics” which contains the following interesting fields:
tconst: the primary key that identifies this episode or series
titleType: for our purposes “tvSeries” and “tvEpisode” are the values we're interested in
primaryTitle: the series name, such as “Arrested Development” or an episode name such as “Let ’Em Eat Cake”.
The episodes are linked to their parent series via the table “title.episode” which contains:
tconst: the key of the episode
parentTconst: the key of the series
seasonNumber
episodeNumber
The IMDB data comes as TSV (tab separated values) files, which when uncompressed weigh in at 664MB for the title.basics file and 147MB for the title.episode file. Working directly with these files can prove to be a little unwieldy, for example Perl has a module called Text::CSV (that also supports TSV); unfortunately when loading the larger of the two files it gets to allocating 10GB of memory and then the process is killed, for me at least, using Text::CSV_XS version 1.46.
As an alternative, the data can be imported into an SQLite database (note that the files must be gunzipped first). So that the season and episode numbers are numeric, a temporary table is created during the import and the data moved across to the final one; this is because SQLite does not support changing column types.
.mode ascii .separator "\t" "\n" .import title.basics.tsv title_basics .import title.episode.tsv title_episode_temp CREATE TABLE title_episode( tconst TEXT PRIMARY KEY, parentTconst TEXT, seasonNumber INTEGER, episodeNumber INTEGER ); INSERT INTO title_episode (tconst, parentTconst, seasonNumber, episodeNumber) SELECT tconst, parentTconst, seasonNumber, episodeNumber FROM title_episode_temp; DROP TABLE title_episode_temp;
Once this SQLite database is set-up, the following query gets you a list of season and episode numbers, along with their titles, for a given series identifier:
SELECT seasonNumber, episodeNumber, primaryTitle FROM title_episode JOIN title_basics ON title_episode.tconst = title_basics.tconst WHERE parentTconst = ? ORDER BY seasonNumber, episodeNumber
The series identifier can be found either by searching the IMDB website and looking for the code starting with “tt” and then a number in the URL for a series, or we can make an SQL query to search for series identifiers:
SELECT COUNT(DISTINCT seasonNumber) seasons, b.tconst AS tconst, b.primaryTitle AS primaryTitle, b.startYear AS startYear FROM title_basics b JOIN title_episode e ON b.tconst = e.parentTconst WHERE titleType = 'tvSeries' AND primaryTitle LIKE ? GROUP BY b.tconst ORDER BY primaryTitle
This query gives you back a list of title identifiers, names, years, and the number of seasons. Remember in SQL the wildcard is the percent symbol %.
If you try this with your favourite show, for example Arrested Development, title “tt0367279”, you may notice it takes quite a long time. Here are the actual numbers.
> time ./mkepisodenames.pl -dry-run -t tt0367279 S01E01.txt 'S01E01.txt' -> './S01E01 Pilot.txt' ________________________________________________________ Executed in 7.12 secs fish external usr time 6.57 secs 0.28 millis 6.57 secs sys time 0.50 secs 1.07 millis 0.50 secs
So it took 7 seconds, which is pretty slow by modern standards. If we go into SQLite we can get it to explain what happens with the query, if by putting EXPLAIN QUERY PLAN in front of the SELECT statement.
sqlite> EXPLAIN QUERY PLAN SELECT seasonNumber, episodeNumber, primaryTitle, titleType FROM title_episode JOIN title_basics ON title_episode.tconst = title_basics.tconst WHERE parentTconst = 'tt264235' ORDER BY seasonNumber, episodeNumber; QUERY PLAN |--SCAN title_episode |--SEARCH title_basics USING AUTOMATIC COVERING INDEX (tconst=?) `--USE TEMP B-TREE FOR ORDER BY
This means it scans the entire episode table row-by-row, which is obviously slow. Also the tconst columns are textual, so the database has to do a text compare which is presumably slower than an integer compare. We can tweak the database so that the tconst fields are INTEGER types, and create an index for them.
CREATE TABLE title_episode( tconst INTEGER PRIMARY KEY, parentTconst INTEGER, seasonNumber INTEGER, episodeNumber INTEGER ); INSERT INTO title_episode (tconst, parentTconst, seasonNumber, episodeNumber) SELECT CAST(SUBSTR(tconst, 3) AS INTEGER), CAST(SUBSTR(parentTconst, 3) AS INTEGER), seasonNumber, episodeNumber FROM title_episode_temp; CREATE INDEX title_episode_parentTconst ON title_episode(parentTconst);
We do a similar thing for title_basics, although creating an index for it seems not to make any difference. You can try running “EXPLAIN QUERY PLAN” again to see for yourself that it uses the index, but let’s try the script now with the tuned database.
> time ./mkepisodenames.pl -dry-run -t 367279 S01E01.txt 'S01E01.txt' -> './S01E01 Pilot.txt' ________________________________________________________ Executed in 81.81 millis fish external usr time 66.77 millis 313.00 micros 66.46 millis sys time 14.79 millis 919.00 micros 13.87 millis
So from 7.12 seconds to 0.082 seconds, or 87.9× faster! The database is also a lot smaller now, 498MB instead of 917MB probably because of going from TEXT columns to INTEGER ones.
Note: in the snippets below, a lot of the error checking is left out, but it can be found in the full script linked for download at the bottom of the page.
Perl has good modules for accessing databases in the form of DBI, which has an SQLite driver.
The code for accessing the database is pretty self-explanatory, we open it read-only just because that is our intention. The return value is an array-of-hashes where each array index represents a row, and the hash is of the columns.
sub get_episode_titles ($title_id) { my $dbfile = dirname($0) . "/imdb.db"; my $dbh = DBI->connect("dbi:SQLite:dbname=$dbfile", undef, undef, { sqlite_open_flags => SQLITE_OPEN_READONLY, }); my $sth = $dbh->prepare(" SELECT seasonNumber, episodeNumber, primaryTitle FROM title_episode JOIN title_basics ON title_episode.tconst = title_basics.tconst WHERE parentTconst = ? ORDER BY seasonNumber, episodeNumber"); $sth->bind_param(1, $title_id); $sth->execute(); return $sth->fetchall_arrayref({}); }
Before we go ahead and start renaming files, we need to find out which ones to rename and what TV series we're interested in. The Getopt::Long package is very useful here: we set up our option variables, call it and it fills them in based on the command line.
We need a title ID to be able to search the database, and we need some files to actually work on. There's an option for help, and we can add more comprehensive help messages and a manual later. After GetOptions, the filenames are left in @ARGV so if it's empty none were supplied.
my $dry_run = ''; my $title_id = ''; my $help = ''; GetOptions( 'help|h' => \$help, 'title-id|t=s' => \$title_id, 'dry-run|n' => \$dry_run) or die "Bad options"; die "Usage: $0 [-dry-run] <-t title-id> <files>" if $help; die "Title ID required" unless $title_id; die "No filenames given" unless @ARGV;
So now we have a list of files and the title ID on the command line. We'll have to assume the files we have are already named in a sensible way; a common convention is to use SxxExx to denote the series (or season) and episode number respectively. These can easily be captured using a regular expression, and we can use the grep function to search the rows returned from the database to find the relevant episode. It's unlikely, but possible, that this pattern could appear in the path of the file somewhere other than the filename, so we split the path into its components before processing. Named captures are stored in the hash %+. The regex at the end of fileparse is for matching the file extension, it just matches the last lot of characters after a dot.
my($filename, $dirs, $suffix) = fileparse($_, qr/\.[^.]*/); $filename =~ /S(?<season>\d+)E(?<episode>\d+)/; my @row = grep { $_->{seasonNumber} == $+{season} && $_->{episodeNumber} == $+{episode} } @$data;
We're making a few assumptions here about how the input files are named, and how the user wants them to be renamed. It's a balance between offering millions of options and doing something reasonable.
Now we have the matching row, a single element array containing a hash reference, so we can build the new filename. The $& variable contains the whole match e.g. S01E01.
my $outfile = "$dirs$& - " . $row[0]{primaryTitle} . $suffix; die "Output file '$outfile' exists" if -e $outfile; rename($_, $outfile) or die "unable to rename '$_': $!";
Perl has its own documentation format called POD which stands for “plain old documentation”. It's a simple markup language, and can be translated into multiple formats such as man pages. There is also a useful module called Pod::Usage which can turn supplied POD into a usage message or a man page.
POD markup is ignored by the interpreter in the same way that comments are, so it can be placed inline or at the end of the script.
The database creation can be automated too. The first thing is to programatically get the path of the database based on the path of the script, which can be done like this. $0 is the path of the script.
my $dbfile = abs_path(dirname($0)) . '/imdb.db';
The SQL itself can be part of the script by placing it after an __END__ directive. A global file handle called DATA exists that allows access to this data; the line
my $sql = do { local $/; <DATA> };
reads all this in one go. This is a Perl-ish way of reading a whole file, $/ defaults to "\n" so that a file is usually read line-by-line; by declaring it as local, it is blanked in scope. Here is the whole setup subroutine:
sub setup_database { my $base = 'https://datasets.imdbws.com/'; my @files = qw(title.basics title.episode); my $tempdir = tempdir( CLEANUP => 1 ); my $sql = do { local $/; <DATA> }; my $ua = LWP::UserAgent->new( timeout => 10 ); $ua->env_proxy; for (@files) { say "Downloading '$_'..."; my $req = HTTP::Request->new( GET => "$base$_.tsv.gz" ); my $res = $ua->request( $req, "$tempdir/$_.tsv.gz" ); say 'Decompressing...'; gunzip "$tempdir/$_.tsv.gz", "$tempdir/$_.tsv" or die "gunzip failed: $GunzipError"; unlink("$tempdir/$_.tsv.gz"); } # now call the setup sql script etc... say "Creating script in '$tempdir'..."; open(my $fh, '>', "$tempdir/import_imdb.sql") or die 'Unable to create SQL import script'; print $fh $sql; close $fh; say 'Calling SQL script...'; my @args = ('sh', '-c', "cd $tempdir && sqlite3 $dbfile <'$tempdir/import_imdb.sql'"); system(@args) == 0 or die "sqlite3 failed: $?"; }
Put it all together in a single file with the appropriate directives and here it is!
-- Response ended
-- Page fetched on Sun May 19 02:03:05 2024