As a Perl beginner, I wrote many scripts that created text files to store
snippets of data. Eventually I had hundreds of little text files in a data
directory, each one holding one piece of information in the form of a text
string. I decided to move from snippets to a single database. Here's how I did
it.
I didn't go all the way to creating a relational database with SQL, in part
because I had not yet learned to use Perl modules to work with such databases.
Plus I wanted to try the apparently simpler technique described in Chapters 11
and 14 of Perl Cookbook,
2nd Edition, namely the use of one of the DBM libraries. I was intrigued by
the idea of creating and using a database without needing to learn SQL or run a
database server such as PostgreSQL or MySQL.
One of my scripts created HTML for a clickable banner ad; the HTML would be
inserted into an HTML email newsletter I publish daily. For any particular
issue, the script pulled data from three different text files, each of which
held a different piece of information in the form of a text string on the
file's first and only line:
A different script created those files and named them according to (i) the
type of data in the file and (ii) the date of the newsletter in which the ad
would appear. For example, data for the December 9, 2005 ad was contained in
three files named url_2005_12_09, gif_2005_12_09 and
headline_2005_12_09.
First I considered the kind of data structure needed to hold the data. I
looked at the relationship between the three text files. It became clear that I
basically had a hash for each ad. The data for any particular ad consisted of
three different keys and their values: URL, gif and headline. The name of any
particular file would provide the key (the type of information contained in the
file). The contents of the file would provide the key's value.
I thought about finding some way to store those hashes as an array of
anonymous hashes (one hash per ad), but then I realized that an array wouldn't
let me access a particular ad's data easily. The hashes would be in the order
in which I saved them into the array, but that wouldn't translate easily to the
ad for a particular date. For example, how would I know where to find the data
for next Monday's newsletter? Is it in $array[8]
or
$array[17]
?
Hmm. Each anonymous hash could be identified by a particular date--the key
(!) to locating the ad for any particular date. What kind of data structure
associates a unique key with a value? A hash, of course! My data would fit
nicely into a hash of hashes.
In code, this is how the data for two days of newsletters could be
represented as a hash of hashes:
The keys of the named hash are 2005_12_08 and 2005_12_09. Each key's value
is a reference to an anonymous hash that contains its own keys and values. When
a hash is created using braces instead of parentheses, its value is a reference
to that unnamed, "anonymous" hash. I need to use a reference because a hash is
permitted to contain only scalar keys and scalar values; another hash can't be
stored as a value. A reference to that hash works, because it acts like a
scalar.
Storing the Data
Now I knew what the data structure would be; but how would I store it? Being
Perl, of course there's more than one way to store data to disk. Some of those
solutions involve the parsing of potentially large amounts of data and then the
need to programmatically find a particular desired piece of data, however.
I came across a single-file database solution in Recipes 11.14 and 14.6 of
Perl Cookbook, 2nd Edition. These describe the use of a DBM file on
disk. A Perl script can read data from a DBM file as if it were contained in a
hash in memory. For data having unique keys, it's a great fit. In addition to
data persistence from one run of your script to the next, using a DBM database
file means your script won't need to read all the data into memory. Instead,
you pull up only data associated with a particular key. This applies especially
to large data sets; a DBM database file on disk might hold data for hundreds of
thousands of keys.
A straightforward but tedious way to get data into a DBM file is shown
below. The code creates a database called ad_data.db and inserts data for two
newsletters.
#!/usr/local/bin/perl
use strict;
use warnings;
use MLDBM qw( DB_File Storable );
use Fcntl qw( :flock O_CREAT O_RDWR );
my $db = '/www/cgi-bin/databases/ad_data.db';
my %data_for_ad_on;
tie %data_for_ad_on, 'MLDBM', $db, O_CREAT|O_RDWR, 0644
or die "Trouble opening $db, stopped: $!";
%data_for_ad_on = (
'2005_12_09' => { 'url' => 'http://acme.com/index.html',
'gif' => 'http://myserver.com/banners/acme_banner.gif',
'headline' => 'Look to Acme for quality, inexpensive widgets!',
},
'2005_12_08' => { 'url' => 'http://roadrunners-r-us.com/index.html',
'gif' => 'http://myserver.com/banners/roadrunners_banner.gif',
'headline' => 'Use Roadrunners R Us for speedy, reliable deliveries!',
},
);
After the script is run, the data contained in %data_for_ad_on
are saved automatically into the DBM database file named
/www/cgi-bin/databases/ad_data.db
.
The secret is Perl's tie
function. It associates a particular
hash (here named %data_for_ad_on
) with a "class" and a file. A
class that works for "complex" data (data containing references) is the MLDBM module, available from
CPAN.
Note: if the values being stored are scalars only, a tied hash using the
DB_File module as the class, rather than the MLDBM
module, will be
much easier to work with. I'm using MLDBM
because my values
include references to data structures. I'll discuss some of
MLDBM
's limitations later in this article.
The script shown above uses the Fcntl module to make it easier
to create or update the DBM database file. The imported
O_CREAT|O_RDWR
parameters tell the script to create the database file if
it doesn't yet exist or to update (read/write) the file if it exists.
The DB_File
and Storable
parameters passed to
MLDBM
specify the particular kind of DBM database to use and the
serialization module to use. They handle the behind-the-scenes conversion of
references into strings. (A recent version of MLDBM
is necessary
to be able to specify such parameters.) DB_File
won't work with
every system, however, because it depends upon your system's configuration,
like all DBM modules. In particular, the Berkeley DB library (a C library) must
be present; see the DB_File
documentation for details. You can get
another DBM module from CPAN if DB_File
won't work for you.
Manually entering hash keys and their values into a script is time-consuming
and error-prone, of course. I prefer to programmatically add data created or
copied from existing files, and used a script to copy data from my hundreds of
text files into the DBM database file.
#!/usr/local/bin/perl -T
use strict;
use warnings;
use MLDBM qw( DB_File Storable );
use Fcntl qw( :flock O_CREAT O_RDWR );
my $db = '/www/cgi-bin/databases/ad_data.db';
my %data_for_ad_on;
tie %data_for_ad_on, 'MLDBM', $db, O_CREAT|O_RDWR, 0644
or die "Can't open $db: $!";
my $data_dir = '/www/cgi-bin/newsletter_data_text_files';
opendir my $dir_handle, $data_dir
or die "Can't open $data_dir, stopped: $!";
# All of the data files are in the same directory; each file's name
# is url_YYYY_MM_DD, gif_YYYY_MM_DD or headline_YYYY_MM_DD (where
# YYYY is the year, MM is the month, with a leading zero if needed,
# and DD is the day of the month, with a leading zero if needed).
# No other files in that directory end with YYYY_MM_DD.
my @data_files = grep { /\d\d\d\d_\d\d_\d\d$/ } readdir $dir_handle;
close $dir_handle;
my $entry;
foreach my $file ( @data_files ) {
# Test the filename to see if the file contains the advertiser's
# URL...
if ($file =~ /^url_(\d\d\d\d_\d\d_\d\d)/ ) {
$entry = $data_for_ad_on{$1};
open (FILE, "$data_dir/$file") or die;
my $url = do { local $/; <FILE> };
close FILE;
$entry->{url} = $url;
$data_for_ad_on{$1} = $entry;
print "Advertiser's URL for $1 has been saved.\n";
# Test the filename to see if the file contains the .gif location...
} elsif ($file =~ /^gif_(\d\d\d\d_\d\d_\d\d)/ ) {
$entry = $data_for_ad_on{$1};
open (FILE, "$data_dir/$file") or die;
my $gif = do { local $/; <FILE> };
close FILE;
$entry->{gif} = $gif;
$data_for_ad_on{$1} = $entry;
print ".gif file location for $1 has been saved.\n";
# Test the filename to see if the file contains the headline...
} elsif ($file =~ /^headline_(\d\d\d\d_\d\d_\d\d)/ ) {
$entry = $data_for_ad_on{$1};
open (FILE, "$data_dir/$file") or die;
my $headline = do { local $/; <FILE> };
close FILE;
$entry->{headline} = $headline;
$data_for_ad_on{$1} = $entry;
print "Headline for $1 has been saved.\n";
}
}
A limitation of MLDBM
is that you can't work piecemeal with the
data structures (the values for keys in the tied hash). You must retrieve a
copy of the data structure in the form of a reference ($entry
in
the example); use that reference to add, update or delete parts of the data
structure; and then save the reference back to the database in order for
MLDBM
to put the changed data structure back onto disk. This is
because the data structure is no longer tied, once it has been retrieved from
disk.
For example, the script above works piecemeal with the data structure: it
builds one key at a time in the anonymous hash. A temporary variable called
$entry holds a reference to the retrieved anonymous hash, which now exists in
memory; then $entry
is used to dereference the anonymous hash (and
to select a named key in the hash, on the right side of the infix operator);
then the script alters a part of that data structure (assigning a value to the
named key); then the now-changed anonymous hash is saved back into the database
file on disk by assigning $entry
to the key.
As shown in earlier examples, an entire data structure can be stored as the
value of a key in the tied hash without fetching the existing data structure
(if any) associated with that key. In such a case, you're overwriting any
existing data structure completely, so there is no need to fetch the existing
data structure first or to work with it using the temporary variable
technique.
Note: a recently added CPAN module called DBM_Deep
allows tied
hashes of complex data as does MLDBM
, but appears to avoid the
need for fetching data first in order to work with it through a temporary
variable.
By the way, MLDBM
allows stored data structures to be arrays,
not just hashes. Hence MLDBM
might be a good disk-based storage
solution for you if a hash of arrays works best to hold your data. Plus,
whether you're storing a hash of hashes or a hash of arrays, those stored
hashes or arrays are permitted to contain values that consist of even more
hashes or arrays!
Another thing to know about MLDBM
is that file locking is not
provided automatically. The solution presented in this article works well in a
single-user environment, but data corruption is possible when more than one
instance of the script (or more than one script) needs to access the disk-based
data while another process is writing to it. See section 10.2.1 of CGI Programming with Perl for
an example of file locking with the DB_File DBM module.
Reading HoH Data
Here is a portion of a script that prints a report for my stored newsletter
data.
I reach into the %data_for_ad_on
hash by specifying a key (the
date of a particular newsletter); the key's associated value is a reference to
the anonymous hash holding ad data for that date. Use infix (arrow) notation
to dereference the anonymous hash and also to access a specified key.
tie %data_for_ad_on, 'MLDBM', $db, O_RDWR, 0644
or die "Can't open $db: $!";
my @newsletter_dates = sort keys %data_for_ad_on;
foreach my $date ( @newsletter_dates ) {
my $url = $data_for_ad_on{$date}->{url};
my $gif = $data_for_ad_on{$date}->{gif};
my $headline = $data_for_ad_on{$date}->{headline};
print "\nHere are details for the banner ad in the $date newsletter:\n";
print " Advertiser's URL: $url\n";
print " Location of banner: $gif\n";
print " Headline above banner: $headline\n";
}
A temporary variable to work with the data was unnecessary because I was
merely reading data. I had no need to save a changed data structure back to
disk after working with it. A temporary variable would be needed as a reference
to the data structure, as noted earlier, if I were adding, updating or deleting
something in the data structure, in order to save the changed data structure
back to disk.
Nevertheless, a temporary variable is useful for storing a fetched data
structure even when a tied hash is being used only to read data from disk. The
MLDBM
documentation points out that disk fetch occurs each time a
script names a key in the tied hash; a new copy of the entire data structure
for that key gets created with each fetch. For efficiency, scripts that merely
read data work better if the data structure for a particular key is stored
first into a temporary variable and the script works with that single copy
of data in memory. For example, the lines in the preceding code that get
values for $url
, $gif
, and $headline
should be rewritten to cut the number of fetches from three to one:
my $entry = $data_for_ad_on{$date};
my $url = $entry->{url};
my $gif = $entry->{gif};
my $headline = $entry->{headline};
Updating HoH Data
My next task was to update existing data. An advertiser informed me that its
web page had changed. The advertiser's banner ad in a particular upcoming
newsletter needed to point to the new page rather than to the URL I had stored
earlier for the ad. So I tried this:
tie %data_for_ad_on, 'MLDBM', $db, O_CREAT|O_RDWR, 0644
or die "Can't open $db: $!";
$data_for_ad_on{'2005_12_14'}->{url}
= 'http://roadrunners-r-us.com/new_home_page.html'; # WRONG
Despite the apparently correct syntax, the line marked "# WRONG" did not
change the data in the tied hash. MLDBM
can't add, update or
delete just a part of a data structure stored on disk; it must retrieve the
entire data structure first. Unfortunately that doesn't happen automatically
when a part of the tied hash is used as an lvalue (the left side of an
assignment).
Here is the correct syntax for updating part of a data structure:
my $entry = $data_for_ad_on{'2005_12_14'}; # Get
$entry->{url} = 'http://roadrunners-r-us.com/new_home_page.html'; # Change
$data_for_ad_on{'2005_12_14'} = $entry; # Save
A different kind of update would be to add a new kind of information to an
existing data structure. Until now I have been working with each ad's data as
an anonymous hash of three keys because I only had three pieces of information
per ad. Here's how I can modify the data structure for any particular ad--that
is, the value of any particular key in the tied hash--to include a fourth piece
of information. I'll record whether or not a particular ad has been billed.
tie %data_for_ad_on, 'MLDBM', $db, O_CREAT|O_RDWR, 0644
or die "Can't open $db: $!";
my $entry = $data_for_ad_on{'2005_12_02'}; # Get
$entry->{has_been_billed} = 'yes'; # Change (by creating a key and value)
$data_for_ad_on{'2005_12_02'} = $entry; # Save
Adding a fourth key as shown is not remarkable because it's done the same
way I created the original three keys. But note how I've added a fourth key to
just one of the ads (for the 2005_12_02 newsletter). I don't need to define
such a fourth key in each of the other anonymous hashes because
MLDBM
doesn't require that the anonymous hashes have the same
structure.
Now that I've added a fourth key to one of the ads, I want to see that key's
information when I run my report, even though the information for each of the
other newsletters is incomplete. Just as with a hash in memory, the
exists
operator can test for the existence of a key in a tied
hash, or test for the existence of a key in one of the stored anonymous hashes.
The earlier reporting script can be improved:
foreach my $date ( @newsletter_dates ) {
my $entry = $data_for_ad_on{$date};
my $url = $entry->{url};
my $gif = $entry->{gif};
my $headline = $entry->{headline};
my $has_been_billed = exists ( $entry->{has_been_billed} ) ?
$entry->{has_been_billed}
: 'Billing status is unknown';
print "\nHere are details for the banner ad in the $date newsletter:\n";
print " Advertiser's URL: $url\n";
print " Location of banner: $gif\n";
print " Headline above banner: $headline\n";
print " Has advertiser been billed? $has_been_billed\n";
}
Note how a default value is provided for $has_been_billed
(the
string 'Billing status is unknown'). If
$entry->{has_been_billed}
had been assigned to
$has_been_billed
without providing a default value, the script (if
it is using the warnings
pragma) would report "Use of
uninitialized value in concatenation (.) or string" every time it processes the
"Has advertiser been billed?" statement for an anonymous hash that has no
'has_been_billed'
key. The value of a nonexistent key in a hash is
undefined, whether or not the hash is tied.
Because the value of a nonexistent key in a hash is undefined, remember when
working with MLDBM
not to assume your tied hash's keys will return
either a reference to a data structure or a reference to an empty data
structure. For example, this code produces a runtime error if the '2005_12_32'
key does not exist in the tied hash (there is no December 32) and the script is
using the strict pragma:
my $entry = $data_for_ad_on{2005_12_32};
my %named_hash = %{ $entry };
if ( $named_hash{url} ) {
print "The advertiser's URL for the ad dated 2005_12_32 is $named_hash{url}.\n";
} else {
print "We have no URL on file for the ad dated 2005_12_32.\n";
}
The runtime error would be 'Can't use string ("") as a HASH ref while "strict refs" in use' on the line number of my %named_hash = %{ $entry }
. The code doesn't create a named hash that has no elements.
Conclusion
I hope this article helps you move from a swamp of weedy text files into the simplicity of a single disk-based database file. A DBM database file enables you to work with your data as if it were a single hash of keys and their values.
The MLDBM
module extends the "hash on a disk" capability by letting you store a key's value as a hash, as illustrated in this article, or store a key's value as an array.
MLDBM
requires a special technique for working piecemeal with
stored hashes (or stored arrays). Also, the solution described works well in my
single-user environment; file locking to protect the disk-based data would
become important in a multi-user environment such as a script executed by a
publicly accessible web server. Still, you get all this hash-like goodness for
key-based data without running a relational database server or using SQL,
without parsing large amounts of disk-based data, and without storing all the
data in memory first.
Happy hashing!