Scripting‎ > ‎Perl examples‎ > ‎

Store urls into database for later search

Thank you for visiting this page, this page has been update in another link Store urls into database for later search
If you have millions urls to save for later search, probably you have to fight with database performance simply because urls are saved as text or long char attribute into database, thus, hard to index.
Here is a way to let you quickly save, load and search. I save url's md5 checksum to make a index key so later on, use a url's md5 checksum to search it's row in the database.
It true, it is possible that different urls' md5 checksum could get to same, but the chance is quite small, The length of the returned md5 checksum string will be 22 and it contains characters from this set: 'A'..'Z', 'a'..'z', '0'..'9', '+' and '/'.  22^64 is a huge number. See more detail in my another article how to calculate md5 of a file/string in perl

So, Below is just an example I did using BerkeleyDB(version 1), apprently, if you want to do with other database for more complicated things, just change to use DBI for other databases, but use the same method metioned in below, won't be diffcult, let me know if you have trouble.
The example is the tool I'm using to save filepath to its id mapping, same way for urls. On my desktop, I loaded 9M files mapping into db in less than 300 secs.
 $./ -r yes -if /home/idofpath/idspath
reading input file at 1378507569 ...
time elapsed 98 secs for md5 compute 9264153 pnfs mapping
time elapsed 164 secs for loading 9264153 pnfs mapping

You see, the map file before was 2GB, after loaded into db, the db file actually is only 680MB. Save you a lot of disk space, not just speed.
$ls -l /home/idofpath/idspath
-rw-r--r--. 1 cindy cindy 2108873748 Sep  7  2013 /home/idofpath/idspath
$ls -l /home/iddb/idof.db
-rw-rw-r-- 1 cindy cindy 687144960 Sep  7 13:52 /home/iddb/idof.db

The idspath file looks like the map below, url and id separated with '|',  the id below has special meaning to  me, but for your case, you can put everything you want.|000100000000000018B0B918

To retrive a mapping  back
./ -durl

The whole perl script is attached at bottom, take a try