Newznab – Adentures in indexing

With the recent closing of Newzbin and then a few days later NZBMatrix, the heat is obviously on for Usenet indexers. The funny thing is though that these are, as any search engine, just aggregating meta data. The internet climate has obviously gotten extremely poisonous lately, where the mechanisms that the law provides (DMCA takedowns, etc) don’t satisfy the right holders and sites need to be bullied out of existence with enormous law suits. A bit of a downer for us honest Linux folk, looking to download the latest releases from a.b.cd.image.linux or a.b.linux.iso.

The software that does this however is extremely simple. And unrefined data is available to anyone with a Usenet account. So I figured I’d give the whole thing a twirl and installed Newznab! I’m not about to run an indexer, obviously, but I was curious about it all. I’m glad to say that the traffic this generates isn’t too horrible at any rate. The whole thing hinges on the mechanism that combs through the data and matches up the different files into a release. This relies heavilly in regexes or regular expressions, strings of text that filter out the date from the different tags these files are marked with. The classic version has 2 basic regexes but these did not yield any results in my tests. Wondering what I was doing wrong set me on a quest and as I don’t like to fail or quit, I debugged some , found out I had downloaded a faulty zip, installed the proper file and went looking for a regex that would at least turn out some result. Some chatting on the IRC channel provided the following gem and allowed me to test the software to its fullest.

/^[.*?(?P[^([]#”][A-Z0-9.-_()]{10,}-[A-Z0-9&]+).*?(?Pd{1,3}/d{1,3})/i

This turned out a wad of refined data, but when looking at the raw data, it missed more than it found! The software supports a lot of different regexes and you obviously need a variety of those to match up the enormous amount of data on these newsgroups.

Curious about it all, I decided to try the plus version, available for anyone who cares enough about the project to donate a small sum. This adds an interesting wad of functionality, among which the option to import files into your search engine! An interesting option that will spare your system the time consuming activity of re-indexing those files! The documentation talks about asking a friend for these files or even trying a simple google search! Depending on how good a friend you asked or how spectacular your search was, this can yield a lot of files and importing these will take a while. A strenuous enough a process to justify an article and some extra code included into the plus package! “How to backfill newznab safely without bloating your database“. The short reason to read this is that the import reads the files; imports the raw data included into your database and extracts fresh refined files from there. If you do that with a gigabyte of data, things won’t be too pretty 🙂 The altered import script paces the import to a more convenient rate.

Unpacking the tar files I found for import wasn’t pretty either. But there’s a simple enough solution for that. A bit of bash script and a bit of patience solves anything 🙂

for a in *.gz; do tar xvzf “$a”; mv “$a” “$a.OK”; done

This line loops over all .gz file in the directory that its executed in, unpacks the data and moves the original file. Should the files contain more zipped files, you can just execute this again and again untill all .gz files are gone. Since the “mv” part renames the original archives after unpacking, you won’t waste time by unpacking the same file twice.

As the article points out however, the process is not recursive and if your Samaritan put the files in convenient little folders, you will wish it was! Again, a wee bit of bash scripting solves a lot! Also, I want to run the import all day until its done, but I only want to import the binary data at night.

#!/bin/bash
# Analyses all files in all subdirectories for the parameter or if no parameter is given, uses the map parameter.
# Set the different paths and commands to match your install.
# The command status is available in great detail on stdout and in short form in $map/status
# Run this in a screen session to ensure continuous processing.

map=$*

if [ -z "$map"];
then
map=/home/you/somefolder/withfiles/ ;
fi

php="sudo -u www-data /usr/bin/php ";
importmap=/var/www/nnplus/www/admin/ ;
import=nzb-importmodified.php ;
updatemap=/var/www/nnplus/misc/update_scripts/;
update=update_releases.php;
binaries=update_binaries.php;
binary_time=" 01 02 03 04 05 06 07 ";


function import_nzbs () {
        local map=$*;
        echo Import NZBs $map;

        count=$( ls $map | wc -l); 
        old=$(( $count +1 )); 

        echo $( date ) - $map  :  $count >> $map../status
        
        while [ $old -gt $count ] ; 
        do 
                date; 

                echo Scan $count files in $map;
                cd $importmap ;
                $php $import $map true ;

                if [ ! -z "$( echo $binary_time | grep $( date +%H ) )" ]; 
                        then 
                        echo $( date ) - Updating Binaries >> $map../status
                        echo Updating binaries.
                        cd $updatemap;
                        $php $binaries ;
                fi

                echo Releases for $map; 
                cd $updatemap;
                $php $update ;

                echo Counting down from $count ;
                old=$count; 
                count=$( ls $map | wc -l); 
                echo $( date ) - $map  :  $count >> $map../status
                echo eth0: $( ifconfig eth0 | grep "RX bytes" ) >> $map../status
        done
}

for a in $map*/; 
do 
        echo $( date ) - Start $a >> $a../status
        import_nzbs $a;
        echo $( date ) - Stop  $a >> $a../status
done

The script loops through all maps and runs the import until the file count no longer goes down. (As the import script deletes successfully imported data) It first does a file count, then analyzes a new batch of files (a 100 by default, more about that later) , downloads the fresh binaries if the hour is in the $binary_time list, generates refined data, does a fresh file-count and starts again till all files and all subdirectories are done. Quite a bit more convenient to me than the proposed altered screen thing. Not that that one’s bad, mind you.. Just not ideal for what I need 🙂 Also, my script doesn’t run the database optimization script.. Which is a good idea.

Which takes us to the final thing worth mentioning.
The comments of the aforementioned article talk about altering the number of files to find the sweet spot to be able to work through the data as quickly as possible. The default is a 100 and considering the overhead in the other command(s), you can probably put a higher number in there. I tried some settings to find the sweet spot for my set-up. These won’t necessarily be the same on your machine, but it will show that it’s worth checking out! (My setup is a server running on an SSD drive & all the data, the Samaritan files & refined files on Glusterfs clustered network storage. One downside to this is that a “ls” will take a while, certainly with lots of files. Causing extra overhead and making more files worth while

The files setting at the bottom of the altered import script;
line 192: originally “if ($nzbCount == 100)”

Stats for the different nzbCount settings
100: 1100 files / hour
200: 1800 files / hour
400: 2400 files / hour

There. Some data and scripts that would certainly have helped me when I was looking for info about the process. One thing I was curious for initially, but haven’t come around to finding out, is how much up/download this scraping generates.. I’ve got 1,4gb for yesterday but there is a wad of data in there from me accessing data on the server, so never mind that number. Probably rather something like 400mb.

And as en encore, a list of relevant links. SABnzbd; Sick Beard; CouchPotato; Headphones; Newznab; Derefer.me.

Published by Gert

Person-at-large.

2 thoughts on “Newznab – Adentures in indexing

  1. Just launched my newznab site http://nzbid.org/ and I must say Newznab scrypt is amazing. In just 2 weeks about 800k releases grabed and the site is very powerful with many information. Can only advice all to try it.

    Like

  2. Just checked out your nzbid.org site and seems no matter why i type in to ‘search’ nothing comes up?

    I’m trying to set up my own newsnab server and came across this post. I particular trying to index a.b.e-books and a.b.ebook, but having probs with the regex and backfilling.

    Like

Leave a Reply to nzbid Cancel reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: