meld > diff

I recently was assigned a task at work that required finding the differences between one directory full of configuration files and another directory full of configuration files. Normally I’d use diff to look at the differences between the files, and figure out what has changed. With a directory of over 100 files, this wasn’t really feasible.

Enter meld, an awesomely awesome GUI-fied tool for Linux that not only lets you compare files, it also lets you compare directories full of files and three files at the same time. Very cool. It is also in the Ubunutu repos which is a plus.

In less time than it would take to interpret a diff on a single file, I quickly saw which files only existed in the original directory, which only existed in the new directory, and which were in both but different. The diff for each file was only a double-click away.

Pretty cool app. If you are on an Ubunutu based system check it out by running: sudo apt-get install meld

Posted in Linux | Leave a comment

All The Domains in the World

After waiting months I’ve finally been given access to the .com TLD zone file! (And the .net, but who cares about .net, right?)

So what is a zone file, you ask? Basically this file keeps track of how to access every .com domain name in the entire world. Well, technically not all of them, just the ones that have name servers associated with them, but practically all of them. So how many .com’s are there in the zone file? Over 88 million! Holy cow that is a lot of domains!

So the first problem is how to use the zone file. It is a 6.5 GB text file so you can’t just open it up and say “hey is jacoballred.com taken?” and expect a quick reply. On top of that, it isn’t even designed to give you a list of taken domain names, that is just a happy side effect of keeping track of how to access all the .com’s in the world.

My solution was to preprocess the data using a few Linux utilities, then load it into MySQL.

The zone file looks a little like this:

 NS E.GTLD-SERVERS.NET.
 NS M.GTLD-SERVERS.NET.
$TTL 172800
ENERCONTECHNOLOGIES NS NS1.BIZ.RR
ENERCONTECHNOLOGIES NS NS2.BIZ.RR
SELF-DRIVE-CAR-RENTAL NS NS3.IZP

None of the domains in the file have .com on the end. Each of these lists a nameserver after it. There are also non-TLD domains (nameservers) that I don’t care about, and other random markers in the file ($TTL). All I want are the domain names, so I use Linux to strip out the stuff I don’t want:

sed -e '/^[^A-Z0-9]/d' -e '/^$/d' -e 's/ .*$//' -e /[^A-Z0-9\-]/d com.zone \
| sort -u \
| awk -F "" '{close(f);f=$1}{print > "com.zone.split."f}'

Wow doesn’t that look fun? So lets go over it.

That first line uses sed to load in the zone file (com.zone) and remove all lines that don’t start with A-Z or 0-9 (the only valid characters for the first character of a .com domain), then it removes blank lines, then it removes all but the first word on each line (gets rid of the nameserver after the domain name), and finally removes any line that has characters that aren’t allowed in a domain (anything other than A-Z, 0-9, or a dash). This gets a list of JUST the domain names (without the .com), but has duplicates and they aren’t in any particular order.

The next line sorts the list of domains and removes duplicates.

The last line uses awk to split the list of domains into 36 separate files, one for each starting character (A-Z, 0-9). This isn’t technically needed but makes things more convenient.

My server is pretty wussy (1GB of RAM) so I’m preprocessing on my fast 8GB of RAM desktop at home. So I kick off that command and 20 minutes later I have files ready to be loaded into MySQL.

My table structure is pretty basic. I have 1 table for each letter/number (for performance) that has a numeric primary key and a varchar for the domain name. So I run this for each letter and number:

DROP TABLE IF EXISTS `com_A`;

CREATE TABLE `com_zone`.`com_A` (
    `id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY ,
    `name` VARCHAR( 255 ) NOT NULL
) ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_general_ci;

Next I use LOAD DATA INFILE to quickly pull in the data:

LOAD DATA INFILE '/path/to/file/com.zone.split.A' INTO TABLE `com_A` (name);

This step took about 5 minutes total for all the processed files. It is super super fast, but we still have one step left. Without an index on the name field, queries are really slow (about 2 seconds for a single domain). So we add an index to each table:

ALTER TABLE `com_A` ADD UNIQUE `name` ( `name` ( 255 ) );

This step was painfully slow, about 40 minutes, but once it was done I could do pretty much any query in a fraction of a second.

The final step was to turn off MySQL on my desktop, copy the MyISAM files to my server, then restart MySQL on my server so it could use them. Woot! I know have nearly every .com in the world on my server, ready to tell any web app I want if a domain is available or not with a high degree of confidence. I have a couple really fun webpages in the works that will use this.

Well that was a bit of a ramble but should be enough to get someone else in my position on the road to domain goodness!

Posted in Linux, Web Dev | Leave a comment

CloudFlare Rocks!

I recently got invited to try out CloudFlare, a new free service in closed beta. I checked out the homepage and was greeted with a “Wouldn’t it be cool if your website were protected by ninjas?” header. Why, yes, that would be cool.

So, not really understanding exactly what it was, I signed up and configured it for the Fake Name Generator.

After some poking around and letting it do its thing for a few days, this is what I’ve discovered it does for me:

  • Provides a free DNS management. This is included for free with many registrars, but it just so happened that the domain I’m trying this out on didn’t come with DNS management so I’ve been paying $10 a year for it. This alone makes CloudFlare worth using for me.
  • Serves my content on a CDN-like intrastructure. This makes my site faster to some users, which is always a good thing.
  • Caches my static content (like images and JavaScript). This dramatically reduces my server load, and makes my site faster. My LAMP server with only 1GB of RAM is currently serving about 100,000 pageviews per day and running millions of queries in offline processes. With CloudFlare, my load average rarely goes above 0.10.
  • Blocks bad guys. This is a huge deal for me. Everyone and their mom thinks it is okay to scrape my site for data. Bots love to hit my site to try to find exploits. CloudFlare does a great job at identifying these people and blocking them for me, or providing a way for them to enter a captcha to prove they aren’t a bot.
  • Provides geolocation data on all visitors. I haven’t started using this yet, but CloudFlare adds a request header with the visitor’s geographic location. This makes it easier to target content to visitors from certain parts of the world.
  • Makes me more profitable. All around, CloudFlare has made my business more profitable. My site requires less server resources, which means I can keep my site running on my relatively cheap tiny server. Fewer bots are loading my ads, which means my click thru rates are higher, which means I get paid more. My pages respond faster, which means I’m ranked higher in the search engines, which means I get more visitors.

One problem I ran into, however, is occasionally a screen scraper gets through their blocks and starts hitting my site. In the past I would use iptables to block them, but the way CloudFlare works makes that impossible (at least with my limited knowledge of iptables). CloudFlare provides a way to block a specific IP, but it can take several minutes to go into effect.

The solution I came up with is to use Apache to give visitor’s from the offending IP a 403 error:

<VirtualHost *>

 SetEnvIf CF-Connecting-IP 98.17.241.185 GoAway=1

 <Directory "/path/to/your/website">
 Order allow,deny
 Allow from all
 Deny from env=GoAway
 </Directory>

</VirtualHost>

This snippet, properly placed in the Apache config file, will cause Apache to look at a header set by CloudFlare, and if it matches the offending IP (in this case 98.17.241.185), it denies access to the site. You can add a nearly unlimited number of SetEnvIf statements to block any number of IPs.

Anyways, if you get an invite to CloudFlare, check it out! It is definitely worth it!



								
Posted in Free Stuff, Linux, My Sites, Web Dev | 1 Comment

Protecting Your Server with DenyHosts

Yesterday I noticed my server’s load average was a bit higher than usual. Normally when this happens it mean someone is screen scraping the Fake Name Generator, so I went and started reviewing the logs trying to figure out who it was so I could block them.

Disappointingly, I couldn’t find anybody that was scraping my site, which means I had to dig deeper. The next step was to use top to figure out what processes are stealing all my resources. To my surprise (and exceedingly great alarm) I saw that there were about a dozen sshd processes running. For those that are not Linux server savvy, there should not be about a dozen sshd processes running.

SSH is the protocol that Linux server admins use to connect to their servers. When connecting, an sshd process will run. When a dozen are showing up, that means a dozen people are connected or trying to connect, which is very very disturbing for a server like mine where I’m the only one that should ever be on it.

I quickly turned to the logs and found thousands of failed login attempts. Someone was trying to hack my box. Yikes!

I quickly used iptables to block the most flagrantly offending IP, but I knew that wouldn’t hold back a committed attacker. Enter my hero: DenyHosts!

DenyHosts is a free chunk of code written in Python that periodically scans your log files, determines if someone looks like they are trying to break in, and blocks them. If you are really paranoid then you can even have it talk to other servers to find out who is trying to hack them, so you can preemptively block the bad guys.

Installation and configuration literally took about 3 minutes, and is even easier to setup if you are using Ubuntu or Linux Mint because it is in the repos. As soon as I started it all the bad guys were blocked and my load averages started to drop. I highly recommend it for anyone that administers Linux servers.

Posted in Free Stuff, Linux | Leave a comment

My First Firefox Add-on

Update (7/16/2010): The add-on got reviewed and approved! Yay!

I’ve been wanting to make a Firefox add-on FOREVER, but have never found something worth making that hasn’t already been made (or isn’t way beyond my abilities to make). So I decided I’d start simple and make a search provider add-on for ABA Number Lookup.

I found a page in the Mozilla developer wiki that explains how to make an OpenSearch plugin. It is super easy. Basically it is just a snippet of XML that tells the browser where to send queries to, and how to get search suggestions. It also contains basic information like the name of the search engine and who wrote it. Easy stuff.

It took about 30 minutes to modify ABA Number Lookup to be able to return search suggestions and to send the proper MIME header for the OpenSearch XML file. After that was done, I went to the “submit an addon” page, followed the instructions, and wham! My search engine add-on is available to the world!

Mozilla puts all new add-ons in a sandbox to make it a little harder to push malicious code. This is just a search add-on so I think it should pretty quickly get approved and the download page will look a little less scary.

Anyways, it was fun! I might write one for Rhyta Arcade just for fun, though I doubt anyone would actually use that one.

Posted in My Sites, Web Dev | Leave a comment