Tracking Users By IP Address

Michael Sparks zathras at
Fri Oct 8 00:23:58 CEST 2004

Fuzzyman wrote:
> Assuming that each visitor will have a unique IP address is probably
> not 100% accurate but will have a reasonably low margin of error !! Is
> there any reason not to use this approach ?

For this task things are worse than it sounds:
   * Many users are behind either firewalls and/or proxies
   * In the case of a simple NATting firewall all your accesses from offices
     will be from a single IP (or small set of IPs).
   * In the case of proxies that set either Client-IP or X-Forwarded-For
     headers you can't guarantee that the IPs are passed through intact
     (depending on the paranoia/privacy settings of the proxy)
   * Even if you *can* see the IP, many people connect from systems where
     their IP changes regularly - meaning you break your request streams so
     you don't get the full behaviour over time and means that people
   * Many ISPs that generate large amounts of traffic proxy all their
     traffic. Meaning you have millions of users from tens or hundreds of
   * The "My TiVO thinks I'm Gay" effect. PVRs often have settings to say "I
     like", "I hate", which allows the PVR to determine the tastes of the
     owner. In theory this sounds great, but because it assumes everyone in
     the same household likes the same thing it jumps to the wrong
     conclusions. If you consider homes with several people, you can't even
     rely on accesses from the same _machine_ to come from the same user.
     (If they share the same login there's nothing you can do obviously)

When you put all these togethrer using IPs _looks_ very bad.

What's an alternative? First a few points:
   * Well, you want to track users. This means by EU regs (based on your
     email this might apply) you have to let users of your site know you're
     doing this.
   * Cookies can get by alot of the issues listed above
   * You're only really interested in people who can reliably send you the
     same cookie twice. (If they refuse cookies, you can't track them using
     cookies obviously, and the above implies IP won't work brilliantly for
   * Relying on cookies also means you can allow your users to opt out from
     being tracked. 

One alternative: (pseudocode)

Recieve request
If no-cookie-received:
   Set Cookie: "NEWUSER"
   if cookie-recieved == "NEWUSER":
      # We know they can send us cookies back
      id = gen-id()
      Set Cookie: id

Then just log requests with the recieved cookie, trackable users will have
a unique id, whether their IP changes, share a system, behind nat'ing
firewalls etc. This allows you to track unique users that are trackable
using cookies. If you have a particularly large number of users accessing
your site you can tie in sampling (perhaps something like density biased
sampling) in there as well something like this:

new-cookie = None
If no-cookie-received:
   new-cookie = "NEWUSER"
   if cookie-recieved == "NEWUSER":
      # We know they can send us cookies back
      id = gen-id()
      new-cookie = id

if add-to-sample-set(request):
   tag = "SAMPLE"
   new-cookie = current-cookie or new-cookie
   tag = "NOSAMPLE"

if new-cookie:
   Set Cookie: tag new-cookie

(Or something like that IYSWIM - ie get the user population to indicate if
they're being sampled - again, this allows your users to easily opt out,
and also means the memory/etc required to determine whether to track the
user or not isn't dependent on the number of requests your site gets -
meaning that you can keep analysis costs for your site under control. If
you've only got a small site this probably doesn't matter to you, but 
worth bearing in mind).

The interesting thing about this from my perspective is that if you do 
take a cookie approach like this, it actually allows you to figure out how
much error there actually is between IP and cookie - rather than just guess.
The other nicety is it allows your users to opt-out very easily - since they
can either switch off cookies, or you can send them a "NOSAMPLE" cookie.

Also, at present comments in this thread revolve around "this isn't
reliable because of x,y and z". If you take this sort of approach you
can find out the margin of error and then decide whether you're happy
with it or not. Also as you can see from above this doesn't really have
to be a very complex operation (unless you're in a high volume scenario
with lots of distinct users and need to add in the sampling aspect).

Best Regards,


More information about the Python-list mailing list