difflib qualm

Gabriel Genellina gagsl-py at yahoo.com.ar
Fri Jan 26 02:33:33 CET 2007


At Thursday 25/1/2007 21:49, Larry Bates wrote:
>Gabriel Genellina wrote:
> > At Wednesday 24/1/2007 23:05, Sick Monkey wrote:
> >
> >> I am trying to write a python script that will compare 2 files which
> >> contains names (millions of them).
> >>
> >> More specifically, I have 2 files (Files1.txt and Files2.txt).
> >> Files1.txt contains 180 thousand names and Files2.txt contains 34
> >> million names.
>
>Put the big list of names in a database and create soundex keys for the names
>and make the soundex keys an index so you can search quickly.  Databases
>are really good at storing data that is searchable via an index.  If 
>you REALLY
>need speed you can consider an in-memory database.
>
>Create soundex keys for each name in your small list and query the database
>with this key into the table in the DB that is indexed on soundex keys.
>If you get a hit, the key is sufficiently "alike" to be a candidate.  I'll
>leave the remainder to you.  Perhaps there is other information that will
>help determine if there is a match?

Soundex is only good for English words, and it's almost useless for 
non-English names, so it must be used with caution if used at all.


-- 
Gabriel Genellina
Softlab SRL 


	

	
		
__________________________________________________ 
Preguntá. Respondé. Descubrí. 
Todo lo que querías saber, y lo que ni imaginabas, 
está en Yahoo! Respuestas (Beta). 
¡Probalo ya! 
http://www.yahoo.com.ar/respuestas 




More information about the Python-list mailing list