[Tutor] gensim to generate document vectors

Alan Gauld alan.gauld at btinternet.com
Mon Aug 17 21:42:44 CEST 2015

On 17/08/15 18:50, Joshua Valdez wrote:
> Okay, so I'm trying to use Doc2Vec to simply read in a a file that is a
> list of sentences like this:

This list us for folks learning the core Pyhton lanmguage and the 
standard library.
Doc2Vec is not part of that library so you might find you get more 
responses asking on the gensim community forums.
A quick Google search suggests there are several
to choose from

You might hit lucky here but its not an area we discuss often.

> What I want to do is generate two files one with unique words from these
> sentences and another file that has one corresponding vector per line (if
> theres no vector output I want to output a vector od 0's)

Don't assume anyone here will know about your area of specialism.
What is a vector in this context?

> I'm getting the vocab fine with my code but I can't seem to figure out how
> to print out the individual sentence vectors, I have looked through the
> documentation and haven't found much help. Here is what my code looks like
> so far.

It seems to have gotten somewhat messed up.
I suspect you are using rich text or HTML formatting.
Try posting again in plain text.

> sentences = []for uid, line in enumerate(open(filename)):
>      sentences.append(LabeledSentence(words=line.split(),
> labels=['SENT_%s' %       uid]))
> model = Doc2Vec(alpha=0.025, min_alpha=0.025)
> model.build_vocab(sentences)for epoch in range(10):
>      model.train(sentences)
>      model.alpha -= 0.002
>      model.min_alpha = model.alpha
> sent_reg = r'[SENT].*'for item in model.vocab.keys():
>      sent = re.search(sent_reg, item)
>      if sent:
>          continue
>      else:
>          print item
> ###I'm not sure how to produce the vectors from here and this doesn't work##
> sent_id = 0for item in model:
>      print model["SENT_"+str(sent_id)]
>      sent_id += 1

Alan G
Author of the Learn to Program web site
Follow my photo-blog on Flickr at:

More information about the Tutor mailing list