[Tutor] Newbie Trouble Processing SRT Strings In Text File

Matt Varner matt.l.varner at gmail.com
Fri Oct 31 12:07:25 CET 2014


TL:DR - Skip to "My Script: "subtrans.py"

<beg>

Optional Links to (perhaps) Helpful Images:
1. The SRT download button:
http://i70.photobucket.com/albums/i82/RavingNoah/Python%20Help/tutor1_zps080f20f7.png

2. A visual comparison of my current problem (see 'Desire Versus
Reality' below):
http://i70.photobucket.com/albums/i82/RavingNoah/Python%20Help/newline_problem_zps307f8cab.jpg

============
The SRT File
============

The SRT file that you can download for every lesson that has a video
contains the caption transcript data and is organized according to
text snippets with some connecting time data.

========================================
Reading the SRT File and Outputting Something Useful
========================================

There may be a hundred different ways to read one of these file types.
The reliable method I chose was to use a handy editor for the purpose
called Aegisub.  It will open the SRT file and let me immediately
export a version of it, without the time data (which I don't
need...yet).  The result of the export is a plain-text file containing
each string snippet and a newline character.

==========================
Dealing with the Text File
==========================

One of these text files can be anywhere between 130 to 500 lines or
longer, depending (obviously) on the length of its attendant video.
For my purposes, as a springboard for extending my own notes for each
module, I need to concatenate each string with an acceptable format.
My desire for this is to interject spaces where I need them and kill
all the newline characters so that I get just one big lump of properly
spaced paragraph text.  From here, I can divide up the paragraphs how
I see fit and I'm golden...

==============================
My first Python script: Issues
==============================

I did my due diligence.  I have read the tutorial at www.python.org.
I went to my local library and have a copy of "Python Programming for
the Absolute Beginner, 3rd Edition by Michael Dawson."  I started
collecting what seemed like logical little bits here and there from
examples found using Uncle Google, but none of the examples anywhere
were close enough, contextually, to be automatically picked up by my
dense 'noobiosity.'  For instance, when discussing string
methods...almost all operations taught to beginners are done on
strings generated "on the fly," directly inputted into IDLE, but not
on strings that are contained in an external file.  There are other
examples for file operations, but none of them involved doing string
operations afterward.  After many errors about not being able to
directly edit strings in a file object, I finally figured out that
lists are used to read and store strings kept in a file like the one
I'm sourcing from...so I tried using that.  Then I spent hours
unsuccessfully trying to call strings using index numbers from the
list object (I guess I'm dense).  Anyhow, I put together my little
snippets and have been banging my head against the wall for a couple
of days now.

After many frustrating attempts, I have NEARLY produced what I'm
looking to achieve in my test file.

================
Example - Source
================

My Test file contains just twelve lines of a much larger (but no more
complex) file that is typical for the SRT subtitle caption file, of
which I expect to have to process a hundred...or hundreds, depending
on how many there are in all of the courses I plan to take
(coincidentally, there is one on Python)

Line 01: # Exported by Aegisub 3.2.1
Line 02: [Deep Dive]
Line 03: [CSS Values & Units Numeric and Textual Data Types with
Guil Hernandez]
Line 04: In this video, we'll go over the
Line 05: common numeric and textual values
Line 06: that CSS properties can accept.
Line 07: Let's get started.
Line 08: So, here we have a simple HTML page
Line 09: containing a div and a paragraph
Line 10: element nested inside.
Line 11: It's linked to a style sheet named style.css
Line 12: and this is where we'll be creating our new CSS rules.

========================
My Script: "subtrans.py"
========================

# Open the target file, create file object
f = open('tmp.txt', 'r')

# Create an output file to write the changed strings to
o = open('result.txt', 'w')

# Create a list object that holds all the strings in the file object
lns = f.readlines()

# Close the source file you no longer
# need now that you have
 your strings
f.close()

# Import sys to get at stdout (standard output) - "print" results will
be written to file
import sys

# Associate stdout with the output file
sys.stdout = o

# Try to print strings to output file using loopback variable (line)
and the list object
for line in lns:
    if ".\n" in line:
        a = line.replace('.\n','.  ')
        print(a.strip('\n'))
    else:
        b = line.strip('\n')
        print(b + " ")

# Close your output file
o.close()

=================
Desire Versus Reality
=================

The source file contains a series of strings with newline characters
directly following whatever the last character in the snippet...with
absolutely no whitespace.  This is a problem for me if I want to
concatentate it back together into paragraph text to use as the
jumping off point for my additional notes.  I've been literally taking
four hours to type explicitly the dialogue from the videos I've been
watching...and I know this is going to save me a lot of time and get
me interacting with the lessons faster and more efficiently.
However...

My script succeeds in processing the source file and adding the right
amount of spaces for each line, the rule being "two spaces added
following a period, and one space added following a string with no
period in it (technically, a period/newline pairing (which was the
only way I could figure out not target the period in 'example.css' or
'version 2.3.2'.

But, even though it successfully kills these additional newlines that
seem to form in the list-making process...I end up with basically a
non-concatenated file of strings...with the right spaces I need, but
not one big chunk of text, like I expect using the s.strip('\n')
functionality.

============================================================
What I'm Holding Out For - This is what my output should look like
(for the test file)
============================================================

# Exported by Aegisub 3.2.1 [Deep Dive] [CSS Values & Units
Numeric and Textual Data Types with Guil Hernandez] In this video,
we'll go over the common numeric and textual values that CSS
properties can accept.  Let's get started.  So, here we have a simple
HTML page containing a div and a paragraph element nested inside.
It's linked to a style sheet named style.css and this is where we'll
be creating our new CSS rules.

===========================
Thank You For Your Time and Efforts!
===========================

</beg>


More information about the Tutor mailing list