テキストファイルをHTMLファイルにするプログラムを教えてください。

Question

kawamori-takumi

14

14もっと見る

90pt

コンピュータインターネット

テキストファイルをHTMLファイルにするプログラムを教えてください。

(Perl/Ruby/Pythonのいずれかで)

手元にあるテキストファイルを指定の行数(もしくは文字数)で分割しHTMLファイルにしたいと考えています。
[sample.txt(10分割)→1.html～10.html]

フリーの分割ソフトはあるようですが
・分割とHTML化を同時にしたい
・ページ下部にファイルに対応したページ番号がつけたい
　(1.htmlの下部に：<p>1ページ</p>のようなイメージ)
・最近プログラミングの勉強をしていていろんなコードを見てみたい
という理由で利用しないつもりです。

何卒よろしくお願いいたします。

回答の条件

1人2回まで

登録：2007/11/02 11:27:56
終了：2007/11/09 11:30:03

※ 有料アンケート・ポイント付き質問機能は2023年2月28日に終了しました。

kawamori-takumi 2007/11/02 12:53:03

この質問の意図を追記します。

実は作成した文書をアップして携帯電話で読めるようにしたいのです。
１ページで表示すると量が多いので何ページかに分割したいのです。

あらかじめHTMLのテンプレートを作ってあるので
分割したテキストをそこに埋め込みたいと思っています。
またファイルのナンバーにあわせて前後へのリンクを作る予定ではありますが
この部分に関しては自分で追加できると考えているのであえて質問にはあげませんでした。

例)2.htmlの場合

<html>
<テンプレート>

分割したテキスト：２番目(改行の必要なし)

<テンプレート>

<a href="1.html" accesskey="">1.前へ</a>　２ページ　　<a href="2.html" accesskey="3">3.次へ</a>

</html>
しおり 2007/11/02 18:01:22

> 使用例の３番目はうまく行きませんでした。。。

どううまく行かなかったのでしょうか？
書き忘れていましたが、Shift_JIS 用になっていますので、EUC-JP/UTF-8 用にする場合は、
1行目を書き換えるなどしてください。
例.
EUC-JP: #!ruby -Ke
UTF-8: #!ruby -Ku
kawamori-takumi 2007/11/02 22:00:44

3番目の通りにすると
実行後にHTMLファイルはできたのですが
テンプレートだけのファイルで元の文章が載っていませんでした。

お気づきの点があればご指摘いただけると幸いです。
しおり 2007/11/03 11:04:57

> next if char == "\n" && chars.empty?

スーパーpre記法のバグで「&」が「&」と表示されていますが、
下記のように書き換えましたか？

next if char == "\n" && chars.empty?
kawamori-takumi 2007/11/04 02:52:38

すみません。
書き換えてないものでやってました(-_-;)
ちゃんと成功しました。
ご指摘ありがとうございました。
TransFreeBSD 2007/11/07 20:32:47

あれ、消えてる。確認したときはあったんですが、ごめんなさい。
>|perl|
#!/usr/bin/perl
use strict;
use warnings;
use CGI qw(:standard);
use Getopt::Std;

our $opt_c;
our $opt_l;
getopts("c:l:");
$opt_l or $opt_c or $opt_l = 25;

my @part;
if ($opt_l) {
my @line = <>;
while (@line) {
push @part, join "", splice @line, 0, $opt_l;
}
} else {
local $/ = undef;
$_ = <>;
@part = /.{1,$opt_c}(?:\n|$)/sg;
}

unshift @part, ""; # insert dummy
for (my $i=1; $i<@part; $i++) {
my $part = escapeHTML($part[$i]);
my $file = "$i.html";
open FILE, ">$file" or die;
print FILE <<__HTML__;
<html>
<body>
<pre>$part</pre>
<p>page $i</p>
</body>
</html>
__HTML__
close FILE;
}
||<
kawamori-takumi 2007/11/08 17:07:38

いえいえ。
コードありがとうございました。
勉強させていただきます。

老兵は黙って去りゆくのみ - 今年のk-1グランプリの優勝者は誰だ！ 2008-12-05 12:36:23

今年のk-1グランプリの優勝者は誰だ！どうでもいい、じじぃの日記。 9/28（日）のK-1は面白かった。「セーム・シュルト VS ピーター・アーツ」戦。ほとんどの人がセーム・シュルトが勝

「あの人に答えてほしい」「この質問はあの人が答えられそう」というときに、回答リクエストを送ってみてましょう。

これ以上回答リクエストを送信することはできません。制限について

リクエスト送信済

回答リクエストを送信したユーザーはいません

しおり · Answer 1 · 2007-11-02T17:16:05+09:00

こんなのでどうでしょうか。

textplaintohtml.rb:

#!ruby -Ks

require 'jcode'


def put_usage
  $stderr.puts("Usage: #{$0} [-Lnum|-Cnum] file")
end

class TextPlainToHTML
  def initialize(file, template)
    @file = file
    @template = template
  end

  def convert
    no = 1
    foreach(@file) do |part|
      File.open("#{no}.html", "w") do |out|
        html = @template.sub(/>>text<</, part)
        html.gsub!(/>>page_no<</, no.to_s)
        out.write(html)
      end
      no += 1
    end
  end
end

class TextPlainToHTMLDivideByLine < TextPlainToHTML
  def initialize(file, template, num = 10)
    super(file, template)
    @num = num
  end

 protected
  def foreach(file)
    lines = ''
    no = 0
    File.foreach(file) do |line|
      lines << line
      no += 1
      next if no < @num
      yield(lines)
      lines.replace('')
      no = 0
    end
    yield(lines) unless lines.empty?
  end
end

class TextPlainToHTMLDivideByChar < TextPlainToHTML
  def initialize(file, template, num = 100)
    super(file, template)
    @num = num
  end

 protected
  def foreach(file)
    chars = ''
    no = 0
    File.foreach(file) do |line|
      line.each_char do |char|
        next if char == "\n" && chars.empty?
        chars << char
        next if char == "\n"
        no += 1
        next if no < @num
        yield(chars)
        chars.replace('')
        no = 0
      end
    end
    yield(chars) unless chars.empty?
  end
end

template = <<'END_OF_TEMPLATE'
<html>
<body>
>>text<<
<p>>>page_no<<ページ</p>
</body>
</html>
END_OF_TEMPLATE


case ARGV.size
when 1
  type = TextPlainToHTMLDivideByLine
  num = 10
  file = ARGV[0]
when 2
  unless /\A-([LC])(\d+)\z/ =~ ARGV[0]
    put_usage
    exit 1
  end
  type = ($1 == 'L' ?
          TextPlainToHTMLDivideByLine : TextPlainToHTMLDivideByChar)
  num = $2.to_i
  file = ARGV[1]
else
  put_usage
  exit 1
end

converter = type.new(file, template, num)
converter.convert

# まだ「&」が「&」になるバグが直ってない……

使用例:

% ruby textplaintohtml.rb sample.txt
% ruby textplaintohtml.rb -L10 sample.txt
% ruby textplaintohtml.rb -C100 sample.txt

Alexandre · Answer 2 · 2007-11-03T18:31:36+09:00

hoge.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from optparse import OptionParser
from string import Template
import sys, codecs, os

default_encoding = 'Shift_JIS'
template = u'''\
<html>
<meta http-equiv="content-type" content="text/html; charset=$encoding">
<body>
$text
<p>${page_no}ページ</p>
</body>
</html>
'''

def split_lines(file, lines, encoding):
    text = ''
    no = 0
    try:
        for line in codecs.open(file, 'r', encoding):
            line = line.rstrip(u'\r\n')
            text = ''.join((text, line))
            no += 1
            if lines <= no:
                yield(text)
                no = 0
                text = ''
        if text: yield(text)
    except UnicodeDecodeError:
        sys.exit('input encoding error')
    except LookupError:
        sys.exit('unknown input encoding')

def split_chars(file, chars, encoding):
    text = ''
    try:
        for line in codecs.open(file, 'r', encoding):
            line = line.rstrip(u'\r\n')
            text += line
            while len(text) >= chars:
                yield(text[:chars])
                text = text[chars:]
        if text: yield(text)
    except UnicodeDecodeError:
        sys.exit('input encoding error')
    except LookupError:
        sys.exit('unknown input encoding')

def output_html(text, page, encoding):
    s = Template(template)
    html = s.safe_substitute(encoding = encoding, text = text, page_no=page)
    try:
        f = codecs.open(''.join((str(page),'.html')), 'w', encoding)
        f.write(html)
    finally:
        f.close()

def main():
    parser = OptionParser(usage = '%prog [option] file')
    parser.add_option('-l', type = 'int', dest = 'lines', default = 25)
    parser.add_option('-c', type = 'int', dest = 'chars')
    parser.add_option('-i', '--input-encoding', dest = 'input_encoding', default = default_encoding) 
    parser.add_option('-o', '--output-encoding', dest = 'output_encoding')

    (options, args) = parser.parse_args()
    if len(args) == 0:
        parser.error('No file specified')
    elif not os.path.isfile(args[0]):
        parser.error('No such file')
    if options.chars:
        get_text = split_chars(args[0], options.chars, options.input_encoding)
    else:
        get_text = split_lines(args[0], options.lines, options.input_encoding)
    output_encoding = options.output_encoding
    if not output_encoding:
        output_encoding = options.input_encoding
    for page, text in enumerate(get_text):
        output_html(text, page+1, output_encoding)
    
if __name__ == '__main__':
    main()

使い方

python hoge.py sample.txt
python hoge.py -l 25 sample.txt
python hoge.py -c 400 sample.txt

TransFreeBSD · Answer 3 · 2007-11-07T12:58:32+09:00

perlがないので。

使い方

perl hoge.pl -c 3000 sample1.txt sample2.txt
perl hoge.pl -l 20 < sample.txt

オプション c はバイト数、l は行数を引数に取ります。その他の引数を入力ファイルとして、すべてつなげた上で内容を切り分けます。入力ファイルがなければ標準入力を使用します。

若干富豪的です。文字コードは考慮していません。文字数指定時は文字数ではなくバイト数で、指定バイトを超えない行末で切ります。1行が指定バイトを超える場合は、その行は無視します。

lunlumo · Answer 4 · 2007-11-08T01:40:51+09:00

　TransFreeBSDさんの例はperlらしくて良いと思いますが，敢えて余りperlらしくないコードを挙げてみます。

#! /usr/bin/perl

package	Object;

use	strict;
use	Class::Accessor;
use	base('Class::Accessor');

sub new {
	my	($pkg) = @_;
	my	$self;
	$self = bless({},$pkg);
	$self;
}

package	Configuration;

use	strict;
use	Getopt::Std;
use	base('Object');

__PACKAGE__->mk_accessors(qw(inCode outCode outFile type length inFile template));

sub new {
	my	($pkg) = @_;
	my	$self;
	$self = $pkg->SUPER::new();
	$self->inCode('shiftjis');
	$self->outCode('shiftjis');
	$self->outFile('output');
	$self->type('line');
	$self->length(30);
	$self;
}

sub initialize {
	my	($self) = @_;
	my	$opts = {};
	getopts('i:o:f:t:l:',$opts);
	$self->inCode($opts->{'i'}) if (defined($opts->{'i'}));
	$self->outCode($opts->{'o'}) if (defined($opts->{'o'}));
	$self->outFile($opts->{'f'}) if (defined($opts->{'f'}));
	$self->type($opts->{'t'}) if (defined($opts->{'t'}));
	$self->length($opts->{'l'}) if (defined($opts->{'l'}));
	if (scalar(@ARGV) == 2) {
		$self->template($ARGV[0]);
		$self->inFile($ARGV[1]);
	} else {
		print "usage: $0 [-i IN_CHARSET] [-o OUT_CHARSET] [-f OUT_FILE_PREFIX] [-t SPLIT_TYPE] [-l SPLIT_LENGTH] TEMPLATE_FILE OUTPUT_FILE\r\n";
		print "\tIN_CHARSET:\t(shiftjis|eucjp|utf8)\r\n";
		print "\tOUT_CHARSET:\t(shiftjis|eucjp|utf8)\r\n";
		print "\tSPLIT_TYPE:\t(line|byte)\r\n";
		exit;
	}
	$self;
}

package	SplitterFactory;

use	strict;

sub getInstance {
	my	($pkg,$configuration) = @_;
	my	$instance;
	if ($configuration->type() eq 'byte') {
		$instance = new ByteSplitter();
	} else {
		$instance = new LineSplitter();
	}
	$instance->configuration($configuration);
	$instance;
}

package	Splitter;

use	strict;
use	Encode;
use	base('Object');

__PACKAGE__->mk_accessors(qw(configuration content));

sub load {
	my	($self) = @_;
	my	$configuration = $self->configuration();
	my	$content = '';
	my	$in;
	open($in,"<:encoding(".$configuration->inCode().")",$configuration->inFile()) || die "";
	$content .= <$in> while (!eof($in));
	close($in);
	$content =~ s/(\r\n|\r|\n)/\r\n/g;
	$self->content($content);
	$self;
}

sub split {
	die "";
}

package	LineSplitter;

use	strict;
use	base('Splitter');

sub split {
	my	($self) = @_;
	my	@contents = split(/(?:\r\n|\r|\n)/,$self->content());
	my	$line = $self->configuration()->length();
	my	@splitted = ();
	while (scalar(@contents)>0) {
		my	@temp = splice(@contents,0,$line);
		push(@splitted,join("\r\n",@temp));
	}
	@splitted;
}

package	ByteSplitter;

use	strict;
use	utf8;
use	Lingua::JA::Fold qw(fold length_half);
use	base('Splitter');

sub cutter {
	my	($pkg,$length,$string) = @_;
	my	$chars;
	my	$shortage;
	if ($length >= length_half($string)) {
		$chars = length($string);
	} else {
		$chars = int($length / 2);
		$shortage = $length - length_half(substr($string,0,$chars));
		while ($shortage != 0) {
			if ($shortage > 0) {
				$shortage = $length - length_half(substr($string,0,++$chars));
			} else {
				$chars--;
				$shortage = 0;
			}
		}
	}
	(substr($string,0,$chars),substr($string,$chars));
}

sub split {
	my	($self) = @_;
	my	$content = $self->content();
	my	$bytes = $self->configuration()->length();
	my	$temp;
	my	@splitted;
	while ($content !~ m/^(\r\n|\r|\n)?$/) {
		($temp,$content) = __PACKAGE__->cutter($bytes, $content);
		push(@splitted,$temp);
	}
	@splitted;
}

package	main;

use	strict;
use	HTML::Template;

eval {
	my	$configuration = new Configuration()->initialize();
	my	$splitter = SplitterFactory->getInstance($configuration);
	my	$template;
	my	$outFile = $configuration->outFile();
	my	$inCode = $configuration->inCode();
	my	$outCode = $configuration->outCode();
	my	$t;
	my	@contents;
	my	$no = 1;
	open ($t,'<:encoding('.$inCode.')',$configuration->template()) || die "";
	$template = new HTML::Template(
			'filehandle'	=>$t,
			'filter'		=> sub {
					my	($text) = @_;
					$$text =~ s/(\r\n|\r|\n)/\r\n/g;
				}
		);
	close($t);
	@contents = $splitter->load()->split();
	foreach (@contents) {
		my	$file;
		my	@lines = map { {'content'=>$_}; } split(/(?:\r\n|\r|\n)/,$_);
		my	$param = {'output'=>$outFile,'contents'=>\@lines,'no'=>$no};
		if ($no != 1) {
			$param->{'prev'} = 1;
			$param->{'prev_no'} = $no - 1;
		}
		if ($no != scalar(@contents)) {
			$param->{'next'} = 1;
			$param->{'next_no'} = $no + 1;
		}
		$template->clear_params();
		$template->param($param);
		open ($file,'>:encoding('.$outCode.')',"${outFile}.${no}.html") || die "";
		print $file $template->output();
		close($file);
		$no++;
	}
};
die $@ if ($@);

1;

　実行には以下のモジュールが必要です。

http://search.cpan.org/~kasei/Class-Accessor-0.31/lib/Class/Acce...

http://search.cpan.org/dist/HTML-Template/

http://search.cpan.org/~hata/Lingua-JA-Fold-0.07/Fold.pm

　テンプレートに適用する部分に関しては余り重要ではないとは思いますが，以下の様なテンプレートを想定しています。

<html>
<head>
	<title>page<tmpl_var escape="html" name="no"></title>
</head>
<body>
<tmpl_loop name="contents"><tmpl_var escape="html" name="content"><br />
</tmpl_loop>
<tmpl_if name="prev"><a href="./<tmpl_var escape="html" name="output">.<tmpl_var escape="html" name="prev_no">.html">前へ</a></tmpl_if>
<tmpl_if name="next"><a href="./<tmpl_var escape="html" name="output">.<tmpl_var escape="html" name="next_no">.html">次へ</a></tmpl_if>
</body>
</html>

テキストファイルをHTMLファイルにするプログラムを教えてください。

回答（4件）

しおり191342007/11/02 17:16:05

Alexandre1112007/11/03 18:31:36

TransFreeBSD6682682007/11/07 12:58:32

lunlumo107142007/11/08 01:40:51

コメント（7件)

この質問への反応（ブックマークコメント）

トラックバック