include処理の推敲

2020/OCT/10

更新履歴

日付	変更内容
2020/SEP/09	新規作成
2020/SEP/09	実装追加
2020/SEP/10	2重includeの防止追加
2020/SEP/11	brush up 追加
2020/SEP/13	ツール追加 kon_utを使った版追加 kon_ut風の説明追加
2020/SEP/15	簡単な置換機能追加
2020/SEP/16	サーチ・ディレクトリ追加再帰展開の抑止追加デフォルトの正規表現を変更追加
2020/SEP/24	にせファイル(というかバッファ) 追加
2020/SEP/26	バッファ名の自動生成追加
2020/OCT/09	字下げ追加エラーなしオプション追加
2020/OCT/10	行末接続追加 2重include許容オプション追加

include行の展開の動作
- 例その1
- 例その2
下書き
実装
ツール
kon_ut風の説明
- new( lst=[] )

C言語のプリプロセッサのinclude行の展開のような事がしたく、簡単なツールを作ってみました。

C言語のプリプロセッサをそのまま使えばそれまでですが、あーでもない、こーでもないとPythonで処理を考えた時の記録です。

include行の展開の動作

今更ですが、include行を展開する動作の説明です。

テキストデータ中に、指定のファイルをincludeする指示の行が現れると、その行を指定ファイルの中身に置き換えます。

例その1

例えば、メールのテキスト。

head1.txt

kon pageの近藤です。
いつもお世話になっております。

head2.txt

kon pageの近藤でございます。
お日柄もよろしく、貴殿様におかれましてはご機嫌麗しゅう。

を用意しておいて、

mail1.txt

#include "head1.txt"

お電話頂いた件につきまして、
ご確認のほどよろしくお願い致します。

ならば

kon pageの近藤です。
いつもお世話になっております。

お電話頂いた件につきまして、
ご確認のほどよろしくお願い致します。

に展開。

mail2.txt

#include "head2.txt"

お電話頂いた件につきまして、
ご確認のほどよろしくお願い致します。

kon pageの近藤でございます。
お日柄もよろしく、貴殿様におかれましてはご機嫌麗しゅう。

お電話頂いた件につきまして、
ご確認のほどよろしくお願い致します。

に展開する動作になります。

例その2

例えば、C言語のhello worldでお馴染みのヘッダファイルstdio.h。

#include <stdio.h>

の行がstdio.hの中身に置き換えられます。

stdio.hの中身からも別のファイルをincludeしてます。

grepしてみると

$ grep include /usr/include/stdio.h | head
 *	This product includes software developed by the University of
#include <sys/cdefs.h>
#include <Availability.h>
#include <_types.h>
/* DO NOT REMOVE THIS COMMENT: fixincludes needs to see:
 * __gnuc_va_list and include <stdarg.h> */
#include <sys/_types/_va_list.h>
#include <sys/_types/_size_t.h>
#include <sys/_types/_null.h>
#include <sys/stdio.h>

ネスト(入れ子)してます。

ファイルの中身への置き換えは、再帰的に処理せねばなりません。

ちょっと実際にプリプロセッサ展開を試してみます。

$ gcc -E -dD /usr/include/stdio.h | head
# 1 "/usr/include/stdio.h"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
#define __llvm__ 1
#define __clang__ 1
#define __clang_major__ 8
#define __clang_minor__ 0
#define __clang_patchlevel__ 0
#define __clang_version__ "8.0.0 (clang-800.0.38)"
#define __GNUC_MINOR__ 2

$ wc -l /usr/include/stdio.h
     495 /usr/include/stdio.h

$ gcc -E -dD /usr/include/stdio.h | wc -l
    2581

495行のstdio.hが2581行の内容に展開されます。

下書き

まずinclude処理について、「ゆるーく」実装を考えてみます。

処理対象のテキストデータは、標準入力から読み込み、展開結果を標準出力に書き出すことにします。

標準入力から1行取得
include展開指示行か判定
- include指示行ならば、指定ファイルを読み込み展開し標準出力へ書き出し
- include指示行でなければ、そのまま取得した1行を標準出力へ書き出し
の繰り返し

while True
	s = sys.stdin.readline()
	if not s:
		break
	if is_inc( s ):
		s = inc_exp( s )
	sys.stdout.write( s )

ちょー簡単。

って、いやいや。

is_inc( s ) はともかく、inc_exp( s )が大変なのです。

inc_exp( s )が問題

inc_exp( s ) の処理について、「ゆるーく」考えてみます。

include指示行sからファイルパスを取得
指定ファイルを読み込み
読み込んだテキストを行ごとに分解
行ごとに次の処理を繰り返す
- include展開指示行か判定
  - include指示行ならば、指定ファイルを読み込み展開し標準出力へ書き出し
  - include指示行でなければ、そのまま取得した1行を標準出力へ書き出し

メインループと同じ処理に戻ってきます。再帰です。

def inc_exp(s):
	path = get_path( s )
	s = get_text( path )
	lines = s.split( '\n' )
	for s in lines:
		if is_inc( s ):
			s = inc_exp( s )
		sys.stdout.write( s )

いや。

メインループのsys.stdin.readline()は末尾に改行コードが付いてます。

split()で行にバラすと改行コードはありません。

sys.stdout.write( s ) は print( s ) とすべきか?

ここは、split() したら join() で戻したいところ。

inc_exp( s )行のリストを返す版

inc_exp( s )は、展開した結果を行のリストで返すことにしてみます。

def inc_exp(s):
	path = get_path( s )
	s = get_text( path )
	lines = s.split( '\n' )

	ret = []
	for s in lines:
		if is_inc( s ):
			ret.extend( inc_exp( s ) )
		else:
			ret.append( s )
	return ret

てな再帰にしておいて、

メインループ側は

s = sys.stdin.read()
lines = s.split( '\n' )

ret = []
for s in lines:
	if is_inc( s ):
		ret.extend( inc_exp( s ) )
	else:
		ret.append( s )
s = '\n'.join( ret )
sys.stdout.write( s )

いやいや。

メインループ側とinc_exp()側で、同じ処理書きすぎでしょ。

inc_exp( s ) の引数sは、改行を含む複数行のテキストデータということで。

1行分の展開処理は内部関数 inc_exp_line( s ) に。

def inc_exp(s):

	def inc_exp_line(s):
		path = get_path( s )
		s = get_text( path )
		return inc_exp( s )

	lines = s.split( '\n' )

	ret = []
	for s in lines:
		if is_inc( s ):
			ret.extend( inc_exp_line( s ) )
		else:
			ret.append( s )
	return ret

メインループ側は

s = sys.stdin.read()
lines = inc_exp( s )
s = '\n'.join( lines )
sys.stdout.write( s )

あー。

それなら inc_exp() の返す値も行のリストにしなくてもいいのでは。

inc_exp( s ) 展開後のテキストで返す版

def inc_exp(s):

	def inc_exp_line(s):
		path = get_path( s )
		s = get_text( path )
		return inc_exp( s )

	lines = s.split( '\n' )

	ret = []
	for s in lines:
		if is_inc( s ):
			ret.append( inc_exp_line( s ) )
		else:
			ret.append( s )
	return '\n'.join( ret )

そしてメインループ側は

s = sys.stdin.read()
s = inc_exp( s )
sys.stdout.write( s )

1 lineなら

sys.stdout.write( inc_exp( sys.stdin.read() ) )

inc_exp( s ) もうひとこえ

inc_exp( s )の後半、append() 同じなので、

def inc_exp(s):

	def inc_exp_line(s):
		path = get_path( s )
		s = get_text( path )
		return inc_exp( s )

	func = lambda s: inc_exp_line( s ) if is_inc( s ) else s
	return '\n'.join( map( func, s.split( '\n' ) ) )

1行の処理はinc_exp_line()側に入れるべきで、もうひとこえ。

def inc_exp(s):

	def inc_exp_line(s):
		if not is_inc( s ):
			return s
		path = get_path( s )
		s = get_text( path )
		return inc_exp( s )

	return '\n'.join( map( inc_exp_line, s.split( '\n' ) ) )

当初より、かなりシンプルになりました。

実装

下書のときの仮の関数

is_inc( s )	s がinclude指示行か判定
get_path( s )	include指示行 s からファイルパスを取得
get_text( path )	get_path( s )で取得したpathのファイルの中身のテキストを返す

こんな感じでした。

指示行の仕様に依存する箇所なので、 1つの関数にまとめて扱っておきます。

inc_text( s )

s がinclude指示行でなければ、Noneを返す
s がinclude指示行ならば、ファイルの中身のテキストを返す
s がinclude指示行だが、ファイルの中身の取得に失敗したらFalseを返す
(ファイルが見つからないなど)

この inc_text() を、呼び出し元から与えるようにしておけば、 include処理そのものは「include指示行の仕様」の影響を受けずに済みます。

def inc_exp(s, inc_text):

	def inc_exp_line(s):
		r = inc_text( s )
		return s if r in ( None, False ) else inc_exp( r, inc_text )

	return '\n'.join( map( inc_exp_line, s.split( '\n' ) ) )

メインループ側は

s = sys.stdin.read()
s = inc_exp( s, inc_text )
sys.stdout.write( s )

inc_text( s )

「include行の仕様」に依存する箇所の実装になります。

とりあえず、簡単な仕様で試してみます。

include行は行頭'@'で始まるべし
@の直後には、includeするファイルの絶対パスか、カレントディレクトリからの相対パスがくるべし

hello
@head1.txt
world
@/tmp/head2.txt

こんな感じで、使います。

シンプルな仕様なので、実装もとても簡単。

import os

def inc_text(s):
	if not s.startswith( '@' ):
		return None
	path = s[ 1: ]

	r = False
	if not os.path.exists( path ):
		return r

	with open( path, 'rb' ) as f:
		try:
			r = f.read().decode()
		except:
			pass
	return r

実装その1

とりあえず版 inc1.py にまとめて試してみます。

テキストデータはasciiコードの範囲のみで。

inc1.py

smp1.txt

hello
@foo.txt
world

foo.txt

foo
FOO

$ cat smp1.txt | ./inc1.py
hello
foo
FOO

world

おっと、意図しない改行が、FOOとworldの間に。

「@foo.txt + 改行」はsplit()で「@foo.txt」に。
そのままjoin()だと「@foo.txt + 改行」に戻るのですが...
「@foo.txt」が「foo + 改行 + FOO + 改行」に展開されるので、
join()で「foo + 改行 + FOO + 改行 + 改行」になってます。

素直に考えると「@foo.txt + 改行」が「「foo + 改行 + FOO + 改行」に展開されて欲しいところです。

美しい対策とは言えませんが...

読み込むファイルの末尾が改行で終っていたら、 join()で追加されるのを見込んで削除しておきます。

inc2.py

差分

--- inc1.py     2020-09-09 22:22:57.000000000 +0900
+++ inc2.py     2020-09-09 22:22:59.000000000 +0900
@@ -15,6 +15,8 @@
        with open( path, 'rb' ) as f:
                try:
                        r = f.read().decode()
+                       if r[ -1 ] == '\n':
+                               r = r[ : -1 ]
                except:
                        pass
        return r

$ cat smp1.txt | ./inc2.py
hello
foo
FOO
world

OK。

$ cat smp1.txt | python2 inc2.py
hello
foo
FOO
world

$ cat smp1.txt | python3 inc2.py
hello
foo
FOO
world

python2, python3 OK。

UTF-8の日本語テキスト

試してみます。

先の

head1.txt

kon pageの近藤です。
いつもお世話になっております。

$ nkf -g head1.txt
UTF-8

smp2.txt

hello
@head1.txt
world

展開される側のsmp2.txtはasciiの範囲だけで。

$ cat smp2.txt | python2 inc2.py
 hello
 @head1.txt
 world

う。python2で展開されてない...です。

$ cat smp2.txt | python3 inc2.py
Traceback (most recent call last):
  File "inc2.py", line 35, in <module>
    sys.stdout.write( s )
UnicodeEncodeError: 'ascii' codec can't encode characters in position 14-19: ordinal not in range(128)

さらにpython3だとエラー。

そう。

Python2とPython3での日本語文字列対応についての通り、ややこしいです。

まずpython2で

r = f.read().decode()

のdecode()が通りません。

とりあえずdecode( 'utf-8' )で。

最後の出力は

sys.stdout.write( s.encode( 'utf-8' ) )

にするとpython2ではOK。

$ cat smp2.txt | python2 inc3.py
hello
kon pageの近藤です。
いつもお世話になっております。
world

python3では通りません。

$ cat smp2.txt | python3 inc3.py
Traceback (most recent call last):
  File "inc3.py", line 35, in <module>
    sys.stdout.write( s.encode( 'utf-8' ) )
TypeError: write() argument must be str, not bytes

pythonのユーティリ・ティプログラム 2020冬 nkf.py の導入は大げさかも?

import six

で乗り切れるか試してみます。

inc3.py

差分

--- inc2.py     2020-09-09 22:22:59.000000000 +0900
+++ inc3.py     2020-09-09 22:23:52.000000000 +0900
@@ -2,6 +2,7 @@

 import sys
 import os
+import six

 def inc_text(s):
        if not s.startswith( '@' ):
@@ -14,7 +15,7 @@

        with open( path, 'rb' ) as f:
                try:
-                       r = f.read().decode()
+                       r = f.read().decode( 'utf-8' )
                        if r[ -1 ] == '\n':
                                r = r[ : -1 ]
                except:
@@ -32,5 +33,7 @@
 if __name__ == "__main__":
        s = sys.stdin.read()
        s = inc_exp( s, inc_text )
-       sys.stdout.write( s )
+
+       f = sys.stdout if six.PY2 else sys.stdout.buffer
+       f.write( s.encode( 'utf-8' ) )
 # EOF

$ cat smp2.txt | python2 inc3.py
hello
kon pageの近藤です。
いつもお世話になっております。
world

$ cat smp2.txt | python3 inc3.py
hello
kon pageの近藤です。
いつもお世話になっております。
world

OK。

展開される側もUTF-8のテキストにして、標準入力のリードも対応してみます。

inc4.py

差分

--- inc3.py     2020-09-09 22:23:52.000000000 +0900
+++ inc4.py     2020-09-09 22:24:27.000000000 +0900
@@ -31,7 +31,9 @@
        return '\n'.join( map( inc_exp_line, s.split( '\n' ) ) )

 if __name__ == "__main__":
-       s = sys.stdin.read()
+       f = sys.stdin if six.PY2 else sys.stdin.buffer
+       s = f.read().decode( 'utf-8' )
+
        s = inc_exp( s, inc_text )

        f = sys.stdout if six.PY2 else sys.stdout.buffer

smp3.txt

関係各位
@head1.txt
以上

$ nkf -g smp3.txt
UTF-8

$ cat smp3.txt | python2 inc4.py
関係各位
kon pageの近藤です。
いつもお世話になっております。
以上

$ cat smp3.txt | python3 inc4.py
関係各位
kon pageの近藤です。
いつもお世話になっております。
以上

python2, python3 OK。

なんとか乗り切りました。

エラーの場合

$ echo @foo.txt | ./inc4.py
foo
FOO

に対して、例えばファイルが見つからなかった場合

$ echo @bar.txt | ./inc4.py
@bar.txt

include指示行が、そのまま残って表示されてます。

何かエラーのアクションが欲しいところ。

inc5.py

差分

--- inc4.py     2020-09-09 22:24:27.000000000 +0900
+++ inc5.py     2020-09-09 22:25:00.000000000 +0900
@@ -22,11 +22,17 @@
                        pass
        return r

+def err(s):
+       sys.stderr.write( 'err ' + s + '\n' )
+       sys.exit( 1 )
+
 def inc_exp(s, inc_text):

        def inc_exp_line(s):
                r = inc_text( s )
-               return s if r in ( None, False ) else inc_exp( r, inc_text )
+               if r == False:
+                       err( s )
+               return s if r == None else inc_exp( r, inc_text )

        return '\n'.join( map( inc_exp_line, s.split( '\n' ) ) )

$ echo @bar.txt | ./inc5.py
err @bar.txt

これで、とりあえずエラーが発生してる事は、なんとか分かります。

もうちょっと、処理中のファイル名や行番号の情報も欲しいなと。

inc6.py

差分

--- inc5.py     2020-09-09 22:25:00.000000000 +0900
+++ inc6.py     2020-09-09 22:25:28.000000000 +0900
@@ -18,23 +18,25 @@
                        r = f.read().decode( 'utf-8' )
                        if r[ -1 ] == '\n':
                                r = r[ : -1 ]
+                       return ( r, path ) #  !
                except:
                        pass
        return r

-def err(s):
-       sys.stderr.write( 'err ' + s + '\n' )
+def err(s, path, i):
+       sys.stderr.write( 'err {}, {} L{}\n'.format( s, path, i+1 ) )
        sys.exit( 1 )

-def inc_exp(s, inc_text):
+def inc_exp(s, inc_text, path='-'):

-       def inc_exp_line(s):
+       def inc_exp_line(i_s):
+               (i, s) = i_s
                r = inc_text( s )
                if r == False:
-                       err( s )
-               return s if r == None else inc_exp( r, inc_text )
+                       err( s, path, i )
+               return s if r == None else inc_exp( r[ 0 ], inc_text, r[ 1 ] )

-       return '\n'.join( map( inc_exp_line, s.split( '\n' ) ) )
+       return '\n'.join( map( inc_exp_line, enumerate( s.split( '\n' ) ) ) )

 if __name__ == "__main__":
        f = sys.stdin if six.PY2 else sys.stdin.buffer

$ echo @bar.txt | ./inc6.py
err @bar.txt, - L1

'-' (標準入力)の1行目の @bar.txt でエラー。OK。

おっと、ここまでネストの動作確認をしておりませんでしたね。

あと、inc_text()の返す値の仕様が少し変りました。

inc_text( s )	変更前 inc5.py	s がinclude指示行でなければ、Noneを返す s がinclude指示行ならば、ファイルの中身のテキストを返す s がinclude指示行だが、ファイルの中身の取得に失敗したらFalseを返す (ファイルが見つからないなど)
inc_text( s )	変更後 inc6.py	s がinclude指示行でなければ、Noneを返す s がinclude指示行ならば、ファイルの中身のテキストと、ファイルパス文字列の組み(タプル)を返す s がinclude指示行だが、ファイルの中身の取得に失敗したらFalseを返す (ファイルが見つからないなど)

ネストの確認

nest.txt

nest test

--> smp1.txt
@smp1.txt
<--

--> foo.txt
@foo.txt
<-- foo.txt

$ cat nest.txt | ./inc6.py
nest test

--> smp1.txt
hello
foo
FOO
world
<--

--> foo.txt
foo
FOO
<-- foo.txt

OK。

ng.txt

ng test start

@bar.txt

ng test end

$ cat ng.txt | ./inc6.py
err @bar.txt, - L3

$ echo @ng.txt | ./inc6.py
err @bar.txt, ng.txt L3

ng.txtの3行目の @bar.txt でエラーと出てますね。

OK。

2重includeの防止

2つのファイルで相互に参照するinclude指示行があると、展開し続けて資源を使い尽くしてしまいます。

せっかくなので、実際に試してみましょう。

$ cat loop.txt
@loop.txt

$ cat loop.txt | ./inc6.py
Traceback (most recent call last):
  File "./inc6.py", line 45, in <module>
    s = inc_exp( s, inc_text )
  File "./inc6.py", line 39, in inc_exp
    return '\n'.join( map( inc_exp_line, enumerate( s.split( '\n' ) ) ) )
  File "./inc6.py", line 37, in inc_exp_line
    return s if r == None else inc_exp( r[ 0 ], inc_text, r[ 1 ] )
  File "./inc6.py", line 39, in inc_exp
    :
  File "./inc6.py", line 23, in inc_text
    pass
RecursionError: maximum recursion depth exceeded while calling a Python object

相互の場合も

$ cat loop_a.txt
@loop_b.txt

$ cat loop_b.txt
@loop_a.txt

$ cat loop_a.txt | ./inc6.py
Traceback (most recent call last):
  File "./inc6.py", line 45, in <module>
    s = inc_exp( s, inc_text )
  File "./inc6.py", line 39, in inc_exp
    return '\n'.join( map( inc_exp_line, enumerate( s.split( '\n' ) ) ) )
    :
    r = inc_text( s )
  File "./inc6.py", line 23, in inc_text
    pass
RecursionError: maximum recursion depth exceeded while calling a Python object

パスの履歴を作って、2重includeを検出したら、対策してみます。

アクションとしては、エラー表示で停止するよりも、同じファイルの2度目のincludeを「そっと、しないでおく」だけにしておきます。

なんとなく、その方が良い気がして。

inc7.py

差分

--- inc6.py     2020-09-09 22:25:28.000000000 +0900
+++ inc7.py     2020-09-10 12:43:31.000000000 +0900
@@ -23,18 +23,28 @@
                        pass
        return r

-def err(s, path, i):
-       sys.stderr.write( 'err {}, {} L{}\n'.format( s, path, i+1 ) )
+paths = [ '-' ]
+
+def err(s, i):
+       sys.stderr.write( 'err {}, {} L{}\n'.format( s, paths[ -1 ], i+1 ) )
        sys.exit( 1 )

-def inc_exp(s, inc_text, path='-'):
+def inc_exp(s, inc_text):

        def inc_exp_line(i_s):
                (i, s) = i_s
                r = inc_text( s )
                if r == False:
-                       err( s, path, i )
-               return s if r == None else inc_exp( r[ 0 ], inc_text, r[ 1 ] )
+                       err( s, i )
+               if r == None:
+                       return s
+               (s, path) = r
+               if path in paths:
+                       return ''  # !
+               paths.append( path )
+               s = inc_exp( s, inc_text )
+               paths.pop()
+               return s

        return '\n'.join( map( inc_exp_line, enumerate( s.split( '\n' ) ) ) )

引数pathを廃止して、グローバル変数pathsにリストで保持します。

グローバル変数を使ったこういう対応も、規模の小さいソースでは「あり」かと。

$ cat loop.txt | ./inc7.py

$ cat loop_a.txt | ./inc7.py

この場合、include指示行しか無い場合だったので、展開結果も皆無でした。

ちゃんと動作してるのだろうか？

ややこしいデータで試してみます。

loop_c.txt

loop_c start
@loop_d.txt
loop_c ck1
@loop_c.txt
loop_c end

loop_d.txt

loop_d start
@loop_e.txt
loop_d ck1
@loop_d.txt
loop_d ck2
@loop_c.txt
loop_d end

loop_e.txt

loop_e start
@loop_e.txt
loop_e ck1
@loop_d.txt
loop_e ck2
@loop_c.txt
loop_e end

$ cat loop_c.txt | ./inc7.py

実行結果

loop_c start   ;;; stdin
loop_d start
loop_e start
               ;;; ここは @loop_e.txt in loop_e , OK
loop_e ck1
               ;;; ここは @loop_d.txt in loop_e , OK
loop_e ck2
loop_c start   ;;; 最初のloop_c.txtは標準入力からなので、path記録的には初回のloop_c
               ;;; ここは @loop_d.txt in loop_c , OK
loop_c ck1
               ;;; ここは @loop_c.txt in loop_c , OK
loop_c end
               ;;; ここは @loop_c.txt in loop_e , OK
loop_e end
loop_d ck1
               ;;; ここは @loop_d.txt in loop_d , OK
loop_d ck2
loop_c start
               ;;; ここは @loop_d.txt in loop_c , OK
loop_c ck1
               ;;; ここは @loop_c.txt in loop_c , OK
loop_c end
               ;;; ここは @loop_c.txt in loop_d , OK
loop_d end
loop_c ck1     ;;; stdin
loop_c start
loop_d start
loop_e start
               ;;; ここは @loop_e.txt in loop_e , OK
loop_e ck1
               ;;; ここは @loop_d.txt in loop_e , OK
loop_e ck2
               ;;; ここは @loop_c.txt in loop_e , OK
loop_e end
loop_d ck1
               ;;; ここは @loop_d.txt in loop_d , OK
loop_d ck2
               ;;; ここは @loop_c.txt in loop_d , OK
loop_d end
loop_c ck1
               ;;; ここは @loop_c.txt in loop_c , OK
loop_c end
               ;;; ここは @loop_c.txt in loop_c (stdin) , OK
loop_c end     ;;; stdin

OK。ああ、ややこしい。

エラー処理の確認

$ ls bar.txt
ls: bar.txt: No such file or directory

$ echo @bar.txt | ./inc7.py
err @bar.txt, - L1

$ cat ng.txt
 ng test start

 @bar.txt

 ng test end

$ echo @ng.txt | ./inc7.py
err @bar.txt, ng.txt L3

正常系の確認

$ cat smp3.txt
関係各位
@head1.txt
以上

$ cat head1.txt
kon pageの近藤です。
いつもお世話になっております。

$ cat smp3.txt | ./inc7.py
関係各位
kon pageの近藤です。
いつもお世話になっております。
以上

OK。

brush up

さらに、そぎ落してみます。

inc8.py

差分

--- inc7.py     2020-09-10 12:43:31.000000000 +0900
+++ inc8.py     2020-09-12 02:24:06.000000000 +0900
@@ -4,46 +4,43 @@
 import os
 import six

-def inc_text(s):
+def inc_text(s, paths):
 if not s.startswith( '@' ):
	 return None
 path = s[ 1: ]

 r = False
-       if not os.path.exists( path ):
-               return r
-
-       with open( path, 'rb' ) as f:
-               try:
-                       r = f.read().decode( 'utf-8' )
-                       if r[ -1 ] == '\n':
-                               r = r[ : -1 ]
-                       return ( r, path ) #  !
-               except:
-                       pass
+       if os.path.exists( path ):
+               with open( path, 'rb' ) as f:
+                       try:
+                               r = f.read().decode( 'utf-8' )
+                               paths.append( path )
+                       except:
+                               pass
 return r

 paths = [ '-' ]

+is_dbl_paths = lambda : paths[ -1 ] in paths[ : -1 ]
+
 def err(s, i):
 sys.stderr.write( 'err {}, {} L{}\n'.format( s, paths[ -1 ], i+1 ) )
 sys.exit( 1 )

+cut_tail_nl = lambda s: s[ : -1 ] if s and s[ -1 ] == '\n' else s
+
 def inc_exp(s, inc_text):

 def inc_exp_line(i_s):
	 (i, s) = i_s
-               r = inc_text( s )
+               r = inc_text( s, paths )
	 if r == False:
		 err( s, i )
-               if r == None:
-                       return s
-               (s, path) = r
-               if path in paths:
-                       return ''  # !
-               paths.append( path )
-               s = inc_exp( s, inc_text )
-               paths.pop()
+               if r != None:
+                       s = ''
+                       if not is_dbl_paths():
+                               s = inc_exp( cut_tail_nl( r ), inc_text )
+                       paths.pop()
	 return s

 return '\n'.join( map( inc_exp_line, enumerate( s.split( '\n' ) ) ) )

inc_text()の仕様、ちょっと変えました。

変更前 inc6.py	inc_text( s )	s がinclude指示行でなければ、Noneを返す s がinclude指示行ならば、ファイルの中身のテキストと、ファイルパス文字列の組み(タプル)を返す s がinclude指示行だが、ファイルの中身の取得に失敗したらFalseを返す (ファイルが見つからないなど)
変更後 inc8.py	inc_text( s, paths )	s がinclude指示行でなければ、Noneを返す s がinclude指示行ならば、ファイルパス文字列をリストpathsの末尾に追加し、ファイルの中身のテキストを返す s がinclude指示行だが、ファイルの中身の取得に失敗したらFalseを返す (ファイルが見つからないなど)

inc_text(s, paths)だけは、include指示行の仕様依存箇所なので、あとで外部ファイルに出せるようにしたく。

グローバル変数pathsを、直接使わないようにしてます。

正常系の確認

$ cat smp3.txt | ./inc8.py
関係各位
kon pageの近藤です。
いつもお世話になっております。
以上

エラー処理の確認

$ echo @ng.txt | ./inc8.py
err @bar.txt, ng.txt L3

2重include防止の確認

$ cat loop_c.txt | ./inc7.py > nest_res.txt

$ cat loop_c.txt | ./inc8.py | diff -u nest_res.txt -
$

ツール

ツールの形に仕立ててみます。

ここまで、include指定行の仕様によって、 inc_text(s, paths) 関数を差し替えれば、対応できる作りにしてきました。

よく使いそうなinclude指定行の仕様の inc_text(s, paths) をいくつか用意して、そこそこ使えるツールを目指してみます。

さらに、コマンドライン行や設定ファイルから、あるいは環境変数からでも、細やかにinclude指定行の仕様を追加できれば、なお良しです。

include指定行の仕様

ここまで試したのは、超シンプルな仕様でした。

include行は行頭'@'で始まるべし
@の直後には、includeするファイルの絶対パスか、カレントディレクトリからの相対パスがくるべし

基本的に、1行で1つのファイルを指定する事は、ゆるぎ無い「しばり」です。

例えば、C言語のプリプロセッサの場合

行頭から'#'で始まり
0個を含むいくつかの空白を許して
引用符で囲われた、一部のファイルパスを含むような形式のファイル名が続く
引用符はダブルクォートか、'<' から '>' の形式

のような感じでしょうか。

そして、一部の「ファイルパスを含むような形式のファイル名」は、複数のincludeパスとして指定されたディレクトリ以下で、合致するファイルを探します。

includeパスは、コンパイル時に "-I ディレクトリ" の形式で指定する「あれ」です。

別に「C言語のヘッダファイルを展開する事」が目的じゃないので、もっとシンプルでも十分なのですが...

その行がinclude指定行か判定する
その行からファイルパスを取り出す
ファイルパスのファイルが存在するか確認して
ファイルパスからファイルの中身を取り出す

の基本は変わりません。

前半の2つが何とかなれば、後半の2つはこれまでの処理通りです。

include指定行か判定

行頭からの形式が、指定のパターンに合致してるかどうか?

結局、そういう事に落とし込まれる気がします。

「正規表現」でしょうか。

となると「いにしえ」からのgrep, sedコマンド。

pythonなら「re」モジュールでしょうか。

これまでほとんど使った事が無かったですが、この機会に試してみます。

$ python
>>> import re
>>> p = re.compile( r'^@' )
>>> p.match( 'foo' )
>>> p.match( '@foo' )
<_sre.SRE_Match object; span=(0, 1), match='@'>

>>> p2 = re.compile( r'^# *include +' )
>>> p2.match( 'foo' )
>>> p2.match( '#include <stdio.h>' )
<_sre.SRE_Match object; span=(0, 9), match='#include '>
>>> p2.match( '#  include  "stdio.h"' )
<_sre.SRE_Match object; span=(0, 12), match='#  include  '>
>>> p2.match( ' #include <stdio.h>' )
>>>

ふーむ。

行からファイルパスを取り出す

sedならば

$ echo foo | sed -n -e 's/^@\(.*\)$/\1/p'
$

$ echo @foo.txt | sed -n -e 's/^@\(.*\)$/\1/p'
foo.txt

$ echo @/tmp/foo.txt | sed -n -e 's/^@\(.*\)$/\1/p'
/tmp/foo.txt
$

pre
$ echo "#include <stdio.h>" | sed -n -e 's/^# *include  *<\(.*\)>$/\1/p'
stdio.h

$ echo '# include "stdio.h"' | sed -n -e 's/^# *include  *"\(.*\)"$/\1/p'
stdio.h

python reモジュールなら

$ python
>>> import re

>>> p = re.compile( r'^@(.+)$' )
>>> p.sub( r'\1', '@foo.txt' )
'foo.txt'
>>> p.sub( r'\1', '@/tmp/foo.txt' )
'/tmp/foo.txt'

>>> p2 = re.compile( r'^# *include +<(.*)>$' )
>>> p2.sub( r'\1', '#include <stdio.h>' )
'stdio.h'
>>> p2.sub( r'\1', '# include  <stdio.h>' )
'stdio.h'

>>> p3 = re.compile( r'^# *include +"(.*)"' )
>>> p3.sub( r'\1', '#include "stdio.h"' )
'stdio.h'
>>> p3.sub( r'\1', '#include "sys/time.h"' )
'sys/time.h'

うーむ、なるほど。

reモジュールを使うのであれば、正規表現の文字列を1つ与えるだけで良さそうです。

マッチするかどうかで、include指定行かどうかを判定。

マッチするなら、r'\1' でファイルパスを取り出せるような正規表現にしておきます。

ツールのデフォルト仕様

名前はinc.pyで。
include指定行の正規表現は、次のパターンを順に試す。
- r'^# *include +(.*)$'
- r'^@ *include +(.*)$'
- r'^@ *inc +(.*)$'
- r'^@(.*)$'
ファイルパスの前後の引用符や'<', '>'は、取り出し後、削除する。
相対ファイルパスは、次のディレクトリの順で探す。
- カレントディレクトリ
- inc.pyが存在するディレクトリ

ツールとしてコーディング

inc8.py を下敷きにして、第一弾。

inc.py

ここまでinc_exp()ではグローバル変数pathsを直接参照せずに、引数でpathsを与えるようにしてきましたが...

include指定行の仕様依存箇所は、正規表現の文字列として切り出してしまったので、 inc_exp()を別ファイルで与えるような計画そのものがナンセンスになってしまいました。

もうグローバル変数pathsの参照ありで、書いてみます。

正常系の確認

$ cat smp3.txt | ./inc.py
関係各位
kon pageの近藤です。
いつもお世話になっております。
以上

$ pwd
/Users/kondoh/kon_page/inc

$ ( cd /tmp ; echo "# include  <head1.txt>" | /Users/kondoh/kon_page/inc/inc.py )
kon pageの近藤です。
いつもお世話になっております。

エラー処理の確認

$ echo @ng.txt | ./inc.py
err @bar.txt, ng.txt L3

2重include防止の確認

$ cat loop_c.txt | ./inc7.py > nest_res.txt

$ cat loop_c.txt | ./inc.py | diff -u nest_res.txt -
$

kon_utを使った版

日本語がUTF-8限定なのもアレなので、 pythonのユーティリティ・プログラム 2020冬 nkf.py を導入します。

他にも、使えるところは出来るだけ kon_ut から引っぱってきてみます。

v1/inc_ut.py

そして、 kon_ut に、 inc_ut.py を部品として追加しておきます。

正常系の確認

$ cat smp3.txt | ./inc_ut.py
関係各位
kon pageの近藤です。
いつもお世話になっております。
以上

$ pwd
/Users/kondoh/kon_page/inc

$ ( cd /tmp ; echo "# include  <head1.txt>" | /Users/kondoh/kon_page/inc/inc_ut.py )
kon pageの近藤です。
いつもお世話になっております。

$ nkf -j smp3.txt | nkf -g
ISO-2022-JP

$ nkf -j smp3.txt | ./inc_ut.py | nkf -g
ISO-2022-JP

$ nkf -j smp3.txt | ./inc_ut.py | nkf -u
関係各位
kon pageの近藤です。
いつもお世話になっております。
以上

エラー処理の確認

$ echo @ng.txt | ./inc_ut.py
err @bar.txt, ng.txt L3

2重include防止の確認

$ cat loop_c.txt | ./inc7.py > nest_res.txt

$ cat loop_c.txt | ./inc_ut.py | diff -u nest_res.txt -
$

簡単な置換機能

少し追加してみました。

v2.patch

diff -ur v1/inc_ut.py v2/inc_ut.py
--- v1/inc_ut.py	2020-09-13 22:47:44.000000000 +0900
+++ v2/inc_ut.py	2020-09-15 04:10:46.000000000 +0900
@@ -28,23 +28,35 @@

 def new( lst=[] ):
 	lst += [
-		r'^# *include +(.*)$',
-		r'^@ *include +(.*)$',
-		r'^@ *inc +(.*)$',
-		r'^@(.*)$',
+		r'^# *include +(\S*)',
+		r'^@ *include +(\S*)',
+		r'^@ *inc +(\S*)',
+		r'^@(\S*)',
 	]
 	pat_lst = list( map( re.compile, lst ) )

 	def match_path(s):
 		for p in pat_lst:
-			if p.match( s ):
-				return cut_quote( p.sub( r'\1', s ) )
-		return None
+			m = p.match( s )
+			if m and m.groups():
+				path = cut_quote( m.group( 1 ) )
+				opts = p.sub( '', s ).split()
+				return ( path, opts )
+		return ( None, '' )

 	paths = [ '-' ]

+	def cvt_opts(r, opts):
+		for opt in opts:
+			delim = '='
+			if delim in opt:
+				i = opt.index( delim )
+				(k, v) = ( opt[ : i ], opt[ i + len( delim ) : ] )
+				r = r.replace( k, v )
+		return r
+
 	def inc_text(s):
-		path = match_path( s )
+		(path, opts) = match_path( s )
 		if path == None:
 			return None

@@ -60,6 +72,7 @@
 				if r == False:
 					return False
 				paths.append( path )
+				r = cvt_opts( r, opts )
 				return cut_tail_nl( r )
 		return False

@@ -83,11 +96,10 @@
 	inc = new()

 	b = nkf.get_stdin()
-	opt = nkf.guess( b )
-	s = nkf.dec( nkf.cvt( b, '-u' ) )
+	(s, opt) = nkf.to_str( b )

 	s = inc.exp( s )

-	b = nkf.cvt( nkf.enc( s ), opt )
+	b = nkf.str_to( s, opt )
 	nkf.put_stdout( b )
 # EOF

include指示行の後続に、文字列の置換の指定を追加できるようにしてみました。

文字列の置換は、次の形式で指定します。

置換前の文字列=置換後の文字列

空白で区切って、複数指定できます。

簡単なパース処理にしてるので、置換前の文字列や置換後の文字列に空白を含めることはできません。

引用符でくくってもダメです。

置換機能の確認

$ echo "@foo.txt" | ./inc_ut.py
foo
FOO
$ echo "@foo.txt FOO=BAR o=0" | ./inc_ut.py
f00
BAR

正常系の確認

$ cat smp3.txt | ./inc_ut.py
関係各位
kon pageの近藤です。
いつもお世話になっております。
以上

$ pwd
/Users/kondoh/kon_page/inc

$ ( cd /tmp ; echo "# include  <head1.txt>" | /Users/kondoh/kon_page/inc/inc_ut.py )
kon pageの近藤です。
いつもお世話になっております。

$ nkf -j smp3.txt | nkf -g
ISO-2022-JP

$ nkf -j smp3.txt | ./inc_ut.py | nkf -g
ISO-2022-JP

$ nkf -j smp3.txt | ./inc_ut.py | nkf -u
関係各位
kon pageの近藤です。
いつもお世話になっております。
以上

エラー処理の確認

$ echo @ng.txt | ./inc_ut.py
err @bar.txt, ng.txt L3

2重include防止の確認

$ cat loop_c.txt | ./inc7.py > nest_res.txt

$ cat loop_c.txt | ./inc_ut.py | diff -u nest_res.txt -
$

サーチ・ディレクトリ

これまで、includeするファイルのサーチ・ディレクトリは、次の仕様で固定してました。

相対ファイルパスは、次のディレクトリの順で探す。

カレントディレクトリ
inc.pyが存在するディレクトリ

import先のソースのあるディレクトリなども追加したく、更新しました。

v3.patch

diff -ur v2/inc_ut.py v3/inc_ut.py
--- v2/inc_ut.py	2020-09-15 04:10:46.000000000 +0900
+++ v3/inc_ut.py	2020-09-16 04:13:58.000000000 +0900
@@ -55,6 +55,8 @@
 				r = r.replace( k, v )
 		return r

+	search_dirs = [ os.path.dirname( __file__ ) ]
+
 	def inc_text(s):
 		(path, opts) = match_path( s )
 		if path == None:
@@ -62,7 +64,7 @@

 		path_lst = [ path ]
 		if not path.startswith( '/' ):
-			path_lst.append( os.path.join( os.path.split( __file__ )[ 0 ], path ) )
+			path_lst += list( map( lambda dir_: os.path.join( dir_, path ), search_dirs ) )

 		for path in path_lst:
 			if os.path.exists( path ):

オブジェクトのアトリビュート.search_dirsとしてリストが用意されています。

exp()実行時に、指定の相対パスのファイルが、カレントディレクトリにファイルが見つからなかった場合、 search_dirs のリストの先頭から順に、ファイルを探します。

初期値は

search_dirs = [ os.path.dirname( __file__ ) ]

として、inc_ut.py の存在するディレクトリだけが入ってます。

サーチ・ディレクトリの確認

$ cd /tmp
$ ls -l foo.txt
ls: cannot access 'foo.txt': No such file or directory

$ cat /home/kondoh/kon_page/kon_ut/foo.txt
foo
FOO

$ cat /tmp/bar/foo.txt
foo_foo

$ python
>>> import inc_ut
>>> inc = inc_ut.new()
>>> inc.search_dirs
['/home/kondoh/kon_page/kon_ut']

>>> print( inc.exp( '@foo.txt' ) )
foo
FOO

>>> inc.search_dirs.insert( 0, '/tmp/bar' )
>>> inc.search_dirs
['/tmp/bar', '/home/kondoh/kon_page/kon_ut']

>>> print( inc.exp( '@foo.txt' ) )
foo_foo

再帰展開の抑止

最初に安易に

include行は行頭'@'で始まるべし
@の直後には、includeするファイルの絶対パスか、カレントディレクトリからの相対パスがくるべし

などと掲げた仕様のために、例えば差分ファイルをinclude展開しようとすると

diff -ur v2/inc_ut.py v3/inc_ut.py
--- v2/inc_ut.py        2020-09-15 04:10:46.000000000 +0900
+++ v3/inc_ut.py        2020-09-16 04:13:58.000000000 +0900
@@ -55,6 +55,8 @@
                                r = r.replace( k, v )
                return r

+       search_dirs = [ os.path.dirname( __file__ ) ]
+
  :

4行目の

@@ -55,6 +55,8 @@

の部分が、正規表現

r'^@(\S*)'

にマッチしてしまいます。

"@ -55,6 +55,8 @@" というファイルを探して見つからずにエラー表示で停止 orz

再帰的な展開を抑止するオプション指定を追加します。

-nr

include指示行にを追加すると、再帰的な展開をしません。

no recursive です。

v4.patch

diff -ur v3/inc_ut.py v4/inc_ut.py
--- v3/inc_ut.py	2020-09-16 04:13:58.000000000 +0900
+++ v4/inc_ut.py	2020-09-16 05:06:21.000000000 +0900
@@ -74,8 +74,7 @@
 				if r == False:
 					return False
 				paths.append( path )
-				r = cvt_opts( r, opts )
-				return cut_tail_nl( r )
+				return empty.new( s=cut_tail_nl( r ), opts=opts )
 		return False

 	def exp_line(i, s):
@@ -86,7 +85,9 @@
 			return s
 		if r == True:
 			return ''
-		s = exp( r )
+		s = cvt_opts( r.s, r.opts )
+		if '-nr' not in r.opts:
+			s = exp( s )
 		paths.pop()
 		return s

再帰的な展開抑止の確認

$ echo "@nest.txt" | ./inc_ut.py
nest test

--> smp1.txt
hello
foo
FOO
world
<--

--> foo.txt
foo
FOO
<-- foo.txt

$ echo "@nest.txt -nr" | ./inc_ut.py
nest test

--> smp1.txt
@smp1.txt
<--

--> foo.txt
@foo.txt
<-- foo.txt

デフォルトの正規表現を変更

再帰展開の抑止だけでは、おさまりませんでした。

例えば、

$ cat head1.txt
kon pageの近藤です。
いつもお世話になっております。

$ ./inc_ut.py <<EOF
> @head1.txt
> ソースコード中の
> #include <stdio.h>
> の件について
>   :
> EOF

などとしようものなら、 "#include <stdio.h>"の行もマッチして、ファイルstdio.hを展開しようとして、探しても見つからずエラー停止。

先の、diff出力の行頭から'@@ 'で始まる行の問題もあるので、デフォルトの正規表現のリストを修正します。

def new( lst=[] ):
	lst += [
		r'^@([^@\s]+)',
	]
	pat_lst = list( map( re.compile, lst ) )
		:

これだけにします。

行頭から@で始まり、続いて「@以外または空白文字以外」が1つ以上続くパターンにマッチします。

「@以外または空白文字以外」が1つ以上続く部分が、ファイルの指定になります。

v5.patch

diff -ur v4/inc_ut.py v5/inc_ut.py
--- v4/inc_ut.py	2020-09-16 05:06:21.000000000 +0900
+++ v5/inc_ut.py	2020-09-16 08:36:04.000000000 +0900
@@ -28,10 +28,7 @@

 def new( lst=[] ):
 	lst += [
-		r'^# *include +(\S*)',
-		r'^@ *include +(\S*)',
-		r'^@ *inc +(\S*)',
-		r'^@(\S*)',
+		r'^@([^@\s]+)',
 	]
 	pat_lst = list( map( re.compile, lst ) )

デフォルト正規表現変更の確認

$ ./inc_ut.py <<EOF
> @head1.txt
> ソースコード中の
> #include <stdio.h>
> の件について
>   :
> EOF
kon pageの近藤です。
いつもお世話になっております。
ソースコード中の
#include <stdio.h>
の件について
  :

$ echo "@head1.txt" | ./inc_ut.py
kon pageの近藤です。
いつもお世話になっております。

$ echo "@@head1.txt" | ./inc_ut.py
@@head1.txt

$ echo "@@ -55,6 +55,8 @@" | ./inc_ut.py
@@ -55,6 +55,8 @@

正常系の確認

$ cat smp3.txt | ./inc_ut.py
関係各位
kon pageの近藤です。
いつもお世話になっております。
以上

$ pwd
/Users/kondoh/kon_page/inc

$ ( cd /tmp ; echo "@<head1.txt>" | /Users/kondoh/kon_page/inc/inc_ut.py )
kon pageの近藤です。
いつもお世話になっております。

$ nkf -j smp3.txt | nkf -g
ISO-2022-JP

$ nkf -j smp3.txt | ./inc_ut.py | nkf -g
ISO-2022-JP

$ nkf -j smp3.txt | ./inc_ut.py | nkf -u
関係各位
kon pageの近藤です。
いつもお世話になっております。
以上

エラー処理の確認

$ echo @ng.txt | ./inc_ut.py
err @bar.txt, ng.txt L3

2重include防止の確認

$ cat loop_c.txt | ./inc7.py > nest_res.txt

$ cat loop_c.txt | ./inc_ut.py | diff -u nest_res.txt -
$

サーチ・ディレクトリの確認

$ cd /tmp
$ ls -l foo.txt
ls: cannot access 'foo.txt': No such file or directory

$ cat /home/kondoh/kon_page/kon_ut/foo.txt
foo
FOO

$ cat /tmp/bar/foo.txt
foo_foo

$ python
>>> import inc_ut
>>> inc = inc_ut.new()
>>> inc.search_dirs
['/home/kondoh/kon_page/kon_ut']

>>> print( inc.exp( '@foo.txt' ) )
foo
FOO

>>> inc.search_dirs.insert( 0, '/tmp/bar' )
>>> inc.search_dirs
['/tmp/bar', '/home/kondoh/kon_page/kon_ut']

>>> print( inc.exp( '@foo.txt' ) )
foo_foo

再帰的な展開抑止の確認

$ echo "@nest.txt" | ./inc_ut.py
nest test

--> smp1.txt
hello
foo
FOO
world
<--

--> foo.txt
foo
FOO
<-- foo.txt

$ echo "@nest.txt -nr" | ./inc_ut.py
nest test

--> smp1.txt
@smp1.txt
<--

--> foo.txt
@foo.txt
<-- foo.txt

にせファイル(というかバッファ)

本来のinclude展開処理から外れますが、バッファにテキストを貯めておいて、そのテキストをinclude展開する仕組みを入れてみました。

v6.patch

diff -ur v5/inc_ut.py v6/inc_ut.py
--- v5/inc_ut.py	2020-09-16 08:36:04.000000000 +0900
+++ v6/inc_ut.py	2020-09-24 21:32:53.000000000 +0900
@@ -7,6 +7,38 @@
 import nkf
 import dbg

+def buf_new():
+	dic = {}
+	ks = []
+
+	def set(s):
+		get_k = lambda : s[ 2 : ]
+
+		if s.startswith( '@>' ):
+			ks.append( get_k() )
+			return True
+		if s.startswith( '@<' ):
+			if ks:
+				ks.pop()
+			return True
+		if s.startswith( '@!' ):
+			k = get_k()
+			if k in dic:
+				dic.pop( k )
+			return True
+		if ks:
+			k = ks[ -1 ]
+			if k not in dic:
+				dic[ k ] = ''
+			dic[ k ] += s + '\n'
+			return True
+
+		return False
+
+	get = dic.get
+
+	return empty.new( locals() )
+
 def read_str(path):
 	r = False
 	with open( path, 'rb' ) as f:
@@ -32,6 +64,8 @@
 	]
 	pat_lst = list( map( re.compile, lst ) )

+	buf = buf_new()
+
 	def match_path(s):
 		for p in pat_lst:
 			m = p.match( s )
@@ -55,10 +89,18 @@
 	search_dirs = [ os.path.dirname( __file__ ) ]

 	def inc_text(s):
+		if buf.set( s ):
+			return True
+
 		(path, opts) = match_path( s )
 		if path == None:
 			return None

+		r = buf.get( path )
+		if r != None:
+			paths.append( path )
+			return empty.new( s=cut_tail_nl( r ), opts=opts )
+
 		path_lst = [ path ]
 		if not path.startswith( '/' ):
 			path_lst += list( map( lambda dir_: os.path.join( dir_, path ), search_dirs ) )
@@ -81,14 +123,17 @@
 		if r == None:
 			return s
 		if r == True:
-			return ''
+			return True
 		s = cvt_opts( r.s, r.opts )
 		if '-nr' not in r.opts:
 			s = exp( s )
 		paths.pop()
 		return s

-	exp = lambda s: '\n'.join( map( lambda i_s: exp_line( *i_s ), enumerate( s.split( '\n' ) ) ) )
+	def exp(s):
+		f_map = lambda i_s: exp_line( *i_s )
+		f_filter = lambda s: s != True
+		return '\n'.join( filter( f_filter, map( f_map, enumerate( s.split( '\n' ) ) ) ) )

 	return empty.new( locals() )

include指示行は

行頭から@で始まり、続いて「@以外または空白文字以外」が1つ以上続くパターン

でした。

さらに、次の3つのパターンを特別扱いします。

'@>'で始まる行	その行の'@>'以降の名前のバッファへの登録開始
'@<'で始まる行	現在のバッファへの登録終了
'@!'で始まる行	その行の'@!'以降の名前のバッファを削除

例えば入力テキスト中に

  :
@>foo
hello
world
@<
  :

の記述があると、名前fooのバッファにhello worldの2行が登録されます。

'@>xxx' から '@<' までの間の行は、出力されません。

以降の箇所で

  :
@foo
  :

のinclude指示行があると、ファイルfooを探さずに、名前fooのバッファの内容が展開されます。

名前fooのバッファが未登録なら、従来通りファイルfooを探します。

buf.txt

にせファイル(というかバッファ)の使用例

@>hoge
みなさまお聴きの放送は
FREQ kHz CS NAME です。
@<

@hoge FREQ=1179 CS=MBS NAME=毎日放送

@hoge FREQ=1008 CS=ABC NAME=朝日放送

@hoge FREQ=1314 CS=OBC NAME=ラジオ大阪

の場合

$ ./inc_ut.py < buf.txt
にせファイル(というかバッファ)の使用例


みなさまお聴きの放送は
1179 kHz MBS 毎日放送 です。

みなさまお聴きの放送は
1008 kHz ABC 朝日放送 です。

みなさまお聴きの放送は
1314 kHz OBC ラジオ大阪 です。

などと展開されます。

バッファへの登録中の出力抑制のために、従来の動作を少しだけ変更しています。

2重includeを検出した時に、そのinclude指示行をしれっと無視する際、指示行は空行として1行分出力されていました。

2重includeの指示行は行削除扱いとして出力しないようにしました。

バッファ名の自動生成

v7.patch

diff -ur v6/inc_ut.py v7/inc_ut.py
--- v6/inc_ut.py	2020-09-24 21:32:53.000000000 +0900
+++ v7/inc_ut.py	2020-09-26 02:54:32.000000000 +0900
@@ -11,20 +11,34 @@
 	dic = {}
 	ks = []

+	e = empty.new()
+	e.last = ''
+
+	def anon_k():
+		id = 0
+		get = lambda : 'tMp_{}'.format( id )
+		while get() in dic:
+			id += 1
+		return get()
+
 	def set(s):
-		get_k = lambda : s[ 2 : ]
+		def get_k():
+			k = s[ 2 : ]
+			return k if k else anon_k()

 		if s.startswith( '@>' ):
 			ks.append( get_k() )
 			return True
 		if s.startswith( '@<' ):
 			if ks:
-				ks.pop()
+				e.last = ks.pop()
 			return True
 		if s.startswith( '@!' ):
 			k = get_k()
 			if k in dic:
 				dic.pop( k )
+			if k == e.last:
+				e.last = ''
 			return True
 		if ks:
 			k = ks[ -1 ]
@@ -37,7 +51,9 @@

 	get = dic.get

-	return empty.new( locals() )
+	to_last_key = lambda s, k='$$': s.replace( k, e.last )
+
+	return empty.add( e, locals() )

 def read_str(path):
 	r = False
@@ -96,6 +112,9 @@
 		if path == None:
 			return None

+		path = buf.to_last_key( path )
+		opts = list( map( buf.to_last_key, opts ) )
+
 		r = buf.get( path )
 		if r != None:
 			paths.append( path )

'@>'で始まる行	その行の'@>'以降の名前のバッファへの登録開始
'@<'で始まる行	現在のバッファへの登録終了
'@!'で始まる行	その行の'@!'以降の名前のバッファを削除

'@>'の登録開始で名前を指定しないときに、重複しない適当な名前を自動生成するように変更してみました。

'@>'で始まる行	その行の'@>'以降の名前のバッファへの登録開始名前の指定が無いと重複しない名前が自動生成される
'@<'で始まる行	現在のバッファへの登録終了
'@!'で始まる行	その行の'@!'以降の名前のバッファを削除

最後に'@<'で登録終了したバッファ名を、 '$$'のキーワードで参照できるようにしました。

'$$'が使えるのはinclude指示行だけです。 path文字列か、以降のオプション文字列の中で使用します。

字下げ

v8.patch

diff -ur v7/inc_ut.py v8/inc_ut.py
--- v7/inc_ut.py	2020-09-26 02:54:32.000000000 +0900
+++ v8/inc_ut.py	2020-10-09 23:21:17.000000000 +0900
@@ -74,6 +74,23 @@
 				return s[ 1 : -1 ]
 	return s

+def indent(s, opts):
+	def get_i():
+		k = '-s'
+		for opt in opts:
+			if opt.startswith( k ):
+				r = opt[ len( k ) : ]
+				if not r:
+					return 1
+				elif r.isdigit():
+					return int( r )
+		return 0
+
+	i = get_i()
+	if i > 0:
+		s = '\n'.join( map( lambda s: ' ' * i + s, s.split( '\n' ) ) )
+	return s
+
 def new( lst=[] ):
 	lst += [
 		r'^@([^@\s]+)',
@@ -146,6 +163,7 @@
 		s = cvt_opts( r.s, r.opts )
 		if '-nr' not in r.opts:
 			s = exp( s )
+		s = indent( s, r.opts )
 		paths.pop()
 		return s

include展開するテキストに「字下げ」を追加するためのオプションを、追加してみました。

include指示行に '-s' を追加すると、展開するテキストの行頭に空白を1つ追加します。

空白を2つ追加したいときは '-s2' のように、-sの直後に数字を指定します。

エラーなしオプション

v9.patch

diff -ur v8/inc_ut.py v9/inc_ut.py
--- v8/inc_ut.py	2020-10-09 23:21:17.000000000 +0900
+++ v9/inc_ut.py	2020-10-09 23:21:53.000000000 +0900
@@ -147,9 +147,13 @@
 					return True
 				r = read_str( path )
 				if r == False:
-					return False
+					break
 				paths.append( path )
 				return empty.new( s=cut_tail_nl( r ), opts=opts )
+
+		if '-q' in opts:
+			return True
+
 		return False

 	def exp_line(i, s):

従来動作では、includeすべきファイルが見つからなかったり、ファイルのオープンに失敗すると、エラー表示で終了します。

include指示行に '-q' (quiet)を追加すると、エラー発生時でもしれっと無視し、 include指示行を削除扱いして、後続の処理を進めるようになります。

行末接続

v10.patch

diff -ur v9/inc_ut.py v10/inc_ut.py
--- v9/inc_ut.py	2020-10-09 23:21:53.000000000 +0900
+++ v10/inc_ut.py	2020-10-10 18:42:36.000000000 +0900
@@ -87,8 +87,11 @@
 		return 0

 	i = get_i()
-	if i > 0:
-		s = '\n'.join( map( lambda s: ' ' * i + s, s.split( '\n' ) ) )
+
+	tc = ' \\' if '-tc' in opts else ''
+
+	if i > 0 or tc:
+		s = '\n'.join( map( lambda s: ' ' * i + s + tc, s.split( '\n' ) ) )
 	return s

 def new( lst=[] ):

include指示行に '-tc' (tail continued) を追加すると、展開するテキストの行末に空白1つとバックスラッシュを追加します。

2重include許容オプション

v11.patch

diff -ur v10/inc_ut.py v11/inc_ut.py
--- v10/inc_ut.py	2020-10-10 18:42:36.000000000 +0900
+++ v11/inc_ut.py	2020-10-10 21:24:30.000000000 +0900
@@ -146,7 +146,7 @@

 		for path in path_lst:
 			if os.path.exists( path ):
-				if path in paths:
+				if path in paths and not '-ad' in opts:
 					return True
 				r = read_str( path )
 				if r == False:

2重includeの防止で、検出すると「そっと、しないでおく」対処になっています。

ですが、簡単な置換機能によって、無事に終端する場合もあります。

あえて、2重includeを許容するオプションとして '-ad' (allow double include) を追加します。

kon_ut風の説明

kon_ut

inc_ut.py

new( lst=[] )

include指示行を含むテキストデータの展開に使用します。

include指示行用の正規表現のリスト lst を与え、 include展開処理用のオブジェクトを返します。

デフォルトとして次の正規表現は組み込み済みで、引数指定のlstは、その前方に挿入されます。

~~r'^# *include +(\S*)'~~
~~r'^@ *include +(\S*)'~~
~~r'^@ *inc +(\S*)'~~
r'^@([^@\s]+)'

オブジェクトの主なメソッド

メソッド	内容
exp(s)	改行を含む文字列sを指定すると、include指示行を展開した文字列を返します。

例

$ cat foo.txt
foo
FOO

$ python
>>> import inc_ut
>>> inc = inc_ut.new()

>>> s = '''bar
... hoge
... #include "foo.txt"
... fuga'''

>>> s
'bar\nhoge\n#include "foo.txt"\nfuga'

>>> inc.exp( s )
'bar\nhoge\nfoo\nFOO\nfuga'

>>> print( inc.exp( s ) )
bar
hoge
foo
FOO
fuga

簡単な置換機能

include指示行の後続に、次の形式で文字列の置換の指定を追加できます。

置換前の文字列=置換後の文字列

空白で区切って、複数指定できます。

簡単なパース処理にしてるので、置換前の文字列や置換後の文字列に空白を含めることはできません。

$ cat foo.txt
foo
FOO

$ python
>>> import inc_ut
>>> inc = inc_ut.new()
>>> print( inc.exp( '@foo.txt FOO=BAR o=0' ) )
f00
BAR

サーチ・ディレクトリ

オブジェクトのアトリビュート.search_dirsとしてリストが用意されています。

初期値は

search_dirs = [ os.path.dirname( __file__ ) ]

として、inc_ut.py の存在するディレクトリだけが入ってます。

再帰展開の抑止

include指示行に '-nr' を追加すると、再帰的な展開をしません。

バッファ

'@>'で始まる行	その行の'@>'以降の名前のバッファへの登録開始名前の指定が無いと重複しない名前が自動生成される
'@<'で始まる行	現在のバッファへの登録終了
'@!'で始まる行	その行の'@!'以降の名前のバッファを削除

include指示行でpathと同じ名前のバッファが存在すると、ファイルを探さずに、バッファの内容を展開します。

include指示行では、キーワード'$$'が、最後に登録されたバッファ名に展開されます。

字下げ

include指示行に '-s' を追加すると、展開するテキストの行頭に空白を1つ追加します。

空白を2つ追加したいときは '-s2' のように、-sの直後に数字を指定します。

行末接続

include指示行に '-tc' (tail continued) を追加すると、展開するテキストの行末に空白1つとバックスラッシュを追加します。

エラーなしオプション

include指示行に '-q' (quiet)を追加すると、エラー発生時でもしれっと無視し、 include指示行を削除扱いして、後続の処理を進めるようになります。

2重include許容オプション

通常は、2重includeを検出すると、2つ目のinclude指示行は無視されます。

include指示行に '-ad' (allow double include)を追加すると、 2重includeを検出しても、処理を続行します。