Rails 字串處理 encode 及除錯

紅寶鐵軌客
Join to follow...
Follow/Unfollow Writer: 紅寶鐵軌客
By following, you’ll receive notifications when this author publishes new articles.
Don't wait! Sign up to follow this writer.
WriterShelf is a privacy-oriented writing platform. Unleash the power of your voice. It's free!
Sign up. Join WriterShelf now! Already a member. Login to WriterShelf.
寫程式中、折磨中、享受中 ......
1.38K   0  
·
2018/07/08
·
7 mins read


寫動態網頁,少不了就是字串變來變去,我們這篇來探討幾個很重要的關鍵:

字串跳脫

先說簡單的,在 Ruby 中,我們要如何處理字串內的單雙引號及換行等的特殊符號,看完以下實例就很清楚了:

# 單雙引號處理 \n 的不同
irb > puts "hello \n world"
hello 
 world

irb > puts 'hello \n world'
hello \n world

irb > puts '\n'
\n

irb > puts "\n"
# 輸出空白行


# 單雙引號內含單雙引號
irb > puts 'Hello \'world\'!'
Hello 'world'!

irb > puts "Hello \"world\"!"
Hello "world"!

# 單雙引號內含變數的不同
irb > s = "world"
=> "world"
irb > "Hello #{s}"
=> "Hello world"
irb > 'Hello #{s}'
=> "Hello \#{s}" # 哈,單引號不會處理#
irb > %('Hello #{s}') # %就是雙引號
=> "'Hello world'"
irb > %(Hello #{s})
=> "Hello world"
irb > %Q(Hello #{s}) # %Q就是雙引號
=> "Hello world"
irb > %q(Hello #{s}) # %Q就是單引號
=> "Hello \#{s}"


HTML Escape / Unescape,中文叫「跳脫特殊字元」

主要用處就是將字串顯示在網頁,其實就是將  & " ' < >  這五種字元,轉來轉去,例如:

  • <p>test</p>\r\n<p>&nbsp;</p>

Escape 會變成:

  • &lt;p&gt;test&lt;/p&gt;\r\n&lt;p&gt;&amp;nbsp;&lt;/p&gt;

使用上,如下:

irb > s="<p>hello world!</p>"
=> "<p>hello world!</p>"

irb > CGI::escapeHTML(s)
=> "&lt;p&gt;hello world!&lt;/p&gt;"

irb > html_escape(s) # 這也是呼叫 上面CGI::escapeHTML
=> "&lt;p&gt;hello world!&lt;/p&gt;"

# html_safe 與 raw 的不同
irb > s.html_safe
=> "<p>hello world!</p>"

irb > helper.raw s
=> "<p>hello world!</p>"

irb > nil.html_safe # html_safe nil 不能輸出
NoMethodError: undefined method 

irb > helper.raw nil # raw 其實是先轉換為字串在呼叫 html_safe
=> ""

如上面的實例:

  1. 網頁的內容上,我們需要的是(unescape),如果你很信任這字串,可以用 raw 直接輸出。
  2. 跳脫(escape),常用的是 CGI:escapeHTML,Escape 主要都是用在網址上,談到網址,最討厭的就是中文編碼問題了,這時就有一個新名詞:

URI encode,中文叫「網址編碼」

我們常會讀取網址,網址,一定是 ASCII 編碼,非 ASCII 字元(像是中日韓文),就會被 URI encode,也就是你常看到的那串 %E7%B6%B2,例如:

  • https://www.abc.com/article/網路時代2017-之一

URI encode 會變成:

  • https://www.abc.com/article/%E7%B6%B2%E8%B7%AF%E6%99%82%E4%BB%A32017-%E4%B9%8B%E4%B8%80

可以看到,我們在非英文的網址中,只有 Params 是要 encode 編碼的。

URI 如何編碼?Ruby 中的 URI 模組就很好用了:

# Params 的 URI encode
irb > URI.encode_www_form([["q", "中文"], ["lang", "en"]])
=> "q=%E4%B8%AD%E6%96%87&lang=en"

# 拆解 URL
irb > s="https://www.abc.com/edit?q=%E4%B8%AD%E6%96%87&lang=en"
irb > uri = URI.parse(s)
=> #<URI::HTTPS https://www.abc.com/edit?q=%E4%B8%AD%E6%96%87&lang=en>
irb > uri.scheme
=> "https"
irb > uri.host
=> "www.abc.com"
irb > uri.path
=> "/edit"
irb > uri.query
=> "q=%E4%B8%AD%E6%96%87&lang=en"

附帶一提 JavaScript 有兩個相關的 function - encodeURI(), encodeURIComponent(),舊的 escape() 就不要用了

  1. encodeURI() :用來編碼整條 URL ,不會連前面 https://www.scrivinor.com 的部分都編掉,變成https%3A%2F%2Fwww.scrivinor.com,不編碼:~!@#$&*()=:/,;?+'
  2. encodeURIComponent() :用來編碼 URL 參數,不編碼:~!*()'

還有一個 HTML 動態網頁是常用的,就是如何:


移除字串中 HTML 標記及 ASCII 控制字元

很常,我們要將 HTML 的字串內容移除 HTML tags,這在 Rails 中超級簡單,例如:

s1 = "<p>test</p>\r\n<p>&nbsp;</p>
s2 = helper.strip_tags(s1).squish = "test"

我們常會在網頁上要顯示一小段原本是HTML編碼的字串,這時,就用以下這個:

truncate(helper.strip_tags(原始字串).squish, escape: false, length: 希望的字串長度)

這真的很好用,不用再 gsub 來來去去!


最討厭的是字串編碼

字串編碼真的是一件很煩人的事,電腦上的文字,有太多的歷史包袱了,說多了,就像白髮宮女話當年,也沒太多意義了,想要知道你可愛的 ruby 內有多少種文字編碼嗎?你會很驚訝:

irb > Encoding.name_list
=> ["ASCII-8BIT", "UTF-8", "US-ASCII", "UTF-16BE", 
"UTF-16LE", "UTF-32BE", "UTF-32LE", "UTF-16", "UTF-32", 
"UTF8-MAC", "EUC-JP", "Windows-31J", "Big5", "Big5-HKSCS", 
"Big5-UAO", "CP949", "Emacs-Mule", "EUC-KR", "EUC-TW", "GB2312", 
"GB18030", "GBK", "ISO-8859-1", "ISO-8859-2", "ISO-8859-3", 
"ISO-8859-4", "ISO-8859-5", "ISO-8859-6", "ISO-8859-7", 
"ISO-8859-8", "ISO-8859-9", "ISO-8859-10", "ISO-8859-11", 
"ISO-8859-13", "ISO-8859-14", "ISO-8859-15", "ISO-8859-16", 
"KOI8-R", "KOI8-U", "Shift_JIS", "Windows-1250", "Windows-1251", 
"Windows-1252", "BINARY", "IBM437", "CP437", "IBM737", "CP737", "IBM775", 
"CP775", "CP850", "IBM850", "IBM852", "CP852", "IBM855", 
"CP855", "IBM857", "CP857", "IBM860", "CP860", "IBM861", 
"CP861", "IBM862", "CP862", "IBM863", "CP863", "IBM864", 
"CP864", "IBM865", "CP865", "IBM866", "CP866", "IBM869", 
"CP869", "Windows-1258", "CP1258", "GB1988", "macCentEuro", 
"macCroatian", "macCyrillic", "macGreek", "macIceland", 
"macRoman", "macRomania", "macThai", "macTurkish", "macUkraine", 
"CP950", "Big5-HKSCS:2008", "CP951", "IBM037", "ebcdic-cp-us", 
"stateless-ISO-2022-JP", "eucJP", "eucJP-ms", "euc-jp-ms", 
"CP51932", "EUC-JIS-2004", "EUC-JISX0213", "eucKR", 
"eucTW", "EUC-CN", "eucCN", "GB12345", "CP936", "ISO-2022-JP", 
"ISO2022-JP", "ISO-2022-JP-2", "ISO2022-JP2", "CP50220", 
"CP50221", "ISO8859-1", "ISO8859-2", "ISO8859-3", "ISO8859-4", 
"ISO8859-5", "ISO8859-6", 
"Windows-1256", "CP1256", "ISO8859-7", "Windows-1253", "CP1253", 
"ISO8859-8", "Windows-1255", "CP1255", "ISO8859-9", "Windows-1254", 
"CP1254", "ISO8859-10", "ISO8859-11", "TIS-620", "Windows-874", 
"CP874", "ISO8859-13", "Windows-1257", "CP1257", "ISO8859-14", 
"ISO8859-15", "ISO8859-16", "CP878", "MacJapanese", "MacJapan", 
"ASCII", "ANSI_X3.4-1968", "646", "UTF-7", "CP65000", "CP65001", 
"UTF-8-MAC", "UTF-8-HFS", "UCS-2BE", "UCS-4BE", "UCS-4LE", 
"CP932", "csWindows31J", "SJIS", "PCK", "CP1250", "CP1251", 
"CP1252", "UTF8-DoCoMo", "SJIS-DoCoMo", "UTF8-KDDI", "SJIS-KDDI", 
"ISO-2022-JP-KDDI", 
"stateless-ISO-2022-JP-KDDI", "UTF8-SoftBank", "SJIS-SoftBank", 
"locale", "external", "filesystem", "internal"]

還好,我們現在幾乎都只有用 UTF-8 了,先來看一下簡單的編碼轉換(...... 一點也不簡單):

irb > s = "R\xC3\xA9sum\xC3\xA9" 
=> "Résumé"
irb > s.encoding #來看看是什麼編碼
=> #<Encoding:UTF-8> # 是 UTF-8
irb > s.encode "ISO-8859-1" # 轉別的編碼
=> "R\xE9sum\xE9"
irb > s.encoding # 這不行,還是 UTF-8
=> #<Encoding:UTF-8>
irb > s = "R\xC3\xA9sum\xC3\xA9".encode(Encoding::ISO_8859_1) # 設定時就要轉才行
=> "R\xE9sum\xE9"
irb > s 
=> "R\xE9sum\xE9" # 怎麼不是"Résumé"?
irb > s.encoding
=> #<Encoding:ISO-8859-1> # 原來不是 UTF-8
irb > s.force_encoding(Encoding::UTF_8)
=> "R\xE9sum\xE9" # 強迫轉 
irb > s.encoding
=> #<Encoding:UTF-8> # 成功
irb > s
=> "R\xE9sum\xE9" # 怎麼還不是"Résumé"?
irb > s1 = s.encode!(Encoding::ISO_8859_1) # 再轉回
Encoding::InvalidByteSequenceError: "\xE9" followed by "s" on UTF-8 # 不行,不認識
irb > s.force_encoding(Encoding::ISO_8859_1) # 強迫轉回 
=> "R\xE9sum\xE9"
irb > s1 = s.encode! (Encoding::ISO_8859_1) # 可以轉了
=> "R\xE9sum\xE9"
irb > s1 = s.encode!(Encoding::UTF_8)# 可以轉 utf-8 了
=> "Résumé"
irb > s1 = s.encode("UTF-8", invalid: :replace, undef: :replace)
=> "Résumé"

寫的真是落落長,因為真的不簡單,不過,要講的重點就是以下幾點而已:

  1. 編碼並不能任意轉換,如果轉到的編碼不存在,就會出錯,
  2. force_encoding 只是將編碼的「標示」改變,並不改變內容,所以你會看到 R\xE9sum\xE9 還是沒辦法顯示成 Résumé,
  3. 要將內容妥上的轉換,一定要用 encode,它才會將字串的內容對應到另外指定的編碼

這也說明了 force_encoding 與 encode 的不同。

用 Rails 的人,剛開始不大會碰到 encode 編碼的問題,但是只要網站上線的時間一久,一定會碰到,而且很難抓,因為編碼的轉換不一定會出錯,只有當轉換到的目的碼不存在時,才會出錯,所以,最後一行的 encode 有處理不合法與未定義字元的功能,就很實用了。 

如果你的網站都沒有連外,你很幸福,問題不大,只不過太不可能了,除非你是做內網的,只要你的網站能讓外部讀取的,就會有很多很多的機器人來讀你的網頁,這時,你很快就會遇到:


incompatible character encodings: UTF-8 and ASCII-8BIT

嚴格說起來,這真的不是你的問題,也不是 Rails 的問題,都不是,又是一個歷史包袱,我們前面提到了網址一定是 ASCII 編碼,所以只要你的網站有讀取使用閱讀者的網址,恭喜你,這外部的網址在 Rails 中就是 ASCII-8BIT 編碼,如果有人想要知道更詳細,請讀:

rack request environment variables are allways encoded in ascii-8bit regardless of default_external Encoding  GitHub

 這問題常發生在當你用request.original_url讀取你的網站閱讀者要求的網址時,這個編碼是 ASCII-8bit 的,而這個內容與編碼是來自 Rails 得更底層 Rack,事實上,網址就一定是 ASCII 編碼也沒錯,可是只要你一使用它,一段時間後,終會遇到出錯,大部分是來自機器人,他們自己組合了一些編碼造成我們內部轉換成 UTF-8 出錯,常發生的地方就在我們設定 Meta tag 的地方:

<meta property="og:url" content="<%= request.original_url %>" /> 

解法也很簡單,就轉成 UTF-8,如果你不在乎內容正確,就強迫轉吧:

<meta property="og:url" content="<%= request.original_url.force_encoding('utf-8') %>" />

這種錯誤很難抓,我花了兩天才找到,這也是我特別寫在這的原因,希望對大家有幫助。

invalid byte sequence in UTF-8

出現這錯誤時,應該就是有人輸如了不屬於 UTF-8 的合法文字,你如果 debug 進去看,通常就是這種 \xA6a\xC4y\xAEMøӣū\xADp 不合法文字,最簡單的方法就是用 scrub method 刪掉它:

"abc\u3042\x81".scrub #=> "abc\u3042\uFFFD"
"abc\u3042\x81".scrub("*") #=> "abc\u3042*"
"abc\u3042\xE3\x80".scrub{|bytes| '<'+bytes.unpack('H*')[0]+'>' } #=> "abc\u3042<e380>"

我是還蠻喜歡 str.scrub('_'),換成底線!



WriterShelf™ is a unique multiple pen name blogging and forum platform. Protect relationships and your privacy. Take your writing in new directions. ** Join WriterShelf**
WriterShelf™ is an open writing platform. The views, information and opinions in this article are those of the author.


Article info

This article is part of:
分類於:
標籤:
日期:
創作於:2018/07/08,最後更新於:2021/02/08。
合計:1976字


Share this article:
About the Author

很久以前就是個「寫程式的」,其實,什麼程式都不熟⋯⋯
就,這會一點點,那會一點點⋯⋯




Join the discussion now!
Don't wait! Sign up to join the discussion.
WriterShelf is a privacy-oriented writing platform. Unleash the power of your voice. It's free!
Sign up. Join WriterShelf now! Already a member. Login to WriterShelf.