See All Titles |
![]() ![]() Special Features of StringsSpecial or Control CharactersLike most other high-level or scripting languages, a backslash paired with another single character indicates the presence of a "special" character, usually a non-printable character, and that this pair of characters will be substituted by the special character. These are the special characters we discussed above that will not be interpreted if the raw string operator precedes a string containing these characters. In addition to the well-known characters such as NEWLINE ( \n ) and (horizontal) TAB ( \t ), specific characters via their ASCII values may be used as well: \OOO or \xXX where OOO and XX are their respective octal and hexadecimal ASCII values. Here are the base 10, 8, and 16 representations of 0, 65, and 255:
Special characters, including the backslash-escaped ones, can be stored in Python strings just like regular characters. Another way that strings in Python are different from those in C is that Python strings are not terminated by the NUL (\000) character (ASCII value 0). NUL characters are just like any of the other special backslash-escaped characters. In fact, not only can NUL characters appear in Python strings, but there can be any number of them in a string, not to mention that they can occur anywhere within the string. They are no more special than any of the other control characters. Table6.7 represents a summary of the escape characters supported by most versions of Python.
And as mentioned before, explicit ASCII octal or hexadecimal values can be given, as well as escaping a NEWLINE to continue a statement to the next line. All valid ASCII character values are between 0 and 255 (octal 0177, hexadecimal 0XFF).
One use of control characters in strings is to serve as delimiters. In database or Internet/Web processing, it is more than likely that most printable characters are allowed as data items, meaning that they would not make good delimiters. It becomes difficult to ascertain whether or not a character is a delimiter or a data item, and by using a printable character such as a colon (:) as a delimiter, you are limiting the number of allowed characters in your data, which may not be desirable. One popular solution is to employ seldomly used, non-printable ASCII values as delimiters. These make the perfect delimiters, freeing up the colon and the other printable characters for more important uses. Triple QuotesAlthough strings can be represented by single or double quote delimitation, it is often difficult to manipulate strings containing special or non-printable characters, especially the NEWLINE character. Python's triple quotes comes to the rescue by allowing strings to span multiple lines, including verbatim NEWLINEs, TABs, and any other special characters. The syntax for triple quotes consists of three consecutive single or double quotes (used in pairs, naturally): >>> para_str = """this is a long string that is made up of … several lines and non-printable characters such as … TAB ( \t ) and they will show up that way when displayed. … NEWLINEs within the string, whether explicitly given like … this within the brackets [ \n ], or just a NEWLINE within … the variable assignment will also show up. … """ Triple quote lets the developer avoid playing quote and escape character games, all the while bringing at least a small chunk of text closer to WYSIWIG (what you see is what you get) format. An example below shows you what happens when we use the print statement to display the contents of this string. Note how every single special character has been converted to its printed form, right down to the last NEWLINE at the end of the string between the "up." and closing triple quotes. Also note that NEWLINEs occur either with an explicit carriage return at the end of a line or its escape code (\n): >>> print para_str this is a long string that is made up of several lines and non-printable characters such as TAB ( ) and they will show up that way when displayed. NEWLINEs within the string, whether explicitly given like this within the brackets [ ], or just a NEWLINE within the variable assignment will also show up. We introduced the len() built-in sequence type function earlier, which, for strings, gives us the total number of characters in a string. >>> len(para_str) 307 Upon applying that function to our string, we get a result of 307, which includes the NEWLINE and TAB characters. Another way to look at the string within the interactive interpreter is by just giving the interpreter the name of the object in question. Here, we will see the "internal" representation of the string, without the special characters being converted to printable ones. If that last NEWLINE we looked at above (after the final word "up" and before the closing triple quotes) is still elusive to you, take a look at the way the string is represented internally below. You will observe that the last character of the string is the aforementioned NEWLINE. >>> para_str 'this is a long string that is made up of\012several lines and non-printable characters such as\012TAB ( \011 ) and they will show up that way when displayed.\012NEWLINEs within the string, whether explicitly given like\012this within the brackets [ \012 ], or just a NEWLINE within\012the variable assignment will also show up.\012\' String ImmutabilityIn Section 4.7.2, we discussed how strings are immutable data types, meaning that their values cannot be changed or modified. This means that if you do want to update a string, either by taking a substring, concatenating another string on the end, or concatenating the string in question to the end of another string, etc., a new string object must be created for it. This sounds more complicated than it really is. Since Python manages memory for you, you won't really notice when this occurs. Any time you modify a string or perform any operation that is contrary to immutability, Python will allocate a new string for you. In the following example, Python allocates space for the strings, 'abc' and 'def'. But when performing the addition operation to create the string 'abcdef', new space is allocated automatically for the new string. >>> 'abc' + 'def' 'abcdef' Assigning values to variables is no different: >>> string = 'abc' >>> string = string + 'def' >>> string 'abcdef' In the above example, it looks like we assigned the string 'abc' to string, then appended the string 'def' to string. To the naked eye, strings look mutable. What you cannot see, however, is the fact that a new string was created when the operation "s + 'def'" was performed, and that the new object was then assigned back to s. The old string of 'abc' was deallocated. Once again, we can use the id() built-in function to help show us exactly what happened. If you recall, id() returns the "identity" of an object. This value is as close to a "memory address" as we can get in Python. >> string = 'abc' >>> >>> id(string) 135060856 >>> >>> string = string + 'def' >>> id(string) 135057968 Note how the identities are different for the string before and after the update. Another test of mutability is to try to modify individual characters or substrings of a string. We will now show how any update of a single character or a slice is not allowed: >>> string 'abcdef' >>> >>> string[2] = 'C' Traceback (innermost last): File "<stdin>", line 1, in ? AttributeError: __setitem__ >>> >>> string[3:6] = 'DEF' Traceback (innermost last): File "<stdin>", line 1, in ? AttributeError: __setslice__ Both operations result in an error. In order to perform the actions that we want, we will have to create new strings using substrings of the existing string, then assign those new strings back to string: >>> string 'abcdef' >>> >>> string = string[0:2] + 'C' + string[3:] >>> string 'abCdef' >>> >>> string[0:3] + 'DEF' 'abCDEF' >>> >>> string = string[0:3] + 'DEF' >>> string 'abCDEF' So for immutable objects like strings, we make the observation that only valid expressions on the left-hand side of an assignment (to the left of the equals sign [ = ]) must be the variable representation of an entire object such as a string, not single characters or substrings. There is no such restriction for the expression on the right-hand side. Unicode SupportUnicode string support, introduced to Python in version 1.6, is used to convert between multiple double-byte character formats and encodings, and include as much functionality to manage these strings as possible. With the addition of string methods (see Section 6.6), Python strings are fully-featured to handle a much wider variety of applications requiring Unicode string storage, access, and manipulation. At the time of this writing, the exact Python specifications have not been finalized. We will do our best here to give an overview of native Unicode 3.0 support in Python: unicode() Built-in FunctionThe Unicode built-in function should operate in a manner similar to that of the Unicode string operator (u/U). It takes a string and returns a Unicode string. encode() Built-in MethodsThe encode() built-in methods take a string and return an equivalent encoded string. encode() exists as methods for both regular and Unicode strings in 2.0, but only for Unicode strings in 1.6. Unicode TypeThere is a new Unicode type named unicode that is returned when a Unicode string is sent as an argument to type(), i.e., type(u'') Unicode OrdinalsThe standard ord() built-in function should work the same way. It was enhanced recently to support Unicode objects. The new unichr() built-in function returns a Unicode object for character (provided it is a 32-bit value); a ValueError exception is raised, otherwise. CoercionMixed-mode string operations require standard strings be converted to Unicode objects. ExceptionsUnicodeError is defined in the exceptions module as subclass of ValueError. All exceptions related to Unicode encoding/decoding should be subclasses of UnicodeError. Also see the string encode() method.
RE Engine Unicode-awareThe new regular expression engine should be Unicode aware. See the re Code Module sidebar in the next section (6.8). String Format OperatorFor Python format strings: '%s' does str(u) for Unicode objects embedded in Python strings, so the output will be u.encode (<default encoding>). If the format string is an Unicode object, all parameters are coerced to Unicode first and then put together and formatted according to the format string. Numbers are first converted to strings and then to Unicode. Python strings are interpreted as Unicode strings using the <default encoding>. Unicode objects are taken as is. All other string formatters should work accordingly. Here is an example: u"%s %s" % (u"abc", "abc") ? u"abc abc" Specific information regarding Python's support of Unicode strings can be found in the Misc/unicode.txt of the distribution. The latest version of this document is always available online at: http://www.starship.python.net/~lemburg/unicode-proposal.txt For more help and information on Python's Unicode strings, see the Python Unicode Tutorial at: http://www.reportlab.com/il8n/python_unicode_tutorial.html No Characters or Arrays in PythonWe mentioned in the previous section that Python does not support a character type. We can also say that C does not support string types explicitly. Instead, strings in C are merely arrays of individual characters. Our third fact is that Python does not have an "array" type as a primitive (although the array module exists if you really have to have one). Implementing strings as character arrays is also deemed unnecessary due to the sequential access ability of strings. In choosing between single characters and strings, Python wisely uses strings as types. It is much easier manipulating the larger entity as a "blob" since most applications operate on strings as a whole rather than individual characters. Applications will convert strings to integers, ask users to input strings, perform regular expression matches on substrings, search files for specific strings, and will even sort a set of strings like names, etc. How often are individual characters operated on, except for searches (i.e., search-and-replace, search-for-delimiter, etc.)? Probably not often as far as most applications are concerned. However, such functionality should still be available to the Python programmer. Search-and-replacing can be done with regular expressions and the re module, searching for and breaking up strings based on delimiters can be accomplished with split(), searching for substrings can be accomplished using find() and rfind(), and just plain old character membership in a string can be verified with the in and not in sequence operators. We are going to quickly revisit the chr() and ord() built-in functions that convert between ASCII integer values and their equivalent characters, and describe one of the "features" of C that has been lost to Python because characters are not integer types in Python as they are in C. One feature of C which is lost is the ability to perform numerical calculations directly on characters, i.e., 'A' + 3. This is allowed in C because both 'A' as a char and 3 as an int are integers (1-byte and 2/4-bytes, respectively), but would be a type mismatch in Python because 'A' is a string, 3 is a plain integer, and no such addition ( + ) operation exists between numeric and string types. >>> 'B' 'B' >>> 'B' + 1 Traceback (innermost last): File "<stdin>", line 1, in ? TypeError: illegal argument type for built-in operation >>> >>> chr('B') Traceback (innermost last): File "<stdin>", line 1, in ? TypeError: illegal argument type for built-in operation >>> >>> ord('B') 66 >>> ord('B') + 1 67 >>> chr(67) 'C' >>> chr(ord('B') + 1) 'C' Our failure scenario occurred when we attempted to increase the ASCII value of 'B' by 1 to get 'C' by addition. Rather than 1-byte integer arithmetic, our solution in Python involves using the chr() and ord() built-in functions.
|
© 2002, O'Reilly & Associates, Inc. |