ASPN ActiveState Programmer Network
  ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups | Web Services
SEARCH

Reference
ActivePython 2.5
Python Documentation
Language Reference
Front Matter
Contents
1. Introduction
2. Lexical analysis
2.1 Line structure
2.2 Other tokens
2.3 Identifiers and keywords
2.4 Literals
2.5 Operators
2.6 Delimiters
3. Data model
4. Execution model
5. Expressions
6. Simple statements
7. Compound statements
8. Top-level components
A. History and License
Index
About this document ...

MyASPN >> Reference >> ActivePython 2.5 >> Python Documentation >> Language Reference
ActivePython 2.5 documentation


2. Lexical analysis

A Python program is read by a parser. Input to the parser is a stream of tokens, generated by the lexical analyzer. This chapter describes how the lexical analyzer breaks a file into tokens.

Python uses the 7-bit ASCII character set for program text. New in version 2.3: An encoding declaration can be used to indicate that string literals and comments use an encoding different from ASCII. For compatibility with older versions, Python only warns if it finds 8-bit characters; those warnings should be corrected by either declaring an explicit encoding, or using escape sequences if those bytes are binary data, instead of characters.

The run-time character set depends on the I/O devices connected to the program but is generally a superset of ASCII.

Future compatibility note: It may be tempting to assume that the character set for 8-bit characters is ISO Latin-1 (an ASCII superset that covers most western languages that use the Latin alphabet), but it is possible that in the future Unicode text editors will become common. These generally use the UTF-8 encoding, which is also an ASCII superset, but with very different use for the characters with ordinals 128-255. While there is no consensus on this subject yet, it is unwise to assume either Latin-1 or UTF-8, even though the current implementation appears to favor Latin-1. This applies both to the source character set and the run-time character set.



See About this document... for information on suggesting changes.

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState 2004 All rights reserved