Index 1.XML Fundamentals 2.XML Markup 3.Well-formed and Valid Documents 4.XML Information Models 5.DTD�s 6.XML Schemas 7.XSLT 8.XML Processing 1. XML Fundamentals XML is Structured data. It is not a programming Language but a Data Format. XML is Extensible because it is not a fixed format like HTML. XML uses tags to provide Information content. It has hierarchical data structure. XML is an abbreviated version of SGML. (Much simpler). Programming languages specify calculations, actions and decisions to be carried out, whereas, Markup languages (like XML) describe information for storage, transmission, or processing by a program. XML files cannot be run on their own. They need programs to be created, displayed or processed, XML removes the dependence on a single, inflexible document type (HTML) and also removes the complexity of SGML. XML does not replace HTML. HTML has been redefined as an XML vocabulary instead of an SGML vocabulary. HTML is now a child of XML, known as XHTML. HTML 5.0 is XHTML 1.0. Existing HTML files will work with XML only if they are well-formed. XML Family: �Display : XHTML, XSLT, XSL �Modeling : DTD, XML Schema �Manipulating : DOM, SAX �Querying : Xlink, XQL, Xpath Strengths of XML Robust Data Representation Easily mutable (APIs are easily accessed by existing code) Platform Independent Works with existing technologies The path of a standard at W3C: NOTEs � Working Drafts � Candidate recommendations � Tech. Recommendations 2. XML Markup Prolog provides initial parameters to XML <?xml version="1.0" encoding="UTF-16" standalone="yes" ?> Default value for standalone is "no" Elements Types are Document (Root) Element, Other Elements and Empty Elements. Empty Elements are a common way of including multimedia files in a document. Attributes provide additional information about an element. PCDATA Vs CDATA PCDATA is parsed character data and CDATA is unparsed character data. CDATA section is used to pass data, which contains characters, reserved for markup. Technique when using data from a legacy system. Processing instructions are passed onto the processing application. <?TargetName any sequence of characters?> Comments are often stripped by the parser and not passed on to the application. The processor ignores them. <!�Here are comments � Namespaces allow us to disambiguate names in our document. 3. Well-formed and Valid documents XML Constraints �XML prolog at the top �Only one root element �Elements must nest properly �Attribute values must be quoted �Every start tag must have an end tag (case sensitive) �Well-formed XML documents must obey the basic XML constraints. If a document is not a well-formed document it is not a XML document. Structural/Semantic Constraints are defined in the information model (DTD, XML Schema). What element (tag) names are allowed What attributes are used with each element Which child elements belong to which parent elements What order child elements can appear in If a document�s structure and tag names match the information model, it is �valid�. Validation is optional. A valid document is always well-formed. 4. XML Information Models Purpose Data Control � ensure that elements in document follow order Processing � document matches a prescribed schema Definition � provides a way to define schema Authoring � assist authors in creating valid documents Information Models enforce rules. Rules allow standardized documents. Standardized documents allow companies to exchange data. Types of Information Models �DTDs �XML Schemas Schemas Vs DTDs �Schemas support namespaces �Schema is written in XML syntax �Schemas provide extensive datatype support, whereas, DTDs has very limited datatype support. �Schemas have full object oriented extensibility whereas DTDs have extended via string substitutions �Schemas are open, closed or refinable content models, whereas, DTDs support closed models only. 5. DTD�s Document Type Definition is a series of statements where document component names and relationships between them are defined. DTD�s can be Internal or External. The dtdname.dtd is a DTD definition and the <!DOCTYPE rootelement SYSTEM "dtdname.dtd"> is a DTD declaration. An Internal DTD is defined and declared within the XML document. An external document is defined as a .dtd file and then declared in the XML document. Identifiers for External DTDs are System(location) or Public (publicly registered identifier) Element Names Must start with aLetter, �_�(Underscore), �:� (Colon). Allowed following characters include aLetter, aDigit, �_� (Underscore), �-� (Hyphen), �:� (Colon) , �.� (Dot). Naming Tips : Do not use cryptic names Avoid unwieldy names Keep consistent naming scheme Do not use numbers Do not append the name of the parent to an element name. Element Content : DTD syntax allows the control of element content: Type - for ex. EMPTY Order � Separated by commas must appear only once and in listed order Car|truck|bike means car or truck or bike Multiplicity - * Zero or more ? Zero or One + One or More No Symbol Once and once only When mixing component types, separate components with pipes and #PCDATA must be declared first. Attributes Sub elements and machine-readable codes end up as attributes. Attribute types : String, Tokenized (varying lexical and semantic constraints), Enumerated (list if valid values). Attribute Qualifiers : #FIXED (must have a default value, not to be overridden) #IMPLIED (optional default value, not mandatory in xml document) #REQUIRED (Default not allowed in DTD, value required in xml document) Enumerated (optional default attribute) Tokenized attributes : ID attribute is unique for an element in the document. Similar to primary key. IDREF is a pointer to an ID. Similar to foreign key. IDREFS functions as a pointer to multiple ID�s separated by spaces. ENTITY is a pointer to an external entity ENTITIES is a list of entity�s separated by white space NMTOKEN (name token) contains a value NMTOKENS is a list of NMTOKENs separated by white space Entities Entities are Storage Units or Storage Objects. They can be Parsed and Unparsed. Parsed entities are used as replacement text and invoked by name (ex. &abc Unparsed entities are non-XML resources and invoked by name via the ENTITY attribute. Entity declaration must reference a notation, which specifies the format type of the information, and what application should handle it. Types of Entities : Pre-defined by the parser (for ex. < . Internal General Entity (text substitution) External General Entity (Uses System identifier) Internal Parameter entities (used within DTDs. Use % sign) DTD weaknesses �Cryptic Syntax �Everything is treated as text �Performance Impact (Validation requires extra time) �Inconsistent parser support for entities �Limited capacity for data validation �Can define only for hierarchical relationships �Poorly suited for automation Elements Vs Attributes �It is easier to edit/display Element content than Attribute values. �Processors can check Attribute values easily than Element content. �It is easier to extract information from attributes than from sub-elements. �Attributes can have default values, Elements cannot. �Elements define content, Attributes describe content. 6. XML Schemas XML schema is a spec, which is defined and maintained by W3C. XML Schemas is a syntax and is a model for describing the structure of XML documents. XML schemas � Highlights Enhanced Datatypes Written in XML Object Oriented Can express sets (child elements can occur in any order) Can specify element content as being unique (Keys) Can define multiple elements with same name but different content Can define elements with null content Can create equivalent elements (subway element equals train element) Open, closed and refinable content models Namespace support Grouping (attributes etc.) XML Schema Components XML schema has 3 components : Declarations, Types and Type Definitions. Element Types can be classified as : Simple Types and Complex Types Simple Types do not have children and no attributes. Complex Types allow element children and attributes are allowed. Users can build new simple and complex types. Some simple types like boolean, String, decimal are built-in. Named Types and Anonymous Types Anonymous Type does not have a name and so used in one element only. Named Type has a name and can be referenced. Inline Declarations and out-of-line Declarations Inline declarations are done top to bottom and out-of-line declarations are done bottom to top. In Inline declarations tree is declared first with ref to branch. Branch is then declared with ref to leaf. Custom Data Types New Data types can be created from an existing data type (called the base type). For Ex : <simpleType name="name" base="source"> <facet value="value"/> <facet value="value"/> </simpleType> source can be any one of string, boolean, float, double, decimal, timeDuration, recurringDuration, uriReference ����. Facet can be any one of pattern, enumeration, length, maxlength, minlength �.. Default value of minOccurs is 1 and default value of maxOccurs is 1 when minOccurs is 0 or 1 and equal to minOccurs when value of minOccurs is greater then 1. 7. XSLT XSL Vs XSLT XSL is a styling language and XSLT is a spec, which is used for transformation. The namespace for XSL is fo: and for XSLT is xsl: CSS Vs XSLT CSS can only style XML documents but XSLT can do the styling as well as do � Reorder nodes in the input document Transform nodes in the input document Sort nodes in the input document Add, remove nodes in the input document Transform both attributes and elements .xsl file is actually a XSLT transform file. NameSpaces <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/xsl/transform"> xsl is the namespace used for style sheets. Namespaces may or may not be present in the source and target documents. XSLT uses namespaces in the xslt transformation file to differentiate between xslt instructions and literal result elements. XSLT Components Top level element xsl:stylesheet can be replaced by xsl:transform xsl:apply-templates selects nodes to be processed, and processes them. For ex. the XSLT instructions below selects all child nodes of root and processes them : <xsl:template match="/" > <xsl:apply-templates /> </xsl:template> XSLT Elements XSLT Elements can be classified into top-level elements and Instruction elements. Example of top-level elements are <xsl:template>;<xsl aram>;<xsl utput> Instruction elements are only found inside the template body. for ex. <xsl:apply-templates>;<xsl:for-each>;<xsl:element>;<xsl:value-of> XSLT Patterns Patterns are used for node processing, template matching etc.. XSLT patterns are described in the XPath spec. Patterns are used to define a condition that a node must satisfy in order to be selected. Examples of a pattern are "/course/topic/slide"; "book/@isbn" Patterns are used in match attribute of xsl:template; select attribute of xsl:for-each� The general form of XSLT patterns is /step/step/step. The Expanded form is /axis::nodetest[predicate]/axis::nodetest[predicate] axis can be parent, child, self etc. nodetest can be done on nodename or nodetype. Predicates potentially reduce the nodelist, zero or more allowed and evaluated from left to right. XML Nodes XML trees are composed of Nodes which have a type, and may have name. Nodes may have child node, parent node etc. XSLT Extensions XSLT engines may offer custom extensions to the basic W3C XSLT spec. XSLT instructions <xsl:for-each> is similar to �for loop�. <xsl:sort> sorts the elements. <xsl:if test="something"> is similar to �if statement�. There is no <xsl:else> in XSLT, <xsl:choose> is used instead. <xsl:choose> is similar to �switch� and <xsl:when> is similar to �case� and <xsl therwise> is similar to �default�. <xsl:variable> is similar to �final� variable. Variable has a scope limited to the element it is defined in. <xsl aram> can only be used within a named template. <xsl:call-template> calls the named template and passes parameters using <xsl:with-param> <xsl:element> and <xsl:attribute> is used to create elements and attributes. <xsl:comment> emits comments <xsl rocessing-instruction> emits processing instructions <xsl:test> emits text. <xsl:copy> does a shallow copy. <xsl:copy-f> does a deep-copy. <xsl:number> is used to apply numbers to the output document. count and level attributes of xsl:number are used to calculate node numbers. <xsl utput method="html"> is used to tell the XSL processor the type of output. The legal values of method are HTML, text and XML. Other attributes of xsl utput other than method are value, encoding, indent etc. <xsl:strip> and <xsl reserve> are used to tell the XSL processor whether to strip or preserve white space. XSLT Templates Examples of template mode are �debug� and �production�. If more than one template can handle a given code, there is a conflict. XSLT engine resolves this by assigning a priority to each template. User can also assign a priority to a template. System assigned priorities are from �0.5 to +0.5. User assigned priorities are usually > +1.0. If two priorities match XSLT engine will pick up the last one. External Stylesheets Breaking up of stylesheets into separate modules will provide reuse. Stylesheets can be included or imported. Importing a stylesheet is similar to subclassing. XSLT Functions XSLT Datatypes are String, Number, Boolean, Node-set, Tree �. XSLT Functions are called inside of XSLT elements. String Functions string() : Conversion to string concat() : Concatenates two or more strings starts-with() : Takes two strings, returns true if first string starts with second string substring() : note that the first character starts with 1 not 0 string-length(): returns the length of the string name() : returns name of the node contains() : takes two strings and returns true if string1 contains string2. document() : to access other XML documents. Takes url as argument. id() : returns ID. Key() : returns nodes with unique valued defined with xsl:key. translate : Takes three arguments. First argument is a string and second and third are patterns (formats). Translate converts from format in second argument to format in third argument. normalize-space() : returns a string after trimming it and replacing sequences of spaces with a space. substring-before ; Takes two arguments, returns sub-string of first argument that precedes the first occurrence of the second argument. Same is the case with substring-after. Boolean Functions : All functions return boolean. boolean() not() true() false() lang() Number Functions : All functions return number number() sum() floor() ceiling() round() last() : returns last position in the current node list position() : returns position of current node in the list count() : returns count of all named nodes in the doc Arithmetic operators available are +, "-", div, mod 8. XML Processing Interface Vs Implementation Interface specifies �what� and implementation specifies �how�. Interface is the �spec� and implementation is �code� Different providers can provide different implementations for the same interface. Parser A set of software components designed for reading, processing and creating XML documents. Parsers expose the structures and tags within a XML document thus making it easy to process XML documents. Types of Parsers SAX (Simple API for XML) Parsers � event based DOM (Document Object Model) Parsers � tree (object) based Validating Parsers Non-Validating Parsers - faster SAX Vs DOM SAX is event-based and DOM is tree-based SAX is developed by XML-Dev mailing list and DOM is a W3C recommendation DOM constructs a tree in memory and SAX does not SAX fires events (streaming) and DOM reads the entire document. DOM is harder to use than SAX, but is flexible SAX is read-only and DOM is read-write. SAX uses less memory and is fast & efficient SAX is preferable for large documents. DOM is preferable for non-sequential processing DOM maintains history. SAX does not. DOM spec is written in CORBA IDL, SAX spec is written in java. DOM spec is 500 pages whereas the SAX spec is only 20 pages. SAX gives control to user during parsing, DOM gives control only after parse. DOM provides range support, traversal support, HTML DOM support and CSS/Stylesheet support. Parser errors are of three types warning : Problems that are not errors as defined by the XML specification. error : Errors defined by the XML specification. Recoverable. fatalError : Defined by XML specification. Non-recoverable. SAX Parsing SAX has 5 interfaces and ~30 methods. SAX Interfaces ContentHandler LexicalHandler DTDHandler DeclHandler ErrorHandler Some of the methods which handle events are : ContentHandler Interface is the main SAX interface. Public void startDocument() Public void endDocument() Public void setDocumentLocator(Locator locator) Public void startElement(String uri, String localName, String qName, Attributes atts) Public void endElement(String uri, String localName, String qName) Public void characters(char ch, int start, int length) Public void ignorableWhitespace(char ch, int start, int length) Public void processingInstruction(String target, String data) Public void skippedEntity(String name) Public void startPrefixMapping(String prefix, String uri) Public void endPrefixMapping(String Prefix) LexicalHandler Interface(for Entities and CDATA) Public void startDTD(String name, String publicId, String systemId) Public void endDTD() Public void startEntity(String name) Public void endEntity(String name) Public void startCDATA() Public void endCDATA() Public void comment(char ch, int start, int length) DTDHandler Interface(DTD Processing) Public void notationDecl(String name, String publicId, String systemId) Public void unparsedEntityDecl(String name, String publicId, String systemId, String notationName) DeclHandler Interface ErrorHandler Interface DOM Parsing Provides two complementary views of the parse tree Flat View : Everything is a node Object-Oriented View : Objects DOMImplementation Methods CreateDocument() CreateDocumentType() hasFeature() Document Methods Node Types Node Methods NodeList Interface NamedNodeMap Interface CharacterData Interface Element Interface (Manages Attributes) Attr Interface DocumentFragment is a lightweight implementation of the document object which does not require a root element and will be inserted into a larger document.