Sunday, January 28, 2007

A meaningful Web for humans and machines, Part 2: Explore the parallel Web

Explore the parallel Web
Can you present two formats with one data source?
In this series of articles, we present a thorough, example-filled examination of the existing and emerging technologies that enable machines and humans to easily access the wealth of Web-published data. In this article, we examine the concept of the parallel Web and look at two techniques that Web content publishers use to put both human-readable and machine-consumable content on the Web: the HTML link element and HTTP content negotiation. With these two techniques, content consumers can choose among a variety of different formats of the data on a Web page. Review the history of the techniques and how they are currently deployed on the Web, and how you might use the parallel Web to integrate calendar, banking, and photo data within an example scenario, MissMASH. Finally, we evaluate the parallel Web and determine that, while these techniques are mature and widely deployed, there are disadvantages to separating machine-readable data from the corresponding human-readable content.
");
}
}
}
//-->
In this article, we will cover our notion of the parallel Web. This term refers to the techniques that help content publishers represent data on the Web with two or more addresses. For example, one address might hold a human-consumable format and another a machine-consumable format. Additionally, we include within the notion of the parallel Web those cases where alternate representations of the same data or resource are made available at the same location, but are selected through the HTTP protocol (see Resources for links to this and other material).
HTTP and HTML are two core technologies that enable the World Wide Web, and the specifications of each contain a popular technique that enables the discovery and negotiation of alternate content representations. Content negotiation is available through the HTTP protocol, the mechanism that allows user agents and proxies/gateways on the Internet to exchange hypermedia. This technique might be mapped mostly to a scenario where alternate representations are found at the same Web address. In HTML pages, the link element indicates a separate location containing an alternate representation of the page. In the remainder of this article, we'll look at some of the history behind these two techniques, examine their current deployment and usage today, explain how they might be applied to help solve the MissMASH example scenario, and evaluate the strengths, weaknesses, and appropriate use cases for both.

The techniques necessary to weave a parallel Web have been available almost since the inception of the World Wide Web itself, and they were codified within the original HTTP/1.0 and HTML 1.0 specifications. However, even today their deployment, implementation, and usage varies. For example, the Apache Web server is the most widely used HTTP server on the Internet today (see Resources) and has had support for content negotiation for almost ten years. But few sites actually include content negotiation configurations and alternate representations of resources. In contrast, the number of Web pages that use the link element to point to alternate representations of pages has grown exponentially, especially with the advent of online journals and Weblogs. Let us briefly examine the origins of these two techniques so as to better understand how they have evolved and helped create a meaningful Web for both humans and machines.

Since the Web was conceived for access by humans across the globe and from all walks of life, it had to provide mechanisms to access hypermedia in a wide variety of languages, character sets, and media types. For example, a visually impaired individual might configure his user agent (his Web browser, usually) to prefer plain text representations of images on Web pages. The plain-text representation would contain information that described the image and that could be used by screen reading software, possibly including details such as the location, names of individuals, or dates. On the other hand, a standard Web browser intended for sighted individuals would request image resources as specific image media types that the browser is capable of rendering (image/png or image/jpeg, for instance). The potential use cases for alternate encodings of information on the Web are as diverse as the populations of human beings served by the Web, and the HTTP specification recognizes this and provides content negotiation to address it.
Content negotiation as a phrase did not appear until the completion of the HTTP/1.1 specification, but the core of its functionality was defined in the HTTP/1.0 specification (see Resources for links to both) -- albeit in a rather underspecified format. The HTTP/1.0 specification provided implementors with a small set of miscellaneous headers: Accept, Accept-Charset, Accept-Encoding, Accept-Language, and Content-Language. (We will go into more technical detail of these headers and their usage in the next section of this article.)
Historically, it is important to note that in its original form, the content negotiation mechanism left it completely up to the server to choose the best representation from all of the combinations available given by the choices sent by the user agent. In the next (and current) version of content negotiation, which arrived with HTTP/1.1, the specification introduced choices with respect to who makes the final decision on the alternate representation's format, language, and encoding. The specification mentions server-driven, agent-driven, and transparent negotiation.
Server-driven negotiation is very similar to the original content negotiation specification, with some improvements. Agent-driven negotiation is new and allows the user agent (possibly with help from the user) to choose the best representation out of a list supplied by the server. This option suffers from underspecification and the need to issue multiple HTTP requests to the server to obtain a resource; as such, it really hasn't taken off in current deployments. Lastly, transparent negotiation is a hybrid of the first two types, but is completely undefined and hence is also not used on the Web today.
Amidst this sea of underspecification, however, HTTP/1.1 introduced quality values: floating point numbers between 0 and 1 that indicate the user agent's preference for a given parameter. For example, the user agent might use quality values to indicate to the server its user's fluency level in several languages. The server could then use that information along with the faithfulness of its alternate language representations to choose the best translation for that request and deliver it as per server-driven content negotiation.
The examples you've seen of content negotiation so far have highlighted its uses to serve the heterogeneous mix of people accessing the World Wide Web. But just as diverse humans can use content negotiation to retrieve Web content in a useful fashion, so also can diverse software programs request content in machine-readable formats to further their utility on the Web. You will see how to use content negotiation to further a meaningful Web for humans and machines later in this article.

As with content negotiation, the link element has gone through some evolutionary stages in order to meet the requirements for a parallel Web. In the first stages of the HTML specification, the link element only possessed a small number of attributes, which mainly served to indicate the relationship between the current page and the location of the resource to which the element's href attribute refers. Most of these relationships were originally designed for navigational functionality, but the link element's relationships were always meant to be extensible. Although the semantics and governance of these relationships had never been clearly defined officially, their usage has exploded in recent years. There's even now a process and a site at Internet Assigned Numbers Authority (IANA) to register new relationships (see Resources).
Most notably, the HTML 4.01 specification introduced new attributes that can indicate the media type and character set of the resource referenced by the link element. It is thanks to these additions that the link element has received wider adoption, especially by Weblogs. Later in this article, you will see that while many of the traditional use cases for the link element revolve around the needs of humans, the element is increasingly used to reference machine-consumable content in an attempt to further software's capabilities on the Web.
Back to top

To start, look at a series of example uses of the link element and see how its usage has evolved over time to provide machines with more accurate access to the data encoded in HTML pages. Listing 1 uses three of the standard link relationships defined within the HTML 4.01 specification (see Resources for all available link types) to give navigational hints to Web page readers. If publishers were to include this additional metadata in their pages, navigating certain types of documents, such as articles, books, or help manuals, would be much easier than the typical situation today, where users must fish for an anchor link, possibly labeled next, or possibly labeled something completely different.Listing 1. Navigational link elements
"http://www.w3.org/TR/html4/strict.dtd">


The Parallel Web




...the rest of the article...
In Listing 2, you see the most popular use of the link element today. Most modern browsers have a decent amount of support for the Cascading Style Sheets specifications, allowing publishers to separate structure from presentation in their HTML pages and making it easier to change their visual presentation without having to recode the entire page. This example in Listing 2 makes use of a new attribute: media, which specifies the intended destination medium for the associated style information. The default value is screen, but you can also use print, or, to target mobile devices, handheld. (See Resources for more on these values.) Beware: while you can use multiple values in the media attribute, they must be separated by commas, as opposed to the spaces you'd use for values of the rel attribute.Listing 2. Providing links to stylesheet information

The Parallel Web
media="screen,print"
type="text/css"
href="../dw.css">

...the rest of the article...
Our next example uses the internationalization capabilities of the HTML specification to give browsers or user agents language-specific navigational information in order to best match the reader with her preferred tongue. In the case of this article series, the authors combined can speak two languages fluently. If we had a large Spanish readership, we could provide those readers with a link to the article in Spanish by using the link element.
Notice a few of the attributes and their values that we use in Listing 3. First, we use the alternate link type to express the relationship between this article and our Spanish version. Because we are using it in combination with the hreflang attribute, we imply that the referenced Web page is a translated, alternate version of the current document. We also add a lang attribute to specify the language of the title attribute. Finally, we added a type attribute value, which is optional and is only there as an advisory hint to the user agent. (A Web browser might use this hint, for example, to launch the registered application that can deal with the specified media type.)Listing 3. Language-specific navigational link elements

The Parallel Web
lang="es"
title="La Red Paralela"
type="text/html"
hreflang="es"
href="es/redparalela.html">

...the rest of the article...
The previous examples mostly served to show the most common uses of the link element, but, with the exception of the language-specific use of link, they didn't really show you how to provide alternate representations of the data found on the original page. And even in the language example given in Listing 3, the alternate representation is intended for consumption by humans and not by machines. The final link example in Listing 4 shows how blogs use the link element to refer to feeds. A feed is an XML representation of its HTML counterpart that carries more structured semantics than the HTML contents of a blog as rendered for a person. Blog-reading software can easily digest an XML feed in order to group blog entries by categories, dates, or authors, and to facilitate powerful search and filter functionality. Without the link element, software would not auto-discover a machine-readable version of a blog's contents and could not provide this advanced functionality. This is a good example of how content generated by humans (even through blog publishing tools) is not easily understood by machines without the use of a parallel Web technique. The feed auto-discovery mechanism enabled by the link element helps humans co-exist with machines on the Web.
In Listing 4, you begin to see the power of the link element emerge to provide machines with valuable information in decoding content on the Web today. We again use of the "alternate" relationship to denote a new URLwhere you can find a substitute representation of the same content currently viewed. Through the use of media types, we can give the user agent the ability to choose the best understood format -- whether it be Atom or RSS feeds or any other media format.
The link elements shown in Listing 4 create a parallel Web with three nodes. First, at the current URL is an HTML representation of this article intended for human consumption. Second, the relative URL ../feed/atom contains a machine-readable version of the same content represented in the application/atom+xml format. And finally, the relative URL ../feed/rss contains a different machine-readable version of the same content, represented this time in the application/rdf+xml format.Listing 4. Providing auto-discovery links to feed URLs in blogs

The Parallel Web
title="Atom 1.0 Feed"
type="application/atom+xml"
href="../feed/atom">
title="RSS Feed"
type="application/rdf+xml"
href="../feed/rss">

...the rest of the article...
We must note that the values of the rel attributes are not limited to a closed set; the Web community is constantly experimenting with new values that eventually become the de facto standard in expressing a new relationship between the HTML page and the location specified in the href attribute. For instance, the pingback value used by the WordPress blog publishing system (see Resources) is now very widely used; it provides machines with a pointer to a service endpoint that connects Weblog entries with one another through their comment listings.
As simple REST-based Web services permeate the Web, we foresee a large number of rel values being defined to point to many different service interfaces. Such an explosion of uses for the link element greatly enhances the functionality of traditional Web browsers but also comes with other problems involving governance and agreement of relationship types that are beyond the scope of this article. Find more information on the HTML link element in Resources.
Back to top

We will cover content negotiation in somewhat less detail than the link element because of its possible complexity, but we'll go into enough detail to illustrate how it is used to create a parallel Web. In this discussion, we will focus on how content negotiation is configured using the popular Apache HTTP Web server software. Other Web server software works similarly, and the underlying concepts are the same.
Typically on the Web, URLs contain a file extension that indicate either the method used to produced the HTML (for example, .html for static pages, .php for PHP scripts, or .cgi for various scripting languages using the Common Gateway Interface) or the media format of the content at the URL (such as .png, .pdf, or .zip). While this approach is adequate in many cases, it requires that you create and maintain one URL for each different representation of the same content. In turn, this requires that you use some mechanism to link the multiple URLs together (such as the HTML link element described above, or the human-traversable a anchor link element) and that software know how to choose between the different URLs according to its needs.
Content negotiation allows different representations of data to be sent in response to a request for a single URL, depending on the media types, character sets, and languages a user agent indicates that it can (and wants to) handle. Whereas the link element constructs a parallel Web by assigning multiple URLs to multiple representations of the same content, content negotiation enables a parallel Web wherein multiple formats of the same content all are available at a single URL. The Resources in this article contain links to more detail on behind-the-scenes HTTP content negotiation.
Apache provides a content negotiation module that has two methods to select resource variants depending on the HTTP request: type maps and MultiViews. If a type map is used in the Apache configuration file, it is used to map URLs to file names based on any combination of language, content type, and character encoding. MultiViews are very similar, except that the administrator does not have to create a type map file; instead, the map is created on the fly based on other configurations specified throughout the server configuration document. For brevity, in this article we will only show an example of a type map, but please see the Resources for links to very detailed Apache content negotiation guides.
In Listing 5, we list the contents of a type map used to provide the mapping necessary to serve different versions of our article based on language and media type. In line 1, we specify the single URI used to serve multiple format and language representations of the same article. Lines 3 and 4 establish that by default we will serve the file parallelweb.en.html in response to all requests for the text/html media type. In lines 6 through 9, we have a special case in which we can serve the Spanish version of our article if desired by the user agent. Next, in lines 10 and 11, we provide a printed version of our article. Finally, you see in lines 13 and 14 the ability for software to retrieve an Atom feed version of our article. Content negotiation allows blog-reading software requesting our article in the application/atom+xml content type to receive a representation of the article that the software can understand more easily than the standard HTML version.Listing 5. Using type maps in Apache to serve multiple language and format representations
[1] URI: parallelweb
[2]
[3] URI: parallelweb.en.html
[4] Content-type: text/html
[5]
[6] URI: parallelweb.es.html
[7] Content-type: text/html
[8] Content-language: es
[9]
[10] URI: parallelweb.ps
[11] Content-type: application/postscript; qs=0.8
[12]
[13] URI: parallelweb.xml
[14] Content-type: application/atom+xml
In summary, this extra configuration on our server allows a single URL to be distributed for our article, while still permitting both human-friendly and machine-readable data to be available as long as the user agent requests it.
Now, let's move on to our example application and hypothesize how we might build it by using parallel Web techniques.
Back to top

In our series overview (see Resources for a link), we proposed an example application called MissMASH that would illustrate the different techniques covered in the series. The application's ultimate goal is for the user to gain access to personal information hosted in different locations, all displayed using a calendar-based interface. Photographs could be hosted on Flickr, online banking statements could come from Citibank, and scheduling details from Google Calendar. (See Resources for links to all three.) Our focus is not to walk you through actual running code for this application, but to highlight the main mechanisms that would facilitate the building of such an application using software not managed by any of the hosting providers of our sample data.
Please keep in mind that parts of our scenario will be, by necessity, fictitious due to the mere fact that not all of our sample data providers employ the techniques covered in this series. The location of the software that powers MissMASH does not matter either. It could be written as a traditional server-side Web application using any Web application framework du jour, or it could be a Web-based client-side application, making use of Ajax (the newly rediscovered XML HTTP request JavaScript object; see Resources) techniques to compose the data. In any case, the key to this kind of application -- and the key question that we are asking -- is how MissMASH finds, identifies, and uses the information necessary to accomplish its purpose. Ideally, the less information it needs from the (human) user, the better. Let's see how the parallel Web might allow us to build MissMASH.
One of the very first pieces of information needed by MissMASH is some sort of online calendar account information. As you might imagine, this might become unwieldy in no time if we required MissMASH to understand every possible public and/or private calendar service on the Internet. But if these services make use of parallel Web techniques, it is as simple as entering the URL you see in your browser when accessing that service into MissMASH. Consider, for example, Google Calendar. (For the purposes of this example, we'll ignore user authentication details.) Imagine that a user configuring MissMASH navigates in a Web browser to his Google Calendar. Once there, he copies the URL in his browser's address bar and pastes it into the MissMASH configuration. MissMASH would request that URL, parse the returned HTML, and find inside a gem of the parallel Web: a link element.
Google offers multiple versions of its online calendar: iCalendar (see Resources), Atom feeds, and HTML. Unfortunately, it only provides a link element to the Atom feed version of the calendar, but it's just as described in Listing 4. The same would occur for Flickr, which happens to also provide a link to an Atom feed of your most recent photos. In either case, MissMASH would get a second URL from the href attribute of the link element and would resolve that URL to receive a machine-readable representation of the user's event and photo information. Because these representations contain (among other data) dates and locations that software can easily recognize, MissMASH can use these alternate representations to render calendar and photo data together.
For illustrative purposes, let's also imagine that Citibank used content negotiation on the user's account URL. MissMASH can then negotiate with the Citibank server for any of the different supported formats, such as Atom feeds or the Quicken Interchange Format (see Resources)
We now have almost all of the information necessary to render our external personal data sources in MissMASH. Of course, for MissMASH to use the parallel Web in this way, it must understand both the structure and the semantics of the machine-readable representations that it retrieves. If MissMASH receives iCal data from Google Calendar, an Atom feed from Flickr, and Quicken Interchange Format data from Citibank, then it must understand how to parse and interpret all three of these machine-readable formats in order to merge the information. The parallel Web techniques do not themselves help resolve this matter; instead, the parallel Web's key benefit is that the user does not have to visit obscure configuration Web pages of the source-data services to find the URLs to one of the data formats supported by MissMASH. While MissMASH must focus on supporting a small number of different formats, this is far superior to requiring MissMASH to understand both a variety of data formats and a variety of application-specific, human-intended services for retrieving that data.
Future articles in this series will examine techniques to create a meaningful Web that impose a smaller burden on the data consumer to understand multiple machine-readable formats.
Back to top

As with most things on the Web, our conclusions about the evaluation categories we discuss in this section will not be clear cut. There are no black-and-white answers, which is why such a multitude of competing and complementary techniques exists. However, we will try our best to help you understand the reasoning behind our evaluation in order to understand what is required for machines to happily exist with humans on the Web. In these evaluations, we use the following summary key:
-- This technique fulfills most or all of the requirements of this evaluation criteria.
-- This technique fulfills some of the requirements of this evaluation criteria, but has some significant drawbacks.
-- This technique does not satisfy the requirements of this evaluation criteria.

Both of the parallel Web techniques that we have presented involve a data consumer retrieving alternative representations of content originally targeted for human beings. In the case of the link element, the link between the human- and machine-readable formats is contained within the HTML page itself. Therefore, the relationship is endorsed by the content creator, and we can be confident that the alternative representation is a true representation of the data shown to humans through the HTML Web page. With content negotiation, the alternative representation is chosen and returned by the Web server that controls the content, and hence we can similarly rest assured that the machine-readable data is true to the semantics of the human-oriented content.
Because the parallel Web techniques do not constrain the specific formats of alternative data representations, in theory, these techniques might provide machine-readable data that is not only equivalent but even richer in content and semantics than the original human-friendly Web page. However, we must point out that most online services use this technique to link to Atom feeds, which often only capture a small amount of the original data in machine-readable formats. For example, an Atom feed of banking transactions might capture the date, location, and title of the transaction in a machine-readable fashion, whereas the (not widely used on the parallel Web) Quicken Interchange Format contains much richer semantic data, such as account numbers, transaction types, and transaction amounts.
Our evaluation: The parallel Web provides authoritatively sound data, but in common usage does not provide machine-readable data that completely represents the authoritative human-oriented content.

The parallel Web techniques do not scale well for the purpose of expressivity and extensibility. These techniques achieve their extensibility by flexibly allowing for multiple alternative content representations. To accomplish this, however, either many URLs (in the case of the link element) or many server-side documents and their corresponding type maps (in the case of content negotiation) must be maintained. A publisher offering these techniques must also maintain code that generates each alternative representation from the underlying original data source (often data in a relational database).
All is not lost, however. Because the parallel Web techniques do not mandate one specific alternate representation, implementors are free to choose a representation that itself supports great extensibility and expressivity. In practice, however, the de facto formats used today on the Internet with parallel Web techniques do not offer much flexibility for extension to new data formats and richer semantics.
Our evaluation: The parallel Web is flexible enough to accommodate new data and new representations, but existing implementations do not scale easily and existing data formats are not particularly extensible.

The principle of "don't repeat yourself" (DRY) mandates that, if you must create both human- and machine-readable versions of data, you shouldn't need to maintain the two data formats separately. While a naive implementation of the parallel Web techniques might result in maintaining multiple files on a Web server that contain duplicate information, most real-world implementations derive the various representations from a single underlying data source (often a relational database).
In our evaluation: While the algorithms to generate the various data formats must be maintained, the actual data need only be represented a single time.

This is one area where the parallel Web lacks our support. In a scenario that makes use of the link element, the human-friendly and machine-readable versions of the relevant information are found at two different resource locations. In scenarios that make use of content negotiation, the same URL is used for both, but the human-friendly and machine-readable content are still received completely separately. In both cases, it is not usually feasible to merge and match up the data within two very different formats (for example, HTML and the Quicken Interchange Format). As such, there is no way to determine which machine-readable data corresponds with which human-oriented visual elements. Consequently, there is no easy way for humans or software to select just a small amount of content (both human and machine content) for reuse in other situations.
Our evaluation: The parallel Web provides no data locality whatsoever.

Any existing Web sites that already use content negotiation or the link element do so in a manner consistent with the parallel Web techniques described in this article. On the flip side, software that is aware of the parallel Web will not assign any additional meaning to Web pages than is already there, and therefore there is no danger of assigning incorrect semantics to existing content.
Our evaluation: The parallel Web is a technique already in use, and adopting it does not threaten the meaning of any current Web sites.

By their very nature, our parallel Web techniques are standards compliant. The link element and content negotiation are specified in the HTML and HTTP specifications, respectively, and they are widely accepted by many development communities around the world. It's also important to note the standardization of MIME types at IANA and general agreement on many of the widely used formats on the Web today. However, what is not commonly standardized is the way in which developers should expose more specific information on the content of the linked resource besides the specified MIME type, such as more information on what the feed item contents are within an Atom feed.
Our evaluation: Developers and implementors can rest assured that the parallel Web is grounded firmly in accepted standards.

The parallel Web makes heavy use of tooling to make itself attractive to developers. The Internet is filled with tooling that either directly or indirectly supports the parallel Web, making it very suitable for wider deployment. As you have seen, the Apache HTTP server contains extensive support for content negotiation; all other major Web server packages do as well. HTML editor software supports creating link elements, and Web browsers and blog-reading software supports their consumption. Additionally, given the specialty of each of the formats and the different communities supporting them, plenty of toolkits are always available in many different languages and licenses to do most of the heavy lifting during application development.
Our evaluation: The parallel Web has been around for a while and a great deal of mature tooling is available to produce and consume it.

In general, the overall complexity of employing parallel Web techniques is moderate. The concepts are rather straightforward, and -- thanks to their standards compliance and the tooling available for them -- it is not very hard to implement the techniques. Additionally, there are several advantages to separating the specification of machine-readable information from the human-readable version:
As a designer, you are freed from the need to maintain both data and presentation.
You can optimize content to suit the needs of the target consumer
At the same time, we have already touched on a few areas in which the parallel Web is more complex than other approaches. Using the link element requires that you maintain multiple URLs for multiple data representations. In a world in which one "404 Not Found" error can be the difference between a sale for you or a sale for your competitor, the maintenance of numerous active URLs can be a burdensome task. Content negotiation itself requires the maintenance of moderately complex server-side configurations. And because both techniques support multiple representations of the same data, you must write and maintain code to generate these representations from the core data.
Overall evaluation: The parallel Web is widely understood and used today, but it does require a high amount of maintenance as time passes.
Back to top

The parallel Web has long been used to further humans' experiences on the World Wide Web, and recently it has also been used a great deal to power software's ability to consume data on the Web. The parallel Web acknowledges that humans and machines have different content needs, and it keeps those needs separate by allowing some representations for humans and others for machines. It maintains those heterogeneous representations in parallel, however, either through the proliferation of many URLs united at the (human-oriented) root by the HTML link element, or by using content negotiation to mask the representations behind a single URL. The techniques that comprise the parallel Web are widely available and commonly used, and will continue to be used for the foreseeable future.
Nevertheless, you've seen in this article that both of these approaches have shortcomings that, to date, have limited them to semantically poor use cases. Use of the link element requires maintenance of multiple URLs and requires multiple HTTP requests to get at machine-readable data. Content negotiation hides different representations at a single URL and makes it difficult to unite the human view of the content with the machine view. Further, the lack of uniformity across machine-readable representations and the lack of extensibility of Atom, one of the only common machine-readable representations in use today, hamper the adoption of the parallel Web in cases such as MissMASH in which the data being produced and consumed is not consistently structured.

In the rest of this series, we'll examine techniques that strive to achieve a meaningful Web for humans and machines without maintaining two (or more) strands of a parallel Web. These techniques begin with an HTML Web page and derive semantics from the content of that page. Thus, they share the benefits of not requiring multiple URLs for multiple representations and of containing both the human- and machine-targeted content within the same payload. You will see, however, that these techniques still differ substantially in their approaches, and have their own advantages and disadvantages, which we will evaluate.
In the next article, we'll examine the algorithmic approach, in which third parties use structural and heuristic techniques to derive machine-readable semantics from HTML Web pages targeted at people.

Friday, January 26, 2007

A meaningful Web for humans and machines, Part 1: How humans can share the wealth of the Web

Explore techniques for coexistence with human- and machine-friendly data

In this series of articles we'll examine the existing and emerging technologies that enable machines and humans to easily access the wealth of Web-published data. We'll discuss the need for techniques that derive the human and machine-friendly data from a single Web page. Using examples, we will explore the relationships between the different techniques and will evaluate the benefits and drawbacks of each approach. The series will examine, in detail: a parallel Web of data representations, algorithmic approaches to generating machine-readable data, microformats, GRDDL, embedded RDF, and RDFa.
In this first article, you meet the human-computer conflict, learn the criteria used to evaluate different technologies, and find a brief description of the major techniques used today to enable machine-human coexistence on the Web.

The World Wide Web empowers human beings like never before. The sheer amount and diversity of information you encounter on the Web is staggering. You can find recipes and sports scores; you share calendars and contact information; you read news stories and restaurant reviews. You can constantly consume data on the Web that's presented in a variety of appealing ways: charts and tables, diagrams and figures, paragraphs and pictures.
Yet this content-rich, human-friendly world has a shadowy underworld. It's a world in which machines attempt to benefit from this wealth of data that's so easily accessible to humans. It's the world of aggregators and agents, reasoners and visualizations, all striving to improve the productivity of their human masters. But the machines often struggle to interpret the mounds of information intended for human consumption.
The story is not all bleak, however. Even if you were unaware of this human-computer conflict, there is no need to worry. By the end of this series, you'll have enough knowledge to choose intelligently among a myriad of possible paths to bridge between the data-presentation needs of machine and human data consumers.

In the early 1990s, Tim Berners-Lee invented HTML, HTTP, and the World Wide Web. He initially designed the Web to be an information space for more than just human-to-human interactions. He intended the Web to be a semantically rich network of data that could be browsed by humans and acted upon by computer programs. This vision of the Web is still referred to as the Semantic Web.
Semantic WebA mesh of information linked up in such a way as to be easily processed by machines, on a global scale. The Semantic Web extends the Web by using standards, markup languages, and related processing tools.
However, by the very nature of its inhabitants, the Web grew exponentially and gave priority to content consumable mostly by humans and not machines. Steadily, users' lives became more reliant on the Web, and we transitioned from a Web of personal and academic homepages to a Web of e-commerce and business-to-business transactions. Even as more and more of the world's most vital information flowed through its links, most of the Web-enabled interactions still required human interpretation. But as expected, the rise of Internet-connected devices in people's lives has driven dependence on the devices' software understanding data on the Web.


Clearly, machines have interacted with each other long before the Web existed. And if the Web has come so far so quickly while primarily targeted at human consumption, it's natural to wonder what's to be gained in developing techniques for machines to share the Web with humans as an information channel. To explore this question, imagine what the current Web would look like if machines did not understand it at all.
The top three Web sites (as of this article's writing) according to Alexa traffic rankings are Yahoo!, MSN, and Google -- all search engines. Each of these sites is powered by an army of software-driven Web crawlers that apply various techniques to index human-generated Web content and make it amenable to text searches. Without these companies' vast arrays of algorithmic techniques for consuming the Web, your Web-navigation experiences would be limited to following explicitly declared hypertext links.
Next, consider the 5th most trafficked site on Alexa's list: eBay. People commonly think of eBay as one of the best examples of humans interacting on the Web. However, machines play a significant role in eBay's popularity. Approximately 47% of eBay's listings are created using software agents rather than with the human-driven forms. During the last quarter of 2005, the machine-oriented eBay platform handled eight billion service requests. Also in 2005, the number of eBay transactions through Web Services APIs increased by 84% annually. It's clear that without the services that eBay provides software agents to participate equally with humans, the online auction business would not be nearly as manageable for humans dealing with significant numbers of sales or purchases.
For a third example, we turn to Web feeds. Content-syndication formats such as Atom and RSS have empowered a new generation of news-reading software that frees you from the tedious, repetitive, and inefficient reliance on bookmarked Web sites and Web browsers to stay in touch with news of interest. Without the machine-understandable content representation embodied by RSS and Atom, these news readers could not exist.
In short, imagine a World Wide Web where a Web site could only contain content authored by humans exclusively for that site. Content could not be shared, remixed, and reused between Web sites. To intelligently aggregate, combine, and act on Web-based content, agents, crawlers, readers, and other devices must be able to read and understand that content. This is why it's necessary to take an in-depth look at the different mechanisms available today to improve the interactions between machines and human-generated content in Web applications.


Consider a scenario from the page of the Semantic Web activity at the W3C. Most people have some personal information that can be accessed on the Web. You can see your bank statements, access your online calendar applications, and post photos online through different photo-sharing services. But can you see uour photos in your calendar to remind yourself of the place and purpose of those photos? Can you see your bank-statement line items displayed in your calendar too?
Creating new data integrations of this sort requires that the software driving the integration be able to understand and interpret the data on particular Web pages. This software must be able to retrieve the Web pages that display your photos from Flickr and discover the dates, times, and descriptions of your photos. It also needs to understand how to interpret the transactions from your online bank statement. The same software must be able to understand various views of your online calendar (daily, weekly, and monthly), and figure out which parts of the Web page represent which dates and times.
The example in Figure 1 shows how embedded metadata might benefit your end-user applications. You begin with your data stored in several places. Flickr hosts your photographs, Citibank provides access to your banking transactions, and Google Calendar manages your daily schedule. You wish to experience all of this data in a single calendar-based interface (missMASH), such that the photos from your Sunday at the State Park appear in the same weekly view as your credit card transaction from Wednesday's grocery shopping. To do this, the software that powers missMASH must have some way to understand the data from your Flickr, Citibank, and Google Calendar accounts in order to remix the data in an integrated environment.

A full spectrum of technologies give application authors the ability to do such integrations. Some of the technologies are well established, while others are still fledgling and not as well understood. The barriers to entry for the technologies vary, and some of the technologies will provide a higher level of utility than others.
In this series, we'll examine how you might implement the scenario discussed above using the different mechanisms available for human-computer coexistence on the Web. We will introduce and explain each technology, then show how the technologies might be used to integrate bank statements, photos, and calendars. We will also evaluate the strengths and weaknesses of each technology, and hopefully make it easier for you to decide between the options.

Evaluation criteria

When you embark on a comparison of technologies, it is helpful to first outline the criteria to evaluate the technologies. The list below describes properties desirable of methods that facilitate a Web that is both human-friendly and machine-readable. We don't expect to find a single technology that succeeds in all these facets, nor should we. In the end, we hope to build a matrix that will aid you in choosing the right tool for the job. Here are the criteria we will use:

Authoritative data

Whatever the particulars are for a technique, you end up with machine-readable data that supposedly is equivalent to what shows up on the corresponding human-friendly Web page. You want the techniques that get you to that point to ensure the fidelity of this relationship; you want to be able to trust that the data really corresponds to what you read on the Web page.
The authority of the data is one axis along which to measure this trust. We consider one representation of data to be authoritative if the representation is published by the owner of the data. A data representation might be non-authoritative if it is derived from a different representation of the data by a third party.

Expressivity and extensibility

If you use one technique to create a Web page with both a human-representable and machine-readable version of directions to your home, you'd rather not need a different technique to add the weather forecast to the same Web page. We hope that this criteria will help minimize the number of software components involved in any particular application, which in turn increases the robustness and maintainability of the application.
Along these lines, we appreciate it if the techniques acommodate new data in an elegant manner, and are expressive enough to instill confidence that in the future we can represent previously unforeseen data.

Don't repeat yourself (DRY)

If the same datum is referenced twice on a Web page, you don't want to write it down twice. Repetition leads to inconsistencies when changes are required, and you don't want to jump through too many hoops to create Web pages that both humans and computers can understand.
This does not necessarily apply to repetition within multiple data representations, if those representations are generated from a single data store.

Data locality

Multiple representations of the same data within the same document should be in as self-contained a unit of the document as possible. For example, if one paragraph within a large Web page is the only part of the page that deals with the ingredients of a recipe, then you want a technique to contain all of the machine-readable data about those ingredients to that paragraph.
In addition to being easy to author and to read, data locality allows visitors to the Web page to copy and paste both the human and machine representations of the data together, rather than requiring multiple discontinuous segments of the Web page be copied to fully capture the data of interest. This, in turn, promotes wider adoption and re-use of the techniques and the data. (Just ask anyone who has ever learned HTML with the generous use of the View Source command.)

Existing content fidelity

You want the techniques to work without requiring authors to rewrite their Web sites. The more that a technique can make use of existing clues about the intended (machine-consumable) meaning of a Web page, the better. But a caveat: techniques shouldn't be so liberal with their interpretations that they ascribe incorrect meanings to existing Web pages.
For example, a new technique that prescribed that text enclosed in the HTML u tag represent the name of a publication might lead to correct semantics in some cases, but might also license many incorrect interpretations. While the markup in this example might be authoritative (because it originates from the owner of the data), it is still incorrect because the Web page author did not intend to use the u HTML tag in this manner.

Standards compliance

You should be able to use techniques without losing the ability to adhere to accepted Web standards such as HTML (HTML 4 or XHTML 1), CSS, XML, XSLT, and more.

Tooling

Creating a Web of data that humans can read and machines can process is of little value if no tools understand the techniques used. We prefer techniques that have a large base of tools already available. Failing that, we prefer techniques for which one can easily implement new tools. Tools should be available both to help author Web pages that make use of a technique, and also to consume the machine-readable data as specified by the technique.

Overall complexity

The Web has been around for a while now, and it's only recently that the need to share data between humans and machines is receiving a lot of attention. The vast landscapes of content available on the Web are authored and maintained by a wide variety of people, and it is important that whatever techniques we promote be easily understandable and adoptable by as many Web authors as possible.
The best techniques are as worthless as no technique if they are too complex to adopt. The most desirable techniques will have a low barrier to entry and a shallow leaning curve. They should not require extensive coding to implement, nor require painstaking efforts to maintain once implemented.

Coexistence options
This section provides a brief introduction to the major techniques used today to enable machine-human coexistence on the Web. Subsequent articles in this series will explore these techniques in detail.

In this world view, the data is represented on the Web with (at least) two addresses (URLs): one address holds a human-consumable format, and one a machine-consumable format. Technologies to enable the parallel Web include the HTML link element and HTTP content negotiation. Those involved in the creation of the HTML specifications saw the need for two linking elements in HTML: the a element that is visible and can only appear in the body of a Web page, and the link element that is invisible and can only appear in the head of a Web page. The HTML specification designers reasoned that agents, depending on their purpose and target audience, would interpret the links in the head based on their rel (relationship) attribute and perform interesting functions with them.
For example, Web feeds and feed readers have empowered humans to keep up with the vast amount of information being published today. When you use a feed reader, you initialize it with the address (URL) of an XML file -- usually an RSS or Atom file. In most cases, the machine-consumable data within such a feed has a parallel URL on the Web, where you can find a human-readable representation of the same contentd. There are a variety of techniques to achieve this parallel Web in a useful and maintainable fashion. Part 2 of this series will discuss a parallel Web in detail, including the benefits and drawbacks of having the same data available at more than one Web address. Future installments of this series will cover techniques that allow multiple data representations to be contained within a single Web address.

Algorithmic approaches encompass software that produces machine-consumable data from human-readable Web pages by application of an arbitrary algorithm. In general, the algorithms tend to fall into two categories:
Scrapers, which extract data by examining the structure and layout of a Web page
Natural-language processors, which attempt to read and understand a Web page's content in order to generate data
These techniques are designed for situations where the structure or content of a Web page is highly predictable and unlikely to change. The algorithms are usually developed by the person seeking to consume the data, and as such they are not governed by any standards organization. Often, these algorithms are an integrator's only option when faced with accessing data whose owner does not publish a machine-readable representation of the data. Stay tuned for details on the algorithmic approach in Part 3 of this series.

Microformats are a series of data formats that use existing (X)HTML and CSS constructs to embed raw data within the markup of a human-targeted Web page. Microformats are guided by a set of design principles and are developed by community consensus. The Microformats community's goal is to add semantics to the existing (X)HTML class attribute, originally intended mostly for presentation.
As with the algorithmic approach, microformats differs from many others in our series because it is not part of a standards process in organizations such as the W3C or IETF. Instead, their principles focus on specific problems and leverage current behaviors and usage patterns on the Web. This has given microformats a great start towards their goal of improving Web microcontent (blogs, for example) publishing in general. The main examples of microformat success have been the hCard and hCalendar specifications. These specifications allow microcontent publishers to easily embed attributes in their HTML content that allow machines to pick out small nuggets of information, such as business cards or event information from microcontent Web sites.

Gleaning Resource Descriptions from Dialects of Languages (GRDDL) allows Web page publishers to associate their XHTML or XML documents with transformations that take the Web page as input and then output machine-consumable data. GRDDL can use XSL transformations to extract specific vocabularies from a Web page. GRDDL also allows the use of profile documents, which in turn reference the appropriate transformation algorithms for a particular class of Web pages and data vocabularies.
GRDDL has great potential for bridging the gap between humans and machines by enabling authoritative on-the-fly transformations of content. While this is similar to the parallel Web, there are significant differences. GRDDL provides a general mechanism for machines to transform content on demand, and GRDDL does not create permanent versions of alternative data representations. The W3C has recently chartered a GRDDL working group to produce a recommended specification for GRDDL.

Embedded RDF is a technique for embedding RDF data within XHTML documents using existing elements and attributes. eRDF attempts to balance ease of markup with extensibility and expressivity. Along with RDFa and, to a lesser extent, GRDDL, it explicitly makes use of the Resource Description Framework (RDF) to model the machine-consumable data that it encodes. eRDF shares with microformats the principle of reusing existing vocabularies for the purpose of embedding metadata within XHTML documents. eRDF seeks to scale beyond a small set of formats and vocabularies by using namespaces and the arbitrary RDF graph data model.
Embedded RDF is not currently developed by a standards body. Similar to microformats, eRDF is capable of encoding data within Web pages to help machines extract contact, event, and location information (and other types of data) to enable powerful software agents.

RDFa, formerly known as RDF/A, is another mechanism for including RDF data directly within XHTML. RDFa uses a fixed set of existing and new XHTML elements and attributes to allow a Web page to contain an arbitrary amount and complexity of machine-readable semantic data, alongside the standard XHTML content that is displayed to humans. RDFa is currently developed by the W3C RDF-in-XHTML task force, a joint product of the XHTML and Semantic Web Deployment working groups.
As with eRDF, RDFa takes advantage of namespaces and the RDF graph data model to enable the representation of many data structures and vocabularies within a single Web page. RDFa seeks to be a general-purpose solution to the inclusion of arbitrary machine-readable data within a Web page.

In summary

This article motivated and explained the challenge of creating a World Wide Web that is accessible to both humans and to machines. We developed an example integration scenario that could be enabled by any of the myriad of coexistence mechanisms. We also discussed the criteria with which to compare and evaluate the techniques that we will cover in more detail in the rest of this series.
Stay tuned for Part 2, which will explore in detail the widely used parallel Web technique.

Monday, January 08, 2007

Exploiting Open Functionality in SMS-Capable Cellular Networks

Accepted at the 12th ACM Conference on Computer and Communications Security (CCS'05)November 7-11, 2005, Alexandria, VA, USA

Cellular networks are a critical component of the economic and social infrastructures in which we live. In addition to voice services, these networks deliver alphanumeric text messages to the vast majority of wireless subscribers. To encourage the expansion of this new service, telecommunications companies offer connections between their networks and the Internet. The ramifications of such connections, however, have not been fully recognized.

This research evaluates the security impact of the Short Messaging Service (SMS) interface on the availability of the cellular phone network. Specifically, we demonstrate the ability to deny voice service to large metropolitan areas with little more than a cable modem. Moreover, attacks targeting the entire United States are feasible with resources available to medium-sized zombie networks.

We characterize network behavior and explore a number of reconnaissance techniques aimed at effectively targeting attacks on these systems. We also discuss a number of countermeasures that mitigate or eliminate the threats introduced by these attacks that must be implemented by cellular service providers in the near future.
for more information follow this link-http://www.smsanalysis.org/

Google Vulnerability A Sign Of Web 2.0 Weakness by Larry Greenemeier

Managers must weigh security risks and protect systems as employees use Web applications from workplace computers.
A design flaw discovered earlier this week in Web-based Google applications spotlights a troublesome security trend for IT departments: what to do about protecting internal systems and data as workers access Web-based e-mail and collaborative applications using their employer's PCs.

Google's problem, first reported by the Googlified Web site and since patched by Google, resulted from the way Google software stored information in a JavaScript file on the company's servers. Prior to the patch, an attacker could overwrite the JavaScript Object Notation, or JSON, that Google used to send information from its servers to a user's client device and gain access to all of the contact information stored in a user's Gmail account, as long as that user was logged on to any Google application. This is known as a "cross-site request forgery." JSON is what makes it possible for a Web mail application to, among other things, fill in the "To:" field in an e-mail from a user's address book after the user has typed in just a few characters.

Google acknowledged that, over the New Year's weekend, it was notified of a vulnerability related to the use of JSON objects that affected several of the company's products. "These objects, if abused, can expose information unintentionally," Google

While most security experts agree that guarding Web applications, a notorious security soft spot today, is crucial for the overall well-being of systems and data, they debate whether security vulnerabilities in consumer-focused Web apps such as Web mail, instant messaging, and social networking sites such as MySpace and Facebook are a great threat to business IT systems.

Employees use Web mail and other Web-based services from their work computers, and IT managers have little control over how securely those Web applications are written. Yankee Group senior analyst Andrew Jaquith says that it's what we don't know about Web applications that make them so dangerous. "Because they aren't fully understood, they're going to attract a lot of attention from hackers," he says, adding that this should concern IT managers because "consumer-grade applications are increasingly becoming de facto parts of corporate IT infrastructures."

This means employees may be mixing IT work with pleasure in their cubicles, potentially adding work-related information to the vast repositories managed by Web mail systems. For example, whenever a user can't remember a password for a given Web site, they'll typically have that password mailed to a Web mail account because they can access that account from any
computer with an Internet connection. If these passwords are for work-related sites, Web mail security becomes a problem.

"Web mail accounts give you access to everything," says Jeremiah Grossman, founder and CTO of WhiteHat Security, a maker of Web application security assessment software. Grossman, who also worked at Yahoo as its security officer, notes that cross-site request forgeries can be used for more than poaching information from Web mail accounts. "An attacker can gain access to any account the user is logged on to," he says. "This includes Web mail address books and even bank accounts."

Under another scenario, a Web mail user's ID and password could be stolen and then used by the attacker to send bogus messages to the victim's co-workers. "All the attacker has to do is send a Web mail saying 'I'm working from home today; use my Web mail account'," McGraw says. This trick could divert all sorts of business-related information to a Web mail account.

Yet other security experts see Web mail as more of a danger for users purposely or inadvertently leaking data out of their employers' IT environments, rather than as an attack vector for malware. "Applications that your employees are going to use that are not under the control of your IT department are definitely a security concern," says 451 Group senior analyst Nick Selby. But, "if an attacker is using malware, that's already being addressed by checking endpoints and isolating infected end points," he adds.

Google, Yahoo, and other Web companies that rely on the fancy Web 2.0 features enabled by JavaScript will most likely continue to respond quickly to security vulnerabilities, although it's less comforting to know that a site known as Googlified was the first to point out the most recent problem. While it's unrealistic for IT managers to stop the use of Web applications, they should be aware of the potential threats to their IT systems and data.

'Infomania' worse than marijuana

'Infomania' worse than marijuana

Workers distracted by email and phone calls suffer a fall in IQ more than twice that found in marijuana smokers, new research has claimed. The study for computing firm Hewlett Packard warned of a rise in "infomania", with people becoming addicted to email and text messages.

Researchers found 62% of people checked work messages at home or on holiday. The firm said new technology can help productivity, but users must learn to switch computers and phones off.

Losing sleep

The study, carried out at the Institute of Psychiatry, found excessive use of technology reduced workers' intelligence. Those distracted by incoming email and phone calls saw a 10-point fall in their IQ - more than twice that found in studies of the impact of smoking marijuana, said esearchers. More than half of the 1,100 respondents said they always responded to an email "immediately" or as soon as possible, with 21% admitting they would interrupt a meeting to do so. The University of London psychologist who carried out the study, Dr Glenn Wilson, told the Daily Mail that unchecked infomania could reduce workers' mental sharpness. Those who are constantly breaking away from tasks to react to email or text messages suffer similar effects on the mind as losing a night's

Monday, January 01, 2007

Configuring Your Home Computer to Run Apache Server, PHP, MySQL, ColdFusion, and IIS

Configuring Your Home Computer to Run Apache Server, PHP, MySQL, ColdFusion, and IIS

This tutorial for installing WAMP Server (Apache, PHP, MySQL, and phpMyAdmin) was written for the students in the Web Page Design program at the Contra Costa ROP. This tutorial will cover basic server installation for web development testing purposes on your home computer. This tutorial will not cover all of the necessary security settings used to create a public web server.

We currently use three texts for web database development:

Macromedia Dreamweaver MX 2004 - ISBN 0-619-21420-1 - Covers Dreamweaver from the ground up. This text also includes an introduction to web database development using PHP/MySQL and ASP/Access. This book is required for all students in the program.

Dreamweaver MX 2004 with ASP, Coldfusion, and PHP - ISBN 0-321-24157-6 - Covers some basic Dreamweaver functions, but the reader should have some prior experience using Dreamweaver to fully comprehend the text. Students can learn the basics of three different scripting languages, PHP/MySQL, ASP/Access, and Coldfusion/Access, by completing this book. Elective course.

Sams Teach Yourself PHP, MySQL, and Apache - ISBN 0-672-32489-X - For students who would like to learn how to hand-code PHP/MySQL applications. Elective course.
Definition of Terms

PHP: Hypertext Preprocessor is a scripting language that is embedded in HTML. PHP scripting code is frequently used to connect web pages to MySQL databases to create dynamic web sites. Some popular examples of PHP driven web sites would be blogs, message boards, and Wikis. PHP files are actually plain text files; they can be created in TextPad or any other Plain Text Editor. In a text editor set the Save As type to text and save the file with a .php extension. Note: You must type the .php yourself. PHP files can also be created with Dreamweaver MX 2004. PHP is "open source" and therefore free. We use the Apache Server to work with PHP files in the classroom.

MySQL is a relational database system. MySQL uses "Structured Query Language" which is the most common database language. It is the most popular "open source" database in the World. MySQL content can be managed via a command line or a web interface. MySQL is "open source" and therefore free. We use the Apache Server to work with MySQL files in the classroom.

Coldfusion is another scripting language used to connect to databases. It performs essentially the same functions as PHP, but it is produced by Macromedia. There is a free developer edition of Coldfusion available on Macromedia's web site. We use the Apache Server to work with Coldfusion files in the classroom.

phpMyAdmin is a popular PHP driven web interface that allows you to manage your MySQL database content via a web form. It is much easier to learn to use phpMyAdmin to manage your MySQL database content than it is to edit it from the command line. phpMyAdmin is "open source" and therefore free.

Apache is a popular web server that many ISP's and individuals use to host web pages. When you install Apache on your system your machine becomes a web server. Pages stored on your system in a "special folder" are accessible on the Internet via the machine's IP address. In order for pages to be viewed on the Internet, the files must be stored in a special directory; this directory is usually called htdocs, public_html, or www. If you use a web host, you probably upload your files to a directory with one of these names. If someone else wants to access your web pages, they must know the IP Address of your machine, i.e., 179.199.9.100. Apache is "open source" and therefore free.

IIS or Internet Information Server is Microsoft's web server. It performs the same functions as Apache and is used with ASP and Access files primarily, but it can also run PHP and MySQL. It can be installed on Windows based machines (Windows 2000, Windows XP, etc.). IIS is available on the Windows installation CD. Typically you will install Apache or IIS; both servers services should not be running on your machine at the same time. IIS is needed to run ASP/Access applications.

Windows Services are used to run these servers and modules. Once services are installed they can be started and stopped via Start > Control Panel > Administrative Tools (switch to classic view if you do not see the Administrative tools) > Services. These Services can be configured to automatically start or to be started manually. The Apache Service and the IIS service should not run at the same time. This will be discussed in more detail later in the tutorial.
PHP and Coldfusion Testing Requirements

In order to test PHP and Coldfusion pages you must have Apache, MySQL, and the PHP and/or Coldfusion module installed on your own computer or have a host that supports PHP or Coldfusion. You can use your f2o.org account, but it is much easier to test your PHP pages on your local machine. If you do not have these modules installed, you would have to upload your PHP pages to your ISP to view them. The beauty of Apache, MySQL, and PHP is that they're free. If you are working in the classroom and you are creating PHP web pages, you must start the Apache Web Server at the beginning of class each day.

follow this link to get more

WAMP Installation Guide

WAMP (Windows-Apache-mySQL-PHP) is an all-in-one packages which installs
the basic programs you will need to get a localhost running and to be able to
build and run PHP scripts.

This guide will walk you through installing a WAMP package for the first time.

follow this link to get more information
http://www.eacomm.com/web/support/wamp_install/index.html

An Introduction to Salesforce.com's AppExchange-by Tony Stubblebine

This is part one of a three-part series on how to build and distribute applications on Salesforce's AppExchange.

I attended Salesforce's Dreamforce conference last month because I'd heard that Salesforce has been making a big effort to build a platform that was friendly to developers. I expected to be confronted with a pile of corporate-speak and a lot of vaporware, but what I found was much more surprising. Six different keynote presenters talked about mashups, and one-third of customers in attendance talked about wanting to build or purchase mashups. There was some corporate-speak, which these articles should cut through. The technology, however, was powerful and easy.

Two main things set Salesforce apart from other companies building development platforms. The first is that their platform is entirely "on demand", meaning there's no installed software, it runs across the internet in a software as a service model. The other is the directory of Salesforce applications called AppExchange. You can build apps and keep them within your organization -- if, for example, you work in a corporate IT department. But if you want to share your application with the world, for profit or otherwise, the AppExchange directory is the answer. It's integrated directly into all Salesforce accounts. If you build an application inside your own account, you can package it directly to the AppExchange directory. Customers who find your application on AppExchange can install it directly into their accounts. Salesforce wants to remove the burden of customer acquisition and distribution so that developers can focus on what they do best: finding and solving problems. I met a number of small companies and individual consultants who all said that Salesforce is making it easier to sell software to the corporate world.

This series of articles will show you how to build and distribute an application on AppExchange. If you're not familiar with Salesforce, then there's some basic information that you should know. I'll lay that out in this article, as well as the process for setting yourself up as a Salesforce developer, the interfaces available for building applications, and the major sources of news and reference. The next two articles will lead you through the steps of building an application and distributing it on AppExchange.
What Is Salesforce.com?

If you're not in sales, you might not even know what Salesforce.com does or even what Customer Relationship Management (CRM) is. Salesforce started out primarily offering software for sales groups to manage their customer relationships. This included simple tools like address books, which are called contacts in the Salesforce world, and more complicated processes to track potential customers from lead to sale. Sales people also like reports with charts and graphs, so those are part of the package.

The interesting point from a developer's perspective is that the underlying technology is based around database concepts, with default actions and views. Salesforce has almost completely opened up this infrastructure. Even novice users can create custom objects that are the equivalent of database tables, and add or remove fields from the default objects.

This infrastructure includes a built-in customer base, built-in distribution through the AppExchange directory, built-in data and authentication models, developer support on the AppExchange Developer Network, and a slew of programming tools. That means developers can focus on solving new problems and not reinventing solutions to old ones.

This infrastructure has been flexible enough to allow Salesforce to branch out into other business applications like marketing and customer support. It has also been flexible enough to allow customers to build their own applications in areas like financial services and human resources.

Signing Up

To get a feel for what Salesforce customers experience, you need to sign up for an account. This will also serve as a sandbox for developing your own applications. You might be tempted to sign up for the 30-day, free trial offer that is prominently advertised on their home page, but signing up for a developer account from the AppExchange Developer Network will give you an account that never expires. The account does come with a few limitations. You can only have two users, one an admin account so that you can build and install applications and the other a normal user account so you can test your work from the perspective of a normal user. The account has a 2MB data limit, which is enough space to add roughly 1500 contacts. There seem to be some other limitations--you can't send mass emails, for example--but I didn't notice any that would hinder development.

After you sign up for an account you'll be sent a confirmation email. Following the link in the email will give you your first look at your new Salesforce account.


Browse the tabs to get a sense of the default functionality that comes with a Salesforce account.

Customizing

For most Salesforce customers, customizing their Salesforce account is a common and exciting experience that feels like application building. This functionality is available in the Setup area. Here users can add and remove fields, customize templates, and even create new database tables that come with automatically created forms and views. Salesforce provides a web UI for all of this functionality.

Don't worry, as a developer you're going to be able to build applications much more powerful than what most users are creating using the Salesforce customization forms. You should, however, know how to use these. You will need this familiarity if you want to extend the data schema, install AppExchange applications, or package your own AppExchange applications.

Start by clicking the Setup link located above the tabs.

Later in this article, we'll be using this area to install an application from AppExchange. For now let's get a feel for how the Setup area works by adding a Website field to the Contacts object, as many of our friends have their own websites.

Start by choosing Customize -> Contacts -> Fields in the left navigation bar. This will show a long list of the standard fields for the Contacts object.

Edit Contact Fields

At the bottom of the list of fields is a list of custom fields and a New button. Clicking this button will start you on a four-step process for adding a field to the Contacts object.

The first step is to select a field type. I was expecting to make this a text field, but found that Salesforce has an explicit URL field type that ensures that the URL will be displayed as a link. Choose the URL field type.

The second step is to give the field a label and a name. The label is shown on displays and reports alongside the field contents. The name is what you're going to use to reference the field when you're writing code. I chose "Website" for the label and "website" for the name.

The last two steps are important if you're managing a Salesforce account with thousands of users. If that were the case, you would want to spend some time setting the access controls and templates. For our purposes, you should just choose the defaults and then save your changes. To see your handiwork, visit the Contacts tab and click the New button. You should see that your field has been automatically added to the form.

Installing an AppExchange Application

Before we can build our own application for the AppExchange, we need to figure out how to install other people's applications.

The first step is to visit the AppExchange. There are over 400 applications in the directory so far. Some of the applications require that you pay for them. Some require that you install software on your own server. But many are completely free applications that run entirely in your Salesforce instance. The Pricing section of each application entry will let you know if the application is free or not.

If you see a Get It Now button then you can install the application directly into your Salesforce account. Other applications have Download buttons instead. These are applications that you run on your desktop or server that access your Salesforce account through the API.

Let's install Salesforce for Google AdWords, an application that lets you create and track Google AdWords campaigns from within Salesforce. It's a good example of a mashup that combines an external service with the database and reporting features of Salesforce.

AdWords App Page

Click the Get It Now button to start the installation process. This will take you through a series of confirmation screens asking you to review legal terms, the contents of the package, and security settings. On step 3, choose "Grant access to all users," since we're installing into an account that doesn't have any users.

It turns out that you still have a little more work to do once you've finished the install process. You need to customize your account through the Setup area so that the new application is visible. It would be nice if applications came with a help document that walked you through this process. However, in my experience the next step is usually the same. Most applications come with their own tab, which you need to add to your visible tabs. In this case click the arrow on your last tab, which takes you to the All Tabs tab. Then click the Customize My Tabs button.

Customize Tabs

This will give you a form for moving available tabs to your selected tabs. Choose Search Campaigns from the Available Tabs list and move it to the Selected Tabs list.

Choose Search Tab

Once you have saved your choice you'll see a new Search Campaigns tab that will let you create and track AdWords campaigns.
Building Native Applications

The most common form of Salesforce application development is done entirely through the Salesforce.com interface. Native applications are built by extending Salesforce's data schema, by writing custom HTML, and by writing JavaScript. These native applications can be bundled and shared through AppExchange.

Extending the data schema can be as simple as adding a field to an existing object like we did when we added a website field to the Contact object. It can also mean creating your own objects. These are essentially database tables, called "custom objects" in Salesforce parlance. Whenever you make a schema change, Salesforce automatically creates or updates the related forms and pages. You don't need to build the standard CRUD actions like create, update, show, or list.

Like most customizations, you can add custom objects through the Setup area. Get started by visiting Setup and then Build -> Custom Objects in the left sidebar.

Add Custom Object

Salesforce has just released a book that shows how to build an example recruiting application using native application techniques. They gave away hard copies of the book for free at their Dreamforce conference, but they've also just started including a link to a PDF copy as part of an advertising campaign on TechCrunch. Until they produce a more official page, you can download the book for free from their advertisement landing page.
Building Custom Pages with S-Controls

Congratulations on graduating from the basics. Most Salesforce development happens entirely through the Salesforce web UI. This gives incredible power to customize and build applications to people who wouldn't normally have any control over their application. However, you probably want to write code. For native applications, that starts with S-Controls.

S-Controls let you write your own HTML in order to build pages and forms. S-Controls include templating variables that let you access Salesforce data. The application-building part can be done in any web application technology, like Java applets or Flash. However, most people use JavaScript.

Salesforce is putting a lot of work into their Ajax toolkit. This toolkit allows you to call back to the Salesforce API in order to read and write data to the database. The beta release includes two excellent tutorials to get you started.

The Ajax toolkit will graduate from beta in Salesforce's Winter '07 release, due out in the next few months.
Accessing Data with SOQL

Many times when you're developing for Salesforce you'll be treating it like a database. Salesforce provides an SQL-like query language called SOQL that you use in combination with the API to query data. SOQL has been limited by the ability of Salesforce to host the high volume and unpredictable data queries of its users. The Winter '07 release will eliminate the biggest limitation, finally allowing users to join multiple tables. The other difference you'll notice right away is that queries won't return the full result of a large data set. Instead you'll have to use queryMore until you've retrieved everything you need.

The developer documentation has the basic syntax.
The Salesforce API

So far, we've been talking about building applications that are limited to filling out web forms and writing JavaScript. If you'd rather build an application in your favorite programming language, then you should plan on hosting the application on your own server and treating Salesforce as a database that you access through their API. This way you can access the data that your company or client is entering into the Salesforce application without having to be an expert yourself.

The Salesforce Projects and Toolkits page lists toolkits for almost every language including Java, .NET, Perl, PHP, and Ruby. Many of the toolkits hide the API behind a traditional Object Relational Model.

Before you get started, you need to generate and download a Salesforce WSDL file. Do this by going to the Setup area and then to Integrate -> AppExchange API.

Here's an example Perl script that would list the names, emails, and websites of all of our contacts. You could get going with a similar few lines of code in almost any language.

use WWW::Salesforce::Simple;

my $sforce = WWW::Salesforce::Simple->new(
'username' => $user,
'password' => $pass,
);

my $query = "select FirstName, LastName, Email, website__c from Contact";
my $res = $sforce->do_query($query);

foreach my $field ( @{ $res } ) {
print $field->{'FirstName'} . "\n"
. $field->{'LastName'} . "\n"
. $field->{'Email'} . "\n"
. $field->{'website__c'} . "\n";
}

The developer docs will probably be your main reference when using the API.
Getting More Information

Developers will probably spend most of their research time referring to the documentation on the AppExchange Developer Network. However, there are several other important sources of information.

Salesforce hosts very active developer forums. You can get most any question about the API or the toolkit answered there. Many members of Salesforce's API and development teams also spend time there, so you're likely to get the inside scoop or even influence the direction of the product.

There's also two popular blogs that cover Salesforce and AppExchange development. Mark Mangano summarizes Salesforce news at SalesForceWatch by monitoring nearly a hundred RSS feeds. Scott Hemmeter talks about his experiences as a developer and Salesforce consultant at Perspectives on Salesforce. You should probably also subscribe to the official AppExchange blog.

If you're looking for a job as a Salesforce consultant or developer, you will want to watch the job board section of the developer forums. If you're looking to build a product that you can sell on AppExchange, then you should definitely check out Salesforce's IdeaExchange. Salesforce built IdeaExchange so that customers could submit feature requests and other customers could rate the requests, Digg-style. This is a gold mine for product ideas. Salesforce knows they can't develop all of these ideas themselves. That's why they're putting so much effort into encouraging an active developer community.
The Future

The Winter ’07 release of Salesforce will be out within a few months. This release will introduce many powerful new development features including inline S-Controls, improved Ajax support, improved SOQL, and external outbound messaging. Soon after the release, Salesforce will release the beta of a new programming language called Apex that runs on the Salesforce servers. This will let you write validation rules, triggers, and stored procedures. They’re re-branding the development platform Apex as well, not to be confused with their new programming language.

On the development roadmap is a JavaScript proxy that will let you access external APIs from JavaScript running on Salesforce hosted pages. This will eliminate one of the major challenges to building JavaScript-powered mashups.

One clearly missing feature of AppExchange is the ability to bill your customers. If you want to charge for your product, you're going to have to build the infrastructure yourself. The good news is that the people I talked to at Salesforce recognized this as a key service to enable developers to make a living by focusing on what they do best. It's not on the development road map yet, but it should be.

In the next article, we'll build a Salesforce application using the Salesforce API and add that application to the AppExchange.

(http://www.oreillynet.com/pub/a/network/2006/11/13/an-introduction-to-saleforcecoms-appexchange.html)