Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Understanding Online Social Network Usage from a Network Perspective BY NIKOLAOS ZOURMPAKIS | HY-558 | May 12, 2016 Summary Online Social Networks (OSNs) are a vast part of the online community with more than half a billion users. They form online communities among people with common interests, activities, backgrounds and/or friendships. However, our understanding of which features attract a user to an OSN, in the first place, is quite inadequate. There is an ongoing interest not only from an academic perspective, but also from various entities such as ISP providers who want to improve their connectivity, researchers and developers trying to identify trends and possible future designs, as well as the OSNs themselves trying to improve and scale up their systems. This study is focused on analyzing the OSNs not from surveys or interviews (limited scope), but by extracting data from clickstreams by passively monitoring network traffic from four different OSNs and reverse engineering them into user interactions. Some of the primary goals is to gather details about feature popularity, session characteristics, impact on the network, user behavior and the dynamics within OSN sessions. In order to an OSN session some basic terms have to be established: The time between login and log out is an authenticated OSN session, while the time before and after is an offline OSN session. The overall time from a logout to another logout is an OSN subsession. Most actions are as request-response pairs (rr-pairs), separated into “active” and “inactive” in regards to if they were triggered by a user’s click or triggered by the interface itself. Indirect requests associated with an active request are considered active. The collection of data was achieved with the enlistment of two different ISP providers from different geographical regions. Through an HTTP analyzer framework, we extracted rr-pairs trying to focus on those concerning only the OSN sessions. The grouping was achieved through the provided OSN cookies. The rr-pairs were separated into actives and indirect using the features of an OSN as a related category. To validate the results, manual traces were recorded and compared. Considering the complexity of OSNs the analysis software had to be easily customizable and highly flexible. Following we dissect our methodology into a number of parts: OSN session handling: Tracking an OSN user is mostly achieved from the OSN cookie/cookies (not standard) produced by each of them. It is through them that we can distinguish the authenticated from the offline periods in a session. Identifying a login/logout is possible by looking for the appropriate URI. If HTTPS is involved, we augment our HTTP traces with flow traces trying to get an indication of a possible login/logout. In the cases where a session had started or ended after our tracing period we either search for the cookie or presume that the last observed request is the end for the session. Rr-pairs classification: Inspecting rr-pairs either by URI or by the HTTP referrer header is a good way to disseminate them and build suitable patterns (each OSN is different in specifics). If a cookie is missing an rr-pair can be classified along with the last known rr-pair, else it is UNKNOWN. Misclassification though possible is highly doubtful. Lastly, an rr-pair should be determined if it is active or indirect. PAGE 1 Customization: This concerns the manual traces developed for narrowing the trace collection to the relevant subset of traffic. Through it we were able to identify the various cookies and corresponding handshakes in the case of HTTPS, and construct the patterns necessary for tracing active rr-pairs from user actions. Validation: The logical step was to use our methodology on the manual traces in order to determine its correct functionality. The end results were as expected with correct assignment to the appropriate categories (whether for active or indirect requests) with almost none guessed or assigned as UNKNOWN. The lessons learned from the above points, where that our approach is viable in a number of OSNs with the only bottleneck in adjusting the manual traces to the special characteristics of an OSN. Even in major reorganization (of the OSN) the existing pattern only change by a small amount. However, the same principles cannot be applied to other WEB 2.0 sites since the advancement of HTTPS for security reasons, prohibits us from acquiring such network traces. In regards to the features popularity it differs by location and service and we have to analyze them from different perspectives: All OSN requests: There are drastic differences from one category to then next, showing how difficult it is to approximate actual usage patterns from other techniques like crawlers. The impact is more apparent when considering the byte distribution between them. The rr-pairs correlating to photos are in general tiny in number, but when considering the case of uploading high quality pics, the bandwidth demand is apparent. Differences across time and between users: The general behavior is as expected with minus difference here and there. Regarding the users, results point out that they do favor specific features across OSNs but also consistently user some others same on across all ISPs and OSNs (the variations in usage of the profile category between all the OSNs and the subsessions within, paint a homogenous distribution). When trying to compare the general traffic characteristics of an OSN session to other Web services in terms of size and duration there are a couple of things that stand out. For the size there is a heavy tail distribution implying that a small fraction of the OSN sessions are responsible for most of the bytes on the network. As for the duration, the data confirms that most users do spend most of their session authenticated (regarding the general amount of time spent in the OSN). One thing that was pretty common was the repetition of multiple session per IP, a result probably due to multiple computer per DSL line or multiple users per computer. Lastly we turn our focus on how the users behave within a session from two standing points: 1. Active vs Inactive time: A user is considered inactive after a set amount of time (example 5 minutes). This also depends from the total duration of the session, with shorter ones usually being considered as active in comparison to longer ones who end up as inactive. The most interesting point is what happens after the inactive periods where a pattern of preference for specific categories emerges (messaging after 5 minutes, home and offline for 10 minutes or more). PAGE 2 2. Feature sequence: There is tendency for specific feature sequences among user clickstreams with home following messaging and profile being the most favorable one. Still, the most dominant pattern is for the user to continue using a feature for a prolonged amount of time before switching to another. The most time consuming category is, as expected, messaging due to the time it takes to compose a message. All in all, we presented a customizable methodology for identifying OSN sessions and user actions, successfully identifying the features that are important to most of the users and pointed out the differences from other web services. PAGE 3