Feb. 27, 2022
Here is the link.
Background
Users in most social networking sites are describable in terms of a social graph. The relationships between users are represented by adjacency lists. If Jack and Jill are friends, they are said to be adjacent. This is known as an "edge" in the graph.
Determining Importance
You'll likely want to rank edges by importance rather than simply the most recent updates, meaning that you need to calculate some sort of score. Facebook's EdgeRank was described by the formula ∑e = ue we de, wherein ∑e is the sum of the edge's rank, ue is the affinity score with the user who created the edge, we is the weight for the content type, and de is a time decay factor.
Calculating a friend's affinity score can be done something like this: ∑i = li ni wi, wherein ∑i is the sum of the interactions with that friend, li is the time since your last interaction (this would need to be weighted so that 1 day > 30 days), ni is the number of interacts, and wi is the weight of those interactions. This method allows you to rank friends in a separate database and then perhaps only show ten updates from the ten closest friends, which isn't a bad idea considering few of us are likely to have more close friends than this.
What to Store
Determining what data to store depends on your front-end (including what activities your users participate in) and your back-end. I'll describe some general information you can store. Italics are special, optional information you might want or need depending on your schema.
Activity(id, user_id, source_id, activity_type, edge_rank, parent_id, parent_type, data, time)
- user_id - user who generated activity
- source_id - record activity is related to
- activity_type - type of activity (photo album, comment, etc.)
- edge_rank - the rank for this particular activity
- parent_type - the parent activity type (particular interest, group, etc.)
- parent_id - primary key id for parent type
- data - serialized object with meta-data
Assuming you're using MySQL as your database store, you can index on (user_id, time) and then perform your basic queries. An example feed row for a photo would be:
(id: 1, user_id: 1, source_id: some_source, activity_type:PHOTO, data: (photo_id: 1, photo_name: Getting married)).
In MySQL, your tables would be heavily denormalized since performing joins will hurt performance.
Potential Problems
- Visibility - must show interesting activities
- Performance - sorting time must be minimized
- Publishing - multiples points of failure depending on your publish method
Publishing Methods
"Push" Model, or Fan-out-on-write
This method involves denormalizing the user's activity data and pushing the metadata to all the user's friends at the time it occurs. You store only one copy of the data as in the schema above, then push pointers to friends with the metadata. The problem with this method is that if you have a large fan-out (a large number of followers), you run the risk of this breaking while your feed accumulates a backlog. If you go with this strategy, you also risk a large number of disk seeks and random writes. You'll want some sort of write-optimized data store such as Cassandra, HBase, or BigTable.
"Pull" Model, or Fan-out-on-load
This method involves keeping all recent activity data in memory and pulling in (or fanning out) that data at the time a user loads their home page. Data doesn't need to be pushed out to all subscribers as soon as it happens, so no back-log and no disk seeks. The problem with this method is that you may fail to generate a user's news feed altogether. To mitigate this risk, you should have a fallback mechanism in place that approximates the user's feed or serves as a good alternative.
Some Suggestions
- If you're using MySQL, you'll be want to be sure that your activities table is compact as possible, your keys are small, and that it's indexed appropriately.
- You may want to use Redis for fast access to fresh activity stream data. Redis is read-optimized and stores all data in memory. This is a good approach for the "Push" model described above.
Conclusions
While this is by no means an exhaustive answer, I'm trying to summarize as much information as I can. My sources for this answer are collected in the links below, so any information in this answer sadly goes without direct attribution. Special thanks, however, goes to Ari Steinberg for his very detailed answer to What are the scaling issues to keep in mind while developing a social network feed?
As I said at the beginning, I would love to get comments, feedback, or alternative approaches on this answer.
No comments:
Post a Comment